ChatGPT and the oil spill

Johannes Stiehler
Technology
Random Rant
#LargeLanguageModels
#TextAI
Cover Image for ChatGPT and the oil spill

"Data is the new oil," said British mathematician Clive Humby in 2006. He probably didn't realize how right he was and how far this analogy can be taken.

  • Data is like oil, without processing it is useless.
  • Data is like oil, essential processes are driven by it.
  • Data is like oil, it is - in a sense - bad for the environment.
  • Data is like oil, it is valuable, though available in relatively large quantities.

But one of the key differences is that, unlike oil, data doesn't deplete. You can use the same data sets over and over again in different contexts.

Classical machine learning has always been about collecting and, more importantly, preparing data first. Automatic translation, for example, hinges on the existence of so-called "alined texts", i.e. documents that are available in two different languages and can ideally be compared sentence by sentence. On this basis, a machine learning model can then be trained to spit out the (often) correct counterpart also for new sentences. In the contest of such systems, the decisive factor was often who had the cleaner and larger data sets and not who had the better algorithm or the better technology.

The technical approaches are often based on freely available research anyway; it's the data that makes the difference.

With deep learning, this hunger for data has become much bigger. While previous methods worked quite well with 100 positive examples per category (e.g. 100 financial news vs 100 political news to train a classifier), deep learning methods needed at least one order of magnitude more. But then suddenly something crazy like image search or image categorization started to work.

With Large Language Models (and other forms of generative AI) we have ignited the next level of data craving. All the talk here is really about billions and then some. Granted, this input data is just uncategorized "tokens" instead of manually presorted documents, but the amounts of data used are still gigantic.

A key difference, however, is that in the case of ChatGPT and other such models, someone else has collected all sorts of multi-colored data for me and mixed it into a mash - an oil spill, so to speak.

As a result, the model can answer almost any question without any further data input, but conversely cannot guarantee truly reliable, accurate, or even reproducible answers for any question. The big data mush spits out an answer to almost everything, but the thinner the data was in the training set, the more questionable the answer becomes.

Huh, but ChatGPT was trained on all the knowledge in the world, wasn't it?

Even if it were (both "total" and "knowledge" and "world" are probably not appropriate terms), it needs not just one document on each topic, but a great many to end up with factually correct answers. You can see this very nicely if you go a bit to the "periphery" of the world knowledge: If there were few documents on a topic in the training set and if one perhaps even surprises the system with aberrant query languages such as German, then hallucinations are never far away:

Amtsgericht, 1. attempt

No, that is not correct, Georg Lohmeier wrote this series and the cases are all set in 1911/12.

But maybe the hyphen was confusing, in the original it is written without it:

Amtsgericht, 2. attempt

No, that is equally wrong. Interestingly, the model remains in the correct "environment", both Ernst Maria Lang and Max Neal are humorists from Bavaria in the broadest sense. But can the hyphen cause such a reinterpretation?

Let's rather try this again with a literal repetition (in the same chat window):

Amtsgericht, 3. attempt

Ah, another Bavarian humorist, again the wrong one. For all mentioned gentlemen (including the real author) Wikipedia pages exist and also other information in the net. Nevertheless, ChatGPT-4 (in the very latest manifestation from Sep 12, 2023) happily fantasizes in different directions.

If we move a little bit more into the center of the world of ChatGPT, the answers become correct, consistent and above all repeatable again:

Oily data, 1. attempt

Every repetition of this question produces (for me, as of today) the same answer. And even the English version is consistent and virtually a 1:1 translation of the German one:

Oily data, 2. attempt


But does it matter? Is anyone really interested in the "Königlich Bayerisches Amtsgericht"?

Probably yes, but the more substantial problem is that countless start-ups are founded in extremely narrow niches with the claim to build a great "AI-powered" product for that niche, which in reality often means that these products send prompts to ChatGPT in a more or less imaginative way and then process its output.

The aforementioned example documents vividly what quality can be expected from such approaches once you leave the lush core of OpenAi training data.

Johannes Stiehler
CO-Founder NEOMO GmbH
Johannes has spent his entire professional career working on software solutions that process, enrich and surface textual information.

There's more where this came from!

Subscribe to our newsletter

If you want to disconnect from the Twitter madness and LinkedIn bubble but still want our content, we are honoured and we got you covered: Our Newsletter will keep you posted on all that is noteworthy.

Please use the form below to subscribe.

NEOMO is committed to protecting and respecting your privacy and will only use your personal data to manage your account and provide the information you request. In order to provide you with the requested content, we need to store and process your personal data. If you consent to us storing your personal data for this purpose, please check the box below.

Follow us for insights, updates and random rants!

Whenever new content is available or something noteworthy is happening in the industry, we've got you covered.

Follow us on LinkedIn and Twitter to get the news and on YouTube for moving pictures.

Sharing is caring

If you like what we have to contribute, please help us get the word out by activating your own network.

More blog posts

Image

"Digitale Wissensbissen": The future of data analysis – A conversation with Christian Schömmer

Data Warehouse, Data Lake, Data Lakehouse - the terms are constantly escalating. But what do I really need for which purpose? Is my old (and expensive) database sufficient or would a “Data Lakehouse” really help my business? Especially in combination with Generative AI, the possibilities are as diverse as they are confusing. Together with Christian Schömmer, we sit down in front of the data house by the lake and get to the bottom of it.

Image

"Digitale Wissensbissen": Generative AI in Business-Critical Processes

After the somewhat critical view of generative AI in the last episode, this time we are looking at the specific application: can generative AI already be integrated into business processes and, if so, how exactly does it work? It turns out that if you follow two or three basic rules, most of the problems fade into the background and the cool possibilities of generative AI can be exploited with relatively little risk. We discuss in detail how we built a compliance application that maximizes the benefits of large language models without sacrificing human control and accountability. (Episode in German)

Image

Bleeding Edge - curse or blessing?

We rely on bleeding edge technologies to drive companies forward with innovative solutions. That's why, in discussions with customers and partners or in our webinars, we are always keen to explain the benefits and possibilities of modern technologies to companies. But we also use AI ourselves: by automating tendering processes, we have been able to save valuable resources and increase efficiency.