Exploring the Limitations of ChatGPT

The well deserved hype around the commercial release on Generative AI platforms like ChatGPT has made it seem like these tools can do anything. While they certainly are impressive, companies should understand some of their limitations before deciding how and when to integrate these technologies into their business.

A look back on the first year of ChatGPT

As we approach the anniversary of the release of ChatGPT, it’s fitting to take a look back at the first year of OpenAI’s revolutionary platform. You have likely read many articles about what ChatGPT can do, which is why in this article, we specifically take a look at what I believe to be the largest limitations that it currently has. I have no doubt that some of these will be addressed through future updates of the platform, but some of these will be persistent issues pertaining to the underlying technology that ChatGPT is based on.

As the hype about the “AI revolution” builds up, it can be extraordinarily difficult to maintain a realistic perspective of the current and future capabilities of Generative AI tools – especially if your only exposure to these technologies are from commercially available products like ChatGPT and Google’s Bard

An engineering perspective

Being around NLP for some eight years now, I’ve had a chance to watch Large Language Models (LLMs) evolve over the last decade – long before they had a useful user interface like ChatGPT and became one of the fastest growing consumer applications in history.

ChatGPT is a great technology which already has – and will continue to have – a large impact on most fields of knowledge work. Most people who’ve watched NLP mature over the last decade will surely agree that GPT-X models' ability to both produce coherent output while remaining very sensitive to the specific prompts by the user is really outstanding when compared to previously available models of similar size.

However, the rapid adoption of these tools by the general public – ChatGPT reached an incredible 100m (!!) users in the just first two months of release – has caused many users to attempt to accomplish tasks that these tools will never be a suitable technology for. Or, at the very least, for which the models of the size of GPT-X will be incredibly inefficient and expensive.

So for the remainder of this article we will dive deeper into 5 areas where – from an engineering perspective – Generative AI platforms will not be an ideal solution. In each of these areas, we will take a look at alternative approaches that complement generative language models or take a completely alternative approach to the same problem with improved benefits in infrastructure costs or output reliability.

Note: because the causes of the limitations in Generative AI are quite technical in nature, their solutions are also quite technical. I will attempt to keep this text approachable even to those without a strong technical background, but there is only so much that I can do in that regard without doubling the size of this article.

Top 5 limitations of Generative AI tools

1. Using ChatGPT as a knowledge base

Large enough sequence-to-sequence language models have been shown to work surprisingly well for information retrieval tasks. This goes hand-in-hand with training on publicly available snapshots of the web, where many relationships are captured and some of the important ones are repeatedly reinforced.

However, using the sole generative model for information retrieval needs has multiple drawbacks. First, we have no way to keep the model information up-to-date. Whenever the president of a country changes, our model would have to be fine-tuned on new texts that present a new association in the context.

Secondly, we have limited options to properly connect outputs to their original source. While some interpretability methods can provide you with an attribution of the generated output to a specific training example, these are still stochastic processes with no definite yes/no answer.

On top of all that, using more than 100-billion-parameter models as an information database can be disproportionately expensive, compute-wise.

An alternative approach

If we can not use structured databases (discussed in the next point), we might want to use a language model specialized for information retrieval (IR). In such a scenario, the model produces a short vector representation of any piece of text.

Such IR models can be trained so that their resulting vector representations have the properties that we want our model to serve. For instance, for the purposes of search, we can optimize the model so that its vector representations of queries are similar to the ones of relevant search results.

A nice property of this approach is that we can greatly benefit from publicly available data for search and adapt an already well-performing pre-optimized model specifically to our user queries with relatively little data.

In dense information retrieval, as this approach is commonly referred to, the resulting model is first used to create an index, i.e. vectors of the whole search database. This might take roughly 12 hours, e.g. for all of Wikipedia and smaller (100M) models, like BERT. However, the inference afterwards requires only encoding of the query, which takes less than a second, even on a CPU.

If the reference to the source with an answer is not enough, you can radically decrease the chances of hallucination in generative LMs (e.g. chatbots) outputs, if you use an approach called Retrieval-augmented generation: You first retrieve relevant context using your search engine and then you include the results in the LM's input context. Removing the memory function from model parameters further allows you to use much smaller (and cheaper) generative LMs with the same or better relevance of the answers.

From our experience, in a combination of Conversational LM + Information Retrieval, the IR component is the one that does the heavy lifting. Wrapping IR outputs into a good-looking response is the easier part, thanks to the pre-training of generative LMs. Therefore, if you have data, adjusting the search model to your needs can bring you large qualitative benefits compared to generic open-source/proprietary API Embedding solutions

2. Integration into more complex process chains

With all the showcases of the universality of GPT-X models, you might be wondering how to integrate the new technology into your own processes, but without rebuilding all your automation from the ground up.

You might quickly find that in order to do that, you’ll need to abandon the smooth, natural language that ChatGPT generates. Instead, you will need the model to comply with very specific input and output format, which the LM may have never seen before. After a while of prompt tuning, you can mostly achieve that with ChatGPT, but from time to time, you still come across errors in parsing the outputs or executing the output SQL, breaking your pipeline. Worse case might be that, as the version of ChatGPT silently changes, you will have to revisit your niche prompts around every month that this happens and you might only notice after you encounter errors in the production (our experience).

In short, if you want to include large language models into a chain of automation, you’ll need very specific inputs and outputs, and that is not something that ChatGPT excels at, and you need to count with errors.

An alternative approach

ChatGPT can produce structured outputs like JSON or SQL because a portion of its web-scraped training data contains such data, and ChatGPT has the capacity to remember its most common patterns. But again, this ability is quite specific and does not require interactions with other tasks. Remember that GPT-X models are primarily tuned on human feedback given directly to the model's responses, which largely optimizes the model for smooth and good-looking outputs.

Thus, if your use case requires text-to-structure, structure-to-text or structure-to-structure generation, you might be better off with fine-tuning a specialized, much smaller generation model. If you don't have enough of your own input-output pairs, you can utilize datasets for code generation from HuggingFace Datasets, e.g. Spider for SQL generation, or GitHub scrapes for a myriad of other structured languages.

A nice thing about tuning your own structured generation model is that you no longer need to bother with elaborately fine-tuned prompts – you can simply fine-tune your model for the generation in any specific prompting format that you desire to use.

But since the publicly available datasets are often from a domain different from yours, it is always a good idea to add in some of your own data whenever possible.

For those with more of a technical background, I’d advise you to formulate the training task in a form that is general enough to allow your model to generalize to your own data. Here’s a quick example to explain what I mean:

When we fine-tuned the text-to-SQL model for our own projects, we also included the description of the database structure in the training prompts. This way, the model knows that it should look for the structure in the input prompt, rather than memorize the structures of the training database. When we use such a model in production, we also always have to provide it with the database structure that it's supposed to request, but the resulting model works even with structures that were never present in the training data.

3. Applications requiring a knowledge base specific to your organization

Naturally, using your own knowledge base as the source of information for your chatbot has many different direct applications. You can serve the knowledge directly to the users using as many different sources of contextual information as possible, including the chat history, user history, user profile or user preferences.

However, since the current version of ChatGPT is based on a scrape of the web back in September 2021, and most of the user context is not going to be available in the training data at all. So in the best case scenario, the general-purpose LM will just tell you "I do not know." And in the worst case it will just hallucinate a response, i.e. it will just make something up.

An alternative approach

In order to integrate fresh and personalized knowledge into conversations with customers you will need to feed your generative model with that relevant knowledge.

For that, both information retrieval and structured generation described in (1) and (2) will come in handy: In (1), you can directly use the query to search in free-text knowledge bases. In (2), you rephrase the user query into a query to your database, execute it, and concatenate the result into the context of your generative model.

A nice property of incorporating text-to-structure generation into the pipeline is that you know what is going on under the hood since you can see and check the structured query. For instance, in cases where the user reports an irrelevant response, you can check what information was retrieved by the generation model from your database. If the problem happens again, you can further fine-tune the text-to-SQL model on a relatively small set of erroneous cases and quickly redeploy.

4. Applications working with confidential data

If you do not choose otherwise, GPT-X models will collect your prompts and use them to improve in future iterations. You can surely unsubscribe, but you still do not have full control over what is happening with possibly sensitive user data that you send outside.

Note: there is currently an Enterprise version of ChatGPT that solves this problem, but for now it's out of reach for many companies and for pretty much all individual customers.

For better or for worse, very large LMs are very good at remembering things. It does not require thousands of occurrences of a specific association for the LM to note the connection – even just one or two occurrences might be enough. You might have heard of the story of Amazon employees that used ChatGPT and already found the information from their conversations in the model's responses.

An option here might be to deploy your own instance of GPT-X (or other LLM) onto your Azure cloud. If you are fine with Azure, this solves the problem of sending data outside. But to no surprise, running your own copy of the gigantic language model is prohibitively expensive for pretty much everyone – and anyone who could afford it is probably better off with the Enterprise version of ChatGPT.

An alternative approach

If none of these options work for you, there are also open-source alternatives referred to as either conversational or instruction models. You can run and fine-tune these models yourself on any infrastructure that you like without sharing your data. You might have heard of Mistral or Llama, but you can also find smaller, well-performing instruction-tuned alternatives like FLAN-T5, or Tk-Instruct. You can find the most recent options and find links to download the corresponding models on public leaderboards.

Again, the integration of external sources of knowledge is a way to go here, rather than fine-tuning custom models on sparse and ephemeral users' own data. But that is good because you do not need to store the data anywhere else (scattered in the model), and you still have full control over it, including manual modifications and compliance with GDPR.

5. Applications that require explicit or symbolic reasoning

While GPT-X models are much better than their predecessors at tasks requiring logical reasoning, it still remains very risky to use them in tasks requiring explicit calculations, or, more generally, symbolic manipulations.

To understand why that is the case, one must note that the language model does not understand the notion of numbers as something special, different from other words. Language models learn the representations of words like "dog" and "12" in the same way. But because "12" co-occurs in training contexts with other tokens like "+" and "4", it might generalize what to do when it gets "12 + 4" on input. However, when the model gets a less likely combination, such as "786895 * 387 + 14895", it will bump into the inaccuracy of its representations, because it has not seen these co-occurrences enough times, if ever, and will consequently fail to generate the correct next token.

An alternative approach

As of today, GPT-4 already integrates several tools to support symbolic reasoning, for instance, including API calls to Wolfram Alpha. The challenge now is to train models to correctly use these tools in appropriate situations.

It will take time for OpenAI to solve this problem because the integration of many diverse tools capable of symbolic reasoning is necessary to answer everyday user queries. Often a simple calculator will not be enough. It will also require something like Python code, memory functions, or logical verifiers.

But your task might not require integration of all of these different symbolic systems. For instance, if you only need to use a calculator, you can simply train your model to do that. That is also what we did in training language models to use a calculator in predictions, by using publicly-available datasets of math exam problems.

But your task might not even require multi-step reasoning, which is currently a bottleneck of smaller models. If correct answers to your targeted prompts can be obtained from a single request to an external system, text-to-structure generation might be sufficient for your use case. For example, you might generate a calculator prompt in a single prediction and return its answer directly to the user.

If your goal is natural interaction with the user, you will need an integration similar to our Calc-aided models and training such models requires exposure to the format of interaction in the training data.

If you already have APIs or tools that contain factual information that could help the model in the conversations of your users, think about how you can automatically or semi-automatically inject such interactions into your unlabeled data. If you can do that, you can fine-tune an open-source model for a better quality of your tool use than what general-purpose ChatGPT can give you.And again, think about how publicly-available data could help you. Perhaps it can also be transformed into a format that could aid the model to generalize well to your own application and you do not even need your own dataset(s).

Conclusion

Once again, let me reiterate that this first wave of consumer GenerativeAI platforms are amazing technological breakthroughs. I have no doubt they will continue to improve at a remarkable rate, and that their application will have tremendous impacts on multiple domains of both research and knowledge work.

However, I believe that the best way to continue their improvement is a clear-eyed acknowledgement of their current limitations. Especially if you’re running a large company and planning on making significant financial investments into these technologies.

NLP engineers experience a complicated mix of emotions as the general public engages with their domain of expertise. On one hand, it means that there will be more commercial demand for more research and development. But on the other hand, one simply needs to accept that there will be a certain level of misunderstanding from those engaging without the technical background. Perhaps this is what it's like for astrophysicists to watch movies about space.

I look forward to following up on this article after more time has passed. But in the meantime, if you have specific questions about adopting these new technologies into your business, my team and I would be happy to give you more information. You can reach out to us here any time that you would like.