
One of the issues that keeps bubbling to the surface with increasing use of ChatGPT is the occasional inclusion of obviously incorrect information within responses, which have been accurately described as hallucinations. Why does this occur, and can it be controlled?
When we were looking at a simple OpenAI API query, we bumped into the variable temperature. Other than it can be between 0 and 1, we merely noted it controlled “the creativity of the response.” Here’s a lightly technical look at what this means.
Before moving on, we had better briefly remember that when an engineering mind thinks “temperature,” they are not thinking “it’s getting hot in here” so much as “raised entropy.” Consider the extra jiggling about of excited molecules as an increased range of (random) possibilities.
Temperature is not specific to OpenAI; it belongs more to the ideas of natural language processing (NLP). While large language models (LLMs) represent the current peak in text generation for a given context, this basic ability to work out the next word has been available with predictive text on your phone for decades.
To understand where the variations come from, let’s consider how a simplistic model learns from examples.
Consider a model ingesting its first-ever sentence:

To be or not to be.
It understands the sentence as a string of ordered words, with the full stop indicating the end. If this is the only sentence it knows, it won’t be doing any decent predicting. And if you do happen to type “To be … ” then it will only suggest Hamlet’s famous line.
So we will add one more line to the model:

To be young again.
Combining the two, we get the possibility of producing either line after the first “To be.” We recognize the full stop as the end of the phrase, so that can be shared by either option, just like the first two words.

The options that might be produced from a model based on the previous two inputs.
So the orange line represents a variation. Our model now understands two lines.
We must note that I treated each word as a token or unit to be consumed, including the full stop. But words are not really discrete entities; we know that the words “doing” and “done” are the same word in different tenses, or that “ships” is the plural of “ship.” We also know that the word “disengage” is the word “engage” with a prefix at the start.
In short, words seem to be themselves made of tokens. Within models driven by the English language, there are roughly 1.3 tokens per word. And this will be different for different languages. The other reason we need to have a feel for tokens, is that this is how GPT models charge you. So price per token is something you need to have a feel for.
What Are the Odds?
Training is the process where tokens and context are learned, until there are multiple options with varying probability of occurring. If we assume our simple model from above has taken in hundreds of examples from text, it will know that “To be frank” and “To be continued” are far more likely to occur than Shakespeare’s 400-year-old soliloquy.
If we were to do a kind of bell curve around the next word after “To be …” we would naturally expect some to be very likely and some to be much less likely. In the diagram below, a block represents a large number of examples. So possible words that don’t appear as options have too few example references.
Let us consider a possible top five:

A block of possible options based on the input “To be … “
If we add up all the blocks, we can express simply enough the chance for any word to be randomly selected. So “continued” would be six chances in 14, or 42% likely to appear next, whereas “or” would only be about one in 14, or 7%. But it is already clearly the case that some words are much less likely to appear.
What if we flattened the curve? This would clearly still express the likely responses as higher probability, but it allows the less common options a better chance to be selected:

A flatter curve shows the possible options to follow the input “To be … “
This has changed the likelihood of “continued” to 36% and moved “or” up to 9%. So the odds have gotten shorter around a wider variety of words getting picked.
This is effectively what raising the temperature does. It flattens the curve, giving the less likely responses a boost. If the temperature is zero, then the model may only chose the highest probability token. Just as a reminder, when you call the OpenAI API directly, you get to input the temperature range directly:
curl https://api.openai.com/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer xx-xxxxXX" -d '{ "model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "What is TheNewStack?"}], "temperature": 0.7 }'
Because we might be looking for an interesting and original response, a value of temperature nearer 1 makes sense.
Now you may well say, “But surely this increases the chances that the model will respond with stuff that isn’t true?” We are then faced with the question of matching the task to the appropriate temperature. This is done by differentiating between “creative” output and “factual” output. If we use too high a temperature with factual material, we are likely to produce the dreaded hallucinations.
Temperature Veils the Source of Chatbot Responses
The great mission of ChatGPT is to fool you into thinking that AI has “thought’ of an answer. It hasn’t. It is doing a much more sophisticated version of the above, with millions of ingested tokens, but it is still entirely guided by pre-constructed LLMs. That is why it can both look authoritative, yet be absolute nonsense.
However, as we see in everyday use, ChatGPT works very well in most cases. This is because for every question you might have, someone has answered it, directly or inadvertently, somewhere on the internet. ChatGPT’s real task is to understand the context of the question and reflect that in the response.
When I read a weather report in my local newspaper, I am not “ripping them off” if I later use that information to answer a friend who wonders if it will be sunny tomorrow. Newspapers are (or were) intended as valid sources of information. But clearly, if I take large parts of text from an expert’s report and reclaim it as my own, this could be fraud.
There will be increasing legal pressure for models not to blurt out responses that make it absolutely obvious where the source material was taken from. And this is why hallucinations are likely to remain, as temperature is used to vary responses and veil their source. Oddly, the same principle was used initially to defeat spam detection — by adding mistakes to spam email, it was initially difficult to blacklist it. Gmail overcame this by its sheer size and ability to understand patterns in distribution.
Overall we recognize LLMs as socially positive. Eventually the law will formalize around the do’s and don’ts of the training process. But between now and then, there will be plenty of opportunities for the temperature to rise over LLMs misappropriating other creators’ content.
The post What Temperature Means in Natural Language Processing and AI appeared first on The New Stack.
In generative AI, "temperature" refers to raised entropy. Here's what that means and why raising the temperature might result in more hallucinations.