Stepping back a bit, I do think it is instructive to understand just what LLMs do. Whilst there is a significant amount of mystery about some pretty basic functions, like how facts are stored, we have enough of a clear idea about how they function to draw some clear lines about what is possible and what not to attribute to them.
That they operate by predicting the next word is true. But also misses a huge amount about what goes into such a prediction. The input phrases pass through a huge number of steps each time the LLM iterates to produce the next word. These steps attract what amount to significant annotation in meaning (attention) and processing that promotes associations that led to successful predictions in training (in the multi-layer perceptrons). The latter kind of steps probably provide the encoding of knowledge, as far as that actually goes.
What we can be sure of in this architecture is that there is little scope for inference. The only state is the prefix phrases passed in on each iteration. The internals can pass a kind of working state down the pipe in a very limited manner. And this passage is controlled by the weights built during training. But it is a significant stretch to suggest that inference is possible. The ability of the LLM to pass information widely between the huge number of running pipelines is limited. A lot of this comes from the tradeoffs that arise from being able to create highly parallel operation. That they work as well as they do is quite amazing.
One of the telling examples is the one where a LLM, when asked to add two two digit numbers provides the wrong answer, but when asked to explain how it arrived at the answer, provides a perfect description of how to perform the task. The two questions activate totally different paths and it is clear there is no internal connection between the of how adding two numbers is performed and actually providing an answer to a specific question. One has been built from textual phrases that attracted meaning from associating with descriptions of how to do arithmetic - of which there are probably tens of thousands of instances fed into training, the other is just blindly associating numerical symbols that happened to be close to one another in text mentioning addition.
The above examples of improved quality and apparent insight are not difficult to understand. Somewhere out in the huge body of text ingested during training are commentaries that address the initial question. The training data used for ChatGPT probably included many dozens if not hundreds of texts discussing Shakespeare. Many more than you might turn up on the Internet as well. It isn’t as if the training of LLMs has been shy mining copyright works.
The most likely phrases output will be composed from words that are chosen after a very significant amount of attraction of semantics that guide the results. Attention steps will attach “Shakespeare” and many other vectors encoding useful locations in the space. There is no actual understanding of the meanings. But the output will be derived from a large number of steps that ran with a lot of highly pertinent attached state information. If you disagree with the answer it provides, and ask again, the new prefix now includes words about disagreement. So the next round of predictions include words that associate the coding of “incorrect” with all the other phrases used for prediction. So the LLM will start to fire on information that also includes the notion of incorrect answers associated with the base question. So a new commentary, one that now includes nuances about possible different answers are, and why the easy answer is not the full story starts to be built. The LLM cannot generate the commentary out of thin air. But if it has been trained on texts that included these ideas, it can be guided to start to prefer output that include them. The LLM doesn’t have any notion of an improved answer.
The example of poor results for specific technical information is the counterpoint. Training probably only included a very small amount of text on the subject. It is unlikely to ever be possible to train an LLM adequately on very limited information like this. There will be too few examples of the information in useful context for training to be able to create solid associations. So it will do its best to assemble phrases, but that is all you can hope for.
This doesn’t diminish the impressive capability that these systems have. But IMHO their ability and utility sits firmly between dismissal of them as just token predictors and the breathless hype. Understanding how they operate is going to be very helpful in getting the best out of them. It is probably not dissimilar to Google-fu. For many of the same reasons.
For technical questions LLMs might be more useful if they could directly offer up the technical source. Which is really asking for citations. Perhaps one problem all the existing models have is that to do so would crack open the lid on very murky copyright and general IP questions. All well and good to cite Wikipedea or public domain papers. Not a good look when it becomes clear the entire corpus of machine readable text available no matter its provenance or ownership has been vacuumed up.
As tools to generate large quantities of text, I remain very sceptical. The entire operating premise is one where you cannot trust what it generates, and human oversight is needed. If you have a situation where huge amounts of boilerplate text, or code, are needed, this just speaks to the poor quality of the tools or coding language being used. Which is sadly a very real problem.