Can LLMs (large language models) be used to decipher ancient languages?

I don’t see how that makes any difference - your sense organs still telegraph the sensation to your brain. When you smell vanilla, the vanilla isn’t going into your brain - you experience it indirectly.

At the rate of human perception, a film (mostly) appears to be continuous because the difference from frame to frame (within a scene, at least) gives the appearance of mostly seamless movement while a slideshow definitely doesn’t. In reality the mind integrates incoming sensory data at different rates and with different levels of attention even though our experience is apparently continuous. From the standpoint of a hypothetically ‘conscious’ chatbot, it would experience the world (in terms of prompts or other inputs) in discrete jumps even if it isn’t aware of the passage of time in between.

No analogy is perfect, of course, because human attention is more than just a film of discrete but rapid snapshots, and we should not try to reason just from analogy, but I think it is pretty evident that a chatbot does not have processes analogous to either the cortical column processing of somatosensory data or the complex process of integrating different types of sensory data into an integrated experience of the world.

You seem to be making the philosophical argument that the molecules that are sensed as taste and smell are “of the thing” but photons emitted or reflected by it or acoustic pressure waves in atmosphere are not. Even if that is taken to be true of what is or is not conceptually ‘part of’ the object in question, the brain receives it all as various kinds of signals from different sense organs and doesn’t ‘directly’ experience any part of the world (except technically for the eyes but even then there is a lot of ‘pre-processing’ of the sensory information before it even gets to the visual cortices and gets integrated in different ways). If you could separate the brain from the body a la the ‘brain in a jar’ thought experiment, and somehow manage to feed it the same fidelity of nervous system signals it would not be otherwise aware that is not embodied or directly experiencing the world. (In reality, creating the simulacrum of an embodied experience for a ‘brain in a jar’ is far more problematic because of how much the ‘processing’ of body functions is done in the parasympathetic and enteric nervous systems, but that is a specific physiological issue, not a conceptual one.) All external senses of the peripheral nervous system are just providing data to the brain.

Stranger

You are probably right to. I really have no idea the extent to which the FOAF actually learned the language.

I think there is some evidence that passive listening to a language, say exposing a kid to TV incessantly, does not result in language.

I would not be so sure, I learned English (after getting some basic notions in a language school) mostly from reading fantasy and SF books in English (I used to joke that I knew how to manage an interstellar empire in English but didn’t know how to order pizza).
That’s mostly passive.
ETA: though it’s fair to say I refined it here in the SD and later at work.

Those basic notions in language school, though, provided a framework to hang the rest on. And there are also a lot of cognates between English and Spanish, which would also help. Certainly books can get a person from a poor grasp of a language to a good grasp, but can they get a person from no grasp at all to a poor grasp? I think, in principle, yes, but that it would take more books than any human has time to read in a lifetime.

TV would probably be easier, since there, you also have the visuals. If I were to watch a Hungarian TV show, the language itself would be completely opaque to me, but I would often be able to make pretty good guesses as to what was being said based on what I saw on screen.

Technically yes, but practically no.

In AI/ML you can create an embedding model with just raw text – without knowing what the text means or including any context. The ELI5 explanation of an embeddings model is that it learns the meaning of words, including disambiguating words that can mean multiple things (e.g. a river bank or a financial bank). A more accurate explanation is that an embeddings model maps words into N-dimensional space such that similar words are clustered more closely. It doesn’t actually know the meaning of any words – just the relative similarity of words. This allows you to perform math operations on the embedding vectors like ‘king - man + woman = queen’.

If you train separate embeddings models on two language (e.g. English and Japanese), the embeddings vectors will not match. ‘Cat’ in English won’t have the same vector as ‘猫 Neko’ in Japanese. However, the relative relationships (king/queen, man/woman) might still exist. With an embeddings model of an unknown language, you could try to mine it for these relationships to find keystone words.

You can train a single embeddings model on two or more languages such that the embeddings are aligned, but as others have said this requires a parallel corpus (aka Rosetta Stone). There are embeddings models trained on lots of languages using a parallel corpus with the intention of modeling patterns in human language. You could extend one of these models with the unknown language in the hopes that its embeddings would align with all of the other languages’ embeddings.

However for all of these techniques there are a lot of caveats that make it impractical for an ancient, unknown language:

  • You need a large corpus of the ancient language. Not just single words or sentences, but blocks of text.
  • The ancient language would have to be structurally similar to the languages you’ve trained on. You could include other similar, but known ancient languages in your model. However, I suspect those corpora are too small as well.
  • I think the topics of the ancient text would need to be varied – for example a bunch of prayers or laws might not be enough coverage.
  • I assume there would be some variation in how language is used and how words are used that would confound the model, but not sure.
  • I also think a logographic ancient language would add another layer of confusion.

Notes:

  1. A vanilla LLM consists of three AI/ML models working together: tokenization, embeddings, and attention (transformer). The embeddings model learns the semantic similarity of words and the attention model learns the disambiguation. This would make it hard to mine for relationships and would require even more data.
  2. With enough data the LLM could generate text in the ancient language, but you still wouldn’t know what it means. You wouldn’t be able to make a chat bot in the ancient language because that requires fine-tuning the LLM with labelled data.
  3. I used the term ‘words’ instead of ‘tokens’ for simplicity sake.

This is an interesting point that I never considered.

I wonder if it is possible to improve the Latin-to-English translation by translating:

English → Latin → English’

then minimizing errors between English and English’?

You could also augment with additional languages:

English → French → Latin → French’ → English’

Reading is not passive. When you read, you are trying to understand the language. By passive, I mean just hearing the language in the background. Note that an infant hearing people talking around them is not passive either. They are desperate to join the conversation. Even my son who didn’t start talking to past 2 clearly understood a lot before that and started talking in complete sentences once he started. Maybe if you watch TV trying to use the situation to learn the language would work. But just background noise does not.