My social media feeds are full of videos that people generated with text-to-video models such as Google’s Veo3. The visuals are stunningly good now, but in many cases the captions of what the characters are saying are wrong - they’re usually close enough to the correct text to be intelligible, but lots of letters get garbled. Considering what large language models can do now with natural language, I find it hard to believe that these models couldn’t accurately write down the text of what they make their AI-generated characters say, so I’m guessing that the authors of such videos intentionally mess up the captions as a joke or nod to the entire discussion about AI accuracy. Is that what’s going on, or are these captions actually the result of a technical glitch?
I think—but want to highlight that I don’t know for sure—that in such cases, the captions were generated simply as part of the image, not as text in itself. So there’s no word-predicting engine like an LLM behind that, it’s just predicting pixels out of noise, as in diffusion models.
You could, presumably, create a video and then give it a pass with a speech-to-text AI to create these captions, and in future iterations of the technology, that may be included, but AFAIK, right now they’re often just generated as part of the image, hence for optical rather than syntactical coherence.
This is probably right.
Here’s an example of that problem.
(There are lots of videos of the talking bigfoot out already, both audio and video AI-created by Veo3.)
Thanks for the answers. TBH, I must say I’m surprised to hear that this it how it works. Surely the text of what the characters say must be present in the model in text format to generate the correct audio and video feeds for the sound and mouth movements; so I would have guessed it would be both less error-prone and computationally more efficient to have a text processing model deal with that and subtitle the video the old-fashioned way rather than have the text-to-video model generate pixels.
In the video I linked the captioning only shows up occasionally, and most of those Bigfoot videos don’t have captioning at all, so I suspect in that case at least the captioning happening at all is an unwanted bug.
I don’t think that’s necessarily true: if you think of image and audio as a combined piece of information (or token), then predicting the next such token from prior ones doesn’t need any intermediate text generation, it just needs to have learned the typical combination rules of audiovisual data from training examples, same as how text generating models learn the typical combination rules of word-tokens from their training data. Any intermediary text generation would just be redundant. It’s just that this process isn’t perfect, and this imperfection is more readily noticeable in text that’s part of the image creation process than the imagery itself, since small visual inconsistencies are easily glossed over, but textual misfits stand out.
I see this or something like this in non-AI videos all the time and it drives me nuts. I’m not sure the last time I’ve seen a YouTube short of any type where embedded captioning is 100% accurate. Simple words are misspelled, homophones or near homophones are substituted, words are missing, etc. Either lazy creators never check the work their captioning software does (not talking about YouTube’s captioning) or they do it as some sort of engagement ploy, hoping pedants leave a comment on a video they normally would have left alone.
Saw this one tonight with nearly perfect captioning.
ETA I see that it has captioning in the FB app but not in the web browser. So those aren’t the same hardsubs as in the OP.
The text of the script will be part of the prompt, but presumably the model generating the video is not optimised for text - a bit like how that willy wonka scam from a while back had text that sort of looked about right, but wasn’t.
If the model has been trained on scraped content, some of which had captions baked into the video, then some of the output will contain things that look like captions. I’m sort of surprised they don’t look more broken than they do.
Yeah, that’s probably a different thing - a lot of video editing software can now do animated subtitling based on speech recognition, but it makes mistakes. Anyone caring about quality will try to review and correct those mistakes - anyone caring only about pumping out a volume of content, won’t.
This isn’t directly on point but may relate to some stuff YouTube itself does.
I watch a lot of human-made professional music videos. The ones for major mainstream acts you’ll hear on the radio / XM / streaming, etc. No AI. Many have closed captions. I often have CC’s turned on to help me accurately understand the lyrics. Sometimes some vids have CC’s available in multiple languages, but that’s rare.
I recently encountered a vid of a South African woman singing in her strongly accented (from my POV) English. Which had only Spanish CC’s available. And there’s an option in YouTube to real-time machine translate those CC’s to English.
Doing that there are many places where her English singing is clearly understandable and does not match the English CCs being generated on the fly. By watching the Spanish CCs you can readily see that you’re looking at a round-trip error, where whoever (human I expect) did the Spanish CCs made reasonable word choices to reflect English idioms into Spanish, and the machine translation the other way ended up with either more literal English, or a different English idiom in the same spot.
None of that was gibberish or Chinglish-like; it was all plausible but wrong.
For your listening & viewing pleasure: