And so it’s a good thing that we don’t need to solve it generally. You start with a human identifying the make and model of the car, and determining the approximate position and orientation of the car. Presumably, you already have a 3D model of that make and model of car. Now you have six parameters for the car, and all you need to do is find an optimal match in those six parameters, starting from a point that’s very close to the optimum. That’s not a difficult problem. And you only need that human input once: Once you have one frame, you use the parameters from that frame as the initial approximation for the next frame. Once you have two frames, you use the parameters and their numerical derivative as the initial approximation for the third frame.
Now, if we didn’t know that the image was of a car, and didn’t know what a car was shaped like, and had no idea of its position or orientation and had to reconstruct all of that from what we were seeing, that would be the general-case problem that you’re talking about, which is extremely difficult. But we do know most of that already.
I’m brought these types of videos on a regular basis. In real life, you rarely have thousands of images (frames). Most of the time, you have a few dozen that are useful.
A common situation would be a vehicle passing a truck. The license plate of the passing vehicle will usually not be visible until the vehicle reaches a point beyond the left fender of the truck. It then rapidly decreases in size (horizontal pixels) until it is basically useless for identification. And many truck video recording systems record at 15 or even 5 FPS.
Sometimes I get a video of a truck following a vehicle for a significant period. That’s a bit different.
And, frankly, most of the time we don’t care about the license plate(s) because the vehicle is eventually involved in the accident and is a big pile of metal.
But “frame averaging” has been used successfully for many, many years and both transform technology and AI have improved it greatly.
Ooops! To clarify, in my previous post, I sort of wandered into digital video recording from vehicle cameras. I should have made this clear, as the OP was discussing analog video sources, such as VHS tapes. My bad.