Sharpening Blurry 1970s-Era VHS or Betamax Video Images

I just viewed on YouTube a five-minute video recorded in the mid-1970s. By today’s hi-def standards, the video quality is terrible. I believe this black-and-white video was originally shot on VHS or Betamax tape, then converted this year to digital and uploaded to Youtube. I know B&W slide film/transparencies hold up much better than color media, so I had high expectations before I saw it, but the quality … awful.

I’ve read today’s digital restoration technologies can dramatically sharpen low-res video images. I’m sure I would need to give the restoration company the original videotape for best results.

Question: How much sharper can restoration professionals make old blurry videotape images? I can see them achieving a slight improvement, but if the “information” isn’t there, I can’t see dramatic results. Would the restorer have to sharpen each video frame, one by one, or is there an automated system? In the 1987 movie “No Way Out,” we see super-sophisticated DoD technologies clarify an extremely blurry photo, making it razor-sharp. I doubt that technology exists today. I know Oliver Stone clarified the Zapruder film of JFK’s assassination, but the improvement wasn’t dramatically huge. If Hollywood can’t achieve miraculous results with their deep pockets, I don’t see a mom-and-pop restoration service in Oxnard, CA doing better with a crummy VHS tape for maybe $300.

It’s easy to sharpen an image as much as you want. The problem is that the sharpness that you get probably won’t be the same sharpness that was present in the original image. There will be details that were lost entirely in the blurring, such that the information just isn’t there at all any more. The simplest methods will result in just having no detail at all there. More complicated, AI-based methods will basically just guess at what the detail should be, with various degrees of accuracy. Of course, you can also have humans doing the guessing, instead of AI, but then you’re looking at extreme amounts of work times however many frames you have.

Not only is the image low-resolution to start with, but over time the videotape will decay. If you’re looking at a >40 year-old piece of mylar tape, you can expect degradation of the media itself, even if it had the most perfect recording ever originally.

Here’s an analog-to-digital place that explains it. If you look at their site, they don’t even offer converting to DVDs, because they’ll eventually suffer the same problems as a videocassette.

DVDs suffer problems eventually, but they don’t suffer the same problems as a videocassette. An aged DVD can be copied exactly to another brand-new DVD, or to any other digital medium. Any physical medium (and there’s always some physical medium) has a finite lifetime, but the digital data on it need not.

Analyzing videos, including analog video sources, is something that people pay me to do.

The simple answer is that you can process/examine a video sequence and get quite a bit of detail, including answering many questions about the original scene…but it depends on exactly what your objective is. For example, it’s fairly easy (and has been for many years) to pull a license plate from a video of a parked car, even if the license plate is not very clear in still images. A moving car is much, much more difficult.

Keep in mind that, in older NTSC analog media, the horizontal lines are discrete, but the image data within the horizontal (scan) lines are analog and continuous. About the best one can hope to achieve is around 704 pixels x 480 pixels (with a PAR adjustment that is appropriate to NTSC picture AR). Theoretically, one could get more than 704 pixels, but it’s not really practical.

Of course, this assumes a pristine analog source. We rarely get one.

OTOH, there are lots of software tools that will help reduce interlacing artifacts, remove scan line artifacts, perform temporal smoothing, and so forth. Depending on your objective, these may be all that’s needed to have a satisfactory digital video file.

Great information, thanks. I’m astonished how BAD the video looks.

Do you think there will ever be a day, even two centuries from today, when AI can turn low-quality VHS video footage into razor-sharp images?

What AI will be able to do will be to guess at what the content of VHS tapes was, and then to create its own content from scratch that matches that guess.

There are various forms of AI video upscaling now.

Of course, that’s what our own brains are doing constantly. What we “see” is not just a video capture of the world. Our eyes are nowhere close to as good as the world looks. Instead, we see a recreation of an approximate model of the world, continually refined by sensory input. It mostly works well but occasionally goes totally wrong, as with optical illusions.

We’re going to have very, very good video restoration in the next several years. And yes, the AI will be “guessing” at much of the content, but those guesses will be guided by the same kinds of things that our own brains are: how lighting works in the real world, the geometry of human faces, and so on. The guesses will be very good unless you go out of your way to fool them somehow.

I work with video material as evidence, so there are some real and logical limitations on exactly what can be done to “improve” the image. (I can’t, for example, use AI to clarify a picture and then testify that the person in the video now looks much more like the defendant than it did before. Courts frown on that. When I DO “enhance” a video, I have to prepare a report explaining and justifying exactly what I did and what impact it had on the image. In many cases, the “enhanced” version is tossed out, usually because it is not the best evidence.)

Going back to an analog (VHS or Beta) tape source, AI and upscalers can do a lot to eliminate scan line artifacts, which are always a problem. They can act as time base correctors to align the individual horizontal scan lines better. They can sharpen the edges of text, titles, etc., and adjust HSV values to produce a much more pleasing image at higher resolution.

The main challenge remains, though…a pixel is a rectangle of uniform HSV value. There is no further detail that can be extracted from that pixel. If a face or a license plate is, say, twelve pixels wide, there’s not much that can be done with it from an evidentiary POV. We can do frame averaging, we can track that portion of the image,we can try to match it to exemplars, but even AI is going to be either guessing or using a source image as a reference (e.g., this blob of pixels ought to be a license plate and so AI will make it look more like one).

For license plates specifically, I’d imagine that the problem is somewhat easier, since there is a finite and discrete list of things that the actual image can be. Though even there, you might still be limited to statements like “The image on screen is consistent with the license plate being TLD1382, but is not consistent with the plate being GHK3497”.

Very true. We can, of course, measure the values of the pixels in the area of the license plate and discard some combinations of characters. Most of the time, the engineers want to know the license plate of the vehicle that just passed the truck. This is more challenging, since the target plate is at an angle and is shifting from side to side (often at a low frame rate) as both the truck and the other vehicle are in motion. It’s much, much easier with a high res fixed camera optimized for the task, as used for ANPR.

I still convert tapes to digital for friends and family from time to time. Nowadays, I have to do all the service on my VCRs myself. This includes optimizing tracking and alignment for a specific tape. It’s worth it to make people I know happy. It’s not really worth it when it’s just people paying me to do it.

One type of AI training for increasing resolution is to take a large number of high resolution photos or videos and downsampling them to a lower resolution. Then you feed both versions into a neural net and let it study the differences between the two versions. A vast amount of training later you feed the resulting AI a low resolution photo or video and the AI will create a “prediction” of what an original higher-resolution version might have looked like. It useful in forensic settings, but good enough when it doesn’t matter how many polkadots wete on Aunt Fran’s dress in the background of the home video.

AI might be good for polka-dot dresses, but you’d neither need nor want it for something like license plates.

I’d disagree. What we want is to effectively stack a bunch of images of a license plate and use superresolution techniques to get the actual image. Dumb techniques work fantastically for things like space imagery, where one can be confident that the images stack nicely with perhaps only an XY offset. But consider a license plate from a video of a car moving around (say, a dashcam video). The plate might not always be visible from the same angle, or the same size, or under the same lighting conditions, and so on.

What’s needed is a model of the scene so that we can determine where the license plate actually is. Then, the transformation can be reversed and the plate imagery can be recovered.

The plate is attached rigidly to the car, and the car is likely to have far more detail available than the plate. Maybe the plate is only 10 pixels across, which isn’t enough to figure out the position very accurately (we need deep subpixel precision). But the car has more detail, and the position and orientation can be determined more accurately. Further, the car has significant frame-to-frame coherence, since it behaves according to physics instead of bouncing around randomly. A series of frames can determine the position of the car very accurately, even while moving, and thus the position of the license plate can be known.

Likewise, an internal model of the lighting conditions (including, say, the position of the sun) can be used to correct the lighting on the plate, again allowing it to be better aligned with imagery from other frames and for superresolution techniques to be used.

The plate itself is not a perfect plane, and the way the lettering reflects light is non-trivial. Reflections may catch the edges of the plate and give more or less brightness compared to a straight-on view.

All of these things and more would be incorporated into a sophisticated AI model. Even with a plate that’s too small to see, every frame of video available (for which there may be thousands) supplies more information. A system that looked only at the plate would be handicapped compared to one that evaluated the whole scene, since every element can be used as a clue in the predictive model.

We don’t know how to perform even simple image processing without AI; rules-based techniques are just too limited. It would have to be a trained model. And probably one adapted from a broader dataset, like one used for self-driving (since much of the scene evaluation machinery would be the same).

I’m not sure what you mean by that… The math is straightforward. I could write a program to do it. Others with more relevant experience could do it much more quickly and easily than I, but we’ve had image processing software for decades.

And you’d need to use straightforward, deterministic methods, not AI, because any situation where you’re trying to read a license plate is a situation where you need to be confident that your license-plate-reader isn’t hallucinating.

To detect the orientation of a car in a scene? That’s inverse rendering, which is classically just about the hardest problem in computer science. It’s barely solved in the most primitive, simplistic cases, let alone real-world input. You’ve got a Turing Award in store if you know how to solve it generally.

Those methods are useless here. The problem is basically one of hypothesis testing–given a possible number for the plate, how confident are we that it’s right? A single frame of video with no additional data gives you very little confidence. But thousands of frames, with the additional knowledge of where the plate is “supposed” to be and with adjustments for lighting, etc., one can start to be much more confident.

But it’s all inherently probabilistic in nature. Deterministic methods can’t work since we can never be 100% confident that our hypothesis is correct. There’s always some chance that video noise corrupted the image in a way that actually supports a different number. It just becomes fantastically unlikely with enough data.

There are photogrammetric Blender plugins for finding correspondences between images/frames and reconstructing a 3D scene from a 2D video of it. As well, one can reconstruct a clear image from a sequence of distorted images (e.g., from a telescope affected by atmospheric turbulence). One can try to jazz some of that up with “AI” techniques but where you actually might want to use a trained neural network is to identify digits/letters in a (possibly distorted, rotated, noisy, etc.) image. For forensic purposes, you certainly do not want to use AI to generate a fake “enhanced” video.

I don’t know what Blender does specifically, but those techniques are usually quite primitive. They use simple image processing techniques (or user input) to identify features to track (say, the corner of something), and then look at how the feature moves around over time. It works for simple things like superimposing a 3D object on top of other imagery, but it’s fairly unsophisticated.

You don’t want to fake it. But the goal is to take some imagery and figure out the most likely physical configuration that is consistent with that imagery, including details that you could not make out with just a single frame. It’s an underdetermined problem, but then so is all visual processing of this nature. Even when the detail is there, you can’t be absolutely certain it wasn’t a trick of the lens or a weird projection or a plate mounted to a drone that’s hovering between you and the car or something else. It’s just that these hypotheses are not very likely compared to the alternatives, and they become even less likely as you add input data.

If you could somehow capture a “frame” of your retina’s data stream right at the limit of your visual acuity, you’d find it useless. Your brain uses the imagery over time and guesses about the likely content to fill in the details. And yet you somehow trust your brain (probably because it does a good job most of the time). We’d want a sophisticated AI to do the same on video data.

There should have been an “isn’t” in there.