Video de-pixelation (CSI style!)

Is there software or published techniques for taking a low-res video and increasing its resolution?

I know this is possible, to some extent, because video contains a lot more information than just a photo. You have many frames, each of which is almost identical but just slightly different (in a non-random way). Online i’ve seen some references to “subtractive techniques.” Does anyone know more specifically? I realize it’ll only improve quality so much, but i think it should a lot better than dumb interpolation.

This often goes by the name “super-resolution.” Here is one review article.

Interesting stuff. But seriously, all you need to do is get the frame grab up on screen and click and drag a box to zoom in to it. It’ll go all blocky at first but then you can click “Enhance” or some such button, and all the detail will appear just like that. If you have really good software you can even rotate the image in 3-D.

I’ve seen it done countless times in movies, so it must be pretty easy by now.

Or if the person’s face was out of view of the camera, you can do a zoom in on the chrome bumper on the car across the street and blow up the reflection; :rolleyes:

I don’t think so.
Film use at movie theatres are actually photos and can be enlarged to theatre size porportions and remain crystal clear.
I don’t know of any video source that can be enlarged that big and retain the resolution of film.

Thanks a lot!

I think his point was that you can use successive frames of video, each of which will have the pixels aligned in a slightly different way with the subject matter, and by analysing the way the pixels vary, get a higher-resolution picture than a still image of the same resolution.

I haven’t fully read the link Omphaloskeptic gave, but it appears that’s what the technique entails, more or less.

With pretty impressive results from the look of it too. Not as ridiculously amazing as in the movies, but the results of the image of the bell pepper (fig. 6) seemed to work pretty damn good. All the sudden, those “Enhancing” shots in shows don’t seem so silly anymore. (I said, “almost”!)

It’s worth noting that all of the examples in the linked paper were generated from a flat image, with the camera moving across it in some way. With a real, three-dimensional object, you’d get changes in perspective, too, which would make it much harder to apply such a method. It still might be good for liscense plates, but I wouldn’t trust it with a face.

I have often wondered with a set of grainy video images say of a suspect in a robbery why they didnt use some sort of composite image to enhance the quality - perhaps it is still too hard

by the time we have computers capable of crating clear resolution from crappy pictures and video everyone will have better cameras.
thats my guess.

hell my new phone has a better camera than my camera for some things.

At sufficiently high framerates, perspective issues are not an insurmountable problem. Locally, the image of a rotating object looks like a linear transformation (a translation, possibly with shear and rescaling), except for the (usually not many) pixels where significant obscuration is happening. So you if you can segment the image then you can handle these transformations as part of a time-dependent but linear warping parameter.

Traditionally, to save on storage space security cameras take pictures at very low framerates–say 1-5fps (I don’t know if this is still the case). This is probably too slow for the linearized approximation above, and it also means that you might only have a few frames of data to average instead of dozens or hundreds.

There are some quite impressive routines out there to perform something called deconvolution - that is, reversing some of the apparent loss of detail in out-of-focus images (if you assume that points of light have been blurred out into gaussian circles, for example, you can reverse the effect. Sometimes it introduces artifacts into the deconvoluted image, but the results can actually be pretty impressive.

One program I know of that is able to do this is called ‘Image Analyzer’ - I think it’s freeware (or at least the version I have is), but I can’t find any current links for it.

Another way to do it is to superimpose successive frames of a moving image - if the subject is moving slightly, but not spinning or anything, you can infer a lot of detail from interpolating pixels in a successive set of images, because any given piece of detail will cross a pixel boundary at some point.

It’s still nothing like the bullshit ‘enhance’ you see on CSI and the like.

true, but I was thinking of a more modelling type approach, where a program would recognise the head and main features eyes nose etc (perhaps given a bit of human prompting), construct a crude 3-D model and refine it further with each frame. I am sure with even 8 grainy frames a much better picture of the suspect would emerge.

That’s basically what the paper linked to above was doing.

As for deconvolution, it’s a well-posed problem iff the point-spread function is known, and the image has infinite dynamic range, and there’s no noise. None of these conditions is ever true in practice, so deconvolution is sharply limited (or rather, blurrily limited). You can sometimes deduce some things from a priori knowledge of the nature of the image: For instance, in an astronomical photo, you can find a blurred image of a background star, and assume that the image of the star is the point-spread function. But ultimately, the reconstruction each pixel of the image will depend on very small changes in the values of nearby pixels, and the small changes involved might be smaller than the count resolution of those pixels, or of other noise sources.

Sure, it can be done, but it’s harder. Even with only three parameters in your linear transformation, that’s a huge increase in the size of your parameter space. I would be a lot more impressed with the results of that paper if they had done it, as would be necessary for most real-world applications, rather than avoiding the issue by using flat photos.

Sure, it’s harder. There may be as many as six parameters to estimate rather than two, and you might only be able to assume that the transformation is approximately linear in 8x8 blocks rather than the whole 320x240 image… but 8*8 is still 64, which is a lot bigger than 6. Video at ~30fps, even at low resolution, has an absurd amount of information. Using all of that information well is a hard problem.

Here is one result on this sort of segmented superresolution.

Have all super-resolution techniques assumed that only the camera is moving? (ie, the entire scene has a single, uniform translation)? You surely need a technique that deduces then compensates for local transformations, and that would fix both 3d rotation and just simple objects-moving-in-different-directions. Seems kind of a basic necessity to me and not too difficult (MPEG standards already do this analysis to some extent*).

*They actually only count translations (and of groups of 4x4 up to 16x16 pixels). This is computationally cheap these days. They don’t try to deduce rotation and scaling. However, reducing everything to translations seems pretty sufficient for high fps video.

EDIT: I think this was what Omphaloskeptic was saying, but the paper he linked to I think talked about something else.