How does it work without some kind of AI recognition system? Here’s a picture as an example. A person can look at this picture and realize the beach ball is in the immediate foreground and the band is further back. But a computer might “think” that there’s a giant ball floating right next to the band or even that the ball is in the background and the band is standing in front of it. Meanwhile, there’s a disco ball which actually is behind the band. How would a computer “know” that one ball in in the foreground and the other is in the background?
OK, to start with, in that picture the back wall has a continuous patter on it, with rather sharp definition, and so must be in focus (or close to it). The ball overlaps on that continuous pattern, and has a fuzzy edge consistent with it being out-of-focus (it could also be an inherently fuzzy object, but it’d be a simple heuristic to reject that). So the ball must be at a significantly different depth than the wall, and since the boundary between the two is fuzzy, it must be the ball in front of the wall, not vice-versa. And since the ball is fuzzier than anything else in the image, it must be the closest object in the image, closer than the musicians.
Then there’s also a floor, with some degree of texture itself. It’d take some degree of heuristics to recognize it as a floor, but since most pictures have horizontal surfaces in them, it’s probably not too hard to figure out which portion of the picture must show it. And the positions of other objects (and their shadows) on the floor helps give some depth information, too.
Get depth information for a few objects, and you can start to more precisely calibrate the out-of-focus blurring, and use that to estimate depths for other objects not easily placed on the floor.
Of course, if the computer has any capability for recognizing common objects like humans (which is becoming pretty commonplace nowadays), that can help, too.
And if this were merely one frame from a video, the program could also use information from adjacent frames to puzzle out some aspects that it couldn’t as easily get from a single frame, especially since most of the objects in the video won’t be moving from one frame to another in three-dimensional space (thus, any apparent motion is due to the motion of the camera).