Actually, we’re not too far off.
Pre-rendered cutscenes has reached a plateau of visual fidelity the last year and a half. Some games’ engines (like the CryEngine 2 & 3) are pretty much up to the standard of the average cutscene - aside from the cutting edge companies such as Ubisoft’s. Crysis played at the Ultra qualitiy is almost indistinguishable from the “CGI” trailers produced for the game. (The caveat is, of course, that the Ultra quality setting was designed for computers years ahead of it’s release date.)
So, at a visual fidelity standpoint, we’re pretty much there.
That leaves two factors: animation and technique.
Animation is going to be a limitation, seeing as most cinematic trailers are highly contextual. That is to say, characters are not using stock attacks, but rather acting independently. Computer games have stock attacks because you need to have a predictable attack that can be bound to a button. Which means you’re going to have repetetive animations and, at best, variations on a theme.
However, “custom” input configurations, like for instance the new motion detection scheme Sony unveiled at E3 yesterday will be able to blunt this. When your controller scheme isn’t limited by a finite amount of input commands, developers will be able to approach animation in a different way, linking the skeleton of the actor up to the direct, physical input of the players. Instead of pressing A to do a left-right swing with his sword, the player will swing the widget from left to right, making the skeleton replicate his exact move.
Which would be contextualized animation, if I’m making sense.
Technique is the last pillar. I couldn’t think of a better word, but what I mean is, basically, how the game treats the camera and what’s happening in the game. Newer games are rapidly integrating “cinematic” camera into it’s actual gameplay. The best example, off the top of my head, is probably the Bourne Conspiracy game, released back in summer 2008.
Here’s a video to showcase what I mean: http://www.gamespot.com/ps3/action/robertludlumsthebourneconspiracy/video/6176997/robert-ludlums-the-bourne-conspiracy-gameplay-movie-1?tag=videos;title;17
(NSFW: Violence)
What you see is that the camera is reacting to what’s happening on screen. If your guy gets uppercutted into the air and knocked down into a crate, the camera will twist to get a cinematic view of that. If an enemy strikes from a predictable angle, you get a split second to make a “counter” move (in this case a Quicktime event) and instead of grabbing you by the hair and slamming you against the railing, you grab his wrist, pivot and throw him to the ground, or break his wrist. While these are stock fighting game tricks, the camerawork is what draws it closer to a movie, or a pre-rendered trailer.
(For the record, the Bourne Conspiracy wasn’t a great game. It was good, but it’s cinematic action style was the real potential.)
So, to conclude, it’s my belief that when contextualized input (i.e. direct player control), contextualized animation (i.e., direct 1:1 transfer of player input) and the skills to code contextualized camera reaches the level visual fidelity is currently at, you will not only have reached the level of pre-rendered trailers, you will have exceeded it.