Why has CPU speed growth slowed down so drastically?

You miss my point. One could, in principle, create a game with graphics so detailed that you can count the pores on the enemy soldier’s face. Such a game would indeed require a very great deal indeed of processing power (whether on the CPU or on the graphics chip). But how much demand is there for being able to count the pores on a face, when you’re probably just going to be sending a rocket-propelled grenade through that face in the next half second? Games and other programs can be made more demanding, but what you get in return for that isn’t very much.

There is an almost infinite demand for CPU cycles. I’m currently working on a program that would be much more useful if it ran 200x faster. It has to process incoming data in real-time, and current technology limits it to relatively low data rates.

Computer graphics could be substantially improved in quality if we had real-time ray tracing. That isn’t about counting pores, it’s about realistic looking lighting.

Conventional radio and television hardware could be replaced with software-defined radio (SDR) systems that replace specialized hardware with software. One box could replace all of your existing radio and television hardware, and be much more flexible and upgradeable. You could have a cell phone that would work on any network in the world, and also allow you to listen to radio or watch TV anywhere in the world.

We might finally get speaker independent voice recognition and high quality voice synthesis like shown in the movie 2001.

I wasn’t talking so much about new features but the additional power of each new generation of CPUs. For example, today’s word processors can perform so much more than they could ten years ago just because they have vastly greater CPU resources. Without continued CPU growth it will become more difficult to add compelling functionality to word processors. Without the new functionality nobody will buy new versions of the software.

You’re kind of illustrating my point. :slight_smile: Simply breaking up tasks into separate threads won’t be enough. A word processor performs only one or two concurrent tasks. What is needed is a way to allow programmers to write software that will automatically break up smaller computing tasks across multiple cores. One avenue is through implicit parallelism.

Otherwise known as bloatware. :slight_smile: The problem is, what compute intensive functions does it make sense to add to a word processor? You can add all sorts of pretty graphics, and lots of features because you have lots of disk space, but we’re already getting to the point where 50% of the features are used by 1% of the users - and they make life more complicated for the other 99%.

To a certain extent all modern CPUs do this already at the instruction set level. I’ve actually worked in this area a long time ago, in horizontal microcode compaction, which was the direct ancestor of VLIW architectures. The big issue is dataflow and access to global resources.
You do microcode compaction by first building a data dependency graph from the high level microcode, and then allocating nodes to slots in the horizontal microinstruction, taking resource dependencies and data dependencies into account. One project in my adviser’s group was a dataflow simulator. I figured that this was a perfect platform to see how much parallelism I could get, but it didn’t work out all that well because of data dependencies you can only catch on the fly.

Itanium uses the compiler to exploit parallelism - in fact it has too (at least back when I was following it) because it doesn’t do a good job figuring it out internally. I’m not sure this has been tremendously successful, though.

There’s a tremendous amount of difference between 3 and 4 in terms of gameplay; for starters, 3 is set in WW2 and 4 is appropriately titled “Modern Warfare.” It’s not like all they do is add more graphics features and call it a sequel (well, for good sequels that’s the case). The thing is, the newest games are designed to take advantage of the most recent technological advances. Most are scalable in the sense that if you don’t have a high-end system you can turn off a lot of the bells and whistles for the sake of keeping the animation framerates at a reasonably playable level, but that leaves you with a game that can actually look worse than the predecessor in some cases. If you’re an avid gamer, you want to tweak your setup to wring as much as you can out of your hardware and take advantage of those graphical goodies. And for a lot of avid gamers, bleeding edge tech may in fact be worth it.

That’s an aspect of computer architecture that I’ve been interested in. Why do things at run-time, and pay the penalty in time and circuitry, when you can do it at compile time? One drawback that I’ve seen mentioned is the lack of binary compatibility among processors in the same family due to the visibility of implementation details. Still, when I read a description of a modern processor, with all the effort expended on extracting parallelism from legacy code, I can’t help wonder if there is a better way to do things.

If you do it at run time you can parallelize dynamically, at compile time it is static. Internally you do speculative execution - that is when you hit a branch you guess which way it is going (based on what happened the last time you hit the branch) and issue instructions for the branch you think will be taken. If you’re wrong, you cancel them. Yes, it is complicated, and that is where a lot of those extra transistors get used.

Our research was in compaction for basic blocks of microcode - that is between two branches. That’s fairly simple. Josh Fisher extended our work into compaction across blocks, which led to VLIW.

I’m not sure what you mean by binary compatibility. None of these techniques have any problems with old code. The ia64 architecture and instruction set was totally different from the ia32 one. The first Itanic, Merced. had a unit which took ia32 instructions and, IIRC, converted them to ia64 microcode and executed them. This didn’t work very well (in implementation) which is why the ia32 performance was so bad. There was talk of doing the conversion in software, which might actually have been better, and I seem to recall reading that this is what they finally did.

One obvious answer, in this case, is that you probably don’t know at time of compiling how many processors the program will be running on. You can still build in multithreading, etc., of course, but true optimization probably depends on the number of processors: With a single processor, the overhead associated with multithreading may well cost more than it’s worth.

The binary compatibility issue that he referred to was that the simplest form of VLIW instruction is made up of several micro-operations, one for each functional unit. The problem with this approach is that when you update the processor to a new one with more functional units, either the processor can’t decode the instructions at all, which is terrible, or the processor can decode the instructions but the extra functional units are wasted because the instructions don’t have a micro-op specified for them, which means that upgrading your processor does nothing for you unless you recompile all of your apps. That isn’t possible(or is very difficult) in a lot of cases(most obvious example: the case of binary applications like Windows or Word – would Microsoft have to have separate releases for every processor out there? Ouch.)

ZPL is one attempt to straddle both worlds. Most of the optimization is done at compile-time with an eye on decent performance on single-processor CPUs.

Well, the code being relatively inefficient on a newer architecture isn’t usually considered to be a big problem, since the equivalent hardware is usually faster anyway. I’d hope they’d maintain instruction set compatibility across processor versions, at least, so I doubt that is a problem. It depends a lot on the level of parallelism the compiler produces. If you’re just talking about more copies of one FU (another multiplier, for instance) you could have the compiler find the parallelism from the data flow graph and let the instruction execution unit inside the processor handle resource conflicts. That would allow you to use all the new resources without recompilation. In any case, adding new functional units is a fairly significant change. Most new versions of a processor are just shrinks, and won’t have anything like that kind of difference.

Thinking of it, having a compiler produce code for proc version n that won’t run on n-1 because of explicit reference to functional units that don’t exist in the older version would be a big problem, so I’m guessing that they do data optimization (and some across basic blocks optimization) only.

Why on earth would you bother with anything as tedious as that? No-one would even notice.

100% realistic smoke, flame, water, projectile physics, weather, sound and vegetation effects together with effective AI agents for 30+ infantry on your side, half-a-dozen AI squadmates, maybe a helicopter gunship or two and a view distance of a couple of kilometres - now you’re talking.

Even games like Far Cry 2 and Crysis have horribly simplified visual environments, for the simple reason that actually simulating a stand of trees with leaves fluttering in the breeze is brutal on the CPU. Heck, the dev team for Operation Flashpoint 2 apparently has a guy whose full-time job it is to build trees essentially by hand and then construct various damage models - that sort of effort is only necessary because proper simulation is impossible without monstrous computing power. Similarly strategy games like Supreme Commander will max out a quad-core CPU if you have the unit caps set high enough - the holy grail of RTS would be to build a game like Total War: Cannae with every one of the 140,000 soldiers simulated as an individual unit with morale, health, fatigue, momentum, LOS, command effect and so on, and wear/breakage calculations for every sword, shield and sandal strap.

The main reason games have stopped being so desperately resource-hungry is that development is mostly targeted at consoles (which are non-upgradeable) and the mass PC market. Developing big games nowadays is so expensive that it’s hard to turn a profit on the enthusiast PC market. Crysis was the last big PC-only shooter targeted at the hardcore and as far as I know you still (a year after release) need several thousand dollars worth of PC hardware to run that sucker with all the options turned right up. It cost $22 million to make, which is a lot of money to gamble on people liking the reviews so much they’re willing to spring for a hardware upgrade.

Some bits from an article about Intel’s new i7 core:

Notably, improving the raw clock speed doesn’t seem to be anyone’s priority.

I wouldn’t say it’s not a priority, it’s just that it’s far more difficult right now to increase clock speed than it is to add cores. Only a subset of computations can make use of parallel processing, so the clock speed problem will not go away.

As evidence that it’s still a priority, IBM’s Power6 is clocked at 4.7ghz. In the lab they’ve gotten it and the cell up to 6ghz (if I remember correctly), so they are clearly working on it, and I’m sure Intel is also.

Crysis is a beast, but I’d say this is exaggeration. I built a new core system (mb/cpu/ram/psu/gpu/case) for about $700 last spring and it’ll run Crysis on the high graphics setting at consistently over 30 FPS at 1600x1200.

As for raw CPU clocks - multiple cores isn’t the only way CPUs are improving. Modern single cores, clock for clock, can do more work than older ones through improved design, better memory bandwidth, and other factors.

I’m rusty about this stuff since it’s been 20 years since college (8086 or 80286 anyone?)

Power consumption was also reduced when the supply voltage was reduced on the CPU along with other peripheral chips. For example, Intel’s x86 chips operated @ 5 Volts, then 3.3V, and what, 1.2 volts for Vss?

You get three advantages: 1) Reduced power consumption (The power consumption is drops by more than half when going from 5.0 V to 3.3; 2) Increased speed, binary transitions take only 66% of the time now as oppose to earlier; And finally 3) Less EMI.

Shrinking the size the of transistors per Moore’s law offsets the power reduction gains because after 36 months you’re now making transistors 1/4 of the size–which doubles your power consumption for the same area that the 4x transistor once spit out.

Which probably does add up to several thousand dollars worth of PC. Don’t forget to count the cost of your time in putting that together.

Given that the settings go up to “Very High”, and that native resolution for an ordinary 24" widescreen is more like 1920*1200, I don’t think I’m exaggerating at all. Particuarly since 30fps is only a little bit above the minimum framerate where you start to notice the game jerking.

I normally play my games at 19201200 or 16801050 (giving a small border), and all the settings on very high - but not with Crysis or Far Cry 2 :frowning:

Yeah, I forgot to include the gold plated hard drive coolers in my cost assessment.

Ok, so $700 for all that, $25 for a DVD burner, $20 for a keyboard, $50 for a mouse, $75 for a 640gb hard drive, $50 for a heatsink and fan and you’ve got $920. Something like 4 hours to throw together (my heatsink had some quirks that took a while to resolve. Typically a system would take 1-2). Still nowhere near “thousands of dollars”.

I may be mixing it up with another game, but I think “very high” is actually something that just turns on DX10 on the high settings. It’s pointlessly confusing.

And with 30 FPS - people typically don’t consider that smooth because while your average fps may be 30, that often means you’ll be dropping down to 15-20 for brief periods of time. Crysis is remarkably good at maintaining a relatively steady framerate so that 30 fps feels surprisingly smooth. You say “ordinary 24” monitor" as if it were the predominant monitor size out there… in any case, I 1920x1080 isn’t that many more pixels than 1600x1200.

Transcoding as we use the process today is a relatively fresh technology and suffering far more from ill optimized software than CPU constraints. Yes, it’s annoying to have to transcode your video in realtime to, say, a .h263 compatible format, when it takes far less time to compress it from the source.

This is mostly due to a mass of “weakest link” scenarios, particularly for streamed content and “lowest common denominator” formats. Considering the future, one might hope that a 1:1 data format for video is made standard, eliminating the need for transcoding, or at least reducing it to the less complex downsizing task.

(This article might be of interest to you - http://www.discover.uottawa.ca/~leizj/Experiences/TransServer/TransServer.html)

The biggest CPU related problem with transcoding, as I have been told, is currently overhead.