An end to Moore's law

Not much utility for an average home user, but there are home users that have problems that sit somewhere between a traditional cpu and a gpu (I’m one). The Tilera chips would work for me but the dev kit costs too much (12k).

Heh. If you have a chip that is a process driver, and you don’t have awful initial yields, you are leaving money on the table. (I’m in the middle of a bring-up now.)
I don’t know what you mean by “doesn’t work” but no one expects the first tapeout to be the final one. There is an iterative process where you find the source of the problem, get the fab to fix the thing in some cases, and fib it in others to be able to work around it.
At a meeting I went to last month Intel reported that about 90% of logic bugs get caught before first silicon, and a somewhat smaller percentage of circuit bugs. I have not seen very many bugs that should have been caught by DFM rules - which are clearly absolutely necessary.

I worked in a microprocessor design group at Intel, a while ago, and you are incorrect. All parts of the design are done in RTL first - for verification, at least. The control parts get synthesized, the datapath parts get hand designed starting from the RTL. There are a number of tools to compare the circuit level design with the RTL as part of the verification process. There were places where aggressive circuit design was done. This may be far more automated today. As I said, library cells are laid out by hand, and that is where the opimization is done.

I know of microprocessors which have been totally synthesized and done with more or less an ASIC flow - and they were very successful, especially in terms of schedule.

Even when I was there it was clear they were worried about the time the processor would run out of steam. There was an article somewhere a few days ago (I think Computerworld) where someone from Intel was quoted as saying that smartphones could never do everything well. They’ve bought all sorts of companies to try to expand beyond processors, and that hasn’t worked at all.
When most heavy computing gets done in the cloud, and the rest gets done on think clients that roam with you or smartphones, Intel is going to be in trouble. They’ll sell chips for servers, sure, but it won’t be the volume they sell now.

Yeah, I was being very loose with the terminology. Without RTL you don’t even know what you are designing.

I really just don’t get their strategy. Well maybe they are think they can get the Atom small and power efficient enough that it can compete with ARM. But I really can’t see it. ARM/Linux is brewing as a serious game changer. Microsoft lost the mobile and embedded platforms, and they won’t ever get them back. So that is the most critical part of the x86 software ecosystem gone from Intel’s market. Desktops will stay x86 for the forseeable future, but handhelds, mobiles, netbooks, embedded, automotive, home entertainment, these are already mostly lost or under threat. The Itanium remains something I feel rather sad about. It is a lovely ISA. It deserves to succeed, but it is looking more and more unhappy. Maybe they will pull something out of a hat; maybe. They just about sank SGI because of it. Yet an Altix 4700 is a lovely bit of kit.

That’s because OO.org is a piece of shit with half the number of features of MS word. I know it’s fashionable to try to position OO.org as a competitor for Word, but, well, it really isn’t, and using the thing for more than ten minutes tells you why.

Actually, there’s a quite obvious use for these GPUs: games. I’ve been to a few talks now about parallel programming, and most of them always preface their talk with an explanation that video games are really driving things here. For instance, I’ve seen talks from people by Codeplay, who make specialist compilers for PlayStations and the Cell, and they’re trying to reposition their technology in light of the hundreds of cores that are going to be in everybody’s computer in a few years.

Slightly off-topic, but a few weeks ago I was at a talk by John Strother Moore, a famous computer scientist. His research area is automated reasoning, and he was hired by AMD after the FDIV bug was found in the Pentium to verify the floating point unit on the K5 (I think?) AMD had originally miscalculated the percentage of the die that could be used for the FPU, and had scaled the portion down a few months prior to fabricating. Of course, they had to throw away a load of verified chip design, and basically wing it, hoping for the best. Then the FDIV bug appeared, and they collectively shat themselves. They brought Moore in to verify the FPU in ACL2 in 10 weeks, knowing nothing about the floating point spec. He managed to get it done, fix a load of bugs in the design, and the FPU went out, bug free.

I worked with some radar engineers on pre-generating radar cross-section models. The models are useful for predicting engagement success, refining algorithms, and even building a fingerprint library to compare unidentified “hits” against. The process of rendering a radar cross section in high fidelity takes days or sometimes weeks of CPU time on desktop or server CPUs… but the process of figuring out what an object looks like in multiple RF bands is very similar to the process that a computer goes through when generating 3D graphics for a known viewpoint. They both involve doing very very very large matrix transformations, which are best solved with fast RAM, huge throughput, and many CPUs sharing the load.

Pretty much everyone in the field is salivating at the prospect of running radar cross-section code on either a desktop with a bunch of parallel graphics cards, or a cluster of gaming consoles like the PS3. The thinking is that with only a little tweaking, they’ll be able to get a pretty serious jump in power and a major cost savings over “industrial” strength servers.

(Linky) http://ceta.mit.edu/PIER/pier81/01.07121302.pdf

Unfortunately Sony is no longer selling PS3’s with the ability to run Linux. I think any interest in the cell by people in the low end of the market (i.e. supers not built by someone like IBM) are going to be using NVIDIA’s GPU’s instead of cell.

NVIDIA is doing the right things with their tweaks to the GPU to make it less like a GPU but I wish they would move faster towards task parallelism. Or I wish Intel would put something on the market (larrabee cancelled, new chip research announced, easily a couple years out).

Games are driving this stuff more than anything, but I don’t think that there are enough gamers willing to pay a premium to drive a mass business. Moving everything to the cloud might. As an aside, I’ve seen some papers about using the massive parallelism inside a GPU to do some really interesting scientific computation. Clever stuff.

When I was there, post FDIV, Intel was really, really serious about verification. Just about every RTL designer became a verification engineer at RTL freeze, and there was a boatload of software tools for it. Given the size and complexity of modern CPUs, I think they do a really good job, because the bug list is not all that big, especially considering that simulation and verification can only find so much.

Yeah, I’ve seen the same papers. Libraries are starting to appear, too, for programming languages, that allows you to do general purpose computations on GPUs. I think there’s a few for Haskell, now.

Yeah, the microprocessor industry is a good source of motivating examples for when writing grant proposals for automated reasoning research. AMD uses ACL2, and Intel uses HOL light for verifying the processor designs, amongst other tools. John Harrison is always giving talks on how Intel verify their FPUs.

I would say HPC is driving the changes to the GPU more than games. Things like a more flexible memory model, relaxed SIMD restrictions, cache, ECC and double precision fp aren’t needed as much in games as they are in the variety of scientific apps. ATI keeps adding more raw graphics processing power while NVIDIA is moving away from extreme data parallelism and is really the only one being used for scientific apps.

Red Hat has pulled the plug on the Itanic. I’d love to know how much Intel lost on this deal.

I don’t know if the 8 core nehalem will match the itanium, but it seems like it will be close enough that there is no reason for the itanium to exist anymore. It has been beaten by power consistently, and now intel’s own chip is close to matching it. Seems like it’s a goner.

I suspect RedHat pulled the plug because the one big user of Itanium Linux, which was SGI, stopped using RedHat a couple of years ago. RedHat dropping Itanium isn’t the same thing as Linux dropping Itanium. SGI went with SUSE Linux Enterprise Server, and although they still offer RHEL, it isn’t the preferred option.

Itanium has architectural features that should allow it to beat out the others. But there are also other issues that run interference. Itanium relies upon the compiler in ways that more conventional ISAs don’t. In some ways this is a repeat of the RISC story. Were it not for compiler advances RISC would never have been a viable competitor.

Of course the other great shame was that Alpha was killed off. In theory Intel were putting a lot of the technology and people they bought with Alpha into Itanium. But that was rather a long time ago now, and we still haven’t seen the payoff.

Any multicore solution that ties to match a single core at performance is missing a key part of the equation. Unless you can use those cores, and use them enough to justify them, there remains a big gap.

I recall reading something years ago - they said that the processor chips all came off the same assembly line and were stress-tested; if they functioned well in the full range of temperatiures, they were marked military grade or top speed. If they gave errors at a lower speed they were marketed as for the speed they would tolerate. Irregularities in the manufctring might put a constriction on conductor path, for example, or make a transistor site too slow.

Sometimes the demand was higher for the slower chips, so they would relabel good chips as slower to meet the demand. This meant if you were lucky your 1.8GHz was actually a 3.0GHz and you could overclock it successfully.

Almost. They definitely all come off the same line, and from the same lots. The get speed binned, which applies a number of functional and structural tests to see how fast an individual chip can go. We understand the relationship of temperature and voltage to speed, so it is not necessary to test at a lot of corners, unless you are trying to understand failures. Many chips are stressed - burned in - but this is to find early reliability failures before shipment. There are many early life failures (which you plot in a bathtub curve, with a high early fail rate decreasing to a small stable fail rate for a long time, and then increasing again when you reach the end of life) and burning in a chip moves it to the stable part of the bathtub curve.

Yes, chips do get downbinned depending on demand, but there is no real way of telling, except, maybe, if you have access to device id information held internally. (And can decode it.) I know how this works for us, I’m not sure about Intel consumer parts.

You are never going to see as wide a range as you give. Sometimes the situations where a speed test fails is rather subtle, and you might get away with not sensitizing the slow path or not noticing when a failure happens.

The Merced management was more scared of Alpha than of anything else. I’m not at all surprised that Intel didn’t get much out of Alpha - there was a real “we’re the smartest” element there. In the early days a lot of HP designers came to work on Itanium. When they left there was no attempt to benchmark the way HP did things to see if there was anything to be learned. PA-Risc was not a bad design.

Why do you say good compilers were required for the success of RISC? (Besides good compilers being necessary for the success of any chip.) One big motivator for RISC was the observation that that the expensive and complicated instructions in CISC machines were hard to generate code for. (I did my dissertation on a very similar problem.) There is a direct linkage between vertical microcode and RISC. Dave Patterson did his dissertation on microcode verification, and microcode was heavily used in IBM where RISC was really invented. There is an even more direct connection between microcode and VLIW machines. Josh Fisher did his early work on compaction across basic blocks after we had solved the compaction within basic blocks problem. I saw him give a very early talk on VLIW, before Multiflow, at a Microprogramming Workshop.

But I’m still convinced that what killed Itanium was it doing such a poor job on the x86 instruction set where the installed base was. That, and its cost, made the case for switching very difficult.

Not so much good as advances in technology. The classical reason was register colouring. Give the compiler lots of GP registers, and let it workout the best use. I remember looking at VAX code years ago, it was amusing to see some of the arcane instructions the DEC compilers used. Yet the manner in which you lost so many registers, and indeed, the poor performance of many of them made it all a bit moot. The wide variation in costs on different machines was an issue too.

IBM’s heavy use of microcode of course enabled them to pitch ISA identical machines at very different price points, at a time when competitors essentially had no compatibility between models. Giving the customer such a huge safety zone about their investment in software was probably one of the most powerful arguments IBM had. Worked too.

It seems that the issue with compiling for Itanium is what you allude to. It just seems hard to get the compiler to be able to schedule enough work across the wide instructions. Works well for intense numeric kernels, harder for less structured code.

I suspect you are right about the x86 compatibility performance causing Itanium such grief. It was astonishingly poor. They could have simply put a previous generation x86 core in one corner of the die and done better.

Sometimes one really wonders how Intel remain the in preeminent position they hold.

And, more importantly, emulate old workhorses like the 1401. And faster than the original. That was exactly what Intel missed - the fast part.

I could telll you some stories, but they’d probably shoot me, even at this late date. :stuck_out_tongue: