Actually neither. Memory latency is the killer. Bandwidth from memory is indeed huge, but the latency is also horrid. Hundreds of clock cycles to complete a transaction. So a cache miss can basically stop the program dead in its tracks. A pathological cache layout issue can bring a program to its knees. You can contrive memory layouts that can have a 100-1 difference in performance for the same code just by messing with the way the cache works. Memory transactions typically fetch a multiple of an entire cache line in one transaction. Controllers have useful optimisation that will do things like grabbing the outstanding word from the transaction the moment it hits the controller’s buffer and stuffing it up the hierarchy as fast as it can, but you are still playing the odds, and averaging a very significant fraction of a memory transaction time for a miss.
Layout of problems in memory with a view to cache operation, and hand in hand, tuning the algorithms in use the layouts, is probably the biggest place you will find speed in a modern system. But there are codes that simply don’t cooperate well. They have irregular access patterns, or pathological ones, ones that seem to keep the caches from really managing to hit their stride.
Sometimes it needs meticulous profiling and hand optimisation down to a very low level to sort these problems out. Addition of cache pre-fetch and similar tricks can help considerably.
Thanks for explanation. Do I correctly understand that for the use case of games* you could simulate pretty much any physics problem?
The question about the ease of using the GPU for non-traditional GPUting is partly because I’ve been wondering why AMD has been so slow in using its GPUs for things other than traditional graphics.
I don’t understand how anyone whose business is real-time computer graphics could have learned that DirectX11 was going to enable tessellation and then gone on not to emphasize that feature in their GPUs. If I understand correctly, tessellation allows you to save a lot of memory bandwidth by transferring a light, simple mesh from the VRAM to the GPU and then using the GPU to add geometric complexity to whatever level is desired in real-time.
VXGI uses much the same principles as Flex by taking a fiendishly complex interaction of millions of points with each other and grouping them into fewer, bigger chunks. The result is not quite as accurate as, say, Iray, but it’s good enough to be a major improvement and it’s hundreds of thousands of times faster.
Seems obvious enough and not that difficult to implement so I’m wondering if there’s something I’m missing here.
*Given that they don’t need scientific accuracy or care if a butterfly batting its wings in Beijing causes a tornado in London.
Not really. Some physics will simply get you the wrong answer, or worse, an utterly wrong answer, unless you are very careful. Real world mezo-scale physics is probably OK, but some forces act very unfortunately.
A big part of getting big simulations right is working out how to cope with the approximations needed, without the problem going all wrong.
For a game you can get away with empirical approximations that get something that looks about right, but may actually not embody anything like the real physics of the problem. Turbulent flow would be a good example. (Actually turbulent flows are a good example of a huge number of the problems faced in real world computation as well. They are just plain awful.)
I don’t think it is really fair to say AMD has been silent on this. nVidia is probably the company that is more strange, given their secretive nature and unwillingness to publish useful details about their offerings. The two big GPU libraries are CUDA and Open CL. Open CL runs on AMD as well as a large number of other platforms (including nVidia) whereas CUDA is nVidia only, and will remain so so far as we know. Proprietary lock-in has not been the best long term strategy for a while now. nVidia may be one company to lose out here, not AMD.
The lead time in design for a GPU is going to be quite some time, from working up a new design in the abstract, simulating, measuring, tuning the design, RTL, layout, fab, testing. This isn’t quick. There isn’t a lot of agility, and even if were apparent that a feature would be a good idea, it isn’t assured that it is worth the silicon it would take up in competition with other parts of the design. Otherwise you are talking more area and greater cost, and a question about how it fits the market, or means another product, and whether you can afford another model.
SGI had multiple resolution textures in about 1997 (I think they came in with the Infinite Reality.) The idea or implementation isn’t new.
Yes, sure. GPU’s are also used for password hacks, Botcoin farming , etc. but those applications are not really what GPU’s are designed for and used for in 99.99% of the time - and can only be used, if the software is specifically designed for use with graphics cards.
True about open standards. Hopefully AMD’s latest attempts with GPUOpen will be successful.
For what reasons do you think Nvidia has been getting the better of AMD for a while now?
Am I correct that the interconnects, both inside the die and to-and-from the die, are major areas for improvements?
Looking at the GTX 980 GPU and the R9 390X GPU which have roughly the same performance: the R9 390X has a memory bandwidth of 384 GB/s and the GTX 980 a memory bandwidth of 224 GB/s *. Now, I know that comparing some aspects like frequency or core count of completely different chips is misleading. However, if memory bandwidth is a major bottleneck, I’d think that would give us a good idea of what the GPU is capable of. Yet the 390X has about the same practical performance as the 980 despite having 70% more memory bandwidth.
The same phenomenon is found when comparison the GTX 970 with the R9 390. What could explain that?
The new generation of GPUs will incorporate VRAM inside the package. Are there advantages to including elements inside the package as opposed to outside, aside from distance?
Looking at the fact that HBM increases bandwidth by increasing bus width and the fact that (contrary to what might be expected) GPU dies have grown over time, can we expect future GPUs to have buses of tens of thousands of bits and dies of 1000mm2 or will other factors be sufficient to curb those trends?
I’d say NVidia is “winning” nowadays largely because they seem to be much, much better at courting companies (in the academic, industrial, and gaming fields). There’s a lot you can do when people optimize for your hardware, and you in turn optimize for their stuff.
It’s not that big companies don’t have some AMD contacts and test machines, of course they do, but Nvidia seems really aggressive about helping every game with pretty graphics and a proprietary engine have some novel feature like Hair Works or a heavily optimized form of anti-aliasing or something.
This works its way into the way they court schools, too. My university has some weird glasses-free holographic VR table donated by them, not to mention they provide us (as a lab/university, not students) cards at a huge discount, and we outright have our graphics lab and a freaking cafe in the electrical engineering building named after and sponsored/partially funded by them.
Which isn’t to say their stuff isn’t good, it is, but I think a large part of the reason it does work so well is that they work closely with so many people to make sure their specific needs are met.
I don’t think they’re less willing, and have no idea why, from my perspective, they seem to be less successful at it. It may just be a feedback loop where Nvidia did it earlier or got lucky.
I don’t think it’s counter to consumer benefit at all, other than perhaps making owning an ATI card a slight pain because nothing is optimized for you.
These are all good questions. In general GPUs have made more rapid progress in recent years because their programming model and physical construction is massively parallel, while CPUs have approached architectural, clock speed and heat limits.
It is relatively straightforward to add more “lightweight cores” or threads to a GPU. OTOH GPUs have been more limited than CPUs by fabrication technology. GPUs have been stuck at 28 nm lithography for a long time, and are only this year moving to 14 nm. This will mean an approx. 2x improvement in performance per watt: AMD RX 480, 470 & 460 Polaris GPUs To Deliver The "Most Revolutionary Jump In Performance" Yet
Clock speed scaling (Dennard Scaling) mostly stopped around 2007, and there is no near-term solution: Dennard scaling - Wikipedia
Since then CPU performance developments have emphasized improving Instructions Per Clock cycle (IPC), adding additional cores, and adding specific hardware for certain cases (AVX instructions, Quick Sync, etc).
Essentially the heat problem is starting to affect how much of the chip can stay powered on. This can’t be fixed by a better chip cooler. Even though integration allows putting more systems on the chip, increasingly large portions must stay powered off (dark silicon).
Instructions Per Clock (IPC) improvements which benefit single-thread performance are also dwindling, since designers have already exploited all the straightforward methods. There are a few remaining tricks such as data speculation. Data speculation differs from control speculation, which is currently used to predict a branch. In theory data speculation could provide an additional 2x performance on single-threaded code, but it would require significant added complexity. See “Limits of Instruction Level Parallelism with Data Speculation”: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47.9196&rep=rep1&type=pdf
VLIW (Very Long Instruction Word) methods could give another 2x or so but would require new software and compilers. Intel’s unsuccessful Itanium was at attempt at this, but some researchers are still investigating the technique: http://millcomputing.com/
So CPUs face a very difficult future. Designers are hemmed in on all sides by increasingly difficult power, architecture and fabrication issues. These are not things some clever architect can work around. The power and fabrication issues would require totally different fabrication and there is no clear way forward. The superscalar architectural limits are intrinsic and except for a few things like data speculation, designers have already used every trick in the book.
GPUs have a little more leeway since they are not limited by superscalar instruction decode, instruction dependency checking and other CPU architectural issues.
On both GPU and CPU sides, Amdahl’s Law limits available parallel speedup. In essence just a small % of serial code will poison the available speedup, even given infinite bus bandwidth:
However GPUs are typically only used for massively parallel algorithms where this is less a factor.
Many common tasks cannot be parallelized to effectively harness a GPU. One example is H264 video encoding which has an inherently sequential algorithm. So it’s not always possible to just re-write the task for a parallel approach.
That said, the cumulative advances in CPUs have produced very fast performance. The Intel Xeon E7-8890 v3 has 18 cores, and produces 2,419-2,995 Linpack GFLOPS, about 3 TFLOPS. This is about 30,000 times faster than the original Cray-1 in 1975, and that supercomputer consumed 115,000 watts of electrical power.
Amdahl’s Law: If I understand correctly, this reduction in improvement can be countered by having the GPU do bigger jobs, right? By adding more cores to a GPU, we could not get significant improvements in higher framerates but we could still get higher pixel/texel/voxel/vertex/edge/polygon/particle counts, right?
Dark silicon: Could this be used to allow a CPU or GPU to alternate between several highly specialized microarchitectures so that any one subsection represents only a small fraction of the die’s area but has very high power/performance ratio at computing a narrow range of problems?
While Amdahl’s Law theoretically affects GPUs, most GPU software is highly parallelized, else it wouldn’t be running on the GPU in the first place. Hence in most cases adding more GPU execution resources will provide good performance increase. GPU manufacturers have been transistor-limited for quite a while due to 28 nm fabrication, but with the move to 14 nm this year that will change and we should see major improvements in performance and performance per watt.
Most multithreaded CPU algorithms require some synchronization between threads. Imagine an office boss having three secretaries each write one paragraph of a letter, then combine it. No matter how quickly the fastest two can type, they will be waiting on the slower one to finish, then they must combine their work.
Even code which seems mostly parallel has some serial elements, and Amdahl’s Law just describes how little serializable code it takes to kill your potential parallel speedup. This applies to both CPUs and GPUs but GPUs are typically only running tasks that are intrinsically highly parallel with very limited synchronization. Provided that stays the same, the more cores you add to a GPU the faster it will run.
The same is true for a CPU but to a lesser degree. CPUs have historical baggage – they must retain excellent single-thread performance, which means each core must be highly complex and use all kinds of superscalar tricks for parallel instruction execution. This in turn burns a lot of power and takes lots of transistors, which limits core count to a fairly low number.
If you are willing to throw out backward compatibility with existing software, there are more architectural CPU options such as VLIW: Very long instruction word - Wikipedia, and ManyCore: ALF – INRIA / IRISA project-team ALF, etc. However this is not a realistic alternative in most cases. You normally cannot tell customers they must replace all their software.
The highest end of the Intel x86 family, the Xeon E7-8890 v3 has 18 cores but can only run at 2.5 Ghz due to power dissipation issues. It is a server chip and would normally never be used in a desktop machine. However it does show there is some room left and it’s possible Intel will eventually have 8 to 16 core mainstream CPUs.
That is a heterogeneous CPU architecture, and Intel is already moving in that direction. Two good examples are AVX vector instructions which provide Cray-like ability to manipulate a string of numbers simultaneously, and Quick Sync video which is essentially an on-chip ASIC for video encoding: Intel Quick Sync Video - Wikipedia
The upside is if your particular task maps to those features, it runs very fast. The downside is the designer blew a lot of his transistor budget on those features and it only benefits specific tasks that can harness those.
In the previous homogeneous CPU approach, each generation was faster for all software. This also incentivized customers to frequently upgrade their hardware.
With the heterogeneous approach, widely varying performance may exist between software which uses those features vs software which does not. Suddenly the big performance differentiator is not getting a new machine but whether your software taps those specific on-chip features. This in turn can have a negative economic effect on the hardware upgrade cycle, but a positive effect for some software vendors.
A good example is video editing software Final Cut Pro X uses Quick Sync and Adobe Premiere Pro CC does not. FCPX can encode and decode H264 video about 4x or 5x faster due to this.
Another variation on this is the Intel Xeon typically does not have Quick Sync, so even higher end workstations like the Xeon-powered Mac Pro doesn’t have this. There are cases where a MacBook laptop (using Quick Sync) can encode video faster than a Mac Pro.
I don’t see any reason you can’t have a couple complex cores for backwards compatibility and a bunch of simpler cores on top of it. It’s essentially putting a GPU and CPU in the same chip, and Intel is already doing this, if I understand correctly.
They are. It might be a challenge to have high-end graphics though because the memory has to be differently optimized.The CPU works best with low latency memory (thin & fast) while GPU works best with high throughput memory (wide & slow). Perhaps two different kinds of memory could feed into a CPU/GPU combo.
It would make CPU-to-GPU communication really fast, though.
How come? A factory-installed watercooler can keep a 980 Ti GPU around 45 degrees Celsius which is about 40 degrees below air cooling.
Aside from heat, what negative effects result from pumping more voltage into a chip to get more performance?
[QUOTE=joema]
…the heat problem is starting to affect how much of the chip can stay powered on. This can’t be fixed by a better chip cooler.
[/QUOTE]
For the same reason you can’t prevent a fire on your cooking stove with a higher-capacity household air conditioner. There is a finite heat transfer from the die to the spreader, and localized hot spots build up on the chip that no cooler can compensate for.
It is well understood this limit was reached for frequency scaling (so-called Dennard scaling) in the mid-2000s. This is why you don’t see production computers with 8Ghz CPUs today, no matter what the cooler. Dennard scaling - Wikipedia
The coming “dark silicon” limit is similar – ever smaller feature size causes increased leakage current therefore heat – but instead of impacting clock frequency it will require ever larger die regions to stay powered off. This in turn will limit available parallel hardware (whether CPU or GPU): http://www.cc.gatech.edu/~hadi/doc/p...rk_silicon.pdf
Ah, ok, it’s about the hotspots. How come smaller nodes result in more hotspots than larger nodes?
I presume the hotspots are located in the particularly complex circuitry sections of each processor? Or is it the interconnects on the die?
Electromigration is bad because it changes the physical structure of the circuit, creating or breaking connections?
Architecture:
GPUs now seem to use unified shader cores. With the problem of dark silicon, wouldn’t it make sense to have highly specialized processors for each step of the graphics pipeline? As in: 1 microarchitecture optimized for each of the steps in this graph: https://i-msdn.sec.s-msft.com/dynimg/IC340510.jpg with only 1 or 2 microarchitectures turned on at the same time?
Could the large expanses of die area which can’t be used at the same time be used to create large caches? I presume that 1mm square of cache doesn’t consume anywhere near as much power as 1mm square of processors.
How come there are significant microarchitecture improvements nearly every generation? For a given set of tasks, can’t the designers find the microarchitecture that does it well enough so as to leave little room for microarchitecture-related improvement later?
Memory bandwidth:
What are the limits on memory bus width? Right now, 4096 bits is the widest commonly available bus. Are there reasons why the bus couldn’t go from 4Kb to 4Mb or more?
yep. electrons from the interconnects (the lithographed metal “wires” connecting transistors/circuits) diffuse into the silicon, eventually leading to failure of the interconnect. ISTR that this was one of the challenges when some companies moved to copper interconnects; they were using aluminum previously but copper was more prone to electromigration.