FPGA Questions

I am looking into FPGA’s to solve a computing problem (need more computing power, can’t afford to buy 100 intel procs, cell and gpu only work somewhat for my prob) and I have some questions:
Which Product?
I’m trying to determine if this will be less costly than other methods (e.g. multi-intel) and before I can begin those calcs, I need to understand which products are geared towards my problem. I see products from $35 to $9,000, but I really don’t know what I’m looking at. I believe I need lots of gates, and some moderate amount of memory (maybe 256mb).

Memory Access
When an fpga comes with memory, are there the same latency/bandwidth issues as with traditional procs? or is some of the memory built right into the fpga with direct fast acccess?

Communicating with PC
I assume I can get these on a board I can just plug in. Can I send data straight to fpga mem like with my gpu? What are xfer rates, etc.?
Ultimately I need to figure out how “much” computation I can get done with one of these within a certain amount of time so I can compare to other methods, any info that would help guide me in that calc would be great, although I realize it’s probably tough without knowing the specifics of the final program/circuit/whatever you call it.

I am in no way an FPGA expert, having only programmed one device for a class I took last year. But it’s my impression that modern FPGAs are quickly replacing traditional microprocessors in most high-performance hardware devices. They’re fast, have very high throughput rates, are energy-efficient, and are very very configurable. You’ll find them in just about every device that performs real-time computations. You can also install code on an FPGA so that it “pretends” to be a traditional microprocessor, and then you program the “microprocessor” using C or whatever. When the FPGA is used in this way it’s called a “soft microprocessor.”

If you want to determine if an FPGA will do what you want, I have two pieces of advice:

  1. Write some code (using VHDL or Verilog) and test it using the FPGA manufacturer’s simulation software.

  2. Nothing against the SDBM, but you should join a forum that specializes in programming FPGAs.

Yeah, this is a good place to keep up to date on news and cultural trends and such, but… I’m extremely puzzled when people ask questions like this here. How is it possible to succeed in any technical field without being aware of all the industry websites? My coworkers and I spend a great deal of time on work related sites and reading the newsletters and so on, especially for purchasing decisions and new (to us) technology.

Huh. There’s few places that I’d trust to have a higher signal/noise ratio for people who aren’t actually in a particular industry. Like me: I’m an AI/distributed processing/mobile robot control guy. Who, as a hobby, finally bought a soldering iron with the intention of doing some hardware stuff in my spare time. For certain things (e.g., intermediate-level advice, an applicable product recommendation, or a tangentially related field), I’d readily ask and then trust the responses given here.

For instance, I had a question awhile back about setting up multiple microphones for speech recognition; a couple audio people (sound engineers?) responded, bringing me to the realization of what an undertaking it would be, so it remains a kernel of a project. Of course, it should come as no surprise if no one responds to technical questions and they just drop quickly off the front page.

But it’s always worth asking here, as far as I’m concerned…and it makes the forum broader, and thus better. At any rate, RaftPeople, sorry for the minor hijack; I’ve only read some very basic stuff about FPGAs, and that would be dated anyway.

Specific, technical questions like this have a 50/50 chance of meeting with success. Cultivating a list of topic-specific forums is good advice. But the SDMB has everybody. And just to prove your snide remarks wrong, Ms Patriot Grill, I will answer the OP in detail.

First of all, what exactly are you trying to do? It’s very difficult to beat a GPU. It has unimaginable memory bandwidth and processing capability. Only if your task is very non-parallel or has a lot of conditional code, and is integer-based, should you consider an FPGA. (FPGAs also win out when you need low, very low, or extremely low latency, or when you’re focused on the digital signal interface aspects–think PCB wires–rather than computation.)

Which product? There are two classes of FPGAs. Mid-range devices that cost $20-200 (meaning Cyclone and Spartan), and high-end devices that have ridiculous prices and are sold to the military or telecom. Besides having more gates, the high-end devices will have faster I/O pins that can do gigabit speeds. They’ll also have a bit more megaherz (alas, no gigahertz), more memory, etc. But mid-range devices have gotten quite powerful, and the high end parts just cost too much.

Memory access FPGAs have two kinds of internal memory and can interface to several kinds of external. Internally they have flop memory, which is extremely fast. Every “gate” in an FPGA is actually a piece of memory (a look-up table), and if you use them all at once you will have extreme bandwidth (terabytes/s). But you won’t have a lot of storage (a few tens of KB) and you’ll use all your gates. Think of these as CPU registers. Then there are special-purpose memory cells, which are also fast and have more capacity (a megabyte). Think of these as cache. Externally, you can use DDR2 DRAM, but you can also use SRAM and various other exotic technologies. It will be nowhere as wide or as fast as GPU memory. It’ll be similar to CPU memory, at best. Whichever memory you use, know that it will be a pain to use it.

Communication with a PC. PCI-Express is fastest (gigabytes/s, just like GPUs). You’ll need a high-end FPGA for that. PCI-e will be quite difficult. Some FPGAs can also drop into AMD CPU sockets. That, actually, will be fastest and more convenient. On the other side of the spectrum, an ancient serial or parallel port link will be very easy. Any sort of no-protocol data pipe is easy.

How much computation? That depends immensely. Complex if() statements that will confuse a CPU will go by with many per clock cycle. Yet a floating-point op that a GPU can do a hundred in a clock cycle will be much more difficult. In terms of raw number-crunching, take a look at the number of multipliers the FGPAs have (typically a few hundred). These can do one 16-bit integer multiplication per cycle (typically 300-600 MHz). Really, it’s not much. Like I said, it only makes sense for non-parallelizable problems. (Or others where you’re in it because you need a chip.)

Oh, and did I mention how difficult it all is? It’s actually easy to make a system-on-a-chip by slapping together a virtual micrcontroller, other virtual peripherals, etc. If you have a specific role for the FPGA (bypassing Xbox DRM?), you can code a small custom component. But running a complex custom app at very high throughput means coding that whole app manually, all in a hardware description language. You have to think of dataflows cycle by cycle. And if you factor in external memory and external PC interfaces, the task is immense. Profoundly immense.

Stick to GPUs and x86 clusters.

Another interesting aspect of an FPGA is the number of bits you use per int really matters. If you use 5-bit integers, you can get many times the performance of 16 bit integers, which will be many times faster yet than 32 bit integers or, god forbit, 64 bit floats.

I haven’t done FPGA work, but on the other hand do know a great deal about clusters and parallel systems for conventional processors. There have been a few HPC systems out there that have provided support for FPGA accelerators. Both Cray and SGI did. It was pretty much a given that the only customers were the spooks. They had a problem that fit well, plus the desire and the money to get it to go really fast. Other than that it has been a truism in HPC work that trying any sort of custom design was a loser. The time taken would see it overtaken in speed by a conventional system before you got it to work. There have been a very few successful systems. Typically they have addressed lattice gauge QCD - where there is a very tight matrix multiplication that is both amenable to a custom solution and is the critical speed determinant. Even then it really wasn’t clear that they had won out. Maybe just broken even versus a conventional solution.

There is a lot of interest in GPUs rather than CPUs, but they are seriously limited for the most part. The memory model is messy to say the least, and the focus on limited precision floating point for graphics work makes them less than ideal for a great range of problems. But things have impoved. CUDA, OpenCL. Both interesting paradigms.

So, the standard spiel.

You can’t even begin to think about architecture of a high performance system until you understand you problem intimately. You need to understand the mathematics, the range of algorithms available, the required data access patterns the various algorithms require, the types of computation, and if approriate, the sensitivity the algorithms have to numerical instabilities.

Once you have all of this you can start to look at what is a likely good system design. Most problems are actually not about compute speed, but are about data movement. The problem is simply feeding the CPU data fast enough. Look at a Core2. 32kBytes of L1 cache. That really isn’t much. And an L1 miss is of the order of 10 cycles. The CPU can do a heck of alot of work in those cycles.

If you can get any sort of regularity into your problem it is usually possible to work out how to parallelise it to get some useful speedups. Just useful however, often nothing fantastic. (I did once get one guy’s code to run over 100 times faster. He was pretty happy :smiley: ) Sometimes you can get lucky, and find you have a fit that can work quite well on a simple cluster. Sometimes you don’t and need more heroic interconnections. Sometimes you really need a shared memory system, and you suddenly find the costs start to escalate. Intel will be shipping early Larabee devices soon. That is going to be really interesting.

But, any solution that involves custom work with FPGA’s will need you to be brutally honest about the value of your (and your colleges) time. You can buy a lot of conventional CPU grunt with a few weeks wasted labour.

Thanks for the answers so far. Here is a little more detail regarding some of your points or questions:

  1. All integer. I’m using 8 to 24 bits for fractional portion, could switch to fp if I wanted, but it’s integer for now.

  2. Project is artificial life/neural network simulation. It’s running on a gtx280 gpu but the memory access just doesn’t fit the neural network part, and that part is branchy also.

  3. It’s a personal project and although I also have it running on multi-core and multiple pc networked systems, I just don’t have the money to get 100 x86 cores working on it. And the networked version is too slow due network comm overhead.

  4. Talked to the guys at Tilera with their 64 and 100 core cards, and it would work, but they are out of my price range (as a consumer).

How much memory do you need per node? Could you do this as an array of microcontrollers, connected together with a fast bus? You can get 20+ MIP (8bits) microntrollers. with a couple of K of RAM for around a buck these days.

I actually worked on the QCDOC machine and got it placed on the Top 500 (requiring a thorough reimplementation of Linpack). The development cost for it I think might have been subsidized, because the successful BlueGene supercomputers were a direct descendant of the design. Funnily enough, so was the Wii.

Yah, neural networks would probably be a pretty good fit for FPGAs. But I don’t think it’d work out as a personal project. But if you want to try it out, I strongly recommend Altera. They have much more user-friendly tools. Get a cheap dev board from Terasic (such as DE2) and run some examples. If you can lower your I/O requirements (low PC and DRAM bandwidth) and boil your algorithm to something very simple and regular, then you can get it done.

Do you feel like explaining the algorithm further (is it a simple neural net, or something fancy)? And how you mapped it to CUDA. There’s many ways to get creative with GPU memory access patterns, and I think you can improve there a lot.

And lol @beowulff

The primary neural net routine is <500 lines of code.

Not a simple net but not sure if it’s fancy or not:

  1. Models neurons and synapses (why? because it sounded interesting, not sure if it adds any mathematical value or if it’s equivalent to having extra layers/neurons).

  2. Not fully connected, but it is recurrent. Data is stored in a compact format. An array of neurons and a separate array of synapses with all inbound connections to a neuron being adjacent to each other in the synapse array. This is the part that creates problems in the gpu: I can either structure data to optimize for reads or writes but not both at the same time.

  3. Each neuron and each synapse can use a variety of methods (standard and non-standard formulas) to calc whether to fire, each one can use a variety of methods to determine what the output value is when firing and each can possibly use a wave function as output. Basically I threw in everything including the kitchen sink with my goal to let GA’s determine what works best and what doesn’t including a mix of approaches all at once. All training is GA, no backprop.
    When I converted to CUDA, I did the following:

  4. Got it working in current format so I understood CUDA (all of my other code that is non-neural net works great on the gpu), etc.

  5. Analyzed the neural net problem further, tried some things like utilizing shared memory, or multiple passes on data, or dividing up the problem a different way or storing data a different way. Seems like the best way to make the neural net map into the gpu memory model is if each layer was fully connected to the previous layer and no recurrence, but that significantly increases the time to evolve a good solution vs not fully connected, so it would be a wash or worse.

I’m open to anything, will that suggestion not work? Although memory sounds too low.

Frankly even 500 lines doesn’t sound simple enough to port to an FPGA.
I don’t believe you’ve explored the CUDA solutions sufficiently (and there’s a lot to explore). Let me see if I’m understanding the data flow: You have an array. Each neuron inputs data from that array (multiple locations). It processes it for a bit. It then writes it back into the same array (one location). Sometimes you can organize all the inputs together, sometimes the outputs, but ultimately it’s pretty scatter-gather. Is that it, or did I miss a complicating factor?

First question, how big is this array? Second question, do you run one iteration then analyze, or do you run many iterations of the same neuron arrangement but with different parameters?

20 MIPS is on the order of 10,000x slower than the peak capability of a GPU.

Hmmm. One approach I was considering is making a “program/circuit” that is basically X number of little processors specific to just the ops I need and run all of the data through that, not sure if that’s easier or hard or slower or faster, etc.

That is basically it, scatter-gather is the issue. Although the branchy-ness of the code also hurts, it’s tough to keep each group of threads on the same instruction unless I throw in a bunch of no-op or throw away operations.

Right now, generally 64k for neuron data, 640k for synapse data per network, typically running 200 networks per run. (Original plan was to grow much larger, but that requires a lot more processing power).

Here’s the sequence that typically happens about 1,000 times before re-generating networks and then doing it another 1,000 times for X generations (500 to 30,000):

  1. Move objects based on output neuron activations
  2. Activate input neurons based on external and internal stimulus
  3. Calc new state of network
  4. Loop

Since i’m not too familiar with neuron algorithms, you’ll need to explain this a little more (or perhaps in straight terms of array ops rather than conceptually). By “network” you mean a completely independent neural net (presumably one per SM)? Or same net with different coefficients? What does “move objects around on neuron activation” entail? How is “activate input neurons” distinct from “calc new state of network”?

How many of these iterations do you manage to do per second (on GPU and/or on CPU)? 64K/640K is uncomfortable. Too much to fit into registers/shared memory, yet not enough to let the dram stretch its legs. How many neurons is that?

You don’t need to do this at all, the hardware takes care of that aspect 100%.

It’s an artificial life/neural network sim, so 200 creatures, each with 1 brain/neural net. So total of 200 different nets each with different structure, weights, etc.

Input neurons (neurons in first layer) are attached to creatures senses and are activated when creature touches something, or sees something, also internal state (e.g. energy level) is fed to input neurons. This step is distinct from “calc new state” because, generally, the neurons are getting set due to external activities (e.g. creature bumped into a different creature, food, wall, or there is something within creatures visual field).

“Calc new state of network” is calculating all of the new values for all non-input layer neurons and synapses.

Output neurons (neurons in final layer) are each connected to “muscles/organs” that perform physical work. Rotate creature, move creature forward, etc.

CPU: Core2quad (Q9450), 200 creatures with brains, 600 total objects of various flavors moving around: 9 seconds for 1,000 cycles (1 generation), which includes all processing, neural net is 45% of processing time, physics is also 45%.

I need to go back and look at my gpu times, they were faster but not to the level I was looking for (10x - 100x) and then some of the gains were lost during re-generation and re-xfer new networks to gpu. If all 30k lines of code were on the gpu then I might see the gains I wanted because no need to keep xferring back and forth and restarting kernels, etc., but I was hoping to find a way to just offload the most compute intensive routines, which is what I’ve done at this point.

4,000 neurons
40,000 synapses

But if all of my threads in a warp are on a different instruction, it’s going to be less efficient than if all on the same, right?

RaftPeople have you done any digital design? It is not straight forward to convert a micro processor program into digital logic. In hardware everything is happening at once not like software where you move from one instruction to the next. It is a different way of thinking about the problem.

Memory access:
Memory access inside FPGAs is much different than memory access from a processor. All FPGAs I have worked with have had one cycle access to RAM inside the FPGA. External to the FPGA is similar to micro processors except that you tend to need to design your own memory controller.

Communication with PC:
You will need to design logic to do this. The FPGA manufactures will have some things to help you. You can have anything from a UART to 10 giga bit ether net. All of these will need board support so you have to check the manufactures prototype boards out.

That is an interesting problem you paint there.

For the last painful question, what is the relationship between neurons and synapses, computationally? (Or rather, why does each have its own state, and how do these states matter.)

If a thread has to take a unique if() branch, all the others will wait for it to come back. All the threads will be resynced and executing identical instructions whenever it is possible.

Very little. Only some circuits in college years ago, enough to understand what you are getting at regarding the shift in thinking, but not enough to actually help me with that shift in thinking. I know that will be a challenge for me.

How does the gate/unit/whatever indicate which bit of memory it wants to access?

Many gates work together to put a number on a set of wires, and the built-in RAM responds to that number.