FPGA Questions

Your RTL code specifies this. In general there is a sea of small rams that can be configured in a very flexible manor. The FPGA software will allow you to specify rams with a wide variety of widths depths and more exotic things like dual port rams. These rams get made out of the smaller chunks of ram in the FPGA.

Why? Because when I started this my goal was to play with X different variables and use GA’s to see which things worked well in which different situations and whether there was anything interesting that could be discovered. I built in the modeling of the synapse and the neuron (and the potential use of a wave function for either) because humans have it and I thought it either hurts, helps or is neutral and I thought it would be interesting to see under what conditions it might be a benefit.

I can’t give you an explicit mathematical advantage to this setup because the math is way beyond me. Furthermore, I suspect that a mathematician would tell me that modeling neurons and synapses is the same as just using X many more neurons in a more traditional setup.

But my entire goal is to explore and let the GA’s tell me what works and what doesn’t, so I threw it all in there and if it turns out it works then it will get used and if not then it won’t.

What I have found to date is:

  1. Using neurons and synapses is “better” than just one or the other, but again I assume it is strictly a numbers game.

  2. Using the wave functions (meaning when fired, neuron or synapse produces output according to a preset wave function over time as opposed to a single value only at time X) has not produced any nets that function very well.

Apologies if these are stupid questions, but do you have to use threads? If I understand what I’ve read so far, it sounds like you have some serious load balance problems. I also believe that thread overhead can be pretty costly, particularly in the lightweight type of application you’re doing.

Secondly, what language was your original (x86) program written in? Did you do any profiling to see where the code is spending most of its time? Any idea how many operations (roughly) you’re doing, and thus what fraction of peak you’re getting out of your cpu?

I’m wondering if this isn’t really a software problem.

Would you agree with Alex that 500 lines of code might be too much for an FPGA?

The reason I asked about the neurons and synapses is that I still can’t say I know what the inner loop looks like.

I didn’t mean too much for the FPGA, more like too much for you. But 40,000 objects surely would be too much for the FPGA. You’d have to maybe reuse circuitry creatively, but then the challenge escalates still.

On the gpu lots of threads is good so it can choose the ones whose memory has arrived, etc. and do some work. In that case I had about 16,384 threads.

On the cpu I just have 4 threads, one per core.

Originally Java, then critical operations converted to C and to GPU. Of the 30k lines of code, only 2 relatively small routines eat up about 95% of processing. The C code is only a little faster than the original Java (highly optimized, all integer). The gpu is much faster for 1 routine, only somewhat faster for the other so that’s my bottleneck now.

I’m sure I could get more out of the C code, but the routine is pretty straight forward, I don’t think I could get the 10x or 100x that I really want.

Oh, and what is the difference between ‘generations’?



Loop through synapses (connection from 1 axon to 1 dendrite)
    Calculate new output value of the synapse based on the current output value of the neuron's axon
End Loop

Loop through neurons
    Loop through all inbound synapses for this neuron
        Add up all of the inputs to this neuron from the synapses
    End Loop
    Calculate new output value for the neuron/axon based on all of the inputs
End Loop


Oh.

When you say too much for fpga (the 40,000), what about if I essentially created 500 mini-processors on the fpga and fed the data to them, kind of like a gpu, but in my case I would make special allowances for my specific data model.

At the end of each generation, a new set of brains are created. Previous brains are either slightly modified or scrapped completely depending on how well the creature was able to survive (i.e. eat other creatures or plants and not get eaten and not waste too much energy).

It is difficult to say without knowing the algorithm.

The big thing that will make or break the project is the tools. I really have no idea what kind of tools you get for low cost. I have always had decent simulation tools. both in school and in industry. I have never used the simulation tools that come with FPGA development boards. So I have no idea if they are good or not.

Some links that might be of interest. Especially the second link. It might be fun to see what they come up with.
http://www.verilog.net/free.html

The alternate approach would be to do something like what beowulff was mentioning, an array of smaller processors. This is what Tilera has with their cards but it costs way too much for me.

Is anyone aware of any products (like boards, etc.) that would let me plug in a bunch of cheap small processors (arm?), each with local memory and fast access to global memory?

Ok, here is my idea.

Instead of drastically changing the brains each generation, make big changes every 32 generations. Inside the epoch of generations, only play with weights. Load up all 32 generations into one warp. Run that warp with perfect branching and perfectly coalesced memory accesses. Possibly you can even use this approach with somewhat more drastic intra-epoch changes. If done right, much of the processing coherence will still be preserved.

This is what I meant by the challenge escalating. Now these things need to be somewhat general-purpose. They need general-purpose buses going back and forth. A whole, powerful memory hierarchy to back them. That is too much. Something more realistic is to make a static circuit that takes in some inputs and spits out an output. Ie, a circuit representing the whole net. But even then, you have to reconfigure it for each brain, which is likely too much overhead.

Hmm, a C-to-Hardware compiler. I seriously doubt you’ll get a good speedup with this automagic tool, but I suppose you can just see if it will even compile (and fit in the FPGA). If it does, you can build a system-on-a-chip with Altera’s plug-and-play tools (SOPC Builder). Won’t be trivial, but fairly doable. Even if it comes out to be impractical (as I foresee), it’ll still be something very cool to have played with.

Btw, in terms of tools you get for free, it’s actually a lot. Altera gives you ModelSim for free. That’s decent, no? Altera has great stuff.

To add, you can start reading about Altera’s free C-to-Hardware compiler tonight: http://www.altera.com/literature/ug/ug_nios2_c2h_compiler.pdf (A bit of background: when you create this piece of hardware, you use it to offload processing from the rest of your C code running on the virtual microcontroller(s). The microcontroller hooks up to a bus that can have other virtual peripherals, like an Ethernet chip, a PCI bridge, etc. These virtual peripherals connect to FPGA pins that are linked up to physical connectors on your development board. The chip is programmed like any other microcontroller, so hopefully you know a bit about writing OS-less embedded code.)

If you use that website compiler, you’ll still need to use vendor tools to “synthesize” the design (convert it into FPGA code). Then you can see if it fits on an fpga.

Re: 32
Yes, that is an interesting possibility. I considered other ideas about keeping the structure similar between nets by requiring same structures within different subsets of each net (each net is actually broken up into multiple sub-nets that can all inter-connect, or not). But I like your idea also as a compromise for the gpu.
I probably will not try out the C to hardware converter for exactly the reason you mention. The problem needs to be re-worked to match the underlying platform and I would rather start simple and get something working than start complex and hope.

Question: If I were an expert in fpga design, knowing what you know about my problem, do you think it would be possible to create something that would give me a 10x increase in performance over the 4 core CPU implementation of my problem? I’m trying to figure out if my limit will be me or the fpga. If it’s me, I can solve that problem, if it’s the fpga, I can’t.

[quote=“RaftPeople, post:36, topic:516003”]

do you think it would be possible to create something that would give me a 10x increase in performance over the 4 core CPU implementation of my problem?/QUOTE]
already said no