In a machine that follows the VonNeumannArchitecture, the bandwidth between the CPU (where all the work
gets done) and memory is very small in comparison with the amount of memory. On typical modern machines
it's also very small in comparison with the rate at which the CPU itself can work.
The obvious solution is parallel processing. It has many problems of its own.
Why isn't the obvious solution to increase the bandwidth between the CPU and memory?
- Doing so is a non-trivial problem. One can increase the width of the data bus; but that complicates physical packaging of the processor (as well as magnifying problems with crosstalk, etc.) One can increase the I/O clock rate, but running fast signals across a circuit board (as opposed to within a blob of silicon) is also difficult. Finally, memory devices themselves need to be sped up.
- Bandwidth is not the main problem. Latency is.
- Actually, I shouldn't have said "finally", because there's a much bigger issue involved. The real meaning of "VonNeumannBottleneck" is not about memory bandwidth, it is about the fact that CPU operations have always been parallel to a greater or lesser extent (there's always parallelism of some sort on chips, e.g. addition using carry lookahead schemes rather than bit-serial ripple carry -- although often not externally visible except for the side effect of higher performance), but CPU <-> memory operations are inherently sequential in von Neumann architectures.
- So whoever said "The obvious solution is parallel processing" -- that's not a random proposal, that is precisely what VonNeumannBottleneck means in the first place. The second most popular architecture, HarvardArchitecture, is non-von to a tiny extent: it separates data from program memory, accesses the two on independent buses, and allows e.g. a data fetch to happen in parallel with an instruction fetch.
- A more extreme kind of non-von is "smart memory", where there may not be any central cpu at all, there's just lots of memory, but each cell supports not just FETCH/STORE, but also a slew of ALU operations, with (ideally) lots and lots of independent buses to let memory talk to itself. This is a very active research area (not a new idea, but getting more mature). -- DougMerritt
The
other obvious solution (which doesn't require abandoning a
VonNeumannArchitecture), which I'm surprised hasn't been mentioned on this page, is moving memory into the same package as the processor. Either in explicitly-separate address spaces (large memory files, special "processing" memories, etc.), or via caching.
That's not abandoing the VonNeumannArchitecture, that's increasing the bandwidth between the CPU and memory. The architcture still meets Von Neumann's requirements even if the CPU and memory are in the same physical device.
- Quite true. It does help, of course, which is why we have caches. The idea of actually integrating all of RAM onto the same chip as the CPU has been kicking around for a long time, since that would speed things up by avoiding off-chip delays (capacitance, if nothing else, causes huge delays), but this has not been particularly feasible for an interesting reason: the process engineering involved in mass producing DRAM is mutually exclusive with that used for mass producing CPUs. Put both together on one die, and yields (and defect rates and speed) drop like a rock. In recent years there's been some limited improvement in this area, but it doesn't look ready for prime time yet. (This is of course purely a digression from your point, but I find the topic an interesting one.) -- Doug
A partial solution I've seen in actual use (on RAID cards, at a memory layer between the RAID-card's CPU and the disk drives) is to add some intelligence to the memory for wide-memory operations (such as zeroing, XORs, copies). The RAID-card's CPU can thus issue a few higher-level commands to the memory than just get+set, mostly to optimize around the VonNeumannBottleneck. One could presumably extend this to support small microprograms at the memory in the same sense as GPU shaders, though one would probably wish to guarantee termination. On the RAID driver, some sort of locking scheme was in use where, when the CPU requested a memory address in the read or write region of the program over the memory, it would end up waiting on that request until the program completed, and if the CPU tried to change the memory-program it would also wait until completion.
digressing even further
Actually, most CPUs manufactured these days *do* integrate all of RAM onto the same chip as the CPU.
(Well, at least they did in 1997
http://en.wikipedia.org/wiki/Microprocessor#Market_statistics
-- is there a more up-to-date reference?
)
Most microcontrollers include all the program memory and the RAM (usually not DRAM) on the same piece of silicon as the CPU.
As of 2009, there are several microcontrollers that include on the order of 1 MBytes of Flash that can be read and re-written by the software.
But they still have a von Neumann bottleneck.
Other research material is
Hyperthreading
Pipe-Lining
Multiple ALU
Prefetch
- Matt Pettit (UK) 2004
- You missed some buzzwords, like VLIW. But to my mind, all that stuff is still variations on the theme of von Neumann. -- Doug
Is it true that the graphics synthesizer for the next generation PlayStation used
"Computational RAM"
http://david.carybros.com/html/vlsi.html#cRAM
?
The other "obvious solution" is not to throw hardware at the VNB, but to rebuild programming from scratch with another paradigm. See OmnigonInternational for less tips here...
See
CanProgrammingBeLiberatedFromTheVonNeumannStyle