Grok Caching

From OptimizationPattern. Caching means not doing twice (or more) what you can do only once. The hard part lies in figuring out the crossover point where the extra overhead of implementing caching is balanced by its benefits.

A HighLevelLanguage presents a fairly uniform memory model, but in real life some kinds of memory are faster than others. The computer will use the fast memory as a cache for slow memory, to avoid the costs of accessing slow memory the second time. There can be many levels of memory, for example:

Onchip registers.No delay. (kind of see super scalar below)
Onchip L1 cache.Very Fast. (see Set associative cache)
Offchip L2 cache.Really Fast. (and big)
Main RAM.60ns cycle time (see cache prefetching)
2nd level of RAM -- ??ns (...used in some mainframe systems)
Swapped disk space.Slow as molasses, (see VM disk cache)
Accessing Remote Data -- Seconds.
Accessing Dinosaurs (Mainframes) -- Minutes. (Is this ScreenScraping??)

Memory lower down the list is slower, often by orders of magnitude.

These differences in speed can be hard to understand because they are hidden by the high level languages, but they can be very important. They can surprise you. For example, optimizations like loop unrolling, or function inlining, can result in a slower programs because they can bloat the code until it no longer fits in some hidden cache.

In the good old days, programmers cared about "working sets" and learned about (eg) disk sorting where the data set was too large to fit into RAM. Today similar techniques are needed to make efficient use of L1/L2 cache. So the wheel turns...

(I'm really hoping this will be filled in by someone who knows modern hardware better than I do. Eg some typical timings would be nice.)

The timings varies a bit, but the following will probably be close enough until someone fixes it. Oh my what a question to answer. So much to say, so little reason to say it.

Try these:

Registers : zero access time
L1 cache : zero or 1-cycle access time (MIPS also has 1-cycle load delay)
L2 cache : not sure; used to be the FSB speed e.g. 133MHz on modern PCs
Main RAM : varies; ZBT RAM exists, and is as fast as L2 cache (this is what L2 caches are made of ;) while much main RAM today does burst-mode accesses which have a high latency but then transmit the data very fast. Rely of main RAM being slow ... a cacheline refill from main RAM to L1 cache can take over 10us

Executive Summary:

For client server in big business with big iron (MainFrameOsaurs)

If you are waiting for remote systems caching and look ahead reads can work. Concurrency issues from multiple updates is the nightmare ahead.

Normally in desk top apps

Minimize disk traffic.
Minimize disk traffic
oh and Minimize Disk traffic.
If the problem really is cache thrashing employ an AnonymousGenius

Excessive disk traffic happens when:

you allocate so much RAM that the virtual memory starts swapping to disk
you read a lot of data from disk all at once.
may not happen if you read the same data several times as the system also has a big disk cache in RAM

Yep that's what I said, the system swaps RAM contents to the disk so that it can have the space to cache disk contents in the RAM!

Details

Some systems have automatic write through caches for disks some cache lots of the disk data. (All systems? at least cache the sector that you are writing one char at a a time.) If you are writing lots of data making sure that the disk has write cache can help a fair amount.

Physical hard disks tend to have RAM inside them so that whenever you read a sector the hard disk reads the whole track. This eliminates all the fiddling that used to get done with adjusting the interleaving of disks.

When the computer wants the next sector, it is now only limited by the bandwidth of the cable, which is something like 1-6 Meg of something per sec. This may mean that if you reading a number of files that it may be better read sizable chunks from each. But then again in your system the track Cache may be in system Ram.

Once the data gets into memory, if your algorithm is an O(n) (go through the data once) then there is not much you can do. Going through it once will happen in a large dot product or vector addition etc.

If as happens in in image processing you are sliding a rectangular convolution filter over an image. Then as you slide along the second row pixels you are accessing nearly exactly the same data as on the first pass. This can be leveraged to gain speed by doing the convolution for a vertical strip of 10 to 20 pixels and then sliding that stripe across the image. This can on a BIG image that flushes the L2 cache on each pass across the image make a big difference. It may be less than you imagine because it is only the leading edge of the filter that is not in the L1 Cache.

Trying to work out how to compute a large FFT such that cache hits are n maximized is a touch more daunting......

If you get stuck between a rock and a hard place. ie It must go faster or not at all, and everything else has been tried. First employ God, and then there are ways to allocate memory and design programs and structure data such that fragmentation and cache misses in general are improved. This may involve throwing away encapsulation and various other critical design principles away. This will often then make the program at least twice as hard to maintain. (See DeathMarch and other horror stories.)

SuperScalar? Processors can stall if one instruction depends on the previous instruction writing the result to a register. This problem can look like peanuts until you realize that it is in the loop executing >90% of the time and costs a >10% speed penalty.

I'm so disappointed! This page turns out to be a very clearly-framed discussion which enables one to grok the critical issues around caching. And here I thought it was going to give me a clever way to cache the groks I experience, so I didn't have to go around re-grokking things. <g>

Ho Ho. A clever way to cache the groks I experience, now there's a concept to grok.

See OptimizationPattern, UseStorageSpacesWisely