Incremental updating for positional evaluation

Gerd Isenberg · Post by **Gerd Isenberg** » Sat Mar 29, 2008 9:12 am

wgarvin wrote: I just realized I was thinking of it in the context of an in-order processor (like a PPC one I use at work), where you would have to try and match the L2 cache miss cost (610 cycles on the one I work with) to the amount of speculative work being done.

Since its an out-of-order processor, I guess the costs are pretty fungible. And it's a lot less clear how to get "optimal" performance. But I guess as long as you give it lots of stuff to chew on between prefetch and the actual read, it will end up being faster. Some of the latency is hidden because the read doesn't issue until a while after the prefetch is issued. More of it might be hidden by issuing and executing instructions in-between that would have to be issued and executed later, but I'm not as certain about that---shouldn't the chip be able to speculate past the branch anyway? Unless it mispredicts it... maybe that is the source of your speedup?

[Edit: oh, and writes are effectively wicked cheap--unless you try to do large numbers of them, or something--because of store buffers, write combining and store forwarding. The store forwarding in particular is something I miss on this in-order PPC chip. You can't even convert an int to a float on the damn thing without writing to memory and paying a 40-cycle Load-Hit-Store penalty. Super annoying.]

I think in-order or not is an issue with preloading (normal move instruction to load the content of an address to a register which might not used immediatly) rather than prefetching with special processor instruction without affecting processor registers.

I don't know anything about PPC or G5, but I guess AltiVec has prefetch instruction as well to guide the memory management unit to bring cachelines closer to the processor to make a later read faster.

I explicitly use x86 SSE prefetch0 via _mm_prefetch-intrinsic.

Code: Select all

_mm_prefetch (&#40;char*) hashptr, _MM_HINT_TO&#41;;

which translates to

Code: Select all

...  0f 18 09  prefetch0 byte ptr &#91;rcx&#93;

The "hiding" code is branchless SIMD-code except repetition detection and a few ifs for handling pinned pieces. But I think branch miss-prediction is no issue while using prefetch-instructions.

Some related quotes from:

Software Optimization Guide for AMD64 Processors

5.6 Prefetch Instructions (pg 104)
...
Prefetching versus Preloading

In code that makes irregular memory accesses rather than sequential accesses, an ordinary MOV instruction is the best way to load data. But in situations where sequential addresses are read, prefetch instructions can improve performance. Prefetch instructions only update the L1 data cache and do not update an architectural register. This uses one less register compared to a load instruction.

Multiple Prefetches

Programmers can initiate multiple outstanding prefetches on the AMD Athlon 64 and AMD Opteron processors. The AMD Athlon 64 and AMD Opteron processors can have a theoretical maximum of eight outstanding prefetches, but in practice the number is usually smaller. When all resources are filled by various memory read requests, the processor waits until resources become free before processing the next request. Multiple prefetch requests are essentially handled in order, prefetching data in the order that it is needed.

Memory-Limited versus Processor-Limited Code

Software prefetching can help to hide the memory latency, but it can not increase the total memory bandwidth. Many loops are limited by memory bandwidth rather than processor speed, as shown in Figure 4. In these cases, the best that software prefetching can do is to ensure that enough memory requests are “in flight” to keep the memory system busy all of the time. The AMD Athlon 64 and AMD Opteron processors support a maximum of eight concurrent memory requests to different cache lines. Multiple requests to the same cache line count as only one towards this limit of eight.
...
9.5 Structuring Code with Prefetch Instructions to Hide Memory Latency (pg 200)
...
Rationale

Prefetch instructions bring the cache line containing a specified memory location into the processor cache. (For more information on prefetch instructions, see “Prefetch Instructions” on page 104.) Prefetching hides the main memory load latency, which is typically many orders of magnitude larger than a processor clock cycle.

There are two types of loops:

Memory-limited: Data can be processed and requested faster than it can be fetched from memory. Processor-limited: Data can be requested and brought into the processor before it is needed because considerable processing occurs during each unrolled loop iteration.
...
Follow these guidelines when working with processor-limited loops:

Arrange your code with enough instructions between prefetches so that there is adequate time for the data to be retrieved.

Incremental updating for positional evaluation

Re: Incremental updating for positional evaluation