Cache pollution when reading/writing hash table

mcostalba · Post by **mcostalba** » Sun Aug 09, 2009 11:27 am

When we store/load an entry in hash table (TT) it is uncommon we access it again in the near future.

So it could be an idea to avoid polluting the cache using MOVNTQ SIMD instruction or something similar.

Has someone experience on this subject ?

I have found nothing browsing around.

Thanks
Marco

Gian-Carlo Pascutto · Sun Aug 09, 2009 12:04 pm

mcostalba wrote:When we store/load an entry in hash table (TT) it is uncommon we access it again in the near future.

I don't think so. At the very least, I wouldn't believe this claim without evidence

hgm · Post by **hgm** » Sun Aug 09, 2009 12:07 pm

I am not sure this would be benificial. Certainly not in my engines, which heavily rely on IID (and thus repeat the same hash probe quite often), and have a cache footprint so small that all infrequent accesses like hash probes simply map to a cache way that would otherwise be unused.

But even for engines that don't do that, there is the issue of true transpositions. A depth-first tree walk in Chess is such that when a position is visited for the first time, very quickly thereafter it is revisited through a path that transposes the moves just before it. E.g. the previous 4 plies could potentially be transposed in 4 ways (2 white x 2 black), if the moves are independent. If this occurs in (say) the last 6 plies of the search, the number of nodes visited in between these probes is small enough that they all fit in cache. So these quick revisits would all be cache hits. Only revisits by transposing moves much closer to the root invole sub-trees that are too large to fit in cache.

mcostalba · Post by **mcostalba** » Sun Aug 09, 2009 1:12 pm

Thanks for your answer. Actually I also am not sure that it happens. I had this idea because profiling with VTune it shows that accessing the hash table almost always is very costly, it means a RAM access is needed most of the times.

But I actually never measured what is the % of RAM access vs L1/L2 accesses. The only thing I can say is that that single look up is the most time consuming function in Stockfish, due to the fact that hash probe is done also in qsearch().

Unfortunatly removing hash probing from qsearch seems to lower the strength in SF. This is due because evaluation is very costly and empty TT slots are also reused cache evaluation values.

Gerd Isenberg · Post by **Gerd Isenberg** » Sun Aug 09, 2009 1:15 pm

mcostalba wrote:When we store/load an entry in hash table (TT) it is uncommon we access it again in the near future.

So it could be an idea to avoid polluting the cache using MOVNTQ SIMD instruction or something similar.

Has someone experience on this subject ?

I have found nothing browsing around.

Thanks
Marco

I actually use SSE2 simd instructions for loading storing hash entries. Bypassing the cache was not beneficial. What works with my approach is prefetching to L1, and then hiding the latency with register intensive on the fly SSE2 kogge-stone attack generation before actually probing TT (at all nodes, even in qsearch).

mcostalba · Post by **mcostalba** » Sun Aug 09, 2009 1:20 pm

Gerd Isenberg wrote:
mcostalba wrote:When we store/load an entry in hash table (TT) it is uncommon we access it again in the near future.

So it could be an idea to avoid polluting the cache using MOVNTQ SIMD instruction or something similar.

Has someone experience on this subject ?

I have found nothing browsing around.

Thanks
Marco
I actually use SSE2 simd instructions for loading storing hash entries. Bypassing the cache was not beneficial. What works with my approach is prefetching to L1, and then hiding the latency with register intensive on the fly SSE2 kogge-stone attack generation before actually probing TT (at all nodes, even in qsearch).

Thanks !!! this is really useful.

You mean that while you waiting for RAM data to be loaded you do attack calculation ?

At a first quick glance is not clear to me what I _could_ do during a TT.retireve() but the idea is very good.

Thanks

Gerd Isenberg · Post by **Gerd Isenberg** » Sun Aug 09, 2009 1:27 pm

mcostalba wrote:
Gerd Isenberg wrote:
mcostalba wrote:When we store/load an entry in hash table (TT) it is uncommon we access it again in the near future.

So it could be an idea to avoid polluting the cache using MOVNTQ SIMD instruction or something similar.

Has someone experience on this subject ?

I have found nothing browsing around.

Thanks
Marco
I actually use SSE2 simd instructions for loading storing hash entries. Bypassing the cache was not beneficial. What works with my approach is prefetching to L1, and then hiding the latency with register intensive on the fly SSE2 kogge-stone attack generation before actually probing TT (at all nodes, even in qsearch).
Thanks !!! this is really useful.

You mean that while you waiting for RAM data to be loaded you do attack calculation ?

At a first quick glance is not clear to me what I _could_ do during a TT.retireve() but the idea is very good.

Thanks

Yes, on my K8. It only works if you don't do too many other memory operations while "waiting" for the cacheline.

Allard Siemelink · Post by **Allard Siemelink** » Sun Aug 09, 2009 1:41 pm

Incidentally, I tried this last week in my engine Spark.
movntq resulted in a slight slowdown vs movdqu/a.
Like Gerd, I have been using prefetches (right after a move has been made), this resulted in a speedup of a few % IIRC.

mcostalba wrote:When we store/load an entry in hash table (TT) it is uncommon we access it again in the near future.

So it could be an idea to avoid polluting the cache using MOVNTQ SIMD instruction or something similar.

Has someone experience on this subject ?

I have found nothing browsing around.

Thanks
Marco

mcostalba · Post by **mcostalba** » Sun Aug 09, 2009 1:54 pm

Gerd Isenberg wrote: Yes, on my K8. It only works if you don't do too many other memory operations while "waiting" for the cacheline.

Could you please post some portable prefetching code or post links to code, so that I can see an example of how to integrate prefetch in C++ ?

Thanks
Marco

Gerd Isenberg · Post by **Gerd Isenberg** » Sun Aug 09, 2009 2:01 pm

mcostalba wrote:
Gerd Isenberg wrote: Yes, on my K8. It only works if you don't do too many other memory operations while "waiting" for the cacheline.
Could you please post some portable prefetching code or post links to code, so that I can see an example of how to integrate prefetch in C++ ?

Thanks
Marco

http://msdn.microsoft.com/en-us/library/84szxsww.aspx

Cache pollution when reading/writing hash table

Cache pollution when reading/writing hash table

Re: Cache pollution when reading/writing hash table

Re: Cache pollution when reading/writing hash table

Re: Cache pollution when reading/writing hash table

Re: Cache pollution when reading/writing hash table

Re: Cache pollution when reading/writing hash table

Re: Cache pollution when reading/writing hash table

Re: Cache pollution when reading/writing hash table

Re: Cache pollution when reading/writing hash table

Re: Cache pollution when reading/writing hash table