SIMD methods in TT probing and replacement

Sesse · Post by **Sesse** » Thu Feb 20, 2020 4:38 pm

SIMD trickery is getting increasingly common in hash tables in general; see e.g. Google's SwissTables. If you byte-align your “control word” (store 8 bits per hash entry), you can use SSE2 parallel comparison (PCMPEQB) and get out one bit per (potentially) matching entry, using PMOVMSKB, which also conveniently puts the result in an integer register where you can easily compare it to zero, run BSF on it, and so on. It's a very quick way to weed out 99%+ of mismatches in just a few cycles.

hgm · Post by **hgm** » Thu Feb 20, 2020 11:16 pm

Yes, that was the initial idea. Originally I proposed the older MMX instructions for this; these also have an 8x8bit compare. Problem is that you have to use compiler intrinsics, and that it often takes extra instructions to shuttle the data into the SIMD registers. This is why the ordinary uint64 arithmetic can still be competitive. I have gotten a bit out of touch with CPU development these past years, but I am not sure if SSE2 instructions can be executed in a super-scalar way. For ordinary uint64 arithmetic you can draw on 3 ALUs to execute the instructions in parallel.

Another advantage is that you are not subject to the pre-defined division of the SIMD operands; you can also do 7x9 or 6x10 patterns.

bob · Post by **bob** » Thu Feb 20, 2020 11:41 pm

hgm wrote: ↑Thu Feb 20, 2020 3:10 pm
bob wrote: ↑Wed Feb 19, 2020 10:30 pmIn a 100M + node search, made after playing the previous move after a 500M node search, Crafty only had to update age because probe age did not match entry age. It had to update age on a probe 170K times. Roughly 0.01% of the probes, so not THAT terrible.
How is this possible? I would think that any time you hit upon an entry that was created in the previous search, you would have to update its age. So this result in fact says that only 170K of the entries created by the previous search was visited by a subsequent search of 100M+ nodes.

I suppose this previous search was 2 ply earlier, without pondering in between. That means the root of the second search was somewhere at ply level 2 in the tree of the first search. If the opponent move was the expected one, it would even be on the PV of the first search. If we assume an EBF of about 3, a 2-ply less deep search from a PV node should take about 9 times fewer nodes. So about 11% of the nodes of the first search (i.e. 55M) would be expected to occur in the tree of the second search.

There is a HUGE discrepancy between that 55M and the reported 170K. How can that be explained? Was the intervening move of the opponent a completely unexpected one? (That would then not be very typical, as ponder-hit rate is typically ~60%.) Was the TT very small, so that it could not hold much more than 170K positions in the first place? Is the hashing algorithm such that hardly any entry stays in the TT very long, so that the 500M search flushed out almost all 55M relevant entries by its own low-depth stuff?

No. I only counted the updates when READING and using the info. Just over-writing I did not count since I am doing a write there anyway. As far as the test, I used the opening position. I played e4, let Crafty think a bit, and it played e5. I played Nf3 and repeated. For the first 8-10 moves (as far as I went) the numbers were similar. Again note my explanation above. I only had to do an unnecessary update if the signature matched AND the result was useful (IE UPPER, LOWER or EXACT.) I did not count signature matches with an unused value since I don't update age there either.

Hope that helps..

Sesse · Post by **Sesse** » Fri Feb 21, 2020 12:21 am

hgm wrote: ↑Thu Feb 20, 2020 11:16 pm Yes, that was the initial idea. Originally I proposed the older MMX instructions for this; these also have an 8x8bit compare. Problem is that you have to use compiler intrinsics, and that it often takes extra instructions to shuttle the data into the SIMD registers.

You just do the load directly into the SSE2 register (with movdqa). No shuffling needed; it just doesn't go through the integer registers.

I have gotten a bit out of touch with CPU development these past years, but I am not sure if SSE2 instructions can be executed in a super-scalar way. For ordinary uint64 arithmetic you can draw on 3 ALUs to execute the instructions in parallel.

Haswell and up can do three vectorized (AVX) fused-multiply-add operations per cycle, _plus_ one integer instruction

It's superscalar alright.

Another advantage is that you are not subject to the pre-defined division of the SIMD operands; you can also do 7x9 or 6x10 patterns.

Sure, but that benefit would have to be fairly big to stay competitive, unless the cost of the memory read is so big that is basically dwarfs everything (which it very well might, but if so, SSE2 is meaningless anyway).

mar · Post by **mar** » Mon Feb 24, 2020 10:13 am

A bit late but I ran a simple test age vs equi-distributed draft (plain, no always-replace entry)
My computer froze after 8.5k bullet games, but I think that's enough games...

The result is 0.6 +- 5.2 elo in favor of equi-distributed draft (=even). I'd have to run more tests (longer TC, ponder on, SMP),
but I think it's safe to say that your idea indeed works and can be used instead of aging, this means
more TT bits to use for something else.

SIMD methods in TT probing and replacement

Re: SIMD methods in TT probing and replacement

Re: SIMD methods in TT probing and replacement

Re: SIMD methods in TT probing and replacement

Re: SIMD methods in TT probing and replacement

Re: SIMD methods in TT probing and replacement