SIMD methods in TT probing and replacement
Moderators: hgm, Rebel, chrisw
-
- Posts: 300
- Joined: Mon Apr 30, 2018 11:51 pm
Re: SIMD methods in TT probing and replacement
SIMD trickery is getting increasingly common in hash tables in general; see e.g. Google's SwissTables. If you byte-align your “control word” (store 8 bits per hash entry), you can use SSE2 parallel comparison (PCMPEQB) and get out one bit per (potentially) matching entry, using PMOVMSKB, which also conveniently puts the result in an integer register where you can easily compare it to zero, run BSF on it, and so on. It's a very quick way to weed out 99%+ of mismatches in just a few cycles.
-
- Posts: 27808
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: SIMD methods in TT probing and replacement
Yes, that was the initial idea. Originally I proposed the older MMX instructions for this; these also have an 8x8bit compare. Problem is that you have to use compiler intrinsics, and that it often takes extra instructions to shuttle the data into the SIMD registers. This is why the ordinary uint64 arithmetic can still be competitive. I have gotten a bit out of touch with CPU development these past years, but I am not sure if SSE2 instructions can be executed in a super-scalar way. For ordinary uint64 arithmetic you can draw on 3 ALUs to execute the instructions in parallel.
Another advantage is that you are not subject to the pre-defined division of the SIMD operands; you can also do 7x9 or 6x10 patterns.
Another advantage is that you are not subject to the pre-defined division of the SIMD operands; you can also do 7x9 or 6x10 patterns.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: SIMD methods in TT probing and replacement
No. I only counted the updates when READING and using the info. Just over-writing I did not count since I am doing a write there anyway. As far as the test, I used the opening position. I played e4, let Crafty think a bit, and it played e5. I played Nf3 and repeated. For the first 8-10 moves (as far as I went) the numbers were similar. Again note my explanation above. I only had to do an unnecessary update if the signature matched AND the result was useful (IE UPPER, LOWER or EXACT.) I did not count signature matches with an unused value since I don't update age there either.hgm wrote: ↑Thu Feb 20, 2020 3:10 pmHow is this possible? I would think that any time you hit upon an entry that was created in the previous search, you would have to update its age. So this result in fact says that only 170K of the entries created by the previous search was visited by a subsequent search of 100M+ nodes.
I suppose this previous search was 2 ply earlier, without pondering in between. That means the root of the second search was somewhere at ply level 2 in the tree of the first search. If the opponent move was the expected one, it would even be on the PV of the first search. If we assume an EBF of about 3, a 2-ply less deep search from a PV node should take about 9 times fewer nodes. So about 11% of the nodes of the first search (i.e. 55M) would be expected to occur in the tree of the second search.
There is a HUGE discrepancy between that 55M and the reported 170K. How can that be explained? Was the intervening move of the opponent a completely unexpected one? (That would then not be very typical, as ponder-hit rate is typically ~60%.) Was the TT very small, so that it could not hold much more than 170K positions in the first place? Is the hashing algorithm such that hardly any entry stays in the TT very long, so that the 500M search flushed out almost all 55M relevant entries by its own low-depth stuff?
Hope that helps..
-
- Posts: 300
- Joined: Mon Apr 30, 2018 11:51 pm
Re: SIMD methods in TT probing and replacement
You just do the load directly into the SSE2 register (with movdqa). No shuffling needed; it just doesn't go through the integer registers.hgm wrote: ↑Thu Feb 20, 2020 11:16 pm Yes, that was the initial idea. Originally I proposed the older MMX instructions for this; these also have an 8x8bit compare. Problem is that you have to use compiler intrinsics, and that it often takes extra instructions to shuttle the data into the SIMD registers.
Haswell and up can do three vectorized (AVX) fused-multiply-add operations per cycle, _plus_ one integer instruction It's superscalar alright.I have gotten a bit out of touch with CPU development these past years, but I am not sure if SSE2 instructions can be executed in a super-scalar way. For ordinary uint64 arithmetic you can draw on 3 ALUs to execute the instructions in parallel.
Sure, but that benefit would have to be fairly big to stay competitive, unless the cost of the memory read is so big that is basically dwarfs everything (which it very well might, but if so, SSE2 is meaningless anyway).Another advantage is that you are not subject to the pre-defined division of the SIMD operands; you can also do 7x9 or 6x10 patterns.
-
- Posts: 2559
- Joined: Fri Nov 26, 2010 2:00 pm
- Location: Czech Republic
- Full name: Martin Sedlak
Re: SIMD methods in TT probing and replacement
A bit late but I ran a simple test age vs equi-distributed draft (plain, no always-replace entry)
My computer froze after 8.5k bullet games, but I think that's enough games...
The result is 0.6 +- 5.2 elo in favor of equi-distributed draft (=even). I'd have to run more tests (longer TC, ponder on, SMP),
but I think it's safe to say that your idea indeed works and can be used instead of aging, this means
more TT bits to use for something else.
My computer froze after 8.5k bullet games, but I think that's enough games...
The result is 0.6 +- 5.2 elo in favor of equi-distributed draft (=even). I'd have to run more tests (longer TC, ponder on, SMP),
but I think it's safe to say that your idea indeed works and can be used instead of aging, this means
more TT bits to use for something else.
Martin Sedlak