help with transposition table

Carbec · Post by **Carbec** » Wed Dec 28, 2022 4:46 pm

Hello,
I use the definition of Bruce Moreland.
If I need the index to store/retrieve an antry, I use : index = key % size.
But in another code (Sungorus), I saw that the index is calculated like this :
index = (key & tt_mask);
in that case : tt_mask = tt_size - 4;
And after, there is a loop of 4, as if were 4 indexes.
I looked at the wiki, and it seems that it's related to have several "buckets" ?
But the wiki is not really clear.

Could someone explain the use of the operator &, and also why take size-4 ?
Im am sorry for all my questions, but Im trying to undertand what I am coding.

Thanks a lot

gaard · Post by **gaard** » Wed Dec 28, 2022 5:43 pm

Carbec wrote: ↑Wed Dec 28, 2022 4:46 pm Hello,
I use the definition of Bruce Moreland.
If I need the index to store/retrieve an antry, I use : index = key % size.
But in another code (Sungorus), I saw that the index is calculated like this :
index = (key & tt_mask);
in that case : tt_mask = tt_size - 4;
And after, there is a loop of 4, as if were 4 indexes.
I looked at the wiki, and it seems that it's related to have several "buckets" ?
But the wiki is not really clear.

Could someone explain the use of the operator &, and also why take size-4 ?
Im am sorry for all my questions, but Im trying to undertand what I am coding.

Thanks a lot

A module of four is equivalent to an and of 4 - 1. On some CPUs the and will be faster than the modulo. When size is a power of two the two operations will produce the same result.

hgm · Post by **hgm** » Wed Dec 28, 2022 5:45 pm

The % operator is very slow on most CPUs (if the divisor is not known at compile time); In Crafty using this method for deriving index from key slowed it down by some 33%!

For this reason most engines restrict the hash table size to a power of two. In that case the modulo operation can be achieved by simply masking away the most-significant bits through an AND operation. (Which takes only a single clock cycle, and thus is as fast as they come.) If the table size is a power of two, the mask you need to get key % tableSize is tableSize - 1.

When you use tableSize - 4 instead, the mask does not only clear the high-order bits, but also the two lowest-order bits. This makes the derived index a multiple of 4. The next 3 table entries would never be indexed directly. But you can use the group of 4 consecutive entries starting at the derived index as a 'bucket', i.e. alternative locations where the info on the position with the given could be stored. This allows you more freedom in preserving the valuable entries (e.g. those from deep searches) in the table. E.g. by overwriting the least deep entry in the bucket with the new entry that you want to store. But you would then of course have to look in all 4 slots where the info could have been stored before you can be sure that it is not in the table.

Carbec · Post by **Carbec** » Fri Dec 30, 2022 3:02 pm

Hello,

Many thanks for the explanation.

I still have some questions. If the table size (entries number) is a power of two, is it necessary that
the entry size (in bytes) should be also a power of two ?

So in my exemple , the bucket has 4 entries, but all four entries are for a same position;
By consequence I can store 4 times less positions (hash keys) into the transposition table ?
Do I have to make the transposition table 4 times larger to compensate for ?

I don't understand exactly why having the possibility to store the same position 4 times is beneficial.
Is it something relatif to the replacement strategy ?

Thanks a lot for the help,

Happy new year !!

Philippe

Bo Persson · Post by **Bo Persson** » Fri Dec 30, 2022 3:22 pm

Carbec wrote: ↑Fri Dec 30, 2022 3:02 pm
I don't understand exactly why having the possibility to store the same position 4 times is beneficial.

It is not the same position, but several diferent positions that happens to get the same location.

If you do key % size, you will get the same result for a whole bunch of keys.

lithander · Post by **lithander** » Fri Dec 30, 2022 10:46 pm

hgm wrote: ↑Wed Dec 28, 2022 5:45 pm The % operator is very slow on most CPUs (if the divisor is not known at compile time); In Crafty using this method for deriving index from key slowed it down by some 33%!

I had tried replacing the % before but tried it again today:

The old code did it like this:

Code: Select all

index = (int)(hash % (ulong)_table.Length);

The new code does it like this:

Code: Select all

_mask = (1UL << powerOfTwo) - 1; //Only once at initialization
index = (int)(hash & _mask); //always

When I did 10 runs on the startpos to depth 18 and compared the best result it was 4199130 nps for the MODULO version vs 4225296 nps for the version using AND with a mask. That's 0.6% faster... the variance between runs was much larger.

If the UCI protocol allows a user to set something like 50MB as a Hash-size he does not necessarily expect the engine to silently interpret that as being 64MB instead. I think it's not worth it for 0.6% to ignore the users explicit request here!

I'm really surprised that for Crafty it was such a slow-down because I don't think the compiler is doing anything clever in my example. It's most likely really computing the modulo and wasting 30+ cycles on it but compared to the time it takes to fetch the value from ram it's just a negligible cost. The other explanation may be that in my engine the Hash query is just not enough of a bottleneck for details like this to matter.

hgm · Post by **hgm** » Fri Dec 30, 2022 11:05 pm

No, he should set it as 32MB. It is never allowed to use more than the user grants you.

Was table.Length known at compile time? In that case the optimizer would optimize the division away.

lithander · Post by **lithander** » Sat Dec 31, 2022 1:43 am

hgm wrote: ↑Fri Dec 30, 2022 11:05 pm Was table.Length known at compile time? In that case the optimizer would optimize the division away.

Here's the output of the modulo version where setting the Hash option triggers the dynamic reallocation of a non-power-of-two array. So there would be nothing to optimize for the compiler even if the JITer would consider changing the machine code at runtime based on the array size. (unlikely)

Code: Select all

Leorik 2.3
setoption name Hash value 63
go depth 18
info depth 1 score cp 62 nodes 21 nps 10500 time 2 pv d2d4
info depth 2 score cp 0 nodes 89 nps 5562 time 16 pv d2d4 d7d5
info depth 3 score cp 47 nodes 262 nps 15411 time 17 pv d2d4 d7d5 g1f3
info depth 4 score cp 0 nodes 494 nps 29058 time 17 pv d2d4 d7d5 g1f3 g8f6
info depth 5 score cp 29 nodes 1239 nps 72882 time 17 pv d2d3 d7d5 g1f3 g8f6 c2c3
info depth 6 score cp 0 nodes 3287 nps 182611 time 18 pv d2d4 d7d5 g1f3 g8f6 b1c3 b8c6
info depth 7 score cp 26 nodes 5928 nps 312000 time 19 pv d2d4 d7d5 g1f3 g8f6 b1c3 b8c6 e2e3
info depth 8 score cp 0 nodes 12425 nps 564772 time 22 pv d2d4 d7d5 g1f3 g8f6 b1c3 b8c6 e2e3 e7e6
info depth 9 score cp 42 nodes 82092 nps 1520222 time 54 pv e2e4 d7d5 e4d5 d8d5 b1c3 d5e5 f1e2 e5g5 e2f3
info depth 10 score cp 8 nodes 178352 nps 1877389 time 95 pv e2e4 e7e5 g1f3 g8f6 b1c3 b8c6 f1b5 f8b4 e1g1 e8g8
info depth 11 score cp 38 nodes 397597 nps 2126187 time 187 pv g1f3 d7d5 d2d4 g8f6 e2e3 c8g4 h2h3 g4f3 d1f3 e7e6 b1d2
info depth 12 score cp 20 nodes 1016833 nps 2872409 time 354 pv e2e4 e7e5 g1f3 g8f6 d2d4 e5d4 e4e5 f6e4 d1d4 d7d5 e5d6 e4d6
info depth 13 score cp 42 nodes 2097565 nps 3399619 time 617 pv e2e4 e7e5 g1f3 b8c6 b1c3 g8f6 f1b5 f8d6 e1g1 e8g8 f1e1 c6d4 f3d4
info depth 14 score cp 17 nodes 4058541 nps 3785952 time 1072 pv e2e4 e7e5 g1f3 b8c6 b1c3 g8f6 d2d4 f8b4 d4d5 c6e7 f1d3 b4c3 b2c3 d7d6
info depth 15 score cp 29 nodes 9776391 nps 4053230 time 2412 pv g1f3 g8f6 d2d4 e7e6 e2e3 c7c5 c2c4 c5d4 e3d4 d7d5 b1c3 f8b4 c4d5 f6d5 f1b5
info depth 16 score cp 21 nodes 24511996 nps 4192234 time 5847 pv e2e4 e7e5 g1f3 b8c6 f1b5 a7a6 b5c6 d7c6 d2d3 f8d6 e1g1 g8f6 d3d4 f6e4 f1e1 c8f5
info depth 17 score cp 31 nodes 52465187 nps 4193189 time 12512 pv e2e4 e7e5 g1f3 g8f6 f3e5 d7d6 e5f3 f6e4 d1e2 d8e7 d2d3 e4f6 g2g3 b8c6 b1c3 c8e6 f1g2
info depth 18 score cp 28 nodes 85400468 nps 4186913 time 20397 pv e2e4 e7e5 g1f3 g8f6 f3e5 d7d6 e5f3 f6e4 d2d3 e4f6 d3d4 c8g4 f1d3 b8c6 c2c3 d8e7 c1e3 g4f3
bestmove e2e4

Didn't run it 10x this time but the 4186913nps is in the same ballpark as what I previously quoted.

help with transposition table

help with transposition table

Re: help with transposition table

Re: help with transposition table

Re: help with transposition table

Re: help with transposition table

Re: help with transposition table

Re: help with transposition table

Re: help with transposition table