The effect of larger hash size

jacobbl · Post by **jacobbl** » Sat Apr 13, 2013 10:02 am

Hi, I've been testing the effect of increasing the hash size, and it doesn't seem to help my engine Sjakk. I use a double hash table of 1 mill entries, and I tried to increase this to 5 mill entries. The engine does about 1 mill nps.

When I tested this with CCRL adjusted time control to about 40/20 I got the following result:

Sjakk1Mill 54,4%, Sjakk5Mill 45,6%

So the old version played better. I only played about 500 games, which of course is a bit few, but testing at these time limits take time... Still 500 games with these scores should be significant to tell that the 5Mill version is not significant better.

Using 5 mill entries slows the engine marginally, but not something that should reduce the score in any significant manner.

Does these results indicate a bug with the hashing or shouldn't I suspect any gain from increasing the hash size?

Regards
Jacob

Evert · Post by **Evert** » Sat Apr 13, 2013 10:25 am

If you haven't already I suggest making the number of entries in the table a power of two. That way you can calculate the index from the hash key using a bit-wise and (key & (nelem-1)), which is (much) faster than the modulo (key%nelem) you have to use if it isn't.

Don't know that that would explain your difference though. What replacement strategy are you using? I think I used to see worse performance for larger hash sizes with Jazz, but that effect went away after bug fixing (don't remember what the bug I fixed was though; I think I was trying to be too clever with the replacement scheme).

hgm · Post by **hgm** » Sat Apr 13, 2013 1:30 pm

With a good replacement scheme the search speed is only very weakly dependent on the hash size (and of course the tree must be much larger than the has size to see any effect at all). With equidistributed-depth replacement I found that making the table 4096 times as small only slowed down the search (time-to-depth) by a factor two. That suggests it scales as a twelfth root. (Which again would correspond to 8 Elo per doubling of the hash size.)

Rebel · Post by **Rebel** » Sat Apr 13, 2013 5:41 pm

jacobbl wrote: Using 5 mill entries slows the engine marginally....

Shouldn't be the case indeed.

Does these results indicate a bug with the hashing or shouldn't I suspect any gain from increasing the hash size?

Likely a bug yes. Another possibility is that you are using the TT for other purposes also.

jacobbl · Post by **jacobbl** » Sat Apr 13, 2013 10:34 pm

Thanks for your help, changing away from modulo to bit operation gave a marginal increase in nps.

I've looked through my replacement strategy, and I might have a bug on my aging variable. I will have a deeper look into it.

I also never replace a PV node with a none PV node, but in my second hash table I allways replace so I don't think old PV nodes will be a problem.

Jacob

Volker Annuss · Post by **Volker Annuss** » Sun Apr 14, 2013 10:04 am

Evert wrote:If you haven't already I suggest making the number of entries in the table a power of two. That way you can calculate the index from the hash key using a bit-wise and (key & (nelem-1)), which is (much) faster than the modulo (key%nelem) you have to use if it isn't.

You can use hash tables of any size without modulo. It is not as fast as bitwise-and but still much faster than modulo even in 32 bit code.

Code: Select all

(&#40;key & 0xffffffffull&#41; * nelem&#41; >> 32

or

Code: Select all

(&#40;key >> 32&#41; * nelem&#41; >> 32

mar · Post by **mar** » Sun Apr 14, 2013 2:10 pm

Volker Annuss wrote: You can use hash tables of any size without modulo. It is not as fast as bitwise-and but still much faster than modulo even in 32 bit code.
Code: Select all
(&#40;key & 0xffffffffull&#41; * nelem&#41; >> 32
or
Code: Select all
(&#40;key >> 32&#41; * nelem&#41; >> 32

Simple and clever, Volker

This access on a x86 should boil down to doing a single mul/use edx as index so it should be very fast.
Avoiding divisions is especially advisable on ARM as it doesn't have (not sure if it still holds for modern ARMs) integer division so div/mod is extremely slow because it has to be emulated on these (iteration). Of course divisions by constant will be optimized by the compiler.

Sven · Post by **Sven** » Sun Apr 14, 2013 2:39 pm

Volker Annuss wrote:
Evert wrote:If you haven't already I suggest making the number of entries in the table a power of two. That way you can calculate the index from the hash key using a bit-wise and (key & (nelem-1)), which is (much) faster than the modulo (key%nelem) you have to use if it isn't.
You can use hash tables of any size without modulo. It is not as fast as bitwise-and but still much faster than modulo even in 32 bit code.
Code: Select all
(&#40;key & 0xffffffffull&#41; * nelem&#41; >> 32
or
Code: Select all
(&#40;key >> 32&#41; * nelem&#41; >> 32

I don't understand your remark. "key & 0xffffffffull" and "key >> 32" are not the same, so it seems you want to imply that you could use either the lower or the upper 32 bits of the hash key in that way. But can you explain why and how that works, in either of the two cases, to multiply by "nelem" (which I assume to be the total number of hash table entries) and then to use the upper 32 bits of the result? To me it seems this operation will often result in "0", at least for smaller hash tables.

Also, why is a multiplication by "nelem" significantly faster than a "modulo nelem"?

The "traditional" way is to use a "modulo nelems" operation, which can be optimized (manually or by the compiler) with a bitwise operation if "nelems" is a power of 2.

Sven

mar · Post by **mar** » Sun Apr 14, 2013 3:48 pm

Sven Schüle wrote: I don't understand your remark. "key & 0xffffffffull" and "key >> 32" are not the same, so it seems you want to imply that you could use either the lower or the upper 32 bits of the hash key in that way. But can you explain why and how that works, in either of the two cases, to multiply by "nelem" (which I assume to be the total number of hash table entries) and then to use the upper 32 bits of the result? To me it seems this operation will often result in "0", at least for smaller hash tables.

Also, why is a multiplication by "nelem" significantly faster than a "modulo nelem"?

The "traditional" way is to use a "modulo nelems" operation, which can be optimized (manually or by the compiler) with a bitwise operation if "nelems" is a power of 2.

Sven

Ok I will try, hope you don't mind

I think Volker's point is that it doesn't really matter if you use lower 32 bits or upper 32 bits to compute entry index. While these two are not functionally equivalent, they shouldn't matter at all because you should have good pseudorandom bits in either upper or lower part of the 64-bit key.
Of course once you have more than 2G entries the distribution will probably not be very good.
How it works: think of 0..0xffffffff it as a real number <0..1), this trick was called fixed point in the good old times.
So you multiply it by entry range and only keep the integer part.
Since zobrist hashes should have a good distribution of bits, it should work just fine.
As for performance: integer division is always slow. Of course if you know in advance you deal with entry_count = power of two, you can mask it of course.
But the point is that forcing TT size to be powers of two means you won't be able to utilize memory efficiently. Say you have 16 gigs of RAM, something goes for the system and so on - so you can only utilize up to 8 gigs for the hash table if you use powers of two.
Using Volker's method you should be able to utilize say 12 gigs without problems.

hgm · Post by **hgm** » Sun Apr 14, 2013 5:42 pm

The point is that taking the 32 low-order bits of a longer variable can be done by simply moving the variable (e.g. from memory) to a 32-bit register. And the i386 architecture has an instruction that multiplies two 32-bit registers to produce a 64-bit result, stored in two registers, one register the low-order 32-bits, another register the high-order 32 bits. You then simply use the latter to get the same effect as doing >>= 32.

Integer division (used to implement the modulo operator, as it produces quotient in one register and remainder in another) is and extremely costly instruction. Bob once reported here a decrease of ~30% in Crafty's NPS just from replacing the masking with the & operator through a more general % operator to get the hash index.

The effect of larger hash size

The effect of larger hash size

Re: The effect of larger hash size

Re: The effect of larger hash size

Re: The effect of larger hash size

Re: The effect of larger hash size

Re: The effect of larger hash size

Re: The effect of larger hash size

Re: The effect of larger hash size

Re: The effect of larger hash size

Re: The effect of larger hash size