The effect of larger hash size

Sven · Post by **Sven** » Sun Apr 14, 2013 5:48 pm

mar wrote:
Sven Schüle wrote: I don't understand your remark. "key & 0xffffffffull" and "key >> 32" are not the same, so it seems you want to imply that you could use either the lower or the upper 32 bits of the hash key in that way. But can you explain why and how that works, in either of the two cases, to multiply by "nelem" (which I assume to be the total number of hash table entries) and then to use the upper 32 bits of the result? To me it seems this operation will often result in "0", at least for smaller hash tables.

Also, why is a multiplication by "nelem" significantly faster than a "modulo nelem"?

The "traditional" way is to use a "modulo nelems" operation, which can be optimized (manually or by the compiler) with a bitwise operation if "nelems" (resp. "nIndices") is a power of 2.

Sven
Ok I will try, hope you don't mind
I think Volker's point is that it doesn't really matter if you use lower 32 bits or upper 32 bits to compute entry index. While these two are not functionally equivalent, they shouldn't matter at all because you should have good pseudorandom bits in either upper or lower part of the 64-bit key.
Of course once you have more than 2G entries the distribution will probably not be very good.
How it works: think of 0..0xffffffff it as a real number <0..1), this trick was called fixed point in the good old times.
So you multiply it by entry range and only keep the integer part.
Since zobrist hashes should have a good distribution of bits, it should work just fine.
As for performance: integer division is always slow. Of course if you know in advance you deal with entry_count = power of two, you can mask it of course.
But the point is that forcing TT size to be powers of two means you won't be able to utilize memory efficiently. Say you have 16 gigs of RAM, something goes for the system and so on - so you can only utilize up to 8 gigs for the hash table if you use powers of two.
Using Volker's method you should be able to utilize say 12 gigs without problems.

Now I understand the idea. For nelems > 2G (still quite unusual today) you could use more than 32 bits in the formula above. In my opinion "nelems" should be called "nIndices" as in "number of hash table indices" which is not necessarily the same as "number of hash table elements" if you have more than one slot per index.

If "nIndices" is a power of two, say 2^K, then the "multiplication" approach uses the upper K of 32 (or more) bits for the index while the "modulo" approach uses the lower K bits. In general this should be almost equivalent, although I am not perfectly sure whether any trouble might arise from using the upper K bits when the 32 bits are the lower 32 ones of the key so that a lot of consecutive hash keys are mapped to the same table slot. It is just some doubt in my head, I can't explain why.

If the only advantage of the "multiplication" approach is the ability to use more than 50% of the available RAM for the transposition table then I would still favor the "modulo" approach together with the restriction to a power of 2 as number of hash table indices, in order to allow for a small speedup by using a shift instead of the "modulo".

For those programs where the difference between using 8GB and, say, 12GB TT hash is really important (I doubt there are many!), it would be an option to buy 20 or 24GB RAM instead of "only" 16GB, so that a doubling to 16GB TT hash would be possible with the "power of 2" approach.

Sven

hgm · Post by **hgm** » Sun Apr 14, 2013 6:46 pm

The multiply aproach cannot only be used to determine the hash bucket, but also for determining the entry within the bucket. E.g. in Joker I have 7 entries per bucket, 1 store-always and 6 depth-preferred. The primary probe goes to entry 0-4, the secondary goes one higher (1-5), and if neither containsit or would be candidates for replacement (e.g. the depth to store is smaller than in both probed entries), it ends up in entry 6. So I need to generate a random number with 5 possibilities from the hash key, which is not a power of two. By taking 5LL*keyL>>32 it uses the high bits of keyL to select the entry, while the low bits of keyL are used to determine the bucket by the normal masking (as Joker requires power-of-two hash size). The other 32 key bits (keyH) go in the entry as signature.

michiguel · Post by **michiguel** » Sun Apr 14, 2013 7:14 pm

Sven Schüle wrote:
mar wrote:
Sven Schüle wrote: I don't understand your remark. "key & 0xffffffffull" and "key >> 32" are not the same, so it seems you want to imply that you could use either the lower or the upper 32 bits of the hash key in that way. But can you explain why and how that works, in either of the two cases, to multiply by "nelem" (which I assume to be the total number of hash table entries) and then to use the upper 32 bits of the result? To me it seems this operation will often result in "0", at least for smaller hash tables.

Also, why is a multiplication by "nelem" significantly faster than a "modulo nelem"?

The "traditional" way is to use a "modulo nelems" operation, which can be optimized (manually or by the compiler) with a bitwise operation if "nelems" (resp. "nIndices") is a power of 2.

Sven
Ok I will try, hope you don't mind
I think Volker's point is that it doesn't really matter if you use lower 32 bits or upper 32 bits to compute entry index. While these two are not functionally equivalent, they shouldn't matter at all because you should have good pseudorandom bits in either upper or lower part of the 64-bit key.
Of course once you have more than 2G entries the distribution will probably not be very good.
How it works: think of 0..0xffffffff it as a real number <0..1), this trick was called fixed point in the good old times.
So you multiply it by entry range and only keep the integer part.
Since zobrist hashes should have a good distribution of bits, it should work just fine.
As for performance: integer division is always slow. Of course if you know in advance you deal with entry_count = power of two, you can mask it of course.
But the point is that forcing TT size to be powers of two means you won't be able to utilize memory efficiently. Say you have 16 gigs of RAM, something goes for the system and so on - so you can only utilize up to 8 gigs for the hash table if you use powers of two.
Using Volker's method you should be able to utilize say 12 gigs without problems.
Now I understand the idea. For nelems > 2G (still quite unusual today) you could use more than 32 bits in the formula above.

But that could overflow (if I am not mistaken).
however, a general approach can be set to use all possible bits if we assume to have "chunks" in which each each of those chunks have a size of power of two such that this is true

max_indeces = max_chunks * chunk_size

Then,

chunk = ((key >> 32) * max_chunks ) >> 32;

x = key & (chunksize - 1); // within chunk

index = chunk * chunksize + x;

The chunksize will determine the granularity.

Miguel
PS: Volker's trick is so simple and cool! It should go to chessprogramming wiki.

In my opinion "nelems" should be called "nIndices" as in "number of hash table indices" which is not necessarily the same as "number of hash table elements" if you have more than one slot per index.

If "nIndices" is a power of two, say 2^K, then the "multiplication" approach uses the upper K of 32 (or more) bits for the index while the "modulo" approach uses the lower K bits. In general this should be almost equivalent, although I am not perfectly sure whether any trouble might arise from using the upper K bits when the 32 bits are the lower 32 ones of the key so that a lot of consecutive hash keys are mapped to the same table slot. It is just some doubt in my head, I can't explain why.

If the only advantage of the "multiplication" approach is the ability to use more than 50% of the available RAM for the transposition table then I would still favor the "modulo" approach together with the restriction to a power of 2 as number of hash table indices, in order to allow for a small speedup by using a shift instead of the "modulo".

For those programs where the difference between using 8GB and, say, 12GB TT hash is really important (I doubt there are many!), it would be an option to buy 20 or 24GB RAM instead of "only" 16GB, so that a doubling to 16GB TT hash would be possible with the "power of 2" approach.

Sven

Volker Annuss · Post by **Volker Annuss** » Sun Apr 14, 2013 8:05 pm

mar wrote:Ok I will try, hope you don't mind

Thank you for explaining. I could not have done better.

Sven · Post by **Sven** » Mon Apr 15, 2013 9:53 am

michiguel wrote:
Sven Schüle wrote:Now I understand the idea. For nelems > 2G (still quite unusual today) you could use more than 32 bits in the formula above.
But that could overflow (if I am not mistaken).

Yes, you are right. I initially thought of

Code: Select all

(&#40;key >> &#40;64 - N_INDEX_BITS&#41;) * nIndices&#41; >> N_INDEX_BITS

when using the upper N_INDEX_BITS of the key, or

Code: Select all

(&#40;key & (&#40;1ULL << N_INDEX_BITS&#41; - 1&#41;) * nIndices&#41; >> N_INDEX_BITS

when using the lower N_INDEX_BITS of the key. But in both cases the multiplication can overflow. Which may become another disadvantage with very large hash tables, since the non-overflowing use of only 32 bits will not help.

Sven

hgm · Post by **hgm** » Mon Apr 15, 2013 11:47 am

To make this work well, the number of bits to multiply with the table size should really be a few more than what you minimally need to address the table. Otherwise you will get a very uneven mapping of hash keys to entries. E.g. if you have 3G entries and us 32-bits, you are basically implementing a multiplication with 0.75. Which means 0 & 1 map to 0, 2 maps to 1, 3 maps to 2 and then the pattern repeats. So entry 0 (and 3, 6, ...) will have double the probe rate as the other entries, which is not optimal.

Note that the whole scheme is basically using the division key*size/MAX_KEY to calculate the mapping, (which is a contiguous mapping) rather than the modulo key%size (which wraps around many times). Because size and MAX_KEY are constants, this can be optimized by pre-computing (size/MAX_KEY) and then multiply with it, provided you do not round as in integer division, but keep the fraction sufficiently accurate that the rounding remains much smaller than one after multiplication with the key. The i386 assembly instructions that produce a double-precision integer product in a pair of registers would allow you to always do that, but it might not be possible to force the compiler to generate them (for lack of a uint_128 data type). Otherwise you would simply write

Code: Select all

uint_64 key, size;
uint_64 index = (&#40;uint_128&#41; size * key&#41;>>64;

in 64-bit mode.

An alternative would be to use floating point:

double factor = size / (double) MAX_KEY;

int index = factor * key;

This would work for sizes upto 2^50 or so.

bob · Post by **bob** » Mon Apr 15, 2013 5:02 pm

Rebel wrote:
jacobbl wrote: Using 5 mill entries slows the engine marginally....
Shouldn't be the case indeed.

Does these results indicate a bug with the hashing or shouldn't I suspect any gain from increasing the hash size?
Likely a bug yes. Another possibility is that you are using the TT for other purposes also.

Some simple math. 5 million entries = how much memory? I'd speculate at least 12 bytes per entry, so 60 megs of memory. 60,000,000 / 4k per page = 15,000 pages. Way bigger than any known TLB. That could slow things down a bit until you resort to jumbo pages, which would reduce that to 15 TLB entries. IF your particular CPU type has 15 TLB entries for large pages...

Otherwise, I would agree with you...

hgm · Post by **hgm** » Mon Apr 15, 2013 7:17 pm

What point are you trying to make? If 5M entries is far beyond the size of any TLB, 1M entries should also be beyond the size of any TLB. So hash probes cause a TLB miss no matter what. 60MB is a small hash size by modern standards.

In theory hash probes could be 'focused', by not using totally random Zobrist keys. I never succeeded in getting a speedup from this, however. It is very easy to focus too much, so that your tree no longer uses the entire hash table. E.g. you could 'section' your TT into sections that will not cause TLB overflow, such that the Pawn structure determines what section you map to (by zeroing the bits in the Zobrist keys from which the section number will be derived in all non-Pawn pieces). As relatively few moves are Pawn moves, this means you stay probing in the same section quite a lot. But if the number of Pawn structures found in the tree is much smaller than the number of sections, you would not use the entire table.

bob · Post by **bob** » Mon Apr 15, 2013 7:27 pm

hgm wrote:What point are you trying to make? If 5M entries is far beyond the size of any TLB, 1M entries should also be beyond the size of any TLB. So hash probes cause a TLB miss no matter what. 60MB is a small hash size by modern standards.

In theory hash probes could be 'focused', by not using totally random Zobrist keys. I never succeeded in getting a speedup from this, however. It is very easy to focus too much, so that your tree no longer uses the entire hash table. E.g. you could 'section' your TT into sections that will not cause TLB overflow, such that the Pawn structure determines what section you map to (by zeroing the bits in the Zobrist keys from which the section number will be derived in all non-Pawn pieces). As relatively few moves are Pawn moves, this means you stay probing in the same section quite a lot. But if the number of Pawn structures found in the tree is much smaller than the number of sections, you would not use the entire table.

Point was that it IS possible that hash probes slow the search down. Even though the effect on the search is beneficial. Somewhat akin to egtb probes that are certainly beneficial if you don't factor in the cost of doing the I/O.

big pages mitigates the effect, as when you get a TLB miss, you still have to do much less work to do the virtual to physical translation compared to using small pages.

Don · Post by **Don** » Mon Apr 29, 2013 2:54 pm

Hi Jacob,

You should something like a 6% speedup for each doubling of hash table size - but there are 2 things that affect that:

1. cache behavior

2. hash table saturation.

If the current hash table size is adequate, increasing it will not provide any improvement - the 7% rule only applies if the hash table fills up quickly and the program fully benefits from increasing it.

The larger the table the more negative impact it can have on the cache. So it's can be counter-productive to increase the hash table size. You might be able to tell by looking at the nodes per second and see if they are suddenly impacted by a hash table increase.

I have seen cases with an overnight search on some position where the ideal hash table size slowed the program down 2 times or more in nodes per second. And yet the benefit from the increased table size was enough to make this a good trade-off.

So what you should look at and observe is any reduction in nodes and you should instrument the hash table utilization. One position is not good enough either, if you are looking at the impact run several positions as there is a random effect from hash table size changes on the search.

Don

jacobbl wrote:Hi, I've been testing the effect of increasing the hash size, and it doesn't seem to help my engine Sjakk. I use a double hash table of 1 mill entries, and I tried to increase this to 5 mill entries. The engine does about 1 mill nps.

When I tested this with CCRL adjusted time control to about 40/20 I got the following result:

Sjakk1Mill 54,4%, Sjakk5Mill 45,6%

So the old version played better. I only played about 500 games, which of course is a bit few, but testing at these time limits take time... Still 500 games with these scores should be significant to tell that the 5Mill version is not significant better.

Using 5 mill entries slows the engine marginally, but not something that should reduce the score in any significant manner.

Does these results indicate a bug with the hashing or shouldn't I suspect any gain from increasing the hash size?

Regards
Jacob

The effect of larger hash size

Re: The effect of larger hash size

Re: The effect of larger hash size

Re: The effect of larger hash size

Re: The effect of larger hash size

Re: The effect of larger hash size

Re: The effect of larger hash size

Re: The effect of larger hash size

Re: The effect of larger hash size

Re: The effect of larger hash size

Re: The effect of larger hash size