Is a querying the hash tables such a huge bottleneck?

rbarreira · Post by **rbarreira** » Fri Oct 29, 2010 10:23 am

bob wrote:
rbarreira wrote:
hgm wrote:It is quite normal that hash probes are slow: they need DRAM access, which takes many hundreds of CPU clock cycles. Probing the same entry twice is not much more expensive, though, as the second time it will be in cache. So this is in accordance with what you see.

If your results are reasonable depends on too many things to say. For instance how elaborate your evaluation is. And how often you probe.

Non-alignment of a hash bucket with the cache line can cause extra DRAM accesses, if in one probe you access two cache lines (requiring two dRAM accesses to fetch, possibly triggering prefetch of a third). To know, print the start adress of your hash table, and check if it is devisable by 64 (printf("hash start = %x\n", hashTable);).

To force alignment, use code like:
Code: Select all
hashMem = &#40;Bcket *) malloc&#40;nrOfBuckets + 1, sizeof Bucket&#41;;
hashTable = &#40;Bucket *) ((&#40;int&#41;hashMem + 63&#41; & ~63&#41;;
Casting a pointer to an int looks like an overflow waiting to happen. Or am I missing something?
I'd prefer (long) to be 64 bit safe... that is what I use...

Better yet, use "long long" or uint64_t from stdint.h.

long is still 32-bit in Windows world.

hgm · Post by **hgm** » Fri Oct 29, 2010 10:40 am

rbarreira wrote:Casting a pointer to an int looks like an overflow waiting to happen. Or am I missing something?

In that case the compiler will issue a warning. I only have 32-bit systems, so for me int is the size that matches a pointer.

OliverUwira · Post by **OliverUwira** » Fri Oct 29, 2010 10:44 am

bob wrote:
OliverUwira wrote:
bob wrote: One more question: Do you force the "bucket" to lie on a 64 byte address boundary so that you are only incurring one hash line fill, as opposed to two?
No I didn't do that as I didn't know how to. But now I tried the following according to your explanation:
Code: Select all
engine->ttable.memory = &#40;TTEntry*)&#40;malloc&#40;&#40;engine->ttable.size + 1&#41; * sizeof&#40;TTEntry&#41;));
		
if&#40;!&#40;engine->ttable.memory&#41;)
	return false;

engine->ttable.entries = &#40;TTEntry*) (((&#40;int&#41;&#40;engine->ttable.memory&#41;) + 63&#41; & ~63&#41;;
The situation didn't change, though. Maybe the compiler did align the address properly already.

...
There is a problem in your code. When you add 63 and then and off the rightmost 6 bits, you are essentially rounding up. You will still use the ttablel_size number of bytes, which will now cause you to access memory just beyond the end of the newly-aligned ttable address. Up to 63 bytes beyond depending on how far off the initial address is (it is supposedly guaranteed to be on an 8 byte boundary according to POSIX).

You need to fudge the size by + 63 when you do the malloc().

I'm a bit confused. My code above allocates 64 bytes more than needed. I believe you took ttable->size for the number of bytes, but it is the number of entries, i.e. the power of two that fits the desired TT size in MB.

Given that information, my code should do that same as the below suggested:

hgm wrote:

Code: Select all

hashMem = &#40;Bcket *) malloc&#40;nrOfBuckets + 1, sizeof Bucket&#41;;
hashTable = &#40;Bucket *) ((&#40;int&#41;hashMem + 63&#41; & ~63&#41;;

Anyway, to be sure I also tried the following this morning, but to no better effect:

Code: Select all

engine->ttable.memory = &#40;TTEntry*)&#40;malloc&#40;engine->ttable.size * sizeof&#40;TTEntry&#41; + 63&#41;);
		
if&#40;!&#40;engine->ttable.memory&#41;)
	return false;

engine->ttable.entries = &#40;TTEntry*) (((&#40;long&#41;&#40;engine->ttable.memory&#41;) + 63&#41; & ~63&#41;;

hgm · Post by **hgm** » Fri Oct 29, 2010 11:42 am

It would be wise to print out the original address, to see if you should expect an effect. Of course the code to align the address is a no-op if the original allocation was already aligned. Having the code there makes the engine more robust, though: it is very possible that the same executable which gives aligned allocation on your machine does give unaligned allocation when copied to an drun on another machine! Because another DLL is installed there that contains the malloc.

If your eval is light-weight, it is very likely that nothing is wrong and this is just it. Hash probing is expensive. Micro-Max 1.6 (a simple alpha-beta without hash) does 6Mnps, and its big brother micro-Max 4.8 (which does have a hash table) only does 1Mnps on the same machine. The only way to save on hash probing is then to do it less frequently (e.g. like Crafty, which does not probe in QS).

bob · Post by **bob** » Sat Oct 30, 2010 1:36 am

OliverUwira wrote:
bob wrote:
OliverUwira wrote:
bob wrote: One more question: Do you force the "bucket" to lie on a 64 byte address boundary so that you are only incurring one hash line fill, as opposed to two?
No I didn't do that as I didn't know how to. But now I tried the following according to your explanation:
Code: Select all
engine->ttable.memory = &#40;TTEntry*)&#40;malloc&#40;&#40;engine->ttable.size + 1&#41; * sizeof&#40;TTEntry&#41;));
		
if&#40;!&#40;engine->ttable.memory&#41;)
	return false;

engine->ttable.entries = &#40;TTEntry*) (((&#40;int&#41;&#40;engine->ttable.memory&#41;) + 63&#41; & ~63&#41;;
The situation didn't change, though. Maybe the compiler did align the address properly already.

...
There is a problem in your code. When you add 63 and then and off the rightmost 6 bits, you are essentially rounding up. You will still use the ttablel_size number of bytes, which will now cause you to access memory just beyond the end of the newly-aligned ttable address. Up to 63 bytes beyond depending on how far off the initial address is (it is supposedly guaranteed to be on an 8 byte boundary according to POSIX).

You need to fudge the size by + 63 when you do the malloc().
I'm a bit confused. My code above allocates 64 bytes more than needed. I believe you took ttable->size for the number of bytes, but it is the number of entries, i.e. the power of two that fits the desired TT size in MB.

I misinterpreted the "size" idea, so disregard my comment.

Given that information, my code should do that same as the below suggested:
hgm wrote:
Code: Select all
hashMem = &#40;Bcket *) malloc&#40;nrOfBuckets + 1, sizeof Bucket&#41;;
hashTable = &#40;Bucket *) ((&#40;int&#41;hashMem + 63&#41; & ~63&#41;;
Anyway, to be sure I also tried the following this morning, but to no better effect:
Code: Select all
engine->ttable.memory = &#40;TTEntry*)&#40;malloc&#40;engine->ttable.size * sizeof&#40;TTEntry&#41; + 63&#41;);
		
if&#40;!&#40;engine->ttable.memory&#41;)
	return false;

engine->ttable.entries = &#40;TTEntry*) (((&#40;long&#41;&#40;engine->ttable.memory&#41;) + 63&#41; & ~63&#41;;

wgarvin · Post by **wgarvin** » Sat Oct 30, 2010 8:26 am

OliverUwira wrote:
bob wrote: One more question: Do you force the "bucket" to lie on a 64 byte address boundary so that you are only incurring one hash line fill, as opposed to two?
No I didn't do that as I didn't know how to. But now I tried the following according to your explanation:
Code: Select all
engine->ttable.memory = &#40;TTEntry*)&#40;malloc&#40;&#40;engine->ttable.size + 1&#41; * sizeof&#40;TTEntry&#41;));
		
if&#40;!&#40;engine->ttable.memory&#41;)
	return false;

engine->ttable.entries = &#40;TTEntry*) (((&#40;int&#41;&#40;engine->ttable.memory&#41;) + 63&#41; & ~63&#41;;
The situation didn't change, though. Maybe the compiler did align the address properly already.

It is late now, but I guess I should go and profile another engine like say Crafty or Stockfish and see if the hash probe takes up that much time on my machine, too.

With regard to my eval (somebody asked about that), it is pretty lightweight and dumb.

Are you sure that sizeof(TTEntry) is a multiple of 64 or divisor of 64? If not, then aligning the base address of the table is not going to help at all.

E.g. if sizeof(TTEntry) were 16 bytes, then exactly 4 of them would fit in each 64-bit cache line. But if sizeof(TTEntry) were 20 bytes, then each cache line would hold 3.2 of them and every time you probe one, you'd have something like a 30% chance of touching two cache lines instead of one.

On the other hand, if sizeof(TTEntry) were 60 bytes, it would be even worse: 14 out of every 16 entries would cross a cache line boundary. Rounding it up to 64 bytes would fix the problem and only waste 6.25% of the memory. If it were 48 bytes then 2 out of every 4 entries would cross a cache line boundary. Etc.

OliverUwira · Post by **OliverUwira** » Sat Oct 30, 2010 11:24 am

wgarvin wrote:
OliverUwira wrote:
bob wrote: One more question: Do you force the "bucket" to lie on a 64 byte address boundary so that you are only incurring one hash line fill, as opposed to two?
No I didn't do that as I didn't know how to. But now I tried the following according to your explanation:
Code: Select all
engine->ttable.memory = &#40;TTEntry*)&#40;malloc&#40;&#40;engine->ttable.size + 1&#41; * sizeof&#40;TTEntry&#41;));
		
if&#40;!&#40;engine->ttable.memory&#41;)
	return false;

engine->ttable.entries = &#40;TTEntry*) (((&#40;int&#41;&#40;engine->ttable.memory&#41;) + 63&#41; & ~63&#41;;
The situation didn't change, though. Maybe the compiler did align the address properly already.

It is late now, but I guess I should go and profile another engine like say Crafty or Stockfish and see if the hash probe takes up that much time on my machine, too.

With regard to my eval (somebody asked about that), it is pretty lightweight and dumb.
Are you sure that sizeof(TTEntry) is a multiple of 64 or divisor of 64? If not, then aligning the base address of the table is not going to help at all.

E.g. if sizeof(TTEntry) were 16 bytes, then exactly 4 of them would fit in each 64-bit cache line. But if sizeof(TTEntry) were 20 bytes, then each cache line would hold 3.2 of them and every time you probe one, you'd have something like a 30% chance of touching two cache lines instead of one.

On the other hand, if sizeof(TTEntry) were 60 bytes, it would be even worse: 14 out of every 16 entries would cross a cache line boundary. Rounding it up to 64 bytes would fix the problem and only waste 6.25% of the memory. If it were 48 bytes then 2 out of every 4 entries would cross a cache line boundary. Etc.

It looks like this, which should 4 x 16 = 64 bytes, unless the compiler messes it up.

Code: Select all

struct TTRecord
&#123;
	hashkey key;  // typedef unsigned __int64 hashkey
	uint64  info;    // typedef unsigned __int64 uint64	
&#125;;

struct TTEntry
&#123;
	TTRecord bucket&#91;4&#93;;
&#125;;

OliverUwira · Post by **OliverUwira** » Sat Oct 30, 2010 11:27 am

hgm wrote: If your eval is light-weight, it is very likely that nothing is wrong and this is just it. Hash probing is expensive. Micro-Max 1.6 (a simple alpha-beta without hash) does 6Mnps, and its big brother micro-Max 4.8 (which does have a hash table) only does 1Mnps on the same machine. The only way to save on hash probing is then to do it less frequently (e.g. like Crafty, which does not probe in QS).

That sounds like a plausible explanation. I might have some room for optimization inside TTableQuery(), but I will assume that I've got no general problem until I have a larger evaluation. That'll be the time to revisit this issue.

Gerd Isenberg · Post by **Gerd Isenberg** » Sat Oct 30, 2010 11:56 am

OliverUwira wrote:
hgm wrote: If your eval is light-weight, it is very likely that nothing is wrong and this is just it. Hash probing is expensive. Micro-Max 1.6 (a simple alpha-beta without hash) does 6Mnps, and its big brother micro-Max 4.8 (which does have a hash table) only does 1Mnps on the same machine. The only way to save on hash probing is then to do it less frequently (e.g. like Crafty, which does not probe in QS).
That sounds like a plausible explanation. I might have some room for optimization inside TTableQuery(), but I will assume that I've got no general problem until I have a larger evaluation. That'll be the time to revisit this issue.

You might also consider pre-fetching to L1, and doing some register intensive computations while "waiting" for the probe.

bob · Post by **bob** » Sat Oct 30, 2010 5:11 pm

rbarreira wrote:
bob wrote:
rbarreira wrote:
hgm wrote:It is quite normal that hash probes are slow: they need DRAM access, which takes many hundreds of CPU clock cycles. Probing the same entry twice is not much more expensive, though, as the second time it will be in cache. So this is in accordance with what you see.

If your results are reasonable depends on too many things to say. For instance how elaborate your evaluation is. And how often you probe.

Non-alignment of a hash bucket with the cache line can cause extra DRAM accesses, if in one probe you access two cache lines (requiring two dRAM accesses to fetch, possibly triggering prefetch of a third). To know, print the start adress of your hash table, and check if it is devisable by 64 (printf("hash start = %x\n", hashTable);).

To force alignment, use code like:
Code: Select all
hashMem = &#40;Bcket *) malloc&#40;nrOfBuckets + 1, sizeof Bucket&#41;;
hashTable = &#40;Bucket *) ((&#40;int&#41;hashMem + 63&#41; & ~63&#41;;
Casting a pointer to an int looks like an overflow waiting to happen. Or am I missing something?
I'd prefer (long) to be 64 bit safe... that is what I use...
Better yet, use "long long" or uint64_t from stdint.h.

long is still 32-bit in Windows world.

Which is OK since pointers are also 32 bits in 32 bit O/S....

Is a querying the hash tables such a huge bottleneck?

Re: Is a querying the hash tables such a huge bottleneck?

Re: Is a querying the hash tables such a huge bottleneck?

Re: Is a querying the hash tables such a huge bottleneck?

Re: Is a querying the hash tables such a huge bottleneck?

Re: Is a querying the hash tables such a huge bottleneck?

Re: Is a querying the hash tables such a huge bottleneck?

Re: Is a querying the hash tables such a huge bottleneck?

Re: Is a querying the hash tables such a huge bottleneck?

Re: Is a querying the hash tables such a huge bottleneck?

Re: Is a querying the hash tables such a huge bottleneck?