Page 2 of 7

Re: Is a querying the hash tables such a huge bottleneck?

Posted: Thu Oct 28, 2010 9:07 pm
by bob
Mincho Georgiev wrote:Plus, you can preserve the old address (with unaligned startup) in the first (unused bytes) if you want to manipulate the whole piece of memory later.
Like:

*aligned_address[-1] = the_unaligned_adress.

and later if you need to use it:

free(aligned_adress[-1]);
Dangerous. What if your original address is already on a 64 byte boundary?

Better is what I do in Crafty in utility.c, AlignedMalloc() and save the original address in a vector along with the aligned address. When I want to free the thing, I match the aligned address in the vector and then free the corresponding unaligned address, which works safely.

Re: Is a querying the hash tables such a huge bottleneck?

Posted: Thu Oct 28, 2010 9:33 pm
by Dirt
hgm wrote:To force alignment, use code like:

Code: Select all

hashMem = (Bcket *) malloc(nrOfBuckets + 1, sizeof Bucket);
hashTable = (Bucket *) (((int)hashMem + 63) & ~63);
I wonder if that's ever likely to change. Should you be using 255, or something, instead of 63?

Re: Is a querying the hash tables such a huge bottleneck?

Posted: Thu Oct 28, 2010 10:28 pm
by bob
Dirt wrote:
hgm wrote:To force alignment, use code like:

Code: Select all

hashMem = (Bcket *) malloc(nrOfBuckets + 1, sizeof Bucket);
hashTable = (Bucket *) (((int)hashMem + 63) & ~63);
I wonder if that's ever likely to change. Should you be using 255, or something, instead of 63?
Certainly won't hurt. But I doubt we will see 256 byte cache lines. Pre-fetching only takes you so far before you begin to pre-fetch things you won't need, at the same time displacing things that you do need.

Re: Is a querying the hash tables such a huge bottleneck?

Posted: Thu Oct 28, 2010 11:50 pm
by wgarvin
Dirt wrote:
hgm wrote:To force alignment, use code like:

Code: Select all

hashMem = (Bcket *) malloc(nrOfBuckets + 1, sizeof Bucket);
hashTable = (Bucket *) (((int)hashMem + 63) & ~63);
I wonder if that's ever likely to change. Should you be using 255, or something, instead of 63?
Some CPUs currently use 128-byte cache lines (the Xbox360 and the PS3 are in this category). If I were writing something for 64-bit desktop machines, I would align it on 128 just in case, but still pack the tt entries into 64-byte blocks.

Re: Is a querying the hash tables such a huge bottleneck?

Posted: Fri Oct 29, 2010 12:03 am
by OliverUwira
bob wrote: One more question: Do you force the "bucket" to lie on a 64 byte address boundary so that you are only incurring one hash line fill, as opposed to two?
No I didn't do that as I didn't know how to. But now I tried the following according to your explanation:

Code: Select all

engine->ttable.memory = (TTEntry*)(malloc((engine->ttable.size + 1) * sizeof(TTEntry)));
		
if(!(engine->ttable.memory))
	return false;

engine->ttable.entries = (TTEntry*) ((((int)(engine->ttable.memory)) + 63) & ~63);
The situation didn't change, though. Maybe the compiler did align the address properly already.

It is late now, but I guess I should go and profile another engine like say Crafty or Stockfish and see if the hash probe takes up that much time on my machine, too.

With regard to my eval (somebody asked about that), it is pretty lightweight and dumb.

Re: Is a querying the hash tables such a huge bottleneck?

Posted: Fri Oct 29, 2010 12:33 am
by micron
mathmoi wrote:
bob wrote:One more question: Do you force the "bucket" to lie on a 64 byte address boundary so that you are only incurring one hash line fill, as opposed to two?
How do you do that?
Use valloc() instead of malloc().

Robert P.

Re: Is a querying the hash tables such a huge bottleneck?

Posted: Fri Oct 29, 2010 1:17 am
by rbarreira
hgm wrote:It is quite normal that hash probes are slow: they need DRAM access, which takes many hundreds of CPU clock cycles. Probing the same entry twice is not much more expensive, though, as the second time it will be in cache. So this is in accordance with what you see.

If your results are reasonable depends on too many things to say. For instance how elaborate your evaluation is. And how often you probe.

Non-alignment of a hash bucket with the cache line can cause extra DRAM accesses, if in one probe you access two cache lines (requiring two dRAM accesses to fetch, possibly triggering prefetch of a third). To know, print the start adress of your hash table, and check if it is devisable by 64 (printf("hash start = %x\n", hashTable);).

To force alignment, use code like:

Code: Select all

hashMem = (Bcket *) malloc(nrOfBuckets + 1, sizeof Bucket);
hashTable = (Bucket *) (((int)hashMem + 63) & ~63);
Casting a pointer to an int looks like an overflow waiting to happen. Or am I missing something?

Re: Is a querying the hash tables such a huge bottleneck?

Posted: Fri Oct 29, 2010 1:58 am
by bob
micron wrote:
mathmoi wrote:
bob wrote:One more question: Do you force the "bucket" to lie on a 64 byte address boundary so that you are only incurring one hash line fill, as opposed to two?
How do you do that?
Use valloc() instead of malloc().

Robert P.
Not portable. I used it for a while but went back to standard malloc() and forced the alignment myself, which is portable across all platforms.

Re: Is a querying the hash tables such a huge bottleneck?

Posted: Fri Oct 29, 2010 2:01 am
by bob
OliverUwira wrote:
bob wrote: One more question: Do you force the "bucket" to lie on a 64 byte address boundary so that you are only incurring one hash line fill, as opposed to two?
No I didn't do that as I didn't know how to. But now I tried the following according to your explanation:

Code: Select all

engine->ttable.memory = (TTEntry*)(malloc((engine->ttable.size + 1) * sizeof(TTEntry)));
		
if(!(engine->ttable.memory))
	return false;

engine->ttable.entries = (TTEntry*) ((((int)(engine->ttable.memory)) + 63) & ~63);
The situation didn't change, though. Maybe the compiler did align the address properly already.

It is late now, but I guess I should go and profile another engine like say Crafty or Stockfish and see if the hash probe takes up that much time on my machine, too.

With regard to my eval (somebody asked about that), it is pretty lightweight and dumb.
There is a problem in your code. When you add 63 and then and off the rightmost 6 bits, you are essentially rounding up. You will still use the ttablel_size number of bytes, which will now cause you to access memory just beyond the end of the newly-aligned ttable address. Up to 63 bytes beyond depending on how far off the initial address is (it is supposedly guaranteed to be on an 8 byte boundary according to POSIX).

You need to fudge the size by + 63 when you do the malloc().

Re: Is a querying the hash tables such a huge bottleneck?

Posted: Fri Oct 29, 2010 2:02 am
by bob
rbarreira wrote:
hgm wrote:It is quite normal that hash probes are slow: they need DRAM access, which takes many hundreds of CPU clock cycles. Probing the same entry twice is not much more expensive, though, as the second time it will be in cache. So this is in accordance with what you see.

If your results are reasonable depends on too many things to say. For instance how elaborate your evaluation is. And how often you probe.

Non-alignment of a hash bucket with the cache line can cause extra DRAM accesses, if in one probe you access two cache lines (requiring two dRAM accesses to fetch, possibly triggering prefetch of a third). To know, print the start adress of your hash table, and check if it is devisable by 64 (printf("hash start = %x\n", hashTable);).

To force alignment, use code like:

Code: Select all

hashMem = (Bcket *) malloc(nrOfBuckets + 1, sizeof Bucket);
hashTable = (Bucket *) (((int)hashMem + 63) & ~63);
Casting a pointer to an int looks like an overflow waiting to happen. Or am I missing something?
I'd prefer (long) to be 64 bit safe... that is what I use...