Is a querying the hash tables such a huge bottleneck?

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Is a querying the hash tables such a huge bottleneck?

Post by bob »

Mincho Georgiev wrote:Plus, you can preserve the old address (with unaligned startup) in the first (unused bytes) if you want to manipulate the whole piece of memory later.
Like:

*aligned_address[-1] = the_unaligned_adress.

and later if you need to use it:

free(aligned_adress[-1]);
Dangerous. What if your original address is already on a 64 byte boundary?

Better is what I do in Crafty in utility.c, AlignedMalloc() and save the original address in a vector along with the aligned address. When I want to free the thing, I match the aligned address in the vector and then free the corresponding unaligned address, which works safely.
Dirt
Posts: 2851
Joined: Wed Mar 08, 2006 10:01 pm
Location: Irvine, CA, USA

Re: Is a querying the hash tables such a huge bottleneck?

Post by Dirt »

hgm wrote:To force alignment, use code like:

Code: Select all

hashMem = (Bcket *) malloc(nrOfBuckets + 1, sizeof Bucket);
hashTable = (Bucket *) (((int)hashMem + 63) & ~63);
I wonder if that's ever likely to change. Should you be using 255, or something, instead of 63?
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Is a querying the hash tables such a huge bottleneck?

Post by bob »

Dirt wrote:
hgm wrote:To force alignment, use code like:

Code: Select all

hashMem = (Bcket *) malloc(nrOfBuckets + 1, sizeof Bucket);
hashTable = (Bucket *) (((int)hashMem + 63) & ~63);
I wonder if that's ever likely to change. Should you be using 255, or something, instead of 63?
Certainly won't hurt. But I doubt we will see 256 byte cache lines. Pre-fetching only takes you so far before you begin to pre-fetch things you won't need, at the same time displacing things that you do need.
wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: Is a querying the hash tables such a huge bottleneck?

Post by wgarvin »

Dirt wrote:
hgm wrote:To force alignment, use code like:

Code: Select all

hashMem = (Bcket *) malloc(nrOfBuckets + 1, sizeof Bucket);
hashTable = (Bucket *) (((int)hashMem + 63) & ~63);
I wonder if that's ever likely to change. Should you be using 255, or something, instead of 63?
Some CPUs currently use 128-byte cache lines (the Xbox360 and the PS3 are in this category). If I were writing something for 64-bit desktop machines, I would align it on 128 just in case, but still pack the tt entries into 64-byte blocks.
User avatar
OliverUwira
Posts: 170
Joined: Mon Sep 13, 2010 9:57 am
Location: Frankfurt am Main

Re: Is a querying the hash tables such a huge bottleneck?

Post by OliverUwira »

bob wrote: One more question: Do you force the "bucket" to lie on a 64 byte address boundary so that you are only incurring one hash line fill, as opposed to two?
No I didn't do that as I didn't know how to. But now I tried the following according to your explanation:

Code: Select all

engine->ttable.memory = (TTEntry*)(malloc((engine->ttable.size + 1) * sizeof(TTEntry)));
		
if(!(engine->ttable.memory))
	return false;

engine->ttable.entries = (TTEntry*) ((((int)(engine->ttable.memory)) + 63) & ~63);
The situation didn't change, though. Maybe the compiler did align the address properly already.

It is late now, but I guess I should go and profile another engine like say Crafty or Stockfish and see if the hash probe takes up that much time on my machine, too.

With regard to my eval (somebody asked about that), it is pretty lightweight and dumb.
micron
Posts: 155
Joined: Mon Feb 15, 2010 9:33 am
Location: New Zealand

Re: Is a querying the hash tables such a huge bottleneck?

Post by micron »

mathmoi wrote:
bob wrote:One more question: Do you force the "bucket" to lie on a 64 byte address boundary so that you are only incurring one hash line fill, as opposed to two?
How do you do that?
Use valloc() instead of malloc().

Robert P.
rbarreira
Posts: 900
Joined: Tue Apr 27, 2010 3:48 pm

Re: Is a querying the hash tables such a huge bottleneck?

Post by rbarreira »

hgm wrote:It is quite normal that hash probes are slow: they need DRAM access, which takes many hundreds of CPU clock cycles. Probing the same entry twice is not much more expensive, though, as the second time it will be in cache. So this is in accordance with what you see.

If your results are reasonable depends on too many things to say. For instance how elaborate your evaluation is. And how often you probe.

Non-alignment of a hash bucket with the cache line can cause extra DRAM accesses, if in one probe you access two cache lines (requiring two dRAM accesses to fetch, possibly triggering prefetch of a third). To know, print the start adress of your hash table, and check if it is devisable by 64 (printf("hash start = %x\n", hashTable);).

To force alignment, use code like:

Code: Select all

hashMem = (Bcket *) malloc(nrOfBuckets + 1, sizeof Bucket);
hashTable = (Bucket *) (((int)hashMem + 63) & ~63);
Casting a pointer to an int looks like an overflow waiting to happen. Or am I missing something?
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Is a querying the hash tables such a huge bottleneck?

Post by bob »

micron wrote:
mathmoi wrote:
bob wrote:One more question: Do you force the "bucket" to lie on a 64 byte address boundary so that you are only incurring one hash line fill, as opposed to two?
How do you do that?
Use valloc() instead of malloc().

Robert P.
Not portable. I used it for a while but went back to standard malloc() and forced the alignment myself, which is portable across all platforms.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Is a querying the hash tables such a huge bottleneck?

Post by bob »

OliverUwira wrote:
bob wrote: One more question: Do you force the "bucket" to lie on a 64 byte address boundary so that you are only incurring one hash line fill, as opposed to two?
No I didn't do that as I didn't know how to. But now I tried the following according to your explanation:

Code: Select all

engine->ttable.memory = (TTEntry*)(malloc((engine->ttable.size + 1) * sizeof(TTEntry)));
		
if(!(engine->ttable.memory))
	return false;

engine->ttable.entries = (TTEntry*) ((((int)(engine->ttable.memory)) + 63) & ~63);
The situation didn't change, though. Maybe the compiler did align the address properly already.

It is late now, but I guess I should go and profile another engine like say Crafty or Stockfish and see if the hash probe takes up that much time on my machine, too.

With regard to my eval (somebody asked about that), it is pretty lightweight and dumb.
There is a problem in your code. When you add 63 and then and off the rightmost 6 bits, you are essentially rounding up. You will still use the ttablel_size number of bytes, which will now cause you to access memory just beyond the end of the newly-aligned ttable address. Up to 63 bytes beyond depending on how far off the initial address is (it is supposedly guaranteed to be on an 8 byte boundary according to POSIX).

You need to fudge the size by + 63 when you do the malloc().
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Is a querying the hash tables such a huge bottleneck?

Post by bob »

rbarreira wrote:
hgm wrote:It is quite normal that hash probes are slow: they need DRAM access, which takes many hundreds of CPU clock cycles. Probing the same entry twice is not much more expensive, though, as the second time it will be in cache. So this is in accordance with what you see.

If your results are reasonable depends on too many things to say. For instance how elaborate your evaluation is. And how often you probe.

Non-alignment of a hash bucket with the cache line can cause extra DRAM accesses, if in one probe you access two cache lines (requiring two dRAM accesses to fetch, possibly triggering prefetch of a third). To know, print the start adress of your hash table, and check if it is devisable by 64 (printf("hash start = %x\n", hashTable);).

To force alignment, use code like:

Code: Select all

hashMem = (Bcket *) malloc(nrOfBuckets + 1, sizeof Bucket);
hashTable = (Bucket *) (((int)hashMem + 63) & ~63);
Casting a pointer to an int looks like an overflow waiting to happen. Or am I missing something?
I'd prefer (long) to be 64 bit safe... that is what I use...