Page 1 of 7

Is a querying the hash tables such a huge bottleneck?

Posted: Thu Oct 28, 2010 1:17 pm
by OliverUwira
Hello all,

I just did a profiling run for Kurt and noticed that the hash table query came in as a close runner-up.

I find that somewhat strange. Is that normal, or should I go and look for a bug? I also noticed that the UCI hashfull info reports a full hash table only very late in the game. After move 10 or so it mostly hovers around 50%.

Code: Select all

Function Name                       Incl.    Excl.     Incl. %  Excl. %
Evaluate                            4270     4052      17,83    16,92
TTableQuery                         3882     3882      16,21    16,21
AttacksTo                           1701     1701       7,1      7,1
SetMove                             1901     1353       7,94     5,65
NextMove                            1152     1152       4,81     4,81
StaticExchangeEvaluation            2313     1053       9,66     4,4
SortMoves                           1766     1022       7,37     4,27
GetPinnedPieces                     1013     1013       4,23     4,23
UnsetMove                            925      925       3,86     3,86
GeneratePawnNonCaps                 1149      894       4,8      3,73
QuiescenceSearch                   14637      789      61,11     3,29
ZeroSearch                         23828      788      99,48     3,29
GeneratePawnCaps                     708      708       2,96     2,96
SortQuiescenceMoves                  572      557       2,39     2,33
TTableStore                          467      467       1,95     1,95
GenerateQueenCaps                    455      455       1,9      1,9
SliderAttacksTo                      502      324       2,1      1,35
GetSliderAttacks                     308      308       1,29     1,29
GenerateKingCaps                     314      290       1,31     1,21
GenerateBishopCaps                   268      268       1,12     1,12
GenerateKingNonCaps                  728      262       3,04     1,09
EnPassantPossible                    255      255       1,06     1,06
GenerateRookCaps                     220      220       0,92     0,92
AnalysePawnStructure                 196      196       0,82     0,82
GenerateQueenNonCaps                 190      190       0,79     0,79
GenerateKnightCaps                   186      186       0,78     0,78
GenerateQuiescenceMoves             3412      108      14,24     0,45
GetBlockMoves                         98       98       0,41     0,41
GenerateBishopNonCaps                 80       80       0,33     0,33
GenerateMoves                       2421       75      10,11     0,31
GenerateLegalEscapes                 275       68       1,15     0,28
GenerateKnightNonCaps                 61       61       0,25     0,25
GenerateRookNonCaps                   54       54       0,23     0,23
GetSquaresBetween                     33       33       0,14     0,14
RecordKiller                          24       24       0,1      0,1
Search                             17211       19      71,85     0,08

Re: Is a querying the hash tables such a huge bottleneck?

Posted: Thu Oct 28, 2010 2:33 pm
by bob
OliverUwira wrote:Hello all,

I just did a profiling run for Kurt and noticed that the hash table query came in as a close runner-up.

I find that somewhat strange. Is that normal, or should I go and look for a bug? I also noticed that the UCI hashfull info reports a full hash table only very late in the game. After move 10 or so it mostly hovers around 50%.

Code: Select all

Function Name                       Incl.    Excl.     Incl. %  Excl. %
Evaluate                            4270     4052      17,83    16,92
TTableQuery                         3882     3882      16,21    16,21
AttacksTo                           1701     1701       7,1      7,1
SetMove                             1901     1353       7,94     5,65
NextMove                            1152     1152       4,81     4,81
StaticExchangeEvaluation            2313     1053       9,66     4,4
SortMoves                           1766     1022       7,37     4,27
GetPinnedPieces                     1013     1013       4,23     4,23
UnsetMove                            925      925       3,86     3,86
GeneratePawnNonCaps                 1149      894       4,8      3,73
QuiescenceSearch                   14637      789      61,11     3,29
ZeroSearch                         23828      788      99,48     3,29
GeneratePawnCaps                     708      708       2,96     2,96
SortQuiescenceMoves                  572      557       2,39     2,33
TTableStore                          467      467       1,95     1,95
GenerateQueenCaps                    455      455       1,9      1,9
SliderAttacksTo                      502      324       2,1      1,35
GetSliderAttacks                     308      308       1,29     1,29
GenerateKingCaps                     314      290       1,31     1,21
GenerateBishopCaps                   268      268       1,12     1,12
GenerateKingNonCaps                  728      262       3,04     1,09
EnPassantPossible                    255      255       1,06     1,06
GenerateRookCaps                     220      220       0,92     0,92
AnalysePawnStructure                 196      196       0,82     0,82
GenerateQueenNonCaps                 190      190       0,79     0,79
GenerateKnightCaps                   186      186       0,78     0,78
GenerateQuiescenceMoves             3412      108      14,24     0,45
GetBlockMoves                         98       98       0,41     0,41
GenerateBishopNonCaps                 80       80       0,33     0,33
GenerateMoves                       2421       75      10,11     0,31
GenerateLegalEscapes                 275       68       1,15     0,28
GenerateKnightNonCaps                 61       61       0,25     0,25
GenerateRookNonCaps                   54       54       0,23     0,23
GetSquaresBetween                     33       33       0,14     0,14
RecordKiller                          24       24       0,1      0,1
Search                             17211       19      71,85     0,08
It depends on how you probe. Most try to probe to a "bucket" that fits within one cache line. If you do a hash probe and visit multiple random entries per probe, then the time goes way up because memory is so slow.

Re: Is a querying the hash tables such a huge bottleneck?

Posted: Thu Oct 28, 2010 3:16 pm
by OliverUwira
bob wrote:
OliverUwira wrote:Hello all,

I just did a profiling run for Kurt and noticed that the hash table query came in as a close runner-up.

I find that somewhat strange. Is that normal, or should I go and look for a bug? I also noticed that the UCI hashfull info reports a full hash table only very late in the game. After move 10 or so it mostly hovers around 50%.

Code: Select all

Function Name                       Incl.    Excl.     Incl. %  Excl. %
Evaluate                            4270     4052      17,83    16,92
TTableQuery                         3882     3882      16,21    16,21
AttacksTo                           1701     1701       7,1      7,1

...

It depends on how you probe. Most try to probe to a "bucket" that fits within one cache line. If you do a hash probe and visit multiple random entries per probe, then the time goes way up because memory is so slow.
Well, I've found one optimization, in that I was calling TTableQuery two times in a row, i.e. before

Code: Select all

if(dist == 0) QuiescenceSearch();
and then again in QuiescenceSearch. I measured again and the Excl. % went down, but only slightly, from 16,2% to 15,1%.

So I fear your advice about the cache line might cause this problem.

But what puzzles me is that my TTable is very similar to Crafty's.

I have two quad words as a record, with 4 records per bucket:

Code: Select all

struct TTRecord
{
	hashkey key;
	uint64  info;	
};

struct TTEntry
{
	TTRecord bucket[4];
};

int TTableQuery(Engine *engine, int *alpha, int beta, int ply, int depth)
{
	const hashkey key = engine->pos.poskey;
	const TTable *ttable = &(engine->ttable);
	TTEntry  *entry = &(ttable->entries[key & (ttable->mask)]);
	TTRecord *record = entry->bucket;
	int i, f, v, d, ret = TT_FLAG_INVALID;
	uint64 info;
	
	engine->tree.PathNode[ply].MoveGenState.hashmove = 0;
	for&#40;i = 0; i < 4; i++)
	&#123;
		if&#40;key == &#40;record->key&#41;)
			break;
		record++;
	&#125;

	if&#40;i > 3&#41; 
		return ret;

        ...

&#125;
I know that you don't have an extra bucket structure in Crafty but rather treat 4 consecutive records as the bucket.

Could the TTEntry structure cause my problem. I decided to use it for clarity and thought it should result in the same memory usage as if I stored consecutively and taking the address mod 64.

Re: Is a querying the hash tables such a huge bottleneck?

Posted: Thu Oct 28, 2010 4:00 pm
by bob
OliverUwira wrote:
bob wrote:
OliverUwira wrote:Hello all,

I just did a profiling run for Kurt and noticed that the hash table query came in as a close runner-up.

I find that somewhat strange. Is that normal, or should I go and look for a bug? I also noticed that the UCI hashfull info reports a full hash table only very late in the game. After move 10 or so it mostly hovers around 50%.

Code: Select all

Function Name                       Incl.    Excl.     Incl. %  Excl. %
Evaluate                            4270     4052      17,83    16,92
TTableQuery                         3882     3882      16,21    16,21
AttacksTo                           1701     1701       7,1      7,1

...

It depends on how you probe. Most try to probe to a "bucket" that fits within one cache line. If you do a hash probe and visit multiple random entries per probe, then the time goes way up because memory is so slow.
Well, I've found one optimization, in that I was calling TTableQuery two times in a row, i.e. before

Code: Select all

if&#40;dist == 0&#41; QuiescenceSearch&#40;);
and then again in QuiescenceSearch. I measured again and the Excl. % went down, but only slightly, from 16,2% to 15,1%.

So I fear your advice about the cache line might cause this problem.

But what puzzles me is that my TTable is very similar to Crafty's.

I have two quad words as a record, with 4 records per bucket:

Code: Select all

struct TTRecord
&#123;
	hashkey key;
	uint64  info;	
&#125;;

struct TTEntry
&#123;
	TTRecord bucket&#91;4&#93;;
&#125;;

int TTableQuery&#40;Engine *engine, int *alpha, int beta, int ply, int depth&#41;
&#123;
	const hashkey key = engine->pos.poskey;
	const TTable *ttable = &&#40;engine->ttable&#41;;
	TTEntry  *entry = &&#40;ttable->entries&#91;key & &#40;ttable->mask&#41;&#93;);
	TTRecord *record = entry->bucket;
	int i, f, v, d, ret = TT_FLAG_INVALID;
	uint64 info;
	
	engine->tree.PathNode&#91;ply&#93;.MoveGenState.hashmove = 0;
	for&#40;i = 0; i < 4; i++)
	&#123;
		if&#40;key == &#40;record->key&#41;)
			break;
		record++;
	&#125;

	if&#40;i > 3&#41; 
		return ret;

        ...

&#125;
I know that you don't have an extra bucket structure in Crafty but rather treat 4 consecutive records as the bucket.

Could the TTEntry structure cause my problem. I decided to use it for clarity and thought it should result in the same memory usage as if I stored consecutively and taking the address mod 64.
One more question: Do you force the "bucket" to lie on a 64 byte address boundary so that you are only incurring one hash line fill, as opposed to two?

Re: Is a querying the hash tables such a huge bottleneck?

Posted: Thu Oct 28, 2010 4:31 pm
by John Major
OliverUwira wrote: I also noticed that the UCI hashfull info reports a full hash table only very late in the game. After move 10 or so it mostly hovers around 50%.
The hash table fill behaviour looks like this:
Image
So it takes a while to completely fill up.

Re: Is a querying the hash tables such a huge bottleneck?

Posted: Thu Oct 28, 2010 4:54 pm
by mathmoi
bob wrote:One more question: Do you force the "bucket" to lie on a 64 byte address boundary so that you are only incurring one hash line fill, as opposed to two?
How do you do that?

Re: Is a querying the hash tables such a huge bottleneck?

Posted: Thu Oct 28, 2010 5:18 pm
by hgm
It is quite normal that hash probes are slow: they need DRAM access, which takes many hundreds of CPU clock cycles. Probing the same entry twice is not much more expensive, though, as the second time it will be in cache. So this is in accordance with what you see.

If your results are reasonable depends on too many things to say. For instance how elaborate your evaluation is. And how often you probe.

Non-alignment of a hash bucket with the cache line can cause extra DRAM accesses, if in one probe you access two cache lines (requiring two dRAM accesses to fetch, possibly triggering prefetch of a third). To know, print the start adress of your hash table, and check if it is devisable by 64 (printf("hash start = %x\n", hashTable);).

To force alignment, use code like:

Code: Select all

hashMem = &#40;Bcket *) malloc&#40;nrOfBuckets + 1, sizeof Bucket&#41;;
hashTable = &#40;Bucket *) ((&#40;int&#41;hashMem + 63&#41; & ~63&#41;;

Re: Is a querying the hash tables such a huge bottleneck?

Posted: Thu Oct 28, 2010 5:20 pm
by jwes
OliverUwira wrote:Hello all,

I just did a profiling run for Kurt and noticed that the hash table query came in as a close runner-up.

I find that somewhat strange. Is that normal, or should I go and look for a bug? I also noticed that the UCI hashfull info reports a full hash table only very late in the game. After move 10 or so it mostly hovers around 50%.

Code: Select all

Function Name                       Incl.    Excl.     Incl. %  Excl. %
Evaluate                            4270     4052      17,83    16,92
TTableQuery                         3882     3882      16,21    16,21
AttacksTo                           1701     1701       7,1      7,1
SetMove                             1901     1353       7,94     5,65
NextMove                            1152     1152       4,81     4,81
StaticExchangeEvaluation            2313     1053       9,66     4,4
SortMoves                           1766     1022       7,37     4,27
GetPinnedPieces                     1013     1013       4,23     4,23
UnsetMove                            925      925       3,86     3,86
GeneratePawnNonCaps                 1149      894       4,8      3,73
QuiescenceSearch                   14637      789      61,11     3,29
ZeroSearch                         23828      788      99,48     3,29
GeneratePawnCaps                     708      708       2,96     2,96
SortQuiescenceMoves                  572      557       2,39     2,33
TTableStore                          467      467       1,95     1,95
GenerateQueenCaps                    455      455       1,9      1,9
SliderAttacksTo                      502      324       2,1      1,35
GetSliderAttacks                     308      308       1,29     1,29
GenerateKingCaps                     314      290       1,31     1,21
GenerateBishopCaps                   268      268       1,12     1,12
GenerateKingNonCaps                  728      262       3,04     1,09
EnPassantPossible                    255      255       1,06     1,06
GenerateRookCaps                     220      220       0,92     0,92
AnalysePawnStructure                 196      196       0,82     0,82
GenerateQueenNonCaps                 190      190       0,79     0,79
GenerateKnightCaps                   186      186       0,78     0,78
GenerateQuiescenceMoves             3412      108      14,24     0,45
GetBlockMoves                         98       98       0,41     0,41
GenerateBishopNonCaps                 80       80       0,33     0,33
GenerateMoves                       2421       75      10,11     0,31
GenerateLegalEscapes                 275       68       1,15     0,28
GenerateKnightNonCaps                 61       61       0,25     0,25
GenerateRookNonCaps                   54       54       0,23     0,23
GetSquaresBetween                     33       33       0,14     0,14
RecordKiller                          24       24       0,1      0,1
Search                             17211       19      71,85     0,08
A simple test would be to run profiling with a very small hash table. If the percentage remains high, it may well be a bug.

Re: Is a querying the hash tables such a huge bottleneck?

Posted: Thu Oct 28, 2010 5:23 pm
by bob
mathmoi wrote:
bob wrote:One more question: Do you force the "bucket" to lie on a 64 byte address boundary so that you are only incurring one hash line fill, as opposed to two?
How do you do that?
malloc() 63 bytes more than you need. Then add 63 to the address malloc() returns, and AND that with ~63 (0xffffffc0). Now the address of the first hash entry is on a 64-byte boundary, and each group of 4 will go exactly into one cache line with no overlap.

Re: Is a querying the hash tables such a huge bottleneck?

Posted: Thu Oct 28, 2010 6:19 pm
by Mincho Georgiev
Plus, you can preserve the old address (with unaligned startup) in the first (unused bytes) if you want to manipulate the whole piece of memory later.
Like:

*aligned_address[-1] = the_unaligned_adress.

and later if you need to use it:

free(aligned_adress[-1]);