Is a querying the hash tables such a huge bottleneck?

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
OliverUwira
Posts: 170
Joined: Mon Sep 13, 2010 9:57 am
Location: Frankfurt am Main

Is a querying the hash tables such a huge bottleneck?

Post by OliverUwira »

Hello all,

I just did a profiling run for Kurt and noticed that the hash table query came in as a close runner-up.

I find that somewhat strange. Is that normal, or should I go and look for a bug? I also noticed that the UCI hashfull info reports a full hash table only very late in the game. After move 10 or so it mostly hovers around 50%.

Code: Select all

Function Name                       Incl.    Excl.     Incl. %  Excl. %
Evaluate                            4270     4052      17,83    16,92
TTableQuery                         3882     3882      16,21    16,21
AttacksTo                           1701     1701       7,1      7,1
SetMove                             1901     1353       7,94     5,65
NextMove                            1152     1152       4,81     4,81
StaticExchangeEvaluation            2313     1053       9,66     4,4
SortMoves                           1766     1022       7,37     4,27
GetPinnedPieces                     1013     1013       4,23     4,23
UnsetMove                            925      925       3,86     3,86
GeneratePawnNonCaps                 1149      894       4,8      3,73
QuiescenceSearch                   14637      789      61,11     3,29
ZeroSearch                         23828      788      99,48     3,29
GeneratePawnCaps                     708      708       2,96     2,96
SortQuiescenceMoves                  572      557       2,39     2,33
TTableStore                          467      467       1,95     1,95
GenerateQueenCaps                    455      455       1,9      1,9
SliderAttacksTo                      502      324       2,1      1,35
GetSliderAttacks                     308      308       1,29     1,29
GenerateKingCaps                     314      290       1,31     1,21
GenerateBishopCaps                   268      268       1,12     1,12
GenerateKingNonCaps                  728      262       3,04     1,09
EnPassantPossible                    255      255       1,06     1,06
GenerateRookCaps                     220      220       0,92     0,92
AnalysePawnStructure                 196      196       0,82     0,82
GenerateQueenNonCaps                 190      190       0,79     0,79
GenerateKnightCaps                   186      186       0,78     0,78
GenerateQuiescenceMoves             3412      108      14,24     0,45
GetBlockMoves                         98       98       0,41     0,41
GenerateBishopNonCaps                 80       80       0,33     0,33
GenerateMoves                       2421       75      10,11     0,31
GenerateLegalEscapes                 275       68       1,15     0,28
GenerateKnightNonCaps                 61       61       0,25     0,25
GenerateRookNonCaps                   54       54       0,23     0,23
GetSquaresBetween                     33       33       0,14     0,14
RecordKiller                          24       24       0,1      0,1
Search                             17211       19      71,85     0,08
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Is a querying the hash tables such a huge bottleneck?

Post by bob »

OliverUwira wrote:Hello all,

I just did a profiling run for Kurt and noticed that the hash table query came in as a close runner-up.

I find that somewhat strange. Is that normal, or should I go and look for a bug? I also noticed that the UCI hashfull info reports a full hash table only very late in the game. After move 10 or so it mostly hovers around 50%.

Code: Select all

Function Name                       Incl.    Excl.     Incl. %  Excl. %
Evaluate                            4270     4052      17,83    16,92
TTableQuery                         3882     3882      16,21    16,21
AttacksTo                           1701     1701       7,1      7,1
SetMove                             1901     1353       7,94     5,65
NextMove                            1152     1152       4,81     4,81
StaticExchangeEvaluation            2313     1053       9,66     4,4
SortMoves                           1766     1022       7,37     4,27
GetPinnedPieces                     1013     1013       4,23     4,23
UnsetMove                            925      925       3,86     3,86
GeneratePawnNonCaps                 1149      894       4,8      3,73
QuiescenceSearch                   14637      789      61,11     3,29
ZeroSearch                         23828      788      99,48     3,29
GeneratePawnCaps                     708      708       2,96     2,96
SortQuiescenceMoves                  572      557       2,39     2,33
TTableStore                          467      467       1,95     1,95
GenerateQueenCaps                    455      455       1,9      1,9
SliderAttacksTo                      502      324       2,1      1,35
GetSliderAttacks                     308      308       1,29     1,29
GenerateKingCaps                     314      290       1,31     1,21
GenerateBishopCaps                   268      268       1,12     1,12
GenerateKingNonCaps                  728      262       3,04     1,09
EnPassantPossible                    255      255       1,06     1,06
GenerateRookCaps                     220      220       0,92     0,92
AnalysePawnStructure                 196      196       0,82     0,82
GenerateQueenNonCaps                 190      190       0,79     0,79
GenerateKnightCaps                   186      186       0,78     0,78
GenerateQuiescenceMoves             3412      108      14,24     0,45
GetBlockMoves                         98       98       0,41     0,41
GenerateBishopNonCaps                 80       80       0,33     0,33
GenerateMoves                       2421       75      10,11     0,31
GenerateLegalEscapes                 275       68       1,15     0,28
GenerateKnightNonCaps                 61       61       0,25     0,25
GenerateRookNonCaps                   54       54       0,23     0,23
GetSquaresBetween                     33       33       0,14     0,14
RecordKiller                          24       24       0,1      0,1
Search                             17211       19      71,85     0,08
It depends on how you probe. Most try to probe to a "bucket" that fits within one cache line. If you do a hash probe and visit multiple random entries per probe, then the time goes way up because memory is so slow.
User avatar
OliverUwira
Posts: 170
Joined: Mon Sep 13, 2010 9:57 am
Location: Frankfurt am Main

Re: Is a querying the hash tables such a huge bottleneck?

Post by OliverUwira »

bob wrote:
OliverUwira wrote:Hello all,

I just did a profiling run for Kurt and noticed that the hash table query came in as a close runner-up.

I find that somewhat strange. Is that normal, or should I go and look for a bug? I also noticed that the UCI hashfull info reports a full hash table only very late in the game. After move 10 or so it mostly hovers around 50%.

Code: Select all

Function Name                       Incl.    Excl.     Incl. %  Excl. %
Evaluate                            4270     4052      17,83    16,92
TTableQuery                         3882     3882      16,21    16,21
AttacksTo                           1701     1701       7,1      7,1

...

It depends on how you probe. Most try to probe to a "bucket" that fits within one cache line. If you do a hash probe and visit multiple random entries per probe, then the time goes way up because memory is so slow.
Well, I've found one optimization, in that I was calling TTableQuery two times in a row, i.e. before

Code: Select all

if(dist == 0) QuiescenceSearch();
and then again in QuiescenceSearch. I measured again and the Excl. % went down, but only slightly, from 16,2% to 15,1%.

So I fear your advice about the cache line might cause this problem.

But what puzzles me is that my TTable is very similar to Crafty's.

I have two quad words as a record, with 4 records per bucket:

Code: Select all

struct TTRecord
{
	hashkey key;
	uint64  info;	
};

struct TTEntry
{
	TTRecord bucket[4];
};

int TTableQuery(Engine *engine, int *alpha, int beta, int ply, int depth)
{
	const hashkey key = engine->pos.poskey;
	const TTable *ttable = &(engine->ttable);
	TTEntry  *entry = &(ttable->entries[key & (ttable->mask)]);
	TTRecord *record = entry->bucket;
	int i, f, v, d, ret = TT_FLAG_INVALID;
	uint64 info;
	
	engine->tree.PathNode[ply].MoveGenState.hashmove = 0;
	for&#40;i = 0; i < 4; i++)
	&#123;
		if&#40;key == &#40;record->key&#41;)
			break;
		record++;
	&#125;

	if&#40;i > 3&#41; 
		return ret;

        ...

&#125;
I know that you don't have an extra bucket structure in Crafty but rather treat 4 consecutive records as the bucket.

Could the TTEntry structure cause my problem. I decided to use it for clarity and thought it should result in the same memory usage as if I stored consecutively and taking the address mod 64.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Is a querying the hash tables such a huge bottleneck?

Post by bob »

OliverUwira wrote:
bob wrote:
OliverUwira wrote:Hello all,

I just did a profiling run for Kurt and noticed that the hash table query came in as a close runner-up.

I find that somewhat strange. Is that normal, or should I go and look for a bug? I also noticed that the UCI hashfull info reports a full hash table only very late in the game. After move 10 or so it mostly hovers around 50%.

Code: Select all

Function Name                       Incl.    Excl.     Incl. %  Excl. %
Evaluate                            4270     4052      17,83    16,92
TTableQuery                         3882     3882      16,21    16,21
AttacksTo                           1701     1701       7,1      7,1

...

It depends on how you probe. Most try to probe to a "bucket" that fits within one cache line. If you do a hash probe and visit multiple random entries per probe, then the time goes way up because memory is so slow.
Well, I've found one optimization, in that I was calling TTableQuery two times in a row, i.e. before

Code: Select all

if&#40;dist == 0&#41; QuiescenceSearch&#40;);
and then again in QuiescenceSearch. I measured again and the Excl. % went down, but only slightly, from 16,2% to 15,1%.

So I fear your advice about the cache line might cause this problem.

But what puzzles me is that my TTable is very similar to Crafty's.

I have two quad words as a record, with 4 records per bucket:

Code: Select all

struct TTRecord
&#123;
	hashkey key;
	uint64  info;	
&#125;;

struct TTEntry
&#123;
	TTRecord bucket&#91;4&#93;;
&#125;;

int TTableQuery&#40;Engine *engine, int *alpha, int beta, int ply, int depth&#41;
&#123;
	const hashkey key = engine->pos.poskey;
	const TTable *ttable = &&#40;engine->ttable&#41;;
	TTEntry  *entry = &&#40;ttable->entries&#91;key & &#40;ttable->mask&#41;&#93;);
	TTRecord *record = entry->bucket;
	int i, f, v, d, ret = TT_FLAG_INVALID;
	uint64 info;
	
	engine->tree.PathNode&#91;ply&#93;.MoveGenState.hashmove = 0;
	for&#40;i = 0; i < 4; i++)
	&#123;
		if&#40;key == &#40;record->key&#41;)
			break;
		record++;
	&#125;

	if&#40;i > 3&#41; 
		return ret;

        ...

&#125;
I know that you don't have an extra bucket structure in Crafty but rather treat 4 consecutive records as the bucket.

Could the TTEntry structure cause my problem. I decided to use it for clarity and thought it should result in the same memory usage as if I stored consecutively and taking the address mod 64.
One more question: Do you force the "bucket" to lie on a 64 byte address boundary so that you are only incurring one hash line fill, as opposed to two?
John Major
Posts: 27
Joined: Fri Dec 11, 2009 10:23 pm

Re: Is a querying the hash tables such a huge bottleneck?

Post by John Major »

OliverUwira wrote: I also noticed that the UCI hashfull info reports a full hash table only very late in the game. After move 10 or so it mostly hovers around 50%.
The hash table fill behaviour looks like this:
Image
So it takes a while to completely fill up.
mathmoi
Posts: 286
Joined: Mon Mar 13, 2006 5:23 pm
Location: Québec

Re: Is a querying the hash tables such a huge bottleneck?

Post by mathmoi »

bob wrote:One more question: Do you force the "bucket" to lie on a 64 byte address boundary so that you are only incurring one hash line fill, as opposed to two?
How do you do that?
User avatar
hgm
Posts: 27788
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Is a querying the hash tables such a huge bottleneck?

Post by hgm »

It is quite normal that hash probes are slow: they need DRAM access, which takes many hundreds of CPU clock cycles. Probing the same entry twice is not much more expensive, though, as the second time it will be in cache. So this is in accordance with what you see.

If your results are reasonable depends on too many things to say. For instance how elaborate your evaluation is. And how often you probe.

Non-alignment of a hash bucket with the cache line can cause extra DRAM accesses, if in one probe you access two cache lines (requiring two dRAM accesses to fetch, possibly triggering prefetch of a third). To know, print the start adress of your hash table, and check if it is devisable by 64 (printf("hash start = %x\n", hashTable);).

To force alignment, use code like:

Code: Select all

hashMem = &#40;Bcket *) malloc&#40;nrOfBuckets + 1, sizeof Bucket&#41;;
hashTable = &#40;Bucket *) ((&#40;int&#41;hashMem + 63&#41; & ~63&#41;;
jwes
Posts: 778
Joined: Sat Jul 01, 2006 7:11 am

Re: Is a querying the hash tables such a huge bottleneck?

Post by jwes »

OliverUwira wrote:Hello all,

I just did a profiling run for Kurt and noticed that the hash table query came in as a close runner-up.

I find that somewhat strange. Is that normal, or should I go and look for a bug? I also noticed that the UCI hashfull info reports a full hash table only very late in the game. After move 10 or so it mostly hovers around 50%.

Code: Select all

Function Name                       Incl.    Excl.     Incl. %  Excl. %
Evaluate                            4270     4052      17,83    16,92
TTableQuery                         3882     3882      16,21    16,21
AttacksTo                           1701     1701       7,1      7,1
SetMove                             1901     1353       7,94     5,65
NextMove                            1152     1152       4,81     4,81
StaticExchangeEvaluation            2313     1053       9,66     4,4
SortMoves                           1766     1022       7,37     4,27
GetPinnedPieces                     1013     1013       4,23     4,23
UnsetMove                            925      925       3,86     3,86
GeneratePawnNonCaps                 1149      894       4,8      3,73
QuiescenceSearch                   14637      789      61,11     3,29
ZeroSearch                         23828      788      99,48     3,29
GeneratePawnCaps                     708      708       2,96     2,96
SortQuiescenceMoves                  572      557       2,39     2,33
TTableStore                          467      467       1,95     1,95
GenerateQueenCaps                    455      455       1,9      1,9
SliderAttacksTo                      502      324       2,1      1,35
GetSliderAttacks                     308      308       1,29     1,29
GenerateKingCaps                     314      290       1,31     1,21
GenerateBishopCaps                   268      268       1,12     1,12
GenerateKingNonCaps                  728      262       3,04     1,09
EnPassantPossible                    255      255       1,06     1,06
GenerateRookCaps                     220      220       0,92     0,92
AnalysePawnStructure                 196      196       0,82     0,82
GenerateQueenNonCaps                 190      190       0,79     0,79
GenerateKnightCaps                   186      186       0,78     0,78
GenerateQuiescenceMoves             3412      108      14,24     0,45
GetBlockMoves                         98       98       0,41     0,41
GenerateBishopNonCaps                 80       80       0,33     0,33
GenerateMoves                       2421       75      10,11     0,31
GenerateLegalEscapes                 275       68       1,15     0,28
GenerateKnightNonCaps                 61       61       0,25     0,25
GenerateRookNonCaps                   54       54       0,23     0,23
GetSquaresBetween                     33       33       0,14     0,14
RecordKiller                          24       24       0,1      0,1
Search                             17211       19      71,85     0,08
A simple test would be to run profiling with a very small hash table. If the percentage remains high, it may well be a bug.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Is a querying the hash tables such a huge bottleneck?

Post by bob »

mathmoi wrote:
bob wrote:One more question: Do you force the "bucket" to lie on a 64 byte address boundary so that you are only incurring one hash line fill, as opposed to two?
How do you do that?
malloc() 63 bytes more than you need. Then add 63 to the address malloc() returns, and AND that with ~63 (0xffffffc0). Now the address of the first hash entry is on a 64-byte boundary, and each group of 4 will go exactly into one cache line with no overlap.
Mincho Georgiev
Posts: 454
Joined: Sat Apr 04, 2009 6:44 pm
Location: Bulgaria

Re: Is a querying the hash tables such a huge bottleneck?

Post by Mincho Georgiev »

Plus, you can preserve the old address (with unaligned startup) in the first (unused bytes) if you want to manipulate the whole piece of memory later.
Like:

*aligned_address[-1] = the_unaligned_adress.

and later if you need to use it:

free(aligned_adress[-1]);