Transposition Table and nps drop

mathmoi · Post by **mathmoi** » Wed Feb 27, 2008 2:35 pm

Hi,

My engine is actually using a simple minimax (no AlphaBeta) function for the moment. I want to get it completely functional before I write the AB search and all the other improvements.

This week I implemented a transposition table, because I want to be able to retrieve the principal variation from it. So I implemented it into my Minimax function and I get confused by the results.

If I use a small hash table of 4Kb (256 entries), the search for my test positions takes 34.4s at 8.5M nps searching 293M nodes.

If I use a 32Mb hash table the search takes 22.6s at 5M nps searching 112M nodes

and if I use a 256Mb the search takes 25.2s at 4.3M nps searching 107M nodes.

I was expecting a drop in nodes per seconds, but it seems that with a 256Mb hash tables, the nps drop by about 50% making the search even slower than with a 32 Mo hash table.

However, when I profile my engine, the profiling code seems to use only <1% of the total cpu time. What does that means? Do I have a bug or is this due the cache miss the huge table will inevitably cause?

hgm · Post by **hgm** » Wed Feb 27, 2008 3:53 pm

How do you count your nodes? If a move leads to a position that is found in the hash table, do you count that as a node, or do you only count nodes where you do move generation, or only count leave nodes where you do evaluation?

What is your eplacement algorithm? Do you replace everything, or do you make an attempt to spare deep eentries

How many test positions do you average (or sum) to get the quoted times?

mathmoi · Post by **mathmoi** » Wed Feb 27, 2008 4:03 pm

hgm wrote:How do you count your nodes? If a move leads to a position that is found in the hash table, do you count that as a node, or do you only count nodes where you do move generation, or only count leave nodes where you do evaluation?

What is your eplacement algorithm? Do you replace everything, or do you make an attempt to spare deep eentries

How many test positions do you average (or sum) to get the quoted times?

Hi,

I count every calls to Minimax, so it counts internal nodes and leaf nodes. At a position if a move is found in the transposition table it's counted as 1 node.

I actually use a replace if deeper or equal scheme. I'll try to use a replace always one to see what happens.

The data I provided are the sum of 6 positions searched to depth 5 or 6. The positions are the ones used by the benchmark command of Crafty.

Thanks for your help.

mathmoi · Post by **mathmoi** » Wed Feb 27, 2008 4:44 pm

Hi,

I just tried the sames test with a replace always scheme and the results are only slightly better.

Can all the caches miss on the TT probe add ups to 50% of the CPU time?

hgm · Post by **hgm** » Wed Feb 27, 2008 5:14 pm

Yes, easily. To fetch randomly addressed data from memory takes 386 CPU clock cycles on my 1.3GHz Pentium M. On faster machines, it will take even more CPU clocks, as only the CPU will be faster, not the memory. And a CPU can do between 2 and 3 instructions per clock cycle. So doing a single probe costs you as much as ~1000 instructions.

I do not know what exactly you do in your leave nodes, but if you do one hash probe, and less than 1000 instructions of work, the TT probes take 50%.

Aleks Peshkov · Post by **Aleks Peshkov** » Wed Feb 27, 2008 5:18 pm

Without minimal positional evaluation, alpha-beta pruning and quiescence search your search trees are artificial. Optimizing performance in such unnatural conditions can make final code even worse.

Harald Johnsen · Post by **Harald Johnsen** » Wed Feb 27, 2008 6:28 pm

What are you doing with your table ?

Retrieving the pv for the search at the next depth is not enought.

Are you using it to do some cutoff ?
Are you using it to order the moves with IID ?

HJ.

hgm · Post by **hgm** » Wed Feb 27, 2008 6:33 pm

Well, he does do cutoffs, or his node count couldn't go down dependent on hash size. Ordering the moves would have little effect without alpha-beta.

bob · Post by **bob** » Wed Feb 27, 2008 6:54 pm

mathmoi wrote:Hi,

My engine is actually using a simple minimax (no AlphaBeta) function for the moment. I want to get it completely functional before I write the AB search and all the other improvements.

This week I implemented a transposition table, because I want to be able to retrieve the principal variation from it. So I implemented it into my Minimax function and I get confused by the results.

If I use a small hash table of 4Kb (256 entries), the search for my test positions takes 34.4s at 8.5M nps searching 293M nodes.

If I use a 32Mb hash table the search takes 22.6s at 5M nps searching 112M nodes

and if I use a 256Mb the search takes 25.2s at 4.3M nps searching 107M nodes.

I was expecting a drop in nodes per seconds, but it seems that with a 256Mb hash tables, the nps drop by about 50% making the search even slower than with a 32 Mo hash table.

However, when I profile my engine, the profiling code seems to use only <1% of the total cpu time. What does that means? Do I have a bug or is this due the cache miss the huge table will inevitably cause?

I have never seen such a drop in NPS, unless you are either (a) so small that everything fits into cache until you start probing the hash which causes things to get replaced frequently; (b) using a hash implementation that is too slow, perhaps using multiple probes or something...

In crafty, nps is relatively stable regardless of hash size used, but the tree gets smaller (as yours apparently does) as the size is increased...

mathmoi · Post by **mathmoi** » Thu Feb 28, 2008 4:00 am

hgm wrote:Yes, easily. To fetch randomly addressed data from memory takes 386 CPU clock cycles on my 1.3GHz Pentium M. On faster machines, it will take even more CPU clocks, as only the CPU will be faster, not the memory. And a CPU can do between 2 and 3 instructions per clock cycle. So doing a single probe costs you as much as ~1000 instructions.

I do not know what exactly you do in your leave nodes, but if you do one hash probe, and less than 1000 instructions of work, the TT probes take 50%.

Hi H.G.Muller,

You pin pointed the issue. Before I added the TT code in my Minimax function at depth=0, I was returning Eval() wich is really fast since it only return an incrementally updated value. When I added my TT code, I did a probe _before_ I checked for depth=0, So I was doing an extra probe at each leaf nodes.

Thanks for you inputs.

Transposition Table and nps drop

Transposition Table and nps drop

Re: Transposition Table and nps drop

Re: Transposition Table and nps drop

Re: Transposition Table and nps drop

Re: Transposition Table and nps drop

Re: Transposition Table and nps drop

Re: Transposition Table and nps drop

Re: Transposition Table and nps drop

Re: Transposition Table and nps drop

Re: Transposition Table and nps drop