Last call for perft 13 estimates!

petero2 · Post by **petero2** » Fri Feb 22, 2013 6:22 pm

ibid wrote:
Ajedrecista wrote:Thank you very much for your great achievement. I suppose that Perft(14) (which I estimated it around 6.1861e+19 on August 4th, 2011) will wait some years.
Well actually... :) Computing it directly from the unique positions at 10 ply
would take about a year and a half, which is a bit much at this point.
Going to the uniq positions at 11 ply would reduce the perft time by a
factor of 3 or so, but computing those 11 ply positions would currently
require many additional months of hard drive abusing work. The current
rewrite is to add some threading and disk caching to the uniq program.
The computation will still have to be done in about 15 parts (even with
compression, a 1 TB drive only holds so much), but the current time
estimate is about 8 months for perft 14 once the code is done.

I have computed the unique positions at ply N for N=0,1,2,...,11:

Code: Select all

ply nPositions   maxCnt   nMax nSing      FEN
0   1            1        1    1          rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq -
1   20           1        20   20         rnbqkbnr/pppppppp/8/8/8/N7/PPPPPPPP/R1BQKBNR b KQkq -
2   400          1        400  400        r1bqkbnr/pppppppp/n7/8/8/N7/PPPPPPPP/R1BQKBNR w KQkq -
3   5362         4        20   1862       r1bqkbnr/pppppppp/n7/8/8/8/PPPPPPPP/RNBQKBNR b KQkq -
4   72078        16       1    9825       rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq -
5   822518       87       2    53516      rnbqkbnr/pppppppp/8/8/8/4P3/PPPP1PPP/RNBQKBNR b KQkq -
6   9417681      479      4    311642     rnbqkbnr/pppp1ppp/4p3/8/8/4P3/PPPP1PPP/RNBQKBNR w KQkq -
7   96400068     2600     2    2018993    rnbqkbnr/pppp1ppp/4p3/8/3PP3/8/PPP2PPP/RNBQKBNR b KQkq -
8   988187354    14363    1    12150635   rnbqkbnr/ppp2ppp/8/3pp3/3PP3/8/PPP2PPP/RNBQKBNR w KQkq -
9   9183421888   115066   1    69284509   rnbqkbnr/ppp2ppp/8/3pp3/4P3/8/PPPP1PPP/RNBQKBNR b KQkq -
10  85375278064  1126591  1    382383387  rnbqkbnr/pppp1ppp/8/4p3/4P3/8/PPPP1PPP/RNBQKBNR w KQkq -
11  726155461002 11783579 1    1994236773 rnbqkbnr/pppp1ppp/8/4p3/3PP3/8/PPP2PPP/RNBQKBNR b KQkq -

nPositions is the number of unique positions at ply N. This number is also given by integer sequence A083276.

For a position P, let cnt(P) be the number of different sequences of N moves from the starting position that lead to position P.

maxCnt is max(cnt(P)) where the maximum is taken over all positions P at ply N.

nMax is the number of positions at ply N having cnt(P) == maxCnt.

nSing is the number of positions at ply N having cnt(P) == 1.

FEN is a position having cnt(P) == maxCnt.

These numbers need independent verification before they can be trusted. As a sanity check I have used the ply 11 data to compute perft 11, 12 and 13 and my results agree with previously published results.

Computing the unique 11 data took about 5 weeks using 2 Intel Core i7 quad core CPUs and two 8TB Buffalo LinkStation NAS units. About 3.7TiB of disk space is needed to store the result. Computing perft 13 from the unique 11 data took about 3 days using 3 Intel Core i7 quad core CPUs.

The perft 14 computation has now been running for about 18 hours and is 1.9% complete. If the current speed is representative for the whole computation, perft 14 will require about 40 days to complete.

Ajedrecista · Post by **Ajedrecista** » Fri Feb 22, 2013 6:59 pm

Hi Peter:

petero2 wrote:I have computed the unique positions at ply N for N=0,1,2,...,11:
Code: Select all
ply nPositions   maxCnt   nMax nSing      FEN
0   1            1        1    1          rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq -
1   20           1        20   20         rnbqkbnr/pppppppp/8/8/8/N7/PPPPPPPP/R1BQKBNR b KQkq -
2   400          1        400  400        r1bqkbnr/pppppppp/n7/8/8/N7/PPPPPPPP/R1BQKBNR w KQkq -
3   5362         4        20   1862       r1bqkbnr/pppppppp/n7/8/8/8/PPPPPPPP/RNBQKBNR b KQkq -
4   72078        16       1    9825       rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq -
5   822518       87       2    53516      rnbqkbnr/pppppppp/8/8/8/4P3/PPPP1PPP/RNBQKBNR b KQkq -
6   9417681      479      4    311642     rnbqkbnr/pppp1ppp/4p3/8/8/4P3/PPPP1PPP/RNBQKBNR w KQkq -
7   96400068     2600     2    2018993    rnbqkbnr/pppp1ppp/4p3/8/3PP3/8/PPP2PPP/RNBQKBNR b KQkq -
8   988187354    14363    1    12150635   rnbqkbnr/ppp2ppp/8/3pp3/3PP3/8/PPP2PPP/RNBQKBNR w KQkq -
9   9183421888   115066   1    69284509   rnbqkbnr/ppp2ppp/8/3pp3/4P3/8/PPPP1PPP/RNBQKBNR b KQkq -
10  85375278064  1126591  1    382383387  rnbqkbnr/pppp1ppp/8/4p3/4P3/8/PPPP1PPP/RNBQKBNR w KQkq -
11  726155461002 11783579 1    1994236773 rnbqkbnr/pppp1ppp/8/4p3/3PP3/8/PPP2PPP/RNBQKBNR b KQkq -
nPositions is the number of unique positions at ply N. This number is also given by integer sequence A083276.

For a position P, let cnt(P) be the number of different sequences of N moves from the starting position that lead to position P.

maxCnt is max(cnt(P)) where the maximum is taken over all positions P at ply N.

nMax is the number of positions at ply N having cnt(P) == maxCnt.

nSing is the number of positions at ply N having cnt(P) == 1.

FEN is a position having cnt(P) == maxCnt.

These numbers need independent verification before they can be trusted. As a sanity check I have used the ply 11 data to compute perft 11, 12 and 13 and my results agree with previously published results.

Computing the unique 11 data took about 5 weeks using 2 Intel Core i7 quad core CPUs and two 8TB Buffalo LinkStation NAS units. About 3.7TiB of disk space is needed to store the result. Computing perft 13 from the unique 11 data took about 3 days using 3 Intel Core i7 quad core CPUs.

These are great news! It is good to see that your results agree with previous results.

petero2 wrote:The perft 14 computation has now been running for about 18 hours and is 1.9% complete. If the current speed is representative for the whole computation, perft 14 will require about 40 days to complete.

GREAT! You will overtake this distributed project:

Perft helpers

Can you give more details about your current computation, please? Thanks in advance.

Just to revive the black art of perft estimates, I copy this link:

Perft(14) estimates

Good luck with your Perft(14) computation!

Regards from Spain.

Ajedrecista.

hgm · Post by **hgm** » Fri Feb 22, 2013 7:06 pm

It seems to me that this method of computing a perft from the unique positions at a certain depth is very similar to a normal hashed perft. My suspicion, however, is that you take a hit by not hashing at all depths, but just at one. And you take a hit by not sharing the hash between the terminal trees. There should be many positions at depth 11 that are so much alike that their perft(3) trees have significant overlap. This is not exploited.

The upside is that the 'hash table' used is lossless. I wonder if one could not get a much faster algorithm by doing a normally hashed perft as a single tree, with lossless hash.

ibid · Post by **ibid** » Fri Feb 22, 2013 9:14 pm

petero2 wrote:I have computed the unique positions at ply N for N=0,1,2,...,11:

Code: Select all

ply nPositions   maxCnt   nMax nSing      FEN
0   1            1        1    1          rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq -
1   20           1        20   20         rnbqkbnr/pppppppp/8/8/8/N7/PPPPPPPP/R1BQKBNR b KQkq -
2   400          1        400  400        r1bqkbnr/pppppppp/n7/8/8/N7/PPPPPPPP/R1BQKBNR w KQkq -
3   5362         4        20   1862       r1bqkbnr/pppppppp/n7/8/8/8/PPPPPPPP/RNBQKBNR b KQkq -
4   72078        16       1    9825       rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq -
5   822518       87       2    53516      rnbqkbnr/pppppppp/8/8/8/4P3/PPPP1PPP/RNBQKBNR b KQkq -
6   9417681      479      4    311642     rnbqkbnr/pppp1ppp/4p3/8/8/4P3/PPPP1PPP/RNBQKBNR w KQkq -
7   96400068     2600     2    2018993    rnbqkbnr/pppp1ppp/4p3/8/3PP3/8/PPP2PPP/RNBQKBNR b KQkq -
8   988187354    14363    1    12150635   rnbqkbnr/ppp2ppp/8/3pp3/3PP3/8/PPP2PPP/RNBQKBNR w KQkq -
9   9183421888   115066   1    69284509   rnbqkbnr/ppp2ppp/8/3pp3/4P3/8/PPPP1PPP/RNBQKBNR b KQkq -
10  85375278064  1126591  1    382383387  rnbqkbnr/pppp1ppp/8/4p3/4P3/8/PPPP1PPP/RNBQKBNR w KQkq -
11  726155461002 11783579 1    1994236773 rnbqkbnr/pppp1ppp/8/4p3/3PP3/8/PPP2PPP/RNBQKBNR b KQkq -

These numbers need independent verification before they can be trusted. As a sanity check I have used the ply 11 data to compute perft 11, 12 and 13 and my results agree with previously published results.

All of those numbers agree with what I got, except, of course, the 11 ply ones which I never computed.

petero2 wrote:Computing the unique 11 data took about 5 weeks using 2 Intel Core i7 quad core CPUs and two 8TB Buffalo LinkStation NAS units. About 3.7TiB of disk space is needed to store the result. Computing perft 13 from the unique 11 data took about 3 days using 3 Intel Core i7 quad core CPUs.

The perft 14 computation has now been running for about 18 hours and is 1.9% complete. If the current speed is representative for the whole computation, perft 14 will require about 40 days to complete.

Very nice! The times are in line with what I was expecting my program to need -- if I'd had more than 1 TB of disk space to work with (I was going to have to do the computation in 15 parts -- 16 TB would have simplified this signicantly).

Glad to see someone computing perft(14) and I eagerly await the Answer.

-paul

ibid · Post by **ibid** » Fri Feb 22, 2013 9:36 pm

hgm wrote:It seems to me that this method of computing a perft from the unique positions at a certain depth is very similar to a normal hashed perft. My suspicion, however, is that you take a hit by not hashing at all depths, but just at one. And you take a hit by not sharing the hash between the terminal trees. There should be many positions at depth 11 that are so much alike that their perft(3) trees have significant overlap. This is not exploited.

The upside is that the 'hash table' used is lossless. I wonder if one could not get a much faster algorithm by doing a normally hashed perft as a single tree, with lossless hash.

I did a number of experiments some years ago: the unique approach was significantly faster, and the degree to which it wins surely just gets larger as the ply increase. The ability to get perfect transposition detection through the first 11 ply in this case is huge -- we are talking 726 billion positions here, any reasonably sized hash table is going to miss an awful lot of transpositions.

Now the transpositions between the remaining perft(3) computations is certainly trickier. Since the 11 ply positions are simply being read from disk one at a time, all of memory can be devoted to the hash table for the subtrees. Hashing with only one ply to go was not helpful -- all you are saving is a single movegen -- and you are filling the table with clutter. So you are really only hashing values with two ply to go. Back then (~2006) I determined the overhead of computing a hash key plus the memory access roughly wiped out any gains from the transpositions seen. So I went for simplicity.

These days, with improved hardware and bigger hash tables, it may well prove to be a win, but I doubt it would be very big.

That said, should perft(15) be computed from the 11 ply positions, I am pretty sure a hash table would be a significant win.

When I computed checkers uniq and perft numbers, perft(28) was done as 22+6 and hash tables were a major win with 6 ply subtrees.

-paul

petero2 · Post by **petero2** » Sat Feb 23, 2013 3:10 am

hgm wrote:It seems to me that this method of computing a perft from the unique positions at a certain depth is very similar to a normal hashed perft. My suspicion, however, is that you take a hit by not hashing at all depths, but just at one. And you take a hit by not sharing the hash between the terminal trees. There should be many positions at depth 11 that are so much alike that their perft(3) trees have significant overlap. This is not exploited.

In fact this is exploited. The uniq 11 positions are in sorted order, a hash table is used for the 726 billion depth 3 perft searches and the hash table is not cleared between searches.

The sort order compares one square at a time, in this order: a1,b1,...,h1,a2,...,h2,...
The sort order for the pieces are: EMPTY < WKING < WQUEEN < WROOK < WBISHOP < WKNIGHT < WPAWN < BKING < BQUEEN < BROOK < BBISHOP < BKNIGHT < BPAWN

I have measured the hash table hit rate when computing smaller perft results and the hit rate is 50%-60% when doing perft 3 from unique positions.

petero2 · Post by **petero2** » Sat Feb 23, 2013 3:21 am

ibid wrote:Now the transpositions between the remaining perft(3) computations is certainly trickier. Since the 11 ply positions are simply being read from disk one at a time, all of memory can be devoted to the hash table for the subtrees. Hashing with only one ply to go was not helpful -- all you are saving is a single movegen -- and you are filling the table with clutter. So you are really only hashing values with two ply to go. Back then (~2006) I determined the overhead of computing a hash key plus the memory access roughly wiped out any gains from the transpositions seen. So I went for simplicity.

These days, with improved hardware and bigger hash tables, it may well prove to be a win, but I doubt it would be very big.

I do use a hash table for the perft(3) computations and I think the speedup is close to a factor of 2. Part of the reason for my speedup may well be that your move generator seems to be about twice as fast as mine.

Also I hash at both depth 1 and 2. The cost of a main memory access is smaller than the win from having a 50-60% chance of not having to compute the number of legal moves.

petero2 · Post by **petero2** » Sat Feb 23, 2013 3:36 am

ibid wrote:

petero2 wrote:I have computed the unique positions at ply N for N=0,1,2,...,11:

Code: Select all

ply nPositions   maxCnt   nMax nSing      FEN
0   1            1        1    1          rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq -
1   20           1        20   20         rnbqkbnr/pppppppp/8/8/8/N7/PPPPPPPP/R1BQKBNR b KQkq -
2   400          1        400  400        r1bqkbnr/pppppppp/n7/8/8/N7/PPPPPPPP/R1BQKBNR w KQkq -
3   5362         4        20   1862       r1bqkbnr/pppppppp/n7/8/8/8/PPPPPPPP/RNBQKBNR b KQkq -
4   72078        16       1    9825       rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq -
5   822518       87       2    53516      rnbqkbnr/pppppppp/8/8/8/4P3/PPPP1PPP/RNBQKBNR b KQkq -
6   9417681      479      4    311642     rnbqkbnr/pppp1ppp/4p3/8/8/4P3/PPPP1PPP/RNBQKBNR w KQkq -
7   96400068     2600     2    2018993    rnbqkbnr/pppp1ppp/4p3/8/3PP3/8/PPP2PPP/RNBQKBNR b KQkq -
8   988187354    14363    1    12150635   rnbqkbnr/ppp2ppp/8/3pp3/3PP3/8/PPP2PPP/RNBQKBNR w KQkq -
9   9183421888   115066   1    69284509   rnbqkbnr/ppp2ppp/8/3pp3/4P3/8/PPPP1PPP/RNBQKBNR b KQkq -
10  85375278064  1126591  1    382383387  rnbqkbnr/pppp1ppp/8/4p3/4P3/8/PPPP1PPP/RNBQKBNR w KQkq -
11  726155461002 11783579 1    1994236773 rnbqkbnr/pppp1ppp/8/4p3/3PP3/8/PPP2PPP/RNBQKBNR b KQkq -

These numbers need independent verification before they can be trusted. As a sanity check I have used the ply 11 data to compute perft 11, 12 and 13 and my results agree with previously published results.

All of those numbers agree with what I got, except, of course, the 11 ply ones which I never computed.

Good to know.

ibid wrote:
petero2 wrote:Computing the unique 11 data took about 5 weeks using 2 Intel Core i7 quad core CPUs and two 8TB Buffalo LinkStation NAS units. About 3.7TiB of disk space is needed to store the result. Computing perft 13 from the unique 11 data took about 3 days using 3 Intel Core i7 quad core CPUs.

The perft 14 computation has now been running for about 18 hours and is 1.9% complete. If the current speed is representative for the whole computation, perft 14 will require about 40 days to complete.
Very nice! The times are in line with what I was expecting my program to need -- if I'd had more than 1 TB of disk space to work with (I was going to have to do the computation in 15 parts -- 16 TB would have simplified this signicantly).

I think it should be possible to compute this perhaps twice as fast with the same hardware, given that the last time a compared my perft speed to the speed of your gperftd program, your program was about twice as fast. In reality it may not be a factor of two speed up for this computation, because if the move generation is faster the win from hashing is lower.

petero2 · Post by **petero2** » Sat Feb 23, 2013 8:20 am

Ajedrecista wrote:Can you give more details about your current computation, please?

There are two parts of this computation, computing the unique positions and using the unique positions to compute perft. Both parts use a move generator derived from the texel chess engine, modified to do bulk counting at leaf nodes.

This method has the advantage of being fast, but does not produce a break-down of the result, so the 20 perft(13) results after the first move are not computed.

My computers run fedora 17 and I use mhddfs to join the two NAS drives into one large logical drive.

I use xz to compress all data stored on the hard disks. This reduces the required disk space by a factor of 10. Since xz compressed data contains checksums this also helps in detecting hard disk data corruption. I have not seen any corruption, but it is good to know that if it did happen it would be detected.

Unique

The unique program computes the unique positions at ply N and the number of times they are reached from the starting position. It does this by using ply N-1 unique data as input and extending the result one ply.

This program basically has to do the same thing as the unix pipe "... | sort | uniq -c", except that you need special algorithms when the data set (about 35TB uncompressed) does not fit in main memory. See external sorting on wikipedia for details.

The program relies heavily on unix pipes. Instead of writing directly to a disk file it writes to a unix pipe that compresses the data and splits it up in multiple files, each file containing 1e10 chess positions. Reading uses a similar pipe that concatenates and decompresses the data.

OpenMP is used to parallelize the move generation and in-memory sorting. The xz program has built-in support for multi-threaded compression of data.

The unique 10 data is split in 9 parts (85375278064 positions with 1e10 positions per file). Each part can be extended independently to ply 11. I manually assigned the 9 parts to different computers. Each part took about 4 days. The final merge of the 9 ply 11 results had to be performed on a single computer and took about 5 days.

The CPU is the bottleneck for this computation. Most of the time is spent on xz compressing the data. Faster disk drives would not help. Larger drives would help though, because then less compression would be needed which would require less CPU time.

The required disk bandwidth was between 18MB/s and 35MB/s. The average number of disk seeks was less than 10/s.

Perft

OpenMP is used to run 8 perft(3) computations simultaneously on each computer.

An implementation of Robert Hyatt's lockless transposition table is used and the hash table is shared between all perft threads. Besides being fast, this also provides protection against memory corruption, see http://talkchess.com/forum/viewtopic.ph ... ht=#502792.

The unique 11 data is split in 73 parts and I manually assign parts to different computers. Two of the computers use 8GB for hash tables and one computer uses 4GB.

This computation is dominated by CPU and RAM speed. The required disk bandwidth is about 900KB/s and the number of disk seeks is less than 0.1/s.

The hashing uses one 64-bit zobrist key for hash table indexing and another independent 64-bit zobrist key to store in the hash table. Using hashing introduces the risk of a hash collision which would ruin the result. If we assume that the probability that a hash probe for a position not in the hash table finds a different position with the same hash key is 2^-64, the probability for a hash collision during the perft(14) computation is less than:

Code: Select all

1-&#40;1-2^-64&#41;^&#40;uniq&#40;11&#41;*30^2&#41; = 1/28000

If only about 50% of the hash probes are hash misses, the probability is more like:

Code: Select all

1-&#40;1-2^-64&#41;^&#40;uniq&#40;11&#41;*15^2&#41; = 1/113000

In reality the zobrist hash keys are not truly random, so it is up to anyone to judge how relevant the above calculations are. Some time ago I computed perft(11) using a smaller hash key and computed the collision probability to more than 99%. Nevertheless I got the correct perft(11) result.

ibid · Post by **ibid** » Tue Feb 26, 2013 12:22 am

petero2 wrote:
ibid wrote:Now the transpositions between the remaining perft(3) computations is certainly trickier. Since the 11 ply positions are simply being read from disk one at a time, all of memory can be devoted to the hash table for the subtrees. Hashing with only one ply to go was not helpful -- all you are saving is a single movegen -- and you are filling the table with clutter. So you are really only hashing values with two ply to go. Back then (~2006) I determined the overhead of computing a hash key plus the memory access roughly wiped out any gains from the transpositions seen. So I went for simplicity.

These days, with improved hardware and bigger hash tables, it may well prove to be a win, but I doubt it would be very big.
I do use a hash table for the perft(3) computations and I think the speedup is close to a factor of 2. Part of the reason for my speedup may well be that your move generator seems to be about twice as fast as mine.

Also I hash at both depth 1 and 2. The cost of a main memory access is smaller than the win from having a 50-60% chance of not having to compute the number of legal moves.

Note to self: don't trust old tests. :)

The significant discrepancy between our programs inspired me to redo some tests on hashing. I computed perft(11) as uniq 8 plus 3 ply searches several ways. (All this was done on a 6-core Phenom 1090T.)

A recent version of my normal program runs this in 13144 seconds.

Grafting on the hash tables from gperftd, hashing only with 2 ply remaining and using 4 GB of memory drops this to 7230 seconds, a 45%
decrease in time. So yes, a very significant win here.

I then went to hashing with both 1 and 2 ply remaining -- I don't have a final number, it was clear pretty quickly it was running a little slower than the version with no hashing at all. So, for whatever reason, my code still doesn't like this approach at all.

As to why the results are different now than last time I did this, I can think of a number of factors: new magic bitboard implementation versus an old array-based non-bitboard setup, very different hardware (multi-core phenom vs. multi-cpu itanium system), bigger hash tables now, etc. But given the numbers above, I have to wonder if the tests done back then simply hashed at all ply and I never tried eliminating the 1-ply lookups.

Assuming the 45% win roughly holds at higher depths, hash tables plus other changes from the last year or two would reduce the time to compute perft 13 as 10+3 to about 8 days and perft 14 as 11+3 to perhaps 75 days (assuming I could computer unique 11).

Some very preliminary trials with part of 8+4 indicate the savings from the hash table are about 75%, perhaps as high as 80%, with 4 ply sub-searches, which negates some of the advantage of doing perft(14) as 11+3 over 10+4, but definitely not all of it. I wouldn't care to put numbers on that at this point. :)

-paul

Last call for perft 13 estimates!

Re: Last call for perft 13 estimates!

Re: Last call for perft 13 estimates!

Re: Last call for perft 13 estimates!

Re: Last call for perft 13 estimates!

Re: Last call for perft 13 estimates!

Re: Last call for perft 13 estimates!

Re: Last call for perft 13 estimates!

Re: Last call for perft 13 estimates!

Re: Last call for perft 13 estimates!

Re: Last call for perft 13 estimates!