A perft faster than qperft?!

Allard Siemelink · Post by **Allard Siemelink** » Fri Apr 25, 2008 1:07 am

I finally got around to adding 'bulk counting' to my old (non hashing) perft code.
Amazingly it turns out to be ~2x faster than qperft on my 32-bit opteron@2500:

Code: Select all

>qperft 7
Quick Perft by H.G. Muller
Perft mode&#58; No hashing, bulk counting in horizon nodes
perft&#40;1&#41;=20 ( 0.000 sec&#41;
perft&#40;2&#41;=400 ( 0.000 sec&#41;
perft&#40;3&#41;=8902 ( 0.000 sec&#41;
perft&#40;4&#41;=197281 ( 0.015 sec&#41;
perft&#40;5&#41;=4865609 ( 0.156 sec&#41;
perft&#40;6&#41;=119060324 ( 3.922 sec&#41;
perft&#40;7&#41;=3195901860 &#40;102.016 sec&#41;

>i-perft 7
perft 1           20     0.00s    1.$ mnps  851.4 ticks/op
perft 2          400     0.00s    1.$ mnps   48.8 ticks/op
perft 3         8902     0.00s    1.$ mnps   33.5 ticks/op
perft 4       197281     0.02s   12.3 mnps   33.3 ticks/op
perft 5      4865609     0.06s   78.5 mnps   32.5 ticks/op
perft 6    119060324     1.56s   76.2 mnps   33.3 ticks/op
perft 7   3195901860    41.36s   77.3 mnps   32.7 ticks/op

>qperft 6 "r3r1k1/1pq2pp1/2p2n2/1PNn4/2QN2b1/6P1/3RPP2/2R3KB b - -"
perft&#40;1&#41;=56 ( 0.000 sec&#41;
perft&#40;2&#41;=2594 ( 0.000 sec&#41;
perft&#40;3&#41;=137198 ( 0.000 sec&#41;
perft&#40;4&#41;=6391735 ( 0.171 sec&#41;
perft&#40;5&#41;=323787902 ( 7.797 sec&#41;
perft&#40;6&#41;=15097513050 &#40;389.578 sec&#41;

>i-perft 6 "r3r1k1/1pq2pp1/2p2n2/1PNn4/2QN2b1/6P1/3RPP2/2R3KB b - -"
perft 1           56     0.00s    1.$ mnps  296.5 ticks/op
perft 2         2594     0.00s    1.$ mnps   30.9 ticks/op
perft 3       137198     0.00s    1.$ mnps   18.7 ticks/op
perft 4      6391735     0.06s  103.1 mnps   20.4 ticks/op
perft 5    323787902     2.41s  134.5 mnps   18.8 ticks/op
perft 6  15097513050   124.05s  121.7 mnps   20.8 ticks/op

Source and exe are available from http://brightchess.googlepages.com (it is the last item in the download section)

Dann Corbit · Post by **Dann Corbit** » Fri Apr 25, 2008 1:37 am

Allard Siemelink wrote:I finally got around to adding 'bulk counting' to my old (non hashing) perft code.
Amazingly it turns out to be ~2x faster than qperft on my 32-bit opteron@2500:

Code: Select all

>qperft 7
Quick Perft by H.G. Muller
Perft mode&#58; No hashing, bulk counting in horizon nodes
perft&#40;1&#41;=20 ( 0.000 sec&#41;
perft&#40;2&#41;=400 ( 0.000 sec&#41;
perft&#40;3&#41;=8902 ( 0.000 sec&#41;
perft&#40;4&#41;=197281 ( 0.015 sec&#41;
perft&#40;5&#41;=4865609 ( 0.156 sec&#41;
perft&#40;6&#41;=119060324 ( 3.922 sec&#41;
perft&#40;7&#41;=3195901860 &#40;102.016 sec&#41;

>i-perft 7
perft 1           20     0.00s    1.$ mnps  851.4 ticks/op
perft 2          400     0.00s    1.$ mnps   48.8 ticks/op
perft 3         8902     0.00s    1.$ mnps   33.5 ticks/op
perft 4       197281     0.02s   12.3 mnps   33.3 ticks/op
perft 5      4865609     0.06s   78.5 mnps   32.5 ticks/op
perft 6    119060324     1.56s   76.2 mnps   33.3 ticks/op
perft 7   3195901860    41.36s   77.3 mnps   32.7 ticks/op

>qperft 6 "r3r1k1/1pq2pp1/2p2n2/1PNn4/2QN2b1/6P1/3RPP2/2R3KB b - -"
perft&#40;1&#41;=56 ( 0.000 sec&#41;
perft&#40;2&#41;=2594 ( 0.000 sec&#41;
perft&#40;3&#41;=137198 ( 0.000 sec&#41;
perft&#40;4&#41;=6391735 ( 0.171 sec&#41;
perft&#40;5&#41;=323787902 ( 7.797 sec&#41;
perft&#40;6&#41;=15097513050 &#40;389.578 sec&#41;

>i-perft 6 "r3r1k1/1pq2pp1/2p2n2/1PNn4/2QN2b1/6P1/3RPP2/2R3KB b - -"
perft 1           56     0.00s    1.$ mnps  296.5 ticks/op
perft 2         2594     0.00s    1.$ mnps   30.9 ticks/op
perft 3       137198     0.00s    1.$ mnps   18.7 ticks/op
perft 4      6391735     0.06s  103.1 mnps   20.4 ticks/op
perft 5    323787902     2.41s  134.5 mnps   18.8 ticks/op
perft 6  15097513050   124.05s  121.7 mnps   20.8 ticks/op

Source and exe are available from http://brightchess.googlepages.com (it is the last item in the download section)

I wonder if it would get faster with hashing.
It is not necessarily the case, if the calculation is so fast that a hash lookup is not faster.

sje · Post by **sje** » Fri Apr 25, 2008 3:51 am

Symbolic on a 3 GHz P4, 32 bit, no transposition table:

Code: Select all

&#91;&#93; emp 1
Pathway count to depth one&#58; 20
&#91;CT&#93; &#91;C&#58; 00.00  U&#58; 00.00&#93;
&#91;&#93; emp 2
Pathway count to depth two&#58; 400
&#91;CT&#93; &#91;C&#58; 00.00  U&#58; 00.00&#93;
&#91;&#93; emp 3
Pathway count to depth three&#58; 8,902
&#91;CT&#93; &#91;C&#58; 00.03  U&#58; 00.03&#93;
&#91;&#93; emp 4
Pathway count to depth four&#58; 197,281
&#91;CT&#93; &#91;C&#58; 00.03  U&#58; 00.03&#93;
&#91;&#93; emp 5
Pathway count to depth five&#58; 4,865,609
&#91;CT&#93; &#91;C&#58; 00.33  U&#58; 00.33&#93;
&#91;&#93; emp 6
Pathway count to depth six&#58; 119,060,324
&#91;CT&#93; &#91;C&#58; 07.92  U&#58; 07.92&#93;
&#91;&#93; emp 7
Pathway count to depth seven&#58; 3,195,901,860
&#91;CT&#93; &#91;C&#58; 03&#58;22.79  U&#58; 03&#58;22.79&#93;

sje · Post by **sje** » Fri Apr 25, 2008 3:56 am

Symbolic on a 3 GHz P4, 32 bit, small transposition table:

Code: Select all

&#91;&#93; emp 1
Pathway count to depth one&#58; 20
&#91;CT&#93; &#91;C&#58; 00.00  U&#58; 00.00&#93;
&#91;&#93; emp 2
Pathway count to depth two&#58; 400
&#91;CT&#93; &#91;C&#58; 00.00  U&#58; 00.00&#93;
&#91;&#93; emp 3
Pathway count to depth three&#58; 8,902
&#91;CT&#93; &#91;C&#58; 00.03  U&#58; 00.03&#93;
&#91;&#93; emp 4
Pathway count to depth four&#58; 197,281
&#91;CT&#93; &#91;C&#58; 00.04  U&#58; 00.04&#93;
&#91;&#93; emp 5
Pathway count to depth five&#58; 4,865,609
&#91;CT&#93; &#91;C&#58; 00.21  U&#58; 00.21&#93;
&#91;&#93; emp 6
Pathway count to depth six&#58; 119,060,324
&#91;CT&#93; &#91;C&#58; 03.16  U&#58; 03.16&#93;
&#91;&#93; emp 7
Pathway count to depth seven&#58; 3,195,901,860
&#91;CT&#93; &#91;C&#58; 01&#58;03.76  U&#58; 01&#58;03.76&#93;

Gerd Isenberg · Post by **Gerd Isenberg** » Fri Apr 25, 2008 11:02 am

Dann Corbit wrote: I wonder if it would get faster with hashing.
It is not necessarily the case, if the calculation is so fast that a hash lookup is not faster.

Kind of exponential smaller tree size due to hashing transpositions versus linear slowdown due to latency has likely some break even point, where both approaches take about equal solution time. The more transpositions are possible with depth > 2 (e.g. e4 c6 d4 versus d4 c6 e4), the more the hash with reasonable replacement scheme pays off.

hgm · Post by **hgm** » Fri Apr 25, 2008 1:11 pm

Did you use the qperft.exe on my website, or did you compile the source in the same way as your own? The reason I ask is that I noticed that gcc / cygwin, (I used to build qperft.exe) apparently does a quite poor job on my engines. Denis Mendoze made a compile for me of uMax that was 80% faster than my own gcc compile. Joker, which is more like qperft, did gain less, but still some 20%.

As to the hashing:

The issue is really if you should hash the frontier nodes (which only do the counting). Hashing interior nodes is always competative. Hashing the frontier nodes in the main hash table actually made qperft slower. The final solution I took was hashing the frontier nodes in a very small dedicated hash table, comparable to an eval cache, fitting entirely in L2. (2048 cache lines of 64 bytes, i.e. 128KB or 14K entries.)

Although the hit rate dropped from 60% to 30%, this gave a significant speedup, as the hash probes now go at the L2 access time, and thus are nearly free.

Allard Siemelink · Post by **Allard Siemelink** » Fri Apr 25, 2008 6:45 pm

hgm wrote:Did you use the qperft.exe on my website, or did you compile the source in the same way as your own? The reason I ask is that I noticed that gcc / cygwin, (I used to build qperft.exe) apparently does a quite poor job on my engines. Denis Mendoze made a compile for me of uMax that was 80% faster than my own gcc compile. Joker, which is more like qperft, did gain less, but still some 20%.

Hi HG, I used both; the exe from your website (quoted above) being slightly faster then my own qperft compile (gcc/mingw).
Admittedly I used gcc's pgo for i-perft, but I gained only 1%.
80% gain sounds incredible, perhaps Dennis should compile both perfts and compare?

Also, according to an old thread, qperft seems to do much better on core 2 duo processors.
Since I do not own one, I wonder if i-perft would still be faster on a core 2 duo.
Perhaps someone (you?) could post some c2d results?

hgm · Post by **hgm** » Fri Apr 25, 2008 8:48 pm

Code: Select all

>qperft 7
perft&#40;1&#41;=                  20 ( 0.000 sec&#41;
perft&#40;2&#41;=                 400 ( 0.000 sec&#41;
perft&#40;3&#41;=                8902 ( 0.000 sec&#41;
perft&#40;4&#41;=              197281 ( 0.000 sec&#41;
perft&#40;5&#41;=             4865609 ( 0.078 sec&#41;
perft&#40;6&#41;=           119060324 ( 1.435 sec&#41;
perft&#40;7&#41;=          3195901860 &#40;37.721 sec&#41;

>i-perft 7

perft 1           20     0.00s    1.$ mnps 2145.6 ticks/op
perft 2          400     0.00s    1.$ mnps   51.3 ticks/op
perft 3         8902     0.00s    1.$ mnps   35.2 ticks/op
perft 4       197281     0.00s    1.$ mnps   35.4 ticks/op
perft 5      4865609     0.08s   62.4 mnps   34.5 ticks/op
perft 6    119060324     1.17s  101.8 mnps   23.6 ticks/op
perft 7   3195901860    30.39s  105.2 mnps   22.8 ticks/op

>qperft 6 "r3r1k1/1pq2pp1/2p2n2/1PNn4/2QN2b1/6P1/3RPP2/2R3KB b - -"
perft&#40;1&#41;=                  56 ( 0.000 sec&#41;
perft&#40;2&#41;=                2594 ( 0.000 sec&#41;
perft&#40;3&#41;=              137198 ( 0.000 sec&#41;
perft&#40;4&#41;=             6391735 ( 0.078 sec&#41;
perft&#40;5&#41;=           323787902 ( 2.371 sec&#41;
perft&#40;6&#41;=         15097513050 &#40;123.833 sec&#41;

>i-perft 6 "r3r1k1/1pq2pp1/2p2n2/1PNn4/2QN2b1/6P1/3RPP2/2R3KB b - -"

perft 1           56     0.00s    1.$ mnps  867.4 ticks/op
perft 2         2594     0.00s    1.$ mnps   33.0 ticks/op
perft 3       137198     0.00s    1.$ mnps   19.7 ticks/op
perft 4      6391735     0.05s  136.0 mnps   22.4 ticks/op
perft 5    323787902     1.89s  171.5 mnps   13.9 ticks/op
perft 6  15097513050    96.20s  156.9 mnps   15.3 ticks/op

This is for Core 2 Duo 2.4GHz. Yours is still faster, but the gap is not nearly as big.

tvrzsky · Post by **tvrzsky** » Fri Apr 25, 2008 11:55 pm

This is on Pentium Dual Core E2140 overclocked to 2.4 GHz, qperft compiled in Microsoft Visual C++ 2005 Express Edition:

Code: Select all

Quick Perft by H.G. Muller
Perft mode&#58; No hashing, bulk counting in horizon nodes

perft&#40;1&#41;=                  20 ( 0.000 sec&#41;
perft&#40;2&#41;=                 400 ( 0.000 sec&#41;
perft&#40;3&#41;=                8902 ( 0.000 sec&#41;
perft&#40;4&#41;=              197281 ( 0.000 sec&#41;
perft&#40;5&#41;=             4865609 ( 0.062 sec&#41;
perft&#40;6&#41;=           119060324 ( 1.172 sec&#41;
perft&#40;7&#41;=          3195901860 &#40;31.344 sec&#41;

i-perft 1.0 &#40;c&#41; 2006-2008 AJ Siemelink

perft 1           20     0.00s    1.$ mnps 6414.4 ticks/op
perft 2          400     0.00s    1.$ mnps   91.9 ticks/op
perft 3         8902     0.00s    1.$ mnps   29.2 ticks/op
perft 4       197281     0.01s   13.2 mnps   24.3 ticks/op
perft 5      4865609     0.05s  103.5 mnps   22.9 ticks/op
perft 6    119060324     1.14s  104.3 mnps   23.1 ticks/op
perft 7   3195901860    30.16s  106.0 mnps   22.7 ticks/op

qperft.exe 6 "r3r1k1/1pq2pp1/2p2n2/1PNn4/2QN2b1/6P1/3RPP2/2R3KB b - -"
Quick Perft by H.G. Muller
Perft mode&#58; No hashing, bulk counting in horizon nodes

perft&#40;1&#41;=                  56 ( 0.000 sec&#41;
perft&#40;2&#41;=                2594 ( 0.000 sec&#41;
perft&#40;3&#41;=              137198 ( 0.000 sec&#41;
perft&#40;4&#41;=             6391735 ( 0.047 sec&#41;
perft&#40;5&#41;=           323787902 ( 2.000 sec&#41;
perft&#40;6&#41;=         15097513050 &#40;102.515 sec&#41;

i-perft.exe 6 "r3r1k1/1pq2pp1/2p2n2/1PNn4/2QN2b1/6P1/3RPP2/2R3KB b - -"

i-perft 1.0 &#40;c&#41; 2006-2008 AJ Siemelink

perft 1           56     0.00s    1.$ mnps 2786.0 ticks/op
perft 2         2594     0.00s    1.$ mnps  121.3 ticks/op
perft 3       137198     0.00s    1.$ mnps   13.4 ticks/op
perft 4      6391735     0.03s  206.2 mnps   14.9 ticks/op
perft 5    323787902     1.86s  174.1 mnps   13.7 ticks/op
perft 6  15097513050    95.16s  158.7 mnps   15.2 ticks/op

hgm · Post by **hgm** » Sat Apr 26, 2008 12:33 pm

So they get very close indeed. I seem to remember Allard told me that his move generator was very close to the mailbox (16x12) I use in qperft. Is that tstill the same move generator as is used in i-perft?

I have no idea why it performs so relatively poorly on Opteron. Might have to do with an unfortunate layout of the global data, and the small number of cache ways on AMD machines.

A perft faster than qperft?!

A perft faster than qperft?!

Re: A perft faster than qperft?!

Symbolic movepath enumeration

Re: Symbolic movepath enumeration

Re: A perft faster than qperft?!

Re: A perft faster than qperft?!

Re: A perft faster than qperft?!

Re: A perft faster than qperft?!

Re: A perft faster than qperft?!

Re: A perft faster than qperft?!