I finally got around to adding 'bulk counting' to my old (non hashing) perft code.
Amazingly it turns out to be ~2x faster than qperft on my 32-bit opteron@2500:
Allard Siemelink wrote:I finally got around to adding 'bulk counting' to my old (non hashing) perft code.
Amazingly it turns out to be ~2x faster than qperft on my 32-bit opteron@2500:
Dann Corbit wrote:
I wonder if it would get faster with hashing.
It is not necessarily the case, if the calculation is so fast that a hash lookup is not faster.
Kind of exponential smaller tree size due to hashing transpositions versus linear slowdown due to latency has likely some break even point, where both approaches take about equal solution time. The more transpositions are possible with depth > 2 (e.g. e4 c6 d4 versus d4 c6 e4), the more the hash with reasonable replacement scheme pays off.
Did you use the qperft.exe on my website, or did you compile the source in the same way as your own? The reason I ask is that I noticed that gcc / cygwin, (I used to build qperft.exe) apparently does a quite poor job on my engines. Denis Mendoze made a compile for me of uMax that was 80% faster than my own gcc compile. Joker, which is more like qperft, did gain less, but still some 20%.
As to the hashing:
The issue is really if you should hash the frontier nodes (which only do the counting). Hashing interior nodes is always competative. Hashing the frontier nodes in the main hash table actually made qperft slower. The final solution I took was hashing the frontier nodes in a very small dedicated hash table, comparable to an eval cache, fitting entirely in L2. (2048 cache lines of 64 bytes, i.e. 128KB or 14K entries.)
Although the hit rate dropped from 60% to 30%, this gave a significant speedup, as the hash probes now go at the L2 access time, and thus are nearly free.
hgm wrote:Did you use the qperft.exe on my website, or did you compile the source in the same way as your own? The reason I ask is that I noticed that gcc / cygwin, (I used to build qperft.exe) apparently does a quite poor job on my engines. Denis Mendoze made a compile for me of uMax that was 80% faster than my own gcc compile. Joker, which is more like qperft, did gain less, but still some 20%.
Hi HG, I used both; the exe from your website (quoted above) being slightly faster then my own qperft compile (gcc/mingw).
Admittedly I used gcc's pgo for i-perft, but I gained only 1%.
80% gain sounds incredible, perhaps Dennis should compile both perfts and compare?
Also, according to an old thread, qperft seems to do much better on core 2 duo processors.
Since I do not own one, I wonder if i-perft would still be faster on a core 2 duo.
Perhaps someone (you?) could post some c2d results?
So they get very close indeed. I seem to remember Allard told me that his move generator was very close to the mailbox (16x12) I use in qperft. Is that tstill the same move generator as is used in i-perft?
I have no idea why it performs so relatively poorly on Opteron. Might have to do with an unfortunate layout of the global data, and the small number of cache ways on AMD machines.