OpenCL perft() Technical Issues

sje · Post by **sje** » Sun Sep 07, 2014 5:10 pm

syzygy wrote:More importantly, a chess engine can afford some false hits that go undetected. A perft() program cannot, unless you don't care about the correctness of the calculation. 64 bit keys certainly is insufficient. With 128-bit keys it might be that the probability of a hashing error becomes smaller than the probability of a hardware error. (But one should make sure to use a new set of 128-bit Zobrist keys in a verification run if using the same program.)

I agree with this; a well written chess program can survive an invalid score retrieval and also an invalid move retrieval with proper legality testing.

But a verification run needs more than just a different set of keys; it should be done on different hardware with a different program written by a different author.

Many years ago, I had written a chess program which had passed all kinds of tests for a couple of years. But one day, the tricky legal move only generation failed -- on a tree grown from a middlegame position out of Win At Chess. There was this one unusual case with a bishop pinning a pawn from behind after an en passant capture which triggered a bug. A humbling incident for sure, and it taught be to be extra cautious.

sje · Post by **sje** » Sun Sep 07, 2014 5:25 pm

mvk wrote:If you have memory for 5B positions. Otherwise it is as if the 23 people don't fit in the room. The bday paradox is often misapplied in the context of computer chess hash tables.

Right now, Symbolic is running an a machine with 32 GB RAM and is using a 24 GB transposition table with 2^30 = 1,073,741,824 elements, just over a billion.

In a decade or so, affordable desktop machines with terabyte RAM might become available. It's good to plan ahead.

syzygy · Post by **syzygy** » Sun Sep 07, 2014 5:33 pm

sje wrote:But a verification run needs more than just a different set of keys; it should be done on different hardware with a different program written by a different author.

I agree that a verification run ideally should be performed as independently as possible, but my point was that using imperfect hashing inherently makes the program unreliable. So a verification run should at the very least address that source of potential errors.

mar · Post by **mar** » Sun Sep 07, 2014 7:23 pm

sje wrote:There is a cool trick for determining the period of any PRNG which returns it's seed. All that is needed is an iteration loop which calls two different instances of a PRNG, the first once per iteration, the second twice per iteration. When the two outputs match, the iteration counter gives the period.

By the way, most C compilers will accept unsigned long long int literals like:
Code: Select all
static const ULong C0 = 0xcf14f4ebc5462216ull;
static const ULong C1 = 0x9576080c75ecfc58ull;

Thanks, I expect the period to be 2^60 or maybe even larger.
So there must be a better way to measure/compute period.
Even at 1Ghz counting iterations, I'd have to wait for ~1 billion seconds if it's really 2^60...

As for the constant, yes I'm aware of that but IIRC I got warnings when compiling with gcc or maybe it was clang that ull is non-standard in pre-C++11. I'm not sure if this applies to C however.

sje · Post by **sje** » Sun Sep 07, 2014 9:41 pm

A surprise from an inadvertent test:

I tried a test run of Oscar calculating perft(9) on a 32 bit Linux notebook with the objective of verifying cross platform compatibility. To my surprise, the result was off by a few parts per million.

At first I thought that there was a coding error in my recently integrated target bit vector code (more on target bit vectors later) . After too much time flailing away, I tried the idea of turning off the transposition table to help isolate the fault. Well, that made the fault go away. Examination of the simple transposition code revealed nothing of interest. It was only then that I looked at the hash generation and found that I'd left in a line of debugging code which limited the PRNG to using only seven primes instead of sixteen.

So, seven primes aren't good enough, something I'd suspected but had not verified until a few hours ago. Just by accident.

ibid · Post by **ibid** » Mon Sep 08, 2014 12:56 am

sje wrote:
mar wrote:Hmm, when I think about it, why not nibble-pack position in 32 bytes plus use a separate table for say 1 or 2-byte flag (castling rights etc.).
Of course it would be (much?) slower, hashing still necessary to get TT index (but could be 64 bit). Also some msbits of hash key could be used as simple checksum in unused flag bits.
That would be slower of course (and not cache friendly) and would require more memory but would also eliminate false positives completely.
16 or 32 bytes is not that much of a difference on today's hardware.
Btw. how many bits do you use to store subtotals in TT?
Symbolic does have a routine to make a nybble-packed, 32 byte CompactBoard object from a 64 integer Board object. But storing an extra 32 bytes in each transposition table entry eats a lot of memory, and there's still a need for a hash signature to scatter the entries throughout the address space.

Then again, storing a CompactBoard object with each entry can be useful for testing, as it's a good way to detect false positives.

The size can be reduced to 24 bytes pretty easily: an 8 byte occupied bitboard plus 4 bits each for up to 32 pieces. You can also define the fake pieces rook-which-can-castle, pawn-which-can-be-captured-en-passant and king-whose-side-has-the-move.

My generator of unique positions has always used this form of CompactBoard instead of hash keys for verifying two positions are the same. And yes, it is rather slow, but I wanted to be absolutely sure long-running perft computations were not starting with faulty uniq data.

That said the unique sets are getting bigger and slower to generate so I am contemplating switching to a 128-bit verification key. This actually is a birthday problem and I estimate the chance of error at ~ 1.4x10^-27 for the 11 ply unique positions in chess.

And I agree 100%, it has always been handy to have CompactBoard code available for debugging purposes.

mar · Post by **mar** » Mon Sep 08, 2014 2:13 am

sje wrote:So, seven primes aren't good enough, something I'd suspected but had not verified until a few hours ago. Just by accident.

That's alarming.
So the product (and thus period) of first sixteen primes should be 32,589,158,477,190,044,730 , right? So slightly over 2^64.
I tried to pass your 16-prime generator to a test suite and I'm afraid it's not a good PRNG, failed 5 of 7 first tests until I stopped it.
I did the same for 32 primes and got a similar result, except being an order of magnitude slower - so I guess adding more primes doesn't really improve the quality
or maybe only marginally.
It's also possible that I messed up but I doubt, I used the code you posted (including fixed CalcHashQword).
I understand that your PRNG has a neat property of being able to generate any n-th number in constant time, but...
Why don't you just grab those 824 good randoms from dev/urandom? It's only 13kb of raw data.
Pride aside - it would be a real pity if such a huge effort would be wasted because of a random collision somewhere in the middle of the computation...

sje · Post by **sje** » Mon Sep 08, 2014 4:06 am

ibid wrote:And I agree 100%, it has always been handy to have CompactBoard code available for debugging purposes.

Interestingly, the idea for CompactBoard came from a chess program I wrote in the early 1980s when the Intel 8086 CPU first appeared. With no web and not much literature available, I didn't know about Zobrist hash codes. Yet I needed a way to detect position repetition using only a rather small amount of memory. So I stored the equivalent of a CompactBoard object on a stack. A bit slow, so the program tested for repetition only at ply zero and ply one.

The other idea I had for repetition detection was to store a canonical version of piece moves, captures, and promotions -- a list which would be identical for identical positions coming from the same root by different paths. But the list would be variable length and that was not an attractive proposition.

sje · Post by **sje** » Mon Sep 08, 2014 4:18 am

Transposition table instrumentation (part 1)

New code:

Code: Select all

// Perft transposition table

typedef struct
&#123;
  ui64     entrycount;       // Number of entries
  ui64     fetchcount;       // Number of fetch calls
  ui64     matchcount;       // Number of matches
  ui64     stashcount;       // Number of stash calls
  ui64     trashcount;       // Number of trashes &#40;overwrites&#41;
  ui64     usagecount;       // Number of entries in use
  PerftRec storage&#91;PTTLen&#93;;  // Storage
&#125; PerftTable;

From the console:

Code: Select all

gail&#58;tmp sje$ ./operft "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1" 5
TT&#58; entry count&#58; 16777216
TT&#58; fetch count&#58; 9323
TT&#58; match count&#58; 3540
TT&#58; stash count&#58; 5783
TT&#58; trash count&#58; 0
TT&#58; usage count&#58; 5783
Utilization&#58; 0.000345
4865609   7.11 MHz
gail&#58;tmp sje$ ./operft "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1" 6
TT&#58; entry count&#58; 16777216
TT&#58; fetch count&#58; 127845
TT&#58; match count&#58; 49983
TT&#58; stash count&#58; 77862
TT&#58; trash count&#58; 5
TT&#58; usage count&#58; 77857
Utilization&#58; 0.004641
119060324   58.37 MHz
gail&#58;tmp sje$ ./operft "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1" 7
TT&#58; entry count&#58; 16777216
TT&#58; fetch count&#58; 1925263
TT&#58; match count&#58; 1024100
TT&#58; stash count&#58; 901163
TT&#58; trash count&#58; 2511
TT&#58; usage count&#58; 898652
Utilization&#58; 0.053564
3195901860   169.15 MHz
gail&#58;tmp sje$ ./operft "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1" 8
TT&#58; entry count&#58; 16777216
TT&#58; fetch count&#58; 22253113
TT&#58; match count&#58; 11838002
TT&#58; stash count&#58; 10415111
TT&#58; trash count&#58; 1569436
TT&#58; usage count&#58; 8845675
Utilization&#58; 0.527243
84998978956   364.39 MHz
gail&#58;tmp sje$ ./operft "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1" 9
TT&#58; entry count&#58; 16777216
TT&#58; fetch count&#58; 297017519
TT&#58; match count&#58; 135172584
TT&#58; stash count&#58; 161844935
TT&#58; trash count&#58; 145068108
TT&#58; usage count&#58; 16776827
Utilization&#58; 0.999977
2439530234167   619.74 MHz

Transposition table instrumentation (part 2), coming soon but not right away, will show why I believe that Oscar's PRNG is fine for its intended purpose.

jdart · Post by **jdart** » Mon Sep 08, 2014 4:35 am

I have used ULL for a long time, and had no issues with gcc or clang. Even old gcc on an ancient SPARC system handled it.

--Jon

OpenCL perft() Technical Issues

Re: Scary signatures

BIllions and billions of positions

Re: Scary signatures

Re: The Great Battle of the Hash Codes

A surprise from an inadvertent test

Re: Scary PRNGs

Re: A surprise from an inadvertent test

Re: Scary PRNGs

Transposition table instrumentation (part 1)

Re: The Great Battle of the Hash Codes