OpenCL perft() Technical Issues

sje · Post by **sje** » Sun Sep 07, 2014 1:47 am

Some hash code text file compression factors, courtesy of gzip:

gail&#58;tmp sje$ gzip -v Hash*
HashCodesJesus&#58;	 41.2% -- replaced with HashCodesJesus.gz
HashCodesMartin&#58;	 41.1% -- replaced with HashCodesMartin.gz
HashCodesOscar&#58;	 41.7% -- replaced with HashCodesOscar.gz
HashCodesUrandom&#58;	 41.2% -- replaced with HashCodesUrandom.gz

No significant differences.

I suppose I could add a ninth or tenth prime to Oscar's PRNG which would reduce the resulting compression factor slightly. But this would only slow the initialization and would not improve the table utilization coefficient.

So far, the moral of the story is:

For Zobrist key generation, don't use a PRNG which is fat, slow, complex, non-portable, or has external dependencies. Instead, use a slim, fast, simple, portable, and wholly internal PRNG which has been thoroughly tested and has proven empirical performance.

bob · Post by **bob** » Sun Sep 07, 2014 5:12 am

sje wrote:With hash codes from /dev/urandom:

Code: Select all

hexdump -v -e ""%08x%08x%08x%08x\n"" /dev/urandom | head -824 >HashCodesUrandom

https://dl.dropboxusercontent.com/u/316 ... desUrandom

Code: Select all

gail&#58;tmp sje$ ./operft "r4rk1/1pp1qppp/p1np1n2/2b1p1B1/2B1P1b1/P1NP1N2/1PP1QPPP/R4RK1 w - - 0 10" 1
Utilization&#58; 0.000000
46   0.00 MHz
gail&#58;tmp sje$ ./operft "r4rk1/1pp1qppp/p1np1n2/2b1p1B1/2B1P1b1/P1NP1N2/1PP1QPPP/R4RK1 w - - 0 10" 2
Utilization&#58; 0.000000
2079   0.00 MHz
gail&#58;tmp sje$ ./operft "r4rk1/1pp1qppp/p1np1n2/2b1p1B1/2B1P1b1/P1NP1N2/1PP1QPPP/R4RK1 w - - 0 10" 3
Utilization&#58; 0.000003
89890   0.15 MHz
gail&#58;tmp sje$ ./operft "r4rk1/1pp1qppp/p1np1n2/2b1p1B1/2B1P1b1/P1NP1N2/1PP1QPPP/R4RK1 w - - 0 10" 4
Utilization&#58; 0.000127
3894594   5.39 MHz
gail&#58;tmp sje$ ./operft "r4rk1/1pp1qppp/p1np1n2/2b1p1B1/2B1P1b1/P1NP1N2/1PP1QPPP/R4RK1 w - - 0 10" 5
Utilization&#58; 0.003070
164075551   56.34 MHz
gail&#58;tmp sje$ ./operft "r4rk1/1pp1qppp/p1np1n2/2b1p1B1/2B1P1b1/P1NP1N2/1PP1QPPP/R4RK1 w - - 0 10" 6
Utilization&#58; 0.072268
6923051137   128.73 MHz
gail&#58;tmp sje$ ./operft "r4rk1/1pp1qppp/p1np1n2/2b1p1B1/2B1P1b1/P1NP1N2/1PP1QPPP/R4RK1 w - - 0 10" 7
Utilization&#58; 0.804249
287188994746   278.74 MHz

No significant differences in utilization coefficients.

This is sort of old news, actually. There are no "good" random number collections when used for hashing. But there are definitely "bad" collections. Several of us tried all sorts of approaches a few years ago. Zero improvement. You can even try some crypto-type stuff such as trying to limit the minimum hamming distance between two numbers, or minimize the sums of squares of said hamming distance, or whatever. Making sure each number has an equal number of 0's and 1's. Or trying to do the same for reasonable pairs of numbers (i.e. for knights, making sure that the from/to randoms are significantly different to affect the signature significantly. And on and on. There are lots of choices. And none of 'em helped a bit. But if you get sloppy and have numbers with too small a hamming distance, or triplets at bad locations so that a from/to number "undo" each other, we could not find any particular "advice" for producing good numbers, just so they are all reasonably random, reasonably uniform in the distribution, and none with the obvious issues of 0 or 1 bit set, etc.

I don't think you will find anything that is better than what you are using, but you certainly might find something worse.

sje · Post by **sje** » Sun Sep 07, 2014 11:39 am

bob wrote: This is sort of old news, actually. There are no "good" random number collections when used for hashing. But there are definitely "bad" collections. Several of us tried all sorts of approaches a few years ago. Zero improvement. You can even try some crypto-type stuff such as trying to limit the minimum hamming distance between two numbers, or minimize the sums of squares of said hamming distance, or whatever. Making sure each number has an equal number of 0's and 1's. Or trying to do the same for reasonable pairs of numbers (i.e. for knights, making sure that the from/to randoms are significantly different to affect the signature significantly. And on and on. There are lots of choices. And none of 'em helped a bit. But if you get sloppy and have numbers with too small a hamming distance, or triplets at bad locations so that a from/to number "undo" each other, we could not find any particular "advice" for producing good numbers, just so they are all reasonably random, reasonably uniform in the distribution, and none with the obvious issues of 0 or 1 bit set, etc.

I don't think you will find anything that is better than what you are using, but you certainly might find something worse.

I pretty much agree with all of the above. There are times when many pages of mathematical analysis are less revealing than a few all-night perft() runs.

What is "something worse"? I tried running Oscar with his PRNG running with fewer than its usual eight primes. With as many as six primes, I was able to get false positives, which is VERY SCARY for a perft() program. (Six prime period is 30,030; Oscar needs 105,472 bits.) Adding the seventh prime (period now 510,510) removed the false positives. But too close for comfort. So Oscar will now use the first sixteen primes so that I may sleep better.

Symbolic uses 31 primes, which is certainly sufficient for a number of purposes including generating many billions of unique random games.

mar · Post by **mar** » Sun Sep 07, 2014 12:18 pm

Hmm, when I think about it, why not nibble-pack position in 32 bytes plus use a separate table for say 1 or 2-byte flag (castling rights etc.).
Of course it would be (much?) slower, hashing still necessary to get TT index (but could be 64 bit). Also some msbits of hash key could be used as simple checksum in unused flag bits.
That would be slower of course (and not cache friendly) and would require more memory but would also eliminate false positives completely.
16 or 32 bytes is not that much of a difference on today's hardware.
Btw. how many bits do you use to store subtotals in TT?

sje · Post by **sje** » Sun Sep 07, 2014 12:53 pm

mar wrote:Hmm, when I think about it, why not nibble-pack position in 32 bytes plus use a separate table for say 1 or 2-byte flag (castling rights etc.).
Of course it would be (much?) slower, hashing still necessary to get TT index (but could be 64 bit). Also some msbits of hash key could be used as simple checksum in unused flag bits.
That would be slower of course (and not cache friendly) and would require more memory but would also eliminate false positives completely.
16 or 32 bytes is not that much of a difference on today's hardware.
Btw. how many bits do you use to store subtotals in TT?

Symbolic does have a routine to make a nybble-packed, 32 byte CompactBoard object from a 64 integer Board object. But storing an extra 32 bytes in each transposition table entry eats a lot of memory, and there's still a need for a hash signature to scatter the entries throughout the address space.

Then again, storing a CompactBoard object with each entry can be useful for testing, as it's a good way to detect false positives.

Both Symbolic and Oscar use an eight byte counter for perft() calculations. This is sufficient for perft(13), but not perft(14) or higher; the latter calculations require post processing to add the 64 bit subtotals.

--------

I've updated Oscar's publicly posted hash code file, this version generated with a 16 prime (vs the earlier 8 prime) PRNG:

https://dl.dropboxusercontent.com/u/316 ... /HashCodes

sje · Post by **sje** » Sun Sep 07, 2014 1:14 pm

How many bits should a hash signature have?

With a signature N bits long, a comparison of two randomly selected signatures will match, on average, N/2 bits. For every two bits added to the signature length, the probability of matching an extra bit is reduced by 50%.

Let's assume that a given chess program hits the transposition table at a rate of 10 MHz. With a 32 bit signature, the program will encounter a false positive about once every 2^16 comparisons, or about 153 false matches every second. That's very bad.

With a 64 bit signature, the program will encounter a false positive about once every 2^32 comparisons, or about once every seven minutes. That's bad.

With a 128 bit signature, the program will encounter a false positive about once every 2^64 comparisons, or about once every fifty-eight thousand years. That I can live with.

sje · Post by **sje** » Sun Sep 07, 2014 2:03 pm

Dynamic storage allocation

In ANSI C, the malloc() and free() routines provide a portable means of dynamic storage allocation. In C++, the new and delete operators do the same and with compile time type checking.

But in the OpenCL environment, there are no dynamic allocation routines, in part because there is no standard C run time library support. In fact, an OpenCL environment may not even have a stack, let alone a storage heap.

So, a problem arises when there's need to have dynamic allocation in a chess program. Without such allocation, various upper limits have to be set at compile time; limits like the size of any transposition tables or the ply depth or the game length.

But I really don't like fixed limits when a no-limit approach is available. So I'm going to install a dynamic allocator into Oscar, code taken from one I wrote as an undergraduate long, long ago. Because of various OpenCL constraints, the size of the heap used by the allocator will be fixed at compile time, but that will be the only fixed limit in the program. There will still be a need for having this limit be high but not so high that a given GPU will reject the resulting kernel. It may be that more than one Oscar OpenCL binary will be needed to keep the various GPUs happy and running with the largest possible transposition table.

mar · Post by **mar** » Sun Sep 07, 2014 2:56 pm

Hmm, the above code has nasty pipeline dependency. So I removed it and now it runs 60% faster, still passing all dieharder tests with default (0) seed:

Code: Select all

static const ULong C0 = 0xc5462216u | (&#40;ULong&#41;0xcf14f4ebu<<32&#41;;
static const ULong C1 = 0x75ecfc58u | (&#40;ULong&#41;0x9576080cu<<32&#41;;

ULong Next&#40; struct PRNG *p )
&#123;
	ULong tmp = p->keys&#91;0&#93;;
	p->keys&#91;0&#93; += Rotate&#40; p->keys&#91;1&#93; ^ C0, 1 );
	return p->keys&#91;1&#93; += Rotate&#40; tmp ^ C1, 9 );
&#125;

It now outputs 1 billion 64-bit random numbers per second on my i7, ~3.5 cycles per output.
Period is well above 2^32 so I guess not bad for 3 lines of code, being portable, super fast, reliable and slim,
immune to zero-bit internal state (some PRNGs produce runs of zeros in that case).
Can any PRNG beat it? Let me know. Perhaps LCG but that's very low quality, I wouldn't even call it PRNG.
As for seeding with fixed value, simply storing internal state as static const will do (2 64-bit constants), reducing seed time to nothing.

mvk · Post by **mvk** » Sun Sep 07, 2014 3:36 pm

sje wrote:How many bits should a hash signature have?

With a 64 bit signature, the program will encounter a false positive about once every 2^32 comparisons, or about once every seven minutes. That's bad.

This is only the case if you compare each new hash with every previously generated hash. But once the table is saturated, that is not what happens because you lose old hashes, and with that the opportunity to find collisions. From that moment onwards the ratio of false hits scales only with the number of non-addressing bits minus the binary logarithm of the bucket size. This is easy to see intuitively if you start to imagine a table size of 1.

sje · Post by **sje** » Sun Sep 07, 2014 4:01 pm

mar wrote:

Code: Select all

static const ULong C0 = 0xc5462216u | (&#40;ULong&#41;0xcf14f4ebu<<32&#41;;
static const ULong C1 = 0x75ecfc58u | (&#40;ULong&#41;0x9576080cu<<32&#41;;

ULong Next&#40; struct PRNG *p )
&#123;
	ULong tmp = p->keys&#91;0&#93;;
	p->keys&#91;0&#93; += Rotate&#40; p->keys&#91;1&#93; ^ C0, 1 );
	return p->keys&#91;1&#93; += Rotate&#40; tmp ^ C1, 9 );
&#125;

I think it looks pretty good. Also, fast. At one time I used a PRNG like the above, but for some reasons not necessary relevant to chess programming, I decided to use a PRNG which could generate the Nth random bit without having to generate the previous N-1 bits.

There is a cool trick for determining the period of any PRNG which returns it's seed. All that is needed is an iteration loop which calls two different instances of a PRNG, the first once per iteration, the second twice per iteration. When the two outputs match, the iteration counter gives the period.

By the way, most C compilers will accept unsigned long long int literals like:

Code: Select all

static const ULong C0 = 0xcf14f4ebc5462216ull;
static const ULong C1 = 0x9576080c75ecfc58ull;

OpenCL perft() Technical Issues

Compression factors

Re: With hash codes from /dev/urandom

Scary PRNGs

Re: Scary PRNGs

Re: Scary PRNGs

Scary signatures

Dynamic storage allocation

Re: The Great Battle of the Hash Codes

Re: Scary signatures

Re: The Great Battle of the Hash Codes