OpenCL perft() Technical Issues

mar · Post by **mar** » Mon Sep 08, 2014 8:04 am

jdart wrote:I have used ULL for a long time, and had no issues with gcc or clang. Even old gcc on an ancient SPARC system handled it.

--Jon

Try to compile this

Code: Select all

int main&#40;) &#123;
    unsigned long long k = 1ull;
&#125;

with gcc or clang using -Wpedantic (without specifying --std). Of course it compiles but you get those silly warnings.

sje · Post by **sje** » Mon Sep 08, 2014 9:17 am

Transposition table instrumentation (part 2)

Oscar uses the first log2 (Table element count) bits of each signature for addressing. For the current 16,777,216 element table, this is 24 bits. For signatures which match the addressing bits but do not match all of the other bits, how many bits (starting from bit position zero), are needed in a hash code to avoid false positives?

Oscar now has a first mismatched bit tallyman, and here is what he says:

Summary:

Code: Select all

perft&#40;5&#41; =         4,865,609 needs 27 bits &#40;covers 2 tests&#41;
perft&#40;6&#41; =       119,060,324 needs 32 bits &#40;covers 556 tests&#41;
perft&#40;7&#41; =     3,195,901,860 needs 39 bits &#40;covers 93,861 tests&#41;
perft&#40;8&#41; =    84,998,978,956 needs 49 bits &#40;covers 8,762,560 tests&#41;
perft&#40;9&#41; = 2,439,530,234,167 needs 52 bits &#40;covers 362,474,852 tests&#41;

The tallyman code:

Code: Select all

static void PerftTableBitMismatching&#40;PerftTable * const perfttableptr,
                                     const Hash * const hashptr0, const Hash * const hashptr1&#41;
&#123;
  si bitpos = -1;
  
  // Determine and record the first mismatched bit of two different hash signatures.
  
  Hash hash0 = *hashptr0, hash1 = *hashptr1;
  
  hash0.qwrd0 &= ~PTTAddrMask;
  hash1.qwrd0 &= ~PTTAddrMask;
  
  if (!HashTestEqual&#40;&hash0, &hash1&#41;)
  &#123;
    if &#40;hash0.qwrd0 != hash1.qwrd0&#41;
    &#123;
      ui index = PTTAddrBitLen;

      while (&#40;bitpos < 0&#41; && &#40;index < QwrdBitsLen&#41;)
      &#123;
        if (&#40;hash0.qwrd0 & BXL&#40;index&#41;) != &#40;hash1.qwrd0 & BXL&#40;index&#41;))
          bitpos = &#40;si&#41; index;
        else
          index++;
      &#125;;
    &#125;
    else
    &#123;
      ui index = 0;

      while (&#40;bitpos < 0&#41; && &#40;index < QwrdBitsLen&#41;)
      &#123;
        if (&#40;hash0.qwrd1 & BXL&#40;index&#41;) != &#40;hash1.qwrd1 & BXL&#40;index&#41;))
          bitpos = &#40;si&#41; index + QwrdBitsLen;
        else
          index++;
      &#125;;
    &#125;;
    perfttableptr->fbmm&#91;bitpos&#93;++;
    perfttableptr->fbmmcount++;
  &#125;;
&#125;

The output for perft(9):

Code: Select all

amanda&#58;tmp sje$ ./operft "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1" 9
TT&#58; entry count&#58; 16777216
TT&#58; fetch count&#58; 297017519
TT&#58; match count&#58; 135172584
TT&#58; stash count&#58; 161844935
TT&#58; trash count&#58; 145068108
TT&#58; usage count&#58; 16776827

TT&#58; bits 0 through 23 used for addressing.

Bit   As first bit mismatched
---   -----------------------
024   181137181
025   90642323
026   45196629
027   22813998
028   11486107
029   5604694
030   2785669
031   1402306
032   712016
033   345769
034   171531
035   83728
036   47902
037   22373
038   10783
039   6801
040   2637
041   1295
042   561
043   214
044   134
045   153
046   20
047   21
048   2
049   2
050   3
Total mismatch count&#58; 362474852
At least 52 bits are needed to avoid false positive matches.

Utilization&#58; 0.999977
2439530234167   1564.56 MHz

Note how the mismatch count roughly follows a 2^-n distribution, as expected.

sje · Post by **sje** » Mon Sep 08, 2014 4:42 pm

Oscar's 16 primes vs /dev/urandom

Here's the summary for using /dev/urandom hash codes, calculating perft(5) through perft(9) as done earlier for Oscar's codes:

Code: Select all

perft&#40;5&#41; =         4,865,609 needs 24 bits &#40;handled with address bits only&#41;
perft&#40;6&#41; =       119,060,324 needs 33 bits &#40;covers 510 tests&#41;
perft&#40;7&#41; =     3,195,901,860 needs 38 bits &#40;covers 88,352 tests&#41;
perft&#40;8&#41; =    84,998,978,956 needs 47 bits &#40;covers 8,685,705 tests&#41;
perft&#40;9&#41; = 2,439,530,234,167 needs 52 bits &#40;covers 361,712,273 tests&#41;

Little difference, and for the biggest calculation, there's no difference: both hash code sets need 52 bits to avoid false positives when calculating perft(9).

If one were to allow a reasonable 6.4 bits of hash per perft() ply, then a 64 bit hash would cover perft(10) and a 128 bit hash would be good through perft(20).

And that's about all I have to say for now about hashes and transposition tables. I need to move forward on the OpenCL aspects of the project. Further, all of the I/O needed or dealing with hash initialization variants and statistics reporting won't work under OpenCL and actually won't even compile because there's no stdio.h header file in OpenCL. So I'm going to rip out any transposition table code which needs preprocessor directives as these directives muck up the the source too much for my taste.

However, since Oscar's source will eventually be released, others may experiment with the program as much as they want.

sje · Post by **sje** » Mon Sep 08, 2014 7:45 pm

One last thing.

Oscar just finished running perft(10) using its own 16 prime PRNG. The run has shown that only the first 56 bits of hash are necessary and sufficient to avoid false positives. That leaves 72 bits to spare, a comfortable safety margin.

How many bits are required with hash codes from /dev/urandom, I don't know. These runs take many hours and I've tossed the instrumentation code anyway.

From the console:

Code: Select all

amanda&#58;tmp sje$ ./operft "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1" 10
TT&#58; entry count&#58; 16777216
TT&#58; fetch count&#58; 5368480244
TT&#58; match count&#58; 2489371364
TT&#58; stash count&#58; 2879108880
TT&#58; trash count&#58; 2862331664
TT&#58; usage count&#58; 16777216

TT&#58; bits 0 through 23 used for addressing.

Bit   As first bit mismatched
---   -----------------------
024   3488592295
025   1741179199
026   871629938
027   436823373
028   218508997
029   108743646
030   54054616
031   27046564
032   13650433
033   6721597
034   3409262
035   1647190
036   976874
037   393663
038   197728
039   116124
040   54577
041   24133
042   11371
043   8587
044   4335
045   1375
046   899
047   576
048   115
049   43
050   23
051   173
054   8
Total mismatch count&#58; 6973797714
At least 56 bits are needed to avoid false positive matches.

Utilization&#58; 1.000000
69352859712417   2368.09 MHz

mvk · Post by **mvk** » Mon Sep 08, 2014 11:28 pm

sje wrote:One last thing.

Oscar just finished running perft(10) using its own 16 prime PRNG. The run has shown that only the first 56 bits of hash are necessary and sufficient to avoid false positives. That leaves 72 bits to spare, a comfortable safety margin.

How many bits are required with hash codes from /dev/urandom, I don't know. These runs take many hours and I've tossed the instrumentation code anyway.

From the console:
Code: Select all
amanda&#58;tmp sje$ ./operft "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1" 10
TT&#58; entry count&#58; 16777216
TT&#58; fetch count&#58; 5368480244
TT&#58; match count&#58; 2489371364
TT&#58; stash count&#58; 2879108880
TT&#58; trash count&#58; 2862331664
TT&#58; usage count&#58; 16777216

TT&#58; bits 0 through 23 used for addressing.

Bit   As first bit mismatched
---   -----------------------
024   3488592295
025   1741179199
026   871629938
027   436823373
028   218508997
029   108743646
030   54054616
031   27046564
032   13650433
033   6721597
034   3409262
035   1647190
036   976874
037   393663
038   197728
039   116124
040   54577
041   24133
042   11371
043   8587
044   4335
045   1375
046   899
047   576
048   115
049   43
050   23
051   173
054   8
Total mismatch count&#58; 6973797714
At least 56 bits are needed to avoid false positive matches.

Utilization&#58; 1.000000
69352859712417   2368.09 MHz

I did a similar experiment twelve years ago on the 24 positions of the BK test.

This plot is based on 900M nodes total, of which 650M were hash misses, so a factor 10,000 (approx 2**13) less computation at the time. And the horizontal axis should be read to exclude the indexing bits (so your '24' should compare to my '0').

I had a bucket size of 8. You don't give your bucket size, so I can't compare the results, but it appears that my results were still more pessimistic, given the tree size difference. Extrapolating, I would have needed 30 + 24 - 3 + 13 = 64 bits to have less than 0.5 probability of error under your load and single entry buckets.

At least the halving of errors for each additional bit is confirmed.

jdart · Post by **jdart** » Tue Sep 09, 2014 4:06 am

This page has some interesting ones and some measurements:

http://xorshift.di.unimi.it/

syzygy · Post by **syzygy** » Tue Sep 09, 2014 9:15 am

mvk wrote:I had a bucket size of 8. You don't give your bucket size, so I can't compare the results, but it appears that my results were still more pessimistic, given the tree size difference. Extrapolating, I would have needed 30 + 24 - 3 + 13 = 64 bits to have less than 0.5 probability of error under your load and single entry buckets.

At least the halving of errors for each additional bit is confirmed.

Steven's bucket size is 1.

He uses bits 0-23 for addressing. He probes N times.
The number of times bit 24 is the first mismatch is almost exactly N/2, as expected.
The number of times bit 24 matches and bit 25 mismatches is almost exactly N/2^2, as expected.
etc.

So the numbers are as good as random. (But I don't know if these numbers all came directly from a pseudorandom generator or were actual hash keys from a perft() run?)

Absolutely no need to repeat this experiment with better random numbers, as the results will be just the same.

sje · Post by **sje** » Tue Sep 09, 2014 10:12 am

In my test runs, the transposition table entry count is 2^24 (16,776,216). However, the bottom bit of the 24 bits is wildcarded; the address index also selects the element at (index xor 1).

Oscar's Fetch() checks both elements for a signature match. The Stash() routine overwrites the element with the least perft() subtotal.

Years ago, I tried a transposition scheme with each value of index mapping to the eight elements at index, (index xor 1), ... (index xor 7). Because of the slowness caused by so many cache misses, the performance was awful.

--------

The next development step is to write and integrate a dynamic memory allocator which will work under OpenCL. Then it will be time to attempt compilation and execution in an OpenCL environment.

sje · Post by **sje** » Wed Sep 10, 2014 9:48 am

The Joy of Lists

As mentioned earlier, Oscar at first used a fixed length array to store position preservation data; position state data which is saved early in a call to the Execute() routine and later restored in the corresponding Retract() routine call. This works okay for perft() as long as the perft() depth is no greater than PlyLen, the fixed ply depth limit.

But I don't like fixed limits when an alternative can be had which will enhance the program. Where possible, the only fixed limit should be the amount of available RAM, or maybe the amount of available virtual memory.

Some fixed limits aren't too bad, though. Oscar (like Symbolic), has a fixed limit of 256 moves generated for any position. That's okay because no one has yet come up with a position which has more than 212 moves. Likewise, in my upcoming Mark II tablebases, there will be a 16 bit limit on storing score values, thus restricting the longest mate/lose distance to about 32,767; I can live with that.

Oscar, like Symbolic, stores position state data snapshots as part of a Position object. There must be space for a potentially large number of these; not just one per ply of perft() or search(), put one for each move in a game. And a game can get very long; some in my random game experiments were well over a thousand ply.

So now, Oscar, again like Symbolic, stores these snapshots on a two-way linked list made from dynamically allocated nodes, one node per snapshot.

Now you might say "Gosh, that would be a lot of overhead to call the library allocator malloc() or its equivalent for each Execute() call and free() in each Retract() call". And you would be right, although my tests show it's not as bad as some might think. It's still bad, though.

So Oscar doesn't call any library routines at all on nearly every Execute()/Retract() call. Instead, the program keeps TWO linked lists of position state snapshots: one for the snapshots for each non-retracted move, and a second list containing unused nodes waiting to be recycled.

Code: Select all

// Preserved position data

typedef struct
&#123;
  Fss  fss;   // Saved FEN scalar set record
  Apd  apd;   // Saved auxiliary position data record
  Hash msig;  // Saved main position signature
  Move move;  // Saved played move
&#125; Ppd;

// Preserved position data storage node

typedef struct PpdNode
&#123;
  struct PpdNode *prev;  // Pointer to prev Ppd storage node
  struct PpdNode *next;  // Pointer to next Ppd storage node
  Ppd             ppd;   // Stored Ppd record
&#125; PpdNode;

// Preserved position data storage list

typedef struct
&#123;
  PpdNode *head;   // Pointer to head Ppd storage node
  PpdNode *tail;   // Pointer to tail Ppd storage node
  ui       count;  // Number of elements
&#125; PpdList;

// Chess position

typedef struct
&#123;
  Fss     fss;          // FEN scalar set
  Apd     apd;          // Auxiliary position data
  Board   board;        // Chessboard
  Tracker tracker;      // Target tracker
  Hash    msig;         // Main position signature
  PpdList usedppdlist;  // List of saved Ppd records, one entry per played move
  PpdList freeppdlist;  // List of free Ppd records
&#125; Position;

Inside of Execute(), Oscar checks the freeppdlist to see if there are any free nodes. In the very rare case that the free list is empty, the node allocator is called. This will happen only once per ply at very first time that ply is seen. The freshly allocated node is then inserted into the free list.

Continuing inside of Execute(), Oscar pops the tail node of the free lest, loads it with position data to be saved, and then appends the node to the end of the used list.

Inside of Retract(), the program pops the tail node of the used list, copies out the values, and then appends the node to the tail of the free list. No costly library routine calls are needed.

There is a lot gained with this approach:

1. No fixed limit on perft() depth or search() depth.

2. No fixed limit on game length.

3. No need for state data preservation structures maintained separately form a position object; moves, signatures, FEN scalars, etc.

4. Routines which need to access position history can do so with only a pointer to a Position object: repetition checking, PGN move text generation, tracing the current variation in perft() or search(), etc.

For public review, the current version of Oscar's header file is at: https://dl.dropboxusercontent.com/u/316 ... ar/Oscar.h

sje · Post by **sje** » Wed Sep 10, 2014 10:04 am

Inside of Execute():

Code: Select all

  if &#40;positionptr->freeppdlist.count == 0&#41;
    PpdListAppendNode&#40;&positionptr->freeppdlist, PpdNodeNew&#40;));

  PpdNode * const ppdnodeptr = PpdListDetachTail&#40;&positionptr->freeppdlist&#41;;

  ppdnodeptr->ppd.fss = positionptr->fss;
  ppdnodeptr->ppd.apd = positionptr->apd;
  ppdnodeptr->ppd.msig = positionptr->msig;
  ppdnodeptr->ppd.move = move;
  PpdListAppendNode&#40;&positionptr->usedppdlist, ppdnodeptr&#41;;

Inside of Retract():

Code: Select all

  PpdNode * const ppdnodeptr = PpdListDetachTail&#40;&positionptr->usedppdlist&#41;;

  positionptr->fss = ppdnodeptr->ppd.fss;
  positionptr->apd = ppdnodeptr->ppd.apd;
  positionptr->msig = ppdnodeptr->ppd.msig;
  const Move move = ppdnodeptr->ppd.move;
  PpdListAppendNode&#40;&positionptr->freeppdlist, ppdnodeptr&#41;;

An upcoming revision will have played moves be stored on their own list. This makes for a more uniform approach as there will be other routines which operate on move lists and don't need to know about the internals of a position preservation data record.

OpenCL perft() Technical Issues

Re: The Great Battle of the Hash Codes

Transposition table instrumentation (part 2)

Oscar's 16 primes vs /dev/urandom

One last thing

Re: One last thing

Re: Scary PRNGs

Re: One last thing

Re: One last thing

The Joy of Lists

Push and Pop