OpenCL perft() Technical Issues
Moderators: hgm, Rebel, chrisw
-
- Posts: 4675
- Joined: Mon Mar 13, 2006 7:43 pm
Re: Oscar the Cat
Oscar is a fairly laid back sort of cat, but he loves to ankle surf to distract me whenever I'm in the kitchen. It takes the delivery of a freshly opened can of cat food to make him desist.
Oscar is the sixth shelter cat I've adopted (all male, three have since passed), and is my fourth black (or mostly black) cat. Apparently due to superstition, shelters have a hard time adopting out black cats, particularly older males. So that's who I get! Black cats have all have an impressively evil appearance; it's unfortunate that they're so difficult to photograph well.
He likes to keep an eye on me when I work, so I'll have to build him a perch to keep him off of the keyboard.
Oscar is the sixth shelter cat I've adopted (all male, three have since passed), and is my fourth black (or mostly black) cat. Apparently due to superstition, shelters have a hard time adopting out black cats, particularly older males. So that's who I get! Black cats have all have an impressively evil appearance; it's unfortunate that they're so difficult to photograph well.
He likes to keep an eye on me when I work, so I'll have to build him a perch to keep him off of the keyboard.
-
- Posts: 4675
- Joined: Mon Mar 13, 2006 7:43 pm
Oscar and transposition tables
Currently, Oscar does not use transposition tables. These can be added later after various OpenCL issues are resolved. There are at present many unknowns regarding efficiency and portability topics with respect to differing OpenCL hosts, and so more research is needed before deploying techniques which could work well on one GPU model but refuse to even load on another.
--------
I've added SAN and FEN encoders to Oscar. Although not necessary for OpenCL operation, they are helpful with testing. Also, these and any other routines not strictly required for OpenCL operation can be quite useful for those using Oscar's source for development of their own chess applications.
The program now includes code for checking FEN validity, although this is slightly incomplete.
--------
It looks like there will be only two perft() applications using the Oscar framework, not three as I had reported earlier. Both will be the functionally the same except that one will be compiled for OpenCL while the other won't. Both will use command line options, and each capable of both one shot operation and work unit processing.
So, at least these two invocation styles:and maybe more.
--------
I've added SAN and FEN encoders to Oscar. Although not necessary for OpenCL operation, they are helpful with testing. Also, these and any other routines not strictly required for OpenCL operation can be quite useful for those using Oscar's source for development of their own chess applications.
The program now includes code for checking FEN validity, although this is slightly incomplete.
--------
It looks like there will be only two perft() applications using the Oscar framework, not three as I had reported earlier. Both will be the functionally the same except that one will be compiled for OpenCL while the other won't. Both will use command line options, and each capable of both one shot operation and work unit processing.
So, at least these two invocation styles:
Code: Select all
operft -w <workunit-file> -o <output-file> -d <depth>
operft -p <FEN-string> -d <depth>
-
- Posts: 4675
- Joined: Mon Mar 13, 2006 7:43 pm
Re: Running on a 3.4 GHz Core i5 (i5-4670).
Symbolic, dual Xeon 5150 CPUs (also from 2006):Ajedrecista wrote:I have just checked this position with 1 GB of hash in an Intel Pentium D930 (of year 2006) and JetChess 1.0.0.0 (single core, 32-bit) computed perft(7) in 475.013 seconds (0:06:55.013), that is, around 6.0459e+8 (leaf nodes)/second.
Code: Select all
[] sf r4rk1/1pp1qppp/p1np1n2/2b1p1B1/2B1P1b1/P1NP1N2/1PP1QPPP/R4RK1 w - - 0 10
[] pctran 7
Count: 287,188,994,746 Pt: 10:13.081 Wt: 2:41.902 U: 0.946681 1.87374 GHz 533.691 ps
-
- Posts: 4675
- Joined: Mon Mar 13, 2006 7:43 pm
Generating move notation
For a SAN move encoder to work, the raw moves must have various flag bits to indicate check, checkmate, and file/rank disambiguation. Such is not strictly neede for perft(), but having real SAN output makes debugging easier.
Here are Oscar's notation move flag generation routines:
Here are Oscar's notation move flag generation routines:
Code: Select all
static void PositionNotateChecks(Position * const positionptr, MoveSeg * const movesegptr)
{
// This routine notates checking moves in the given move segment for the given position.
ui index;
for (index = 0; index < movesegptr->count; index++)
{
PositionExecute(positionptr, movesegptr->baseptr[index]);
if (positionptr->apd.incheck)
SetMf(movesegptr->baseptr[index], MfChck);
PositionRetract(positionptr);
};
}
static void PositionNotateCheckmates(Position * const positionptr, MoveSeg * const movesegptr)
{
// This routine notates checkmating moves in the given move segment for the given position.
ui index;
for (index = 0; index < movesegptr->count; index++)
{
const Move move = movesegptr->baseptr[index];
if (TestMf(move, MfChck))
{
PositionExecute(positionptr, move);
if (PositionIsCheckmate(positionptr))
SetMf(movesegptr->baseptr[index], MfMate);
PositionRetract(positionptr);
};
};
}
static void PositionNotateDisambiguation(const Position * const positionptr, MoveSeg * const movesegptr)
{
// This routine notates disambiguation of the moves in the given move segment for the given position.
const Move * const baseptr = movesegptr->baseptr;
const ui limit = movesegptr->count;
ui index0;
Sq tosq;
Census census;
ui landings[SqLen];
CensusLoadFromBoard(&census, &positionptr->board);
for (tosq = SqA1; tosq <= SqH8; tosq++)
landings[tosq] = 0;
for (index0 = 0; index0 < limit; index0++)
landings[GetToSq(baseptr[index0])]++;
for (index0 = 0; index0 < limit; index0++)
{
const Move move0 = baseptr[index0];
const Man frman0 = GetFrMan(move0);
const Sq tosq0 = GetToSq(move0);
if (IsManQRBN(frman0) && (census.mancount[frman0] > 1) && (landings[tosq0] > 1))
{
const Sq frsq0 = GetFrSq(move0);
const File frfile0 = MapSqToFile(frsq0);
const Rank frrank0 = MapSqToRank(frsq0);
ui index1;
ui msc = 0, mfc = 0, mrc = 0;
for (index1 = 0; index1 < limit; index1++)
{
const Move move1 = baseptr[index1];
const Sq frsq1 = GetFrSq(move1);
if ((frman0 == GetFrMan(move1)) && (tosq0 == GetToSq(move1)) && (frsq0 != frsq1))
{
msc++;
if (MapSqToFile(frsq1) == frfile0)
mfc++;
if (MapSqToRank(frsq1) == frrank0)
mrc++;
};
};
if (msc > 0)
{
if ((mfc == 0) || (mrc > 0))
SetMf(movesegptr->baseptr[index0], MfAndf);
if (mfc > 0)
SetMf(movesegptr->baseptr[index0], MfAndr);
};
};
};
}
static void PositionNotate(Position * const positionptr, MoveSeg * const movesegptr)
{
// This routine fully notates the moves in the given move segment for the given position.
PositionNotateChecks(positionptr, movesegptr);
PositionNotateCheckmates(positionptr, movesegptr);
PositionNotateDisambiguation(positionptr, movesegptr);
}
-
- Posts: 4675
- Joined: Mon Mar 13, 2006 7:43 pm
Sorting moves by SAN
A move sort routine is handy for a number of reasons. For perft(), a sorted move vector makes it easier to compare output from different programs, or different versions of the same program.
For Oscar, running in OpenCL mode adds a few challenges in regard to sorting. Usually, it's best to just pick one of the sort routines already present in the C/C++ standard library. But with OpenCL, there's no guarantee that the library will have the routine, or even if there is any library at all. Further, even if a library routine is available, it may be too large or it could drag in too much other code to operate under OpenCL memory constraints.
So, I wrote a SAN sort for Oscar with no library dependencies. It's just a simple bubble sort; there's not much need for speed for something which only runs at ply zero. Further, since recursion is not allowed, the more exotic sorting algorithms would need a non recursive adaptation.
For Oscar, running in OpenCL mode adds a few challenges in regard to sorting. Usually, it's best to just pick one of the sort routines already present in the C/C++ standard library. But with OpenCL, there's no guarantee that the library will have the routine, or even if there is any library at all. Further, even if a library routine is available, it may be too large or it could drag in too much other code to operate under OpenCL memory constraints.
So, I wrote a SAN sort for Oscar with no library dependencies. It's just a simple bubble sort; there's not much need for speed for something which only runs at ply zero. Further, since recursion is not allowed, the more exotic sorting algorithms would need a non recursive adaptation.
Code: Select all
static void MoveSegSortSan(MoveSeg * const movesegptr)
{
// This routine sorts by SAN the moves in the given move storage segment.
const ui count = movesegptr->count;
if (count > 1)
{
Move * const baseptr = movesegptr->baseptr;
ui index;
bool changed = true;
const ui passcount = count - 1;
ui pass = 0;
SanRec sanvec[MoveGenLen];
for (index = 0; index < count; index++)
SanRecLoadFromMove(&sanvec[index], baseptr[index]);
while (changed && (pass < passcount))
{
const ui paircount = count - pass - 1;
ui pair;
changed = false;
for (pair = 0; pair < paircount; pair++)
{
const ui eindex0 = pair + 0, eindex1 = pair + 1;
if (CompareStrings(sanvec[eindex0].chvec, sanvec[eindex1].chvec) > 0)
{
const Move move0 = baseptr[eindex0], move1 = baseptr[eindex1];
const SanRec sanrec0 = sanvec[eindex0], sanrec1 = sanvec[eindex1];
baseptr[eindex0] = move1;
baseptr[eindex1] = move0;
sanvec[eindex0] = sanrec1;
sanvec[eindex1] = sanrec0;
changed = true;
};
};
pass++;
};
};
}
-
- Posts: 4675
- Joined: Mon Mar 13, 2006 7:43 pm
FEN verification
Oscar now has full support for verification of FEN strings. While not strictly necessary, a little extra checking never hurts and it can be done quickly without using too much space.
The FEN validation code includes full testing of the en passant target square. If the square is not nil, then there must be at least one en passant capture available in the position. All possible constraints are checked, even making sure that the half move counter is zero.
--------
Inside the move Execute() routine, Oscar will set the en passant target square to a non-nil value if and only if at least one en passant capture is available. This helps ensure correct hash signature generation and relieves the move generation and move counting routines from doing checking themselves. Further, this supports correct FEN string generation at any time during calculations; a help with debugging.
--------
An upcoming task is the development of the parameter structure definitions needed to communicate input and output values between the driver an an OpenCL kernel. There is a need for caution: the OpenCL header file has a number of scalar type definitions which are different for different platforms. Alignment may be an issue, and the endian sense may be an issue.
All OpenCL kernel/driver parameters must be passed by value in both directions; pointers are disallowed because the kernel and the driver have different memory spaces.
The FEN validation code includes full testing of the en passant target square. If the square is not nil, then there must be at least one en passant capture available in the position. All possible constraints are checked, even making sure that the half move counter is zero.
--------
Inside the move Execute() routine, Oscar will set the en passant target square to a non-nil value if and only if at least one en passant capture is available. This helps ensure correct hash signature generation and relieves the move generation and move counting routines from doing checking themselves. Further, this supports correct FEN string generation at any time during calculations; a help with debugging.
--------
An upcoming task is the development of the parameter structure definitions needed to communicate input and output values between the driver an an OpenCL kernel. There is a need for caution: the OpenCL header file has a number of scalar type definitions which are different for different platforms. Alignment may be an issue, and the endian sense may be an issue.
All OpenCL kernel/driver parameters must be passed by value in both directions; pointers are disallowed because the kernel and the driver have different memory spaces.
-
- Posts: 4675
- Joined: Mon Mar 13, 2006 7:43 pm
Transposition table assistance
Oscar now has transposition table assistance along with incremental hash signature updating. Everything seems to work, including hashing of castling availability status and en passant targets.
When performing perft(), there's no need for a hash contribution based on moving color when a hash contribution based on ply is folded into the main hash prior to hitting the table.
Oscar uses a 128 bit hash although this could be changed to a 64 bit hash for those who are willing to take a larger risk of false positives. Timing shows no measurable difference between the non hash version and the hash version with 128 bit codes when fetch/store operations are deactivated. So, while there is an extra cost in space for 128 bit codes vs 64 bit codes, there's not one in time.
At present, the element count of Oscar's transposition table is determined at compile time. The reason for this is that there's no guarantee that malloc() or its pals will work well, or even exist in a truly portable OpenCL program. More research is needed on this topic.
From the header file:
Constant value key storage:
The table (currently, Oscar's only global variable):
Fetch and store:
The construction of the constant hashes is detailed in the next post.
When performing perft(), there's no need for a hash contribution based on moving color when a hash contribution based on ply is folded into the main hash prior to hitting the table.
Oscar uses a 128 bit hash although this could be changed to a 64 bit hash for those who are willing to take a larger risk of false positives. Timing shows no measurable difference between the non hash version and the hash version with 128 bit codes when fetch/store operations are deactivated. So, while there is an extra cost in space for 128 bit codes vs 64 bit codes, there's not one in time.
At present, the element count of Oscar's transposition table is determined at compile time. The reason for this is that there's no guarantee that malloc() or its pals will work well, or even exist in a truly portable OpenCL program. More research is needed on this topic.
From the header file:
Code: Select all
// Perft transposition table items
#define PTTAddrBitLen 24 // Log2(Transposition table length)
#define PTTAddrMask MX(PTTAddrBitLen) // Transposition table index mask
#define PTTLen BX(PTTAddrBitLen) // Transposition table length
// Hashes
typedef struct
{
ui64 qwrd0; // 1st quad word
ui64 qwrd1; // 2nd quad word
} Hash;
// Perft transposition table element
typedef struct
{
Hash msig; // Main hash signature; adjusted for ply
NodeCount nodecount; // Movepath count
} PerftTTRec;
Code: Select all
static Hash ManSqHash[ManRCLen][SqLen]; // Hash codes indexed by man and square
static Hash CabsHash[CabsLen]; // Hash codes indexed by cabs
static Hash EpFileHash[FileLen]; // Hash codes indexed by en passant target file
static Hash PlyHash[PlyLen]; // Hash codes indexed by ply
Code: Select all
// Transposition table storage
static PerftTTRec PTT[PTTLen]; // Perft transposition table
Code: Select all
static bool PositionPTTFetch(const Position * const positionptr, const ui ply, NodeCount * const nodecountptr)
{
bool found;
Hash addrhash = positionptr->msig;
HashFoldPly(&addrhash, ply);
const ui index0 = (ui) (addrhash.qwrd0 & PTTAddrMask);
PerftTTRec * const ptr0 = &PTT[index0];
if (HashTestEqual(&addrhash, &ptr0->msig))
{
*nodecountptr = ptr0->nodecount;
found = true;
}
else
{
const ui index1 = index0 ^ 1;
PerftTTRec * const ptr1 = &PTT[index1];
if (HashTestEqual(&addrhash, &ptr1->msig))
{
*nodecountptr = ptr1->nodecount;
found = true;
}
else
{
*nodecountptr = 0;
found = false;
};
};
return found;
}
static void PositionPTTStash(const Position * const positionptr, const ui ply, const NodeCount nodecount)
{
Hash addrhash = positionptr->msig;
HashFoldPly(&addrhash, ply);
const ui index0 = (ui) (addrhash.qwrd0 & PTTAddrMask);
const ui index1 = index0 ^ 1;
PerftTTRec * const ptr0 = &PTT[index0];
PerftTTRec * const ptr1 = &PTT[index1];
PerftTTRec * ptr;
if (ptr0->nodecount < ptr1->nodecount)
ptr = ptr0;
else
ptr = ptr1;
ptr->msig = addrhash;
ptr->nodecount = nodecount;
}
-
- Posts: 4675
- Joined: Mon Mar 13, 2006 7:43 pm
Cooking hash
Oscar's constant value hash codes start with the first eight primes:
The product of these primes is 9,699,690 which is also the period of the following pseudorandom bit generator which uses those primes:
We want to build hash code constants from pairs of 64 bit integers, so to get a single 64 bit pseudorandom value:
Finally, to make all the constant hashes which Oscar needs:
The above uses the first 105,472 bit values of the bit generator.
Code: Select all
static const char CvOrdinalToPrime[PrimeLen] =
{
2, 3, 5, 7, 11, 13, 17, 19
};
Code: Select all
static ui CalcHashBit(const ui bitordinal)
{
ui result = 0;
ui index;
for (index = 0; index < PrimeLen; index++)
result ^= (bitordinal % CvOrdinalToPrime[index]) & 0x01;
return result;
}
Code: Select all
static ui64 CalcHashQwrd(const ui qwrdordinal)
{
ui64 result = 0;
ui index;
for (index = 0; index < QwrdBitsLen; index++)
{
result |= CalcHashBit((qwrdordinal * QwrdBitsLen) + index);
result <<= 1;
};
return result;
}
Code: Select all
static void InitHashes(void)
{
// This routine initializes the constant value for the base hash codes.
ui ordinal = 0;
Man man;
Cabs cabs;
File file;
ui ply;
for (man = ManWhitePawn; man <= ManBlackKing; man++)
{
Sq sq;
for (sq = SqA1; sq <= SqH8; sq++)
{
ManSqHash[man][sq].qwrd0 = CalcHashQwrd(ordinal++);
ManSqHash[man][sq].qwrd1 = CalcHashQwrd(ordinal++);
};
};
for (cabs = CabsX0; cabs <= CabsXf; cabs++)
{
CabsHash[cabs].qwrd0 = CalcHashQwrd(ordinal++);
CabsHash[cabs].qwrd1 = CalcHashQwrd(ordinal++);
};
for (file = FileA; file <= FileH; file++)
{
EpFileHash[file].qwrd0 = CalcHashQwrd(ordinal++);
EpFileHash[file].qwrd1 = CalcHashQwrd(ordinal++);
};
for (ply = 0; ply < PlyLen; ply++)
{
PlyHash[ply].qwrd0 = CalcHashQwrd(ordinal++);
PlyHash[ply].qwrd1 = CalcHashQwrd(ordinal++);
};
}
-
- Posts: 4675
- Joined: Mon Mar 13, 2006 7:43 pm
Sample runs
Oscar with transposition table assistance running on a single thread on a year 2006 2.66 GHz Xeon:
Code: Select all
gail:tmp sje$ ./operft "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1" 6
Na3 4856835
Nc3 5708064
Nf3 5723523
Nh3 4877234
a3 4463267
a4 5363555
b3 5310358
b4 5293555
c3 5417640
c4 5866666
d3 8073082
d4 8879566
e3 9726018
e4 9771632
f3 4404141
f4 4890429
g3 5346260
g4 5239875
h3 4463070
h4 5385554
119060324 55.03 MHz
gail:tmp sje$ ./operft "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1" 7
Na3 120142144
Nc3 148527161
Nf3 147678554
Nh3 120669525
a3 106743106
a4 137077337
b3 133233975
b4 134087476
c3 144074944
c4 157756443
d3 227598692
d4 269605599
e3 306138410
e4 309478263
f3 102021008
f4 119614841
g3 135987651
g4 130293018
h3 106678423
h4 138495290
3195901860 147.68 MHz
gail:tmp sje$ ./operft "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1" 8
Na3 3193522577
Nc3 3926684340
Nf3 3937354096
Nh3 3221278282
a3 2863411653
a4 3676309619
b3 3579299617
b4 3569067629
c3 3806229124
c4 4199667616
d3 6093248619
d4 7184581950
e3 8039390919
e4 8102108221
f3 2728615868
f4 3199039406
g3 3641432923
g4 3466204702
h3 2860408680
h4 3711123115
84998978956 318.24 MHz
gail:tmp sje$ ./operft "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1" 9
Na3 85849641909
Nc3 109418317145
Nf3 108393009416
Nh3 86659653631
a3 74950758099
a4 101265301849
b3 96577095997
b4 97442160946
c3 108697368719
c4 120549219832
d3 176976245463
d4 227220482342
e3 259522947791
e4 263561543780
f3 68094899093
f4 84792070664
g3 99646370024
g4 92281289941
h3 74778417365
h4 102853440161
2439530234167 542.32 MHz
-
- Posts: 4675
- Joined: Mon Mar 13, 2006 7:43 pm
Then again
Then again, there's nothing like making a public code posting to trigger detection of faulty code.
The quadword pseudorandom value generator was leaving bit zero reset. The corrected version is:
Better initialization:
The above called by:
The quadword pseudorandom value generator was leaving bit zero reset. The corrected version is:
Code: Select all
static ui64 CalcHashQwrd(const ui qwrdordinal)
{
ui64 result = 0;
ui index;
for (index = 0; index < QwrdBitsLen; index++)
{
result <<= 1;
result |= CalcHashBit((qwrdordinal * QwrdBitsLen) + index);
};
return result;
}
Code: Select all
static void HashAssign(Hash * const hashptr, ui * const ordinalptr)
{
// This routine assigns a pseudorandom value to the given hash object.
hashptr->qwrd0 = CalcHashQwrd((*ordinalptr)++);
hashptr->qwrd1 = CalcHashQwrd((*ordinalptr)++);
}
Code: Select all
static void InitHashes(void)
{
// This routine initializes the constant value for the base hash codes.
ui ordinal = 0;
Man man;
Sq sq;
Cabs cabs;
File file;
ui ply;
for (man = ManWhitePawn; man <= ManBlackKing; man++)
for (sq = SqA1; sq <= SqH8; sq++)
HashAssign(&ManSqHash[man][sq], &ordinal);
for (cabs = CabsX0; cabs <= CabsXf; cabs++)
HashAssign(&CabsHash[cabs], &ordinal);
for (file = FileA; file <= FileH; file++)
HashAssign(&EpFileHash[file], &ordinal);
for (ply = 0; ply < PlyLen; ply++)
HashAssign(&PlyHash[ply], &ordinal);
}