Deep split perft()

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Deep split perft()

Post by sje »

Deep split perft()

My program Symbolic has two multithreaded perft() classes.

The first, older class is RsPath (Root split) which splits the perft(n) tree at the root and allocates the perft(n - 1) tasks to worker threads via a semaphore controlled job queue. This is a simple approach and works well except when the number of worker threads is less than the number of cores/hyperthreads available. This condition occurs towards the end of every calculation, and can occur earlier.

The second, newer class in DsPath (Deep split) which splits the perft() tree at every node where the calculation draft is above a certain threshold (currently set at four). The splitting is done regardless of the number of worker threads in use. DsPath in guaranteed to keep all the work threads busy throughout the entire calculation.

Each deep split involves creating a job queue entry (class DsPathJob) and a "smart" accumulator (class DsAcc). The path job contents include the position to search, the draft of the search, and a pointer to the accumulator to be used to collect the results. The accumulator contents include the running total of the counts, a reference counter containing the number of pending subtotals (one per move), and a pointer to the parent accumulator.

To avoid a humungous job queue, new jobs are prepended to the queue so that smaller calculations have priority; the big guys have to wait in the back.

When a perft job is processed, each subtotal generated is sent to the appropriate accumulator. The reference counter in the accumulator is decremented once for each incoming subtotal. When the counter reaches zero, the accumulator sends its total to its parent accumulator and then deletes itself. This is done recursively as needed up the accumulator tree. When the topmost accumulator reference count goes to zero, it means that the whole task is done and the master thread is woken with a Report record sent to it input queue.

I got the new DsPath class working this morning and it performs better than the old RsPath, although not by much in most situations. Then again, I haven't done any optimization work on it -- yet.
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

From the header file

Post by sje »

From the header file DsPath.h

Code: Select all

// Accumulators

class DsAcc
{
public:
  DsAcc(void)  {sum = 0;}
  ~DsAcc(void) {}

private:
  friend class DsPathMaster;
  friend class DsPathWorker;

  DsAccNodePtr parentptr;  // Parent accumulator; 0:report to master
  Hash         pathhash;   // Path transposition table hash
  Move         move;       // Prior move, used for ply one trace reporting
  ui           ply;        // Ply of result, used to trigger ply one trace reporting
  ui           pending;    // Count of pending reports, set to move count at start
  NodeCount    sum;        // Running total of reported counts
};

class DsAccNode&#58; public DsAcc, public TWNode<DsAccNode> &#123;&#125;;
class DsAccList&#58; public TWList<DsAccNode> &#123;&#125;;
class DsAccPool&#58; public TWPool<DsAccNode, DsAccList> &#123;&#125;;

// Deep split job commands

typedef enum
&#123;
  DsPathCmdNil = -1,
  DsPathCmdExit,    // Exit thread
  DsPathCmdPerft,   // Calculate perft&#40;)
  DsPathCmdReport   // Report final result
&#125; DsPathCmd;

class DsPathJob
&#123;
public&#58;
  DsPathJob&#40;void&#41;;
  ~DsPathJob&#40;void&#41;;

private&#58;
  friend class DsPathMaster;
  friend class DsPathWorker;

  DsPathCmd       command;      // Job command
  Pcm             pcm;          // Path counting mode &#40;bulk, full, tran&#41;
  PositionPtr     positionptr;  // The allocated position to search
  ui              ply;          // Ply count from root
  ui              draft;        // Draft to calculate
  PathTranBasePtr baseptr;      // Transposition table
  DsAccNodePtr    resultptr;    // Receiving accumulator
&#125;;

class DsPathJobNode&#58; public DsPathJob, public TWNode<DsPathJobNode> &#123;&#125;;
class DsPathJobList&#58; public TWList<DsPathJobNode> &#123;&#125;;
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

A new log file

Post by sje »

Here's a colorful new log file with perft(n) n=7..10 using the new deep split code on the octo core i7-5680X machine.

See: https://dl.dropboxusercontent.com/u/316 ... 8p9p10.log

Perft(7) was done in 1 second.
Perft(8) was done in 12 seconds.
Perft(9) was done in 1 minute 55 seconds.
Perft(10) was done in 22 minutes 39 seconds.

Perft(11) is underway.

----

The BusyFEN throughput frequency is up to 40 MHz with the new code, a slight increase.
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: A new log file

Post by Joost Buijs »

sje wrote:Here's a colorful new log file with perft(n) n=7..10 using the new deep split code on the octo core i7-5680X machine.

See: https://dl.dropboxusercontent.com/u/316 ... 8p9p10.log

Perft(7) was done in 1 second.
Perft(8) was done in 12 seconds.
Perft(9) was done in 1 minute 55 seconds.
Perft(10) was done in 22 minutes 39 seconds.

Perft(11) is underway.

----

The BusyFEN throughput frequency is up to 40 MHz with the new code, a slight increase.
Wasn't it a core I7-5960X machine???

Only 1 second for Perft(7) is very fast indeed.

Perft(7) takes here about 5 seconds on the same 8 cores.
But this is just the move logic of my engine with full up/down-date, without hashing.
I never tried to speed it up, It's mainly use is for checking my move generator and the up/down-date routine.
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Re: A new log file

Post by sje »

Joost Buijs wrote:Wasn't it a core I7-5960X machine???
I've always had trouble with numbers.
:D
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Re: A new log file

Post by sje »

Joost Buijs wrote:Only 1 second for Perft(7) is very fast indeed.

Perft(7) takes here about 5 seconds on the same 8 cores.
But this is just the move logic of my engine with full up/down-date, without hashing.
I never tried to speed it up, It's mainly use is for checking my move generator and the up/down-date routine.
In the above runs, Symbolic is using a transposition table and bulk counting at terminal positions. Except for BusyFEN, which visits every node with full updating.

There was a hardware glitch in the first run where the machine booted with only six of the eight DIMMs recognized. I re-ran the job and revised the posted log file. Perft(11) was restarted and is about half way done.
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: A new log file

Post by Joost Buijs »

sje wrote: There was a hardware glitch in the first run where the machine booted with only six of the eight DIMMs recognized.
I know these modern mainboards are very picky with respect to memory timings, especially when all slots are filled. Most manufacturers have a memory compatibility sheet with information about brand, type and the number of memory slots you can reliable use with it.

The timings change a little bit when the temperature of the memory chips rises, maybe the timings are too critical when the chips are still cold.
This should never happen, it certainly is something to look at.

Maybe it can help to change the settings for the memory timings in the BIOS a little bit, there are tons of options.
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Perft(11) again

Post by sje »

I re-ran the perft(11) calculation with the new deep split code and got a speed-up of about 3.4%.

The colorful log file: https://dl.dropboxusercontent.com/u/316 ... es/p11.log
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: Perft(11) again

Post by Joost Buijs »

How much difference in time does it make when you compare Perft(7) with and without transposition table?
It seems like a nice experiment to build a transposition table in my Perft function.
I think it is possible to pack a whole position in something like 22 bytes.
Zobrist hashing will be faster, but there is always the chance of a collision.
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Re: Perft(11) again

Post by sje »

There is no constant time ratio between running with and without transposition table assistance.

With 128 bit hash keys, memory corruption due to cosmic rays is more likely than a false positive match.