The cost of transposition table instrumentation

sje · Post by **sje** » Sun Aug 23, 2015 9:40 am

Some comments on the cost of transposition table instrumentation to enable statistics generation:

Symbolic has several transposition table classes, each of them having the same base class which contains member variables common to all derived classes.

Code: Select all

class TranBase
&#123;
public&#58;
  TranBase&#40;const TtId table, const ui log2count&#41;;
  TranBase&#40;const TtId table, const double frac&#41;;

#if &#40;EnableTranStats&#41;
  void IncrMatch&#40;void&#41; &#123;ctspinlock.Lock&#40;); ctmatch++; ctspinlock.Unlock&#40;);&#125;
  void IncrProbe&#40;void&#41; &#123;ctspinlock.Lock&#40;); ctprobe++; ctspinlock.Unlock&#40;);&#125;
  void IncrStash&#40;void&#41; &#123;ctspinlock.Lock&#40;); ctstash++; ctspinlock.Unlock&#40;);&#125;
  void IncrUsage&#40;void&#41; &#123;ctspinlock.Lock&#40;); ctusage++; ctspinlock.Unlock&#40;);&#125;
#else
  void IncrMatch&#40;void&#41; &#123;&#125;
  void IncrProbe&#40;void&#41; &#123;&#125;
  void IncrStash&#40;void&#41; &#123;&#125;
  void IncrUsage&#40;void&#41; &#123;&#125;
#endif

  void LsbReporter&#40;const Hash& hash0, const Hash& hash1&#41;;

  std&#58;&#58;string EncodeStats&#40;void&#41; const;
  std&#58;&#58;string Encode&#40;void&#41; const;

protected&#58;
  static const std&#58;&#58;string basenames&#91;TtIdLen&#93;;
  static const size_t      itemsizes&#91;TtIdLen&#93;;

  void SetUpFromLog2&#40;const TtId table, const ui log2count&#41;;
  void SetUpFromFrac&#40;const TtId table, const double frac&#41;;

  ui CalcSlice&#40;const usize offset&#41; const &#123;return (&#40;ui&#41; &#40;offset >> sliceshift&#41;) & slicemask;&#125;

  void FastReset&#40;void&#41;;

  TtId        ttid;
  std&#58;&#58;string name;

  ui        log2entrycount;  // Log2 of entry count
  size_t    entrysize;       // Size of a single entry
  size_t    tablesize;       // Size of the entire storage region
  usize     entrycount;      // Number of entries
  usize     addrmask;        // Address generation mask
  ui        sliceshift;      // Slice shift bit count
  usize     slicemask;       // Slice index generation mask

  ui64      ctmatch;         // Count of probe matches
  ui64      ctprobe;         // Count of probe calls
  ui64      ctstash;         // Count of stash calls
  ui64      ctusage;         // Count of used entries
  SpinLock  ctspinlock;      // Count access spinlock

  ui        lsbhwm;          // LSB match high water mark
  SpinLock  lsbspinlock;     // Spinlock for high watermark access

  void      *storeptr;       // Pointer to allocated storage region

  SpinLock SLSliceVec&#91;SLSliceLen&#93;;  // Regional access spinlocks
&#125;;

Note the four 64 bit counters: ctmatch, ctprobe, ctstash, and ctusage along with their spinlock ctspinlock needed because of multithreaded access.

When a table is created, its constructor resets these members. They stay that way unless the compile time symbol EnableTranStats is set, in which case the counter increment routines are called when appropriate. The member method EncodeStats() presents these in text format when called.

These are handy numbers to have. But their generation does eat some time. How much time? An empirical answer follows.

The task used for the experiment is the calculation of the value of perft(9) of the initial array, which is 2,439,530,234,167. This was done by running four hyperthreads on a dual core 64 bit CPU with a transposition table containing 268,435,456 (2^28) elements with 128 bit signatures.

First, without instrumentation:

Code: Select all

&#91;&#93; pctran 9
TT&#58; PETBase&#58; ItemCount&#58; 268,435,456 &#40;2^28&#41;   ItemSize&#58; 24 B   TableSize&#58; 6 GiB
Count&#58; 2,439,530,234,167   Pt&#58; 32&#58;20.992   Wt&#58; 8&#58;27.289   U&#58; 3.8262   4.80895 GHz   207.946 ps
Total&#58; two trillion four hundred thirty-nine billion five hundred thirty million two hundred thirty-four thousand one hundred sixty-seven

Second, with instrumentation:

Code: Select all

&#91;&#93; pctran 9
TT&#58; PETBase&#58; ItemCount&#58; 268,435,456 &#40;2^28&#41;   ItemSize&#58; 24 B   TableSize&#58; 6 GiB
TT&#58; PETBase&#58; items&#58; 268,435,456   probe&#58; 280,394,218   match&#58; 169,624,952   stash&#58; 111,793,594   usage&#58; 99,677,863   load&#58; 0.371329
Count&#58; 2,439,530,234,167   Pt&#58; 32&#58;46.044   Wt&#58; 8&#58;33.844   U&#58; 3.82615   4.74761 GHz   210.632 ps
Total&#58; two trillion four hundred thirty-nine billion five hundred thirty million two hundred thirty-four thousand one hundred sixty-seven

The result is that the cost of instrumentation in the above is about an additional 1.29% of total processing time.

So now you know.

bob · Post by **bob** » Mon Aug 24, 2015 2:32 am

sje wrote:Some comments on the cost of transposition table instrumentation to enable statistics generation:

Symbolic has several transposition table classes, each of them having the same base class which contains member variables common to all derived classes.

Code: Select all

class TranBase
&#123;
public&#58;
  TranBase&#40;const TtId table, const ui log2count&#41;;
  TranBase&#40;const TtId table, const double frac&#41;;

#if &#40;EnableTranStats&#41;
  void IncrMatch&#40;void&#41; &#123;ctspinlock.Lock&#40;); ctmatch++; ctspinlock.Unlock&#40;);&#125;
  void IncrProbe&#40;void&#41; &#123;ctspinlock.Lock&#40;); ctprobe++; ctspinlock.Unlock&#40;);&#125;
  void IncrStash&#40;void&#41; &#123;ctspinlock.Lock&#40;); ctstash++; ctspinlock.Unlock&#40;);&#125;
  void IncrUsage&#40;void&#41; &#123;ctspinlock.Lock&#40;); ctusage++; ctspinlock.Unlock&#40;);&#125;
#else
  void IncrMatch&#40;void&#41; &#123;&#125;
  void IncrProbe&#40;void&#41; &#123;&#125;
  void IncrStash&#40;void&#41; &#123;&#125;
  void IncrUsage&#40;void&#41; &#123;&#125;
#endif

  void LsbReporter&#40;const Hash& hash0, const Hash& hash1&#41;;

  std&#58;&#58;string EncodeStats&#40;void&#41; const;
  std&#58;&#58;string Encode&#40;void&#41; const;

protected&#58;
  static const std&#58;&#58;string basenames&#91;TtIdLen&#93;;
  static const size_t      itemsizes&#91;TtIdLen&#93;;

  void SetUpFromLog2&#40;const TtId table, const ui log2count&#41;;
  void SetUpFromFrac&#40;const TtId table, const double frac&#41;;

  ui CalcSlice&#40;const usize offset&#41; const &#123;return (&#40;ui&#41; &#40;offset >> sliceshift&#41;) & slicemask;&#125;

  void FastReset&#40;void&#41;;

  TtId        ttid;
  std&#58;&#58;string name;

  ui        log2entrycount;  // Log2 of entry count
  size_t    entrysize;       // Size of a single entry
  size_t    tablesize;       // Size of the entire storage region
  usize     entrycount;      // Number of entries
  usize     addrmask;        // Address generation mask
  ui        sliceshift;      // Slice shift bit count
  usize     slicemask;       // Slice index generation mask

  ui64      ctmatch;         // Count of probe matches
  ui64      ctprobe;         // Count of probe calls
  ui64      ctstash;         // Count of stash calls
  ui64      ctusage;         // Count of used entries
  SpinLock  ctspinlock;      // Count access spinlock

  ui        lsbhwm;          // LSB match high water mark
  SpinLock  lsbspinlock;     // Spinlock for high watermark access

  void      *storeptr;       // Pointer to allocated storage region

  SpinLock SLSliceVec&#91;SLSliceLen&#93;;  // Regional access spinlocks
&#125;;

Note the four 64 bit counters: ctmatch, ctprobe, ctstash, and ctusage along with their spinlock ctspinlock needed because of multithreaded access.

When a table is created, its constructor resets these members. They stay that way unless the compile time symbol EnableTranStats is set, in which case the counter increment routines are called when appropriate. The member method EncodeStats() presents these in text format when called.

These are handy numbers to have. But their generation does eat some time. How much time? An empirical answer follows.

The task used for the experiment is the calculation of the value of perft(9) of the initial array, which is 2,439,530,234,167. This was done by running four hyperthreads on a dual core 64 bit CPU with a transposition table containing 268,435,456 (2^28) elements with 128 bit signatures.

First, without instrumentation:

Code: Select all

&#91;&#93; pctran 9
TT&#58; PETBase&#58; ItemCount&#58; 268,435,456 &#40;2^28&#41;   ItemSize&#58; 24 B   TableSize&#58; 6 GiB
Count&#58; 2,439,530,234,167   Pt&#58; 32&#58;20.992   Wt&#58; 8&#58;27.289   U&#58; 3.8262   4.80895 GHz   207.946 ps
Total&#58; two trillion four hundred thirty-nine billion five hundred thirty million two hundred thirty-four thousand one hundred sixty-seven

Second, with instrumentation:

Code: Select all

&#91;&#93; pctran 9
TT&#58; PETBase&#58; ItemCount&#58; 268,435,456 &#40;2^28&#41;   ItemSize&#58; 24 B   TableSize&#58; 6 GiB
TT&#58; PETBase&#58; items&#58; 268,435,456   probe&#58; 280,394,218   match&#58; 169,624,952   stash&#58; 111,793,594   usage&#58; 99,677,863   load&#58; 0.371329
Count&#58; 2,439,530,234,167   Pt&#58; 32&#58;46.044   Wt&#58; 8&#58;33.844   U&#58; 3.82615   4.74761 GHz   210.632 ps
Total&#58; two trillion four hundred thirty-nine billion five hundred thirty million two hundred thirty-four thousand one hundred sixty-seven

The result is that the cost of instrumentation in the above is about an additional 1.29% of total processing time.

So now you know.

I don't spinlock on statistical counters, nor on history counters, nor on countermoves, etc. The probability of an interleaved update is very low, and the consequences are nil (who cares if such a counter is off by one or not). I'd bet most of that 1.29% is in the read-for-ownership cache traffic for acquiring the spin lock. Without the lock I bet you won't see any difference whatsoever.

sje · Post by **sje** » Mon Aug 24, 2015 4:44 am

bob wrote:
sje wrote:The result is that the cost of instrumentation in the above is about an additional 1.29% of total processing time.
I don't spinlock on statistical counters, nor on history counters, nor on countermoves, etc. The probability of an interleaved update is very low, and the consequences are nil (who cares if such a counter is off by one or not). I'd bet most of that 1.29% is in the read-for-ownership cache traffic for acquiring the spin lock. Without the lock I bet you won't see any difference whatsoever.

The probability of race is low, but increases with the thread count and frequency of access. I like to see accurate counters anyway. Also, how much is a 1.29% speed cost? Maybe two or three Elo points.

On some architectures, spinlock access time for an uncontested lock is very, very small. Note that spinlocks would not be necessary in the above if only 64 bit increments were guaranteed to be atomic. But they aren't, at least on the three different 32 bit CPU kinds I use as targets.

In an earlier version of Symbolic, there were no table counters. Instead, I had the table base class treat every entry with a zero signature as being unused. At the end of a search, a table's usage would be calculated by simply scanning all its entries and return the ratio of the used count divided by the total count. That worked, but was rather slow for large tables.

bob · Post by **bob** » Mon Aug 24, 2015 4:50 am

sje wrote:
bob wrote:
sje wrote:The result is that the cost of instrumentation in the above is about an additional 1.29% of total processing time.
I don't spinlock on statistical counters, nor on history counters, nor on countermoves, etc. The probability of an interleaved update is very low, and the consequences are nil (who cares if such a counter is off by one or not). I'd bet most of that 1.29% is in the read-for-ownership cache traffic for acquiring the spin lock. Without the lock I bet you won't see any difference whatsoever.
The probability of race is low, but increases with the thread count and frequency of access. I like to see accurate counters anyway. Also, how much is a 1.29% speed cost? Maybe two or three Elo points.

On some architectures, spinlock access time for an uncontested lock is very, very small. Note that spinlocks would not be necessary in the above if only 64 bit increments were guaranteed to be atomic. But they aren't, at least on the three different 32 bit CPU kinds I use as targets.

In an earlier version of Symbolic, there were no table counters. Instead, I had the table base class treat every entry with a zero signature as being unused. At the end of a search, a table's usage would be calculated by simply scanning all its entries and return the ratio of the used count divided by the total count. That worked, but was rather slow for large tables.

Problem is, it is 1.29% today. What about on 32 or 64 cores? This is more exponential than linear. As far as atomicity goes, 64 bit machines provide such, but that doesn't prevent interleaved update still. Only means the entire 8 bytes is stored in one operation. Those kinds of counters are not going to be memory access anyway, they are going to be handled by inter-cache forwarding. And read-for-ownership is very expensive compared to the usual load/store...

sje · Post by **sje** » Mon Aug 24, 2015 7:12 am

bob wrote:Problem is, it is 1.29% today. What about on 32 or 64 cores? This is more exponential than linear. As far as atomicity goes, 64 bit machines provide such, but that doesn't prevent interleaved update still. Only means the entire 8 bytes is stored in one operation. Those kinds of counters are not going to be memory access anyway, they are going to be handled by inter-cache forwarding. And read-for-ownership is very expensive compared to the usual load/store...

The additional cost of spinlock access for the counters can be reduced to zero by having a set of 2^N counter groups, one counter group per each of the 2^N (N=8 here) table slices which already have their own spinlock. All else that's needed is to sum these groups when printing the statistics report.

Some routines for the perft() transposition table:

Code: Select all

PETBase&#58;&#58;PETBase&#40;const double frac&#41;&#58; TranBase&#40;TtIdPE, frac&#41;
&#123;
  DIPtr->LogCtor&#40;name&#41;;
  storage = new PETItem&#91;entrycount&#93;;
  storeptr = &#40;void *) storage;
  Reset&#40;);
  DIPtr->LogMsg&#40;"  " + Encode&#40;));
&#125;

PETBase&#58;&#58;~PETBase&#40;void&#41;
&#123;
  delete &#91;&#93; storage;
  DIPtr->LogDtor&#40;name&#41;;
&#125;

void PETBase&#58;&#58;Reset&#40;void&#41;
&#123;
#if &#40;UseFastReset&#41;
  FastReset&#40;);
#else
  ZOL64&#40;index, entrycount&#41;
    storage&#91;index&#93;.Reset&#40;);
#endif
&#125;

bool PETBase&#58;&#58;Probe&#40;const Hash& pathhash, NodeCount& count&#41;
&#123;
  const usize offset = &#40;pathhash.GetQwrd0&#40;) & addrmask&#41;;
  const ui slice = CalcSlice&#40;offset&#41;;
  const PETItemPtr probeptr0 = storage + offset;
  const PETItemPtr probeptr1 = storage + &#40;offset ^ 1&#41;;
  bool found;

  IncrProbe&#40;);
  SLSliceVec&#91;slice&#93;.Lock&#40;);
  if &#40;probeptr0->hash == pathhash&#41;
  &#123;
    count = probeptr0->count;
    found = true;
  &#125;
  else
  &#123;
    if &#40;EnableLsbReporter&#41;
      LsbReporter&#40;probeptr0->hash, pathhash&#41;;
    if &#40;probeptr1->hash == pathhash&#41;
    &#123;
      count = probeptr1->count;
      found = true;
    &#125;
    else
    &#123;
      if &#40;EnableLsbReporter&#41;
        LsbReporter&#40;probeptr1->hash, pathhash&#41;;
      count = 0;
      found = false;
    &#125;;
  &#125;;
  SLSliceVec&#91;slice&#93;.Unlock&#40;);
  if &#40;found&#41;
    IncrMatch&#40;);
  return found;
&#125;

void PETBase&#58;&#58;Stash&#40;const Hash& addrhash, const NodeCount count&#41;
&#123;
  const usize offset = &#40;addrhash.GetQwrd0&#40;) & addrmask&#41;;
  const ui slice = CalcSlice&#40;offset&#41;;
  const PETItemPtr probeptr0 = storage + offset;
  const PETItemPtr probeptr1 = storage + &#40;offset ^ 1&#41;;
  PETItemPtr entryptr;

  IncrStash&#40;);
  SLSliceVec&#91;slice&#93;.Lock&#40;);
  if &#40;probeptr0->count < probeptr1->count&#41;
    entryptr = probeptr0;
  else
    entryptr = probeptr1;
  if &#40;EnableTranStats && entryptr->hash.IsReset&#40;))
    IncrUsage&#40;);
  entryptr->hash = addrhash;
  entryptr->count = count;
  SLSliceVec&#91;slice&#93;.Unlock&#40;);
&#125;

AlvaroBegue · Post by **AlvaroBegue** » Mon Aug 24, 2015 1:05 pm

Not an expert in multi-threaded code, but... Can you store per-thread counters and add them just before you show them to the user? (Make sure the counters are separated enough that they live in separate cache lines, or what I am suggesting might not help at all.)

sje · Post by **sje** » Mon Aug 24, 2015 1:33 pm

AlvaroBegue wrote:Not an expert in multi-threaded code, but... Can you store per-thread counters and add them just before you show them to the user? (Make sure the counters are separated enough that they live in separate cache lines, or what I am suggesting might not help at all.)

Search for "C++ thread local storage" and you'll get an answer. And the answer depends on what version of C++ and what additional support is available (like Boost).

The underlying problem is that there is no direct linear mapping between a thread ID and an integer that's usable as an index to select among an array of objects. A mapping will always involve some set-up and an indirection for each access.

AlvaroBegue · Post by **AlvaroBegue** » Mon Aug 24, 2015 2:25 pm

sje wrote:
AlvaroBegue wrote:Not an expert in multi-threaded code, but... Can you store per-thread counters and add them just before you show them to the user? (Make sure the counters are separated enough that they live in separate cache lines, or what I am suggesting might not help at all.)
Search for "C++ thread local storage" and you'll get an answer. And the answer depends on what version of C++ and what additional support is available (like Boost).

The underlying problem is that there is no direct linear mapping between a thread ID and an integer that's usable as an index to select among an array of objects. A mapping will always involve some set-up and an indirection for each access.

I don't see why that's the case. You can easily access the transposition table through an object that is thread specific and collect your statistics there.

Joost Buijs · Post by **Joost Buijs** » Mon Aug 24, 2015 2:44 pm

AlvaroBegue wrote:Not an expert in multi-threaded code, but... Can you store per-thread counters and add them just before you show them to the user? (Make sure the counters are separated enough that they live in separate cache lines, or what I am suggesting might not help at all.)

This is exactly what I'm doing in my engine, for each thread I have a struct which contains everything that belongs to that thread, including node counters etc.
When I make sure with an alignment pragma that these structs are at least separated one cache-line apart from each other, I don't see any performance degradation at all.
When I want to show the counters to the user I just add the separate thread counters together, this works beautifully.

mar · Post by **mar** » Mon Aug 24, 2015 5:21 pm

AlvaroBegue wrote:I don't see why that's the case. You can easily access the transposition table through an object that is thread specific and collect your statistics there.

Yes absolutely. TLS is a scarce resource and gets allocated for EACH thread within the process.
What you described is a common way to do it (per thread stats and then collect them once work is done - you don't need TLS for this).
EDIT: qouting MS docs:
"The constant TLS_MINIMUM_AVAILABLE defines the minimum number of TLS indexes available in each process. This minimum is guaranteed to be at least 64 for all systems. The maximum number of indexes per process is 1,088."

The cost of transposition table instrumentation

The cost of transposition table instrumentation

Re: The cost of transposition table instrumentation

Re: The cost of transposition table instrumentation

Re: The cost of transposition table instrumentation

Re: The cost of transposition table instrumentation

Re: The cost of transposition table instrumentation

Re: The cost of transposition table instrumentation

Re: The cost of transposition table instrumentation

Re: The cost of transposition table instrumentation

Re: The cost of transposition table instrumentation