The cost of transposition table instrumentation

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: The cost of transposition table instrumentation

Post by AlvaroBegue »

bob wrote:
AlvaroBegue wrote:
bob wrote:[...] As opposed to a pointer, or a bit map, or a move (which has several related pieces of information like from, to, piece, etc) or whatever, where interleaved update might produce a more serious problem than just losing one counter update.
If you are allowing writes to a variable from several threads without proper guards, you are invoking undefined behavior. You are probably right that the only likely consequence is missing the occasional update, but this could change in the future. I am afraid a few years down the line you are going to be screaming about how some compiler "broke" your code.
Sorry, but the only problem is a lost update. If the value is a counter, losing an addition or a subtraction is not going to break anything, and is often acceptable over the cost of a lock acquisition which is extremely expensive. No compiler or hardware is going to "break my code" in this instance. Interleaved update is only caused when two different threads read, update and write the same location. Since the orders of those operations can be interleaved. The only effect is that ONE of the updates is lost. For a statistical counter that has no effect on program behavior, missing an update is inconsequential. And is more than acceptable when compared to the cost of the lock. Since most of these are once-per-node-searched at best (and usually less often) the probability is minimal when spread over N counters.

If you were building a move, that might be different if you somehow forced everything to go to memory, and you first add the from, then the to, etc. But not with a counter. And no, I won't be screaming, unless apple decides to intentionally break my code as they did with strcpy(), something that got them raked over the coals by tens of thousands of complaints. If they do what I write, the results will be usable forever. If they take it upon themselves to make the program abort on a perfectly safe operation, yes I will complaint.
Here's an example of a program that increments a counter from multiple threads without any synchronization, and it looks like it can already be a problem:

Code: Select all

#include <iostream>
#include <thread>
#include <atomic>

int counter&#40;0&#41;;
//If you use std&#58;&#58;atomic<int> it works fine

bool is_prime&#40;int n&#41; &#123;
  if &#40;n < 2&#41;
    return false;
  if &#40;n % 2 == 0&#41;
    return n == 2;
  for &#40;int k = 3; k * k <= n; k += 2&#41; &#123;
    if &#40;n % k == 0&#41;
      return false;
  &#125;
  return true;
&#125;

void call_from_thread&#40;) &#123;
  for &#40;int i = 0; i < 1000000; ++i&#41;
    if &#40;is_prime&#40;i&#41;)
      ++counter;
&#125;

int main&#40;) &#123;
  std&#58;&#58;thread t&#91;10&#93;;
  for &#40;int i = 0; i < 10; ++i&#41;
    t&#91;i&#93; = std&#58;&#58;thread&#40;call_from_thread&#41;;
  for &#40;int i = 0; i < 10; ++i&#41;
    t&#91;i&#93;.join&#40;);
  std&#58;&#58;cout << counter << '\n';
&#125;
Using g++-4.8.1 on my computer, I get 78498 as the output, instead of the expected 784980.

EDIT: The exact compilation command is this:
g++ my_prog.cpp -O -std=c++11 -lpthread

What do you think is the key difference between the way you are using counters and this program that makes it impossible for a compiler to mess up your program but not this one?
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: The cost of transposition table instrumentation

Post by Joost Buijs »

What Bob means is that for a node-counter or something similar it does not matter if it is a few counts off from what it should be because it won't break his engine.

On the other hand it is very simple to have per-thread counters and add them together when you need them (which is most likely at the root).
mvk
Posts: 589
Joined: Tue Jun 04, 2013 10:15 pm

Re: The cost of transposition table instrumentation

Post by mvk »

Ok, I don't do that in my history table, where I increase by d*d, and decrease bad moves or on overflow. That is not counting. At least, I wouldn't know what quantity it represents.

Well, some call a general purpose register an 'accumulator'. Probably the same thing.
[Account deleted]
AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: The cost of transposition table instrumentation

Post by AlvaroBegue »

Joost Buijs wrote:What Bob means is that for a node-counter or something similar it does not matter if it is a few counts off from what it should be because it won't break his engine.

On the other hand it is very simple to have per-thread counters and add them together when you need them (which is most likely at the root).
I understand what Bob means. The problem is that there's no guarantee that it will only be a few counts off from what it should be. The compiler could do things like putting the variable in a register and writing it to memory at the very end, which is what g++ is doing in my example.
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: The cost of transposition table instrumentation

Post by Joost Buijs »

mvk wrote:Ok, I don't do that in my history table, where I increase by d*d, and decrease bad moves or on overflow. That is not counting. At least, I wouldn't know what quantity it represents.

Well, some call a general purpose register an 'accumulator'. Probably the same thing.
The history table has also a certain size, the chance that you are poking from different threads in the same cache-line is a lot smaller than with a single counter at a fixed address.
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: The cost of transposition table instrumentation

Post by Joost Buijs »

AlvaroBegue wrote:
Joost Buijs wrote:What Bob means is that for a node-counter or something similar it does not matter if it is a few counts off from what it should be because it won't break his engine.

On the other hand it is very simple to have per-thread counters and add them together when you need them (which is most likely at the root).
I understand what Bob means. The problem is that there's no guarantee that it will only be a few counts off from what it should be. The compiler could do things like putting the variable in a register and writing it to memory at the very end, which is what g++ is doing in my example.
The are a two problems, first it is possible that the compiler optimizes access to a variable from a different thread away because it does not know that it may be altered or used by the other thread.
And secondly you can have a problem with cache coherence.

You can circumvent this by using volatile variables and a lock or memory fence/barrier, unfortunately every solution will cost you some performance.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: The cost of transposition table instrumentation

Post by bob »

AlvaroBegue wrote:
bob wrote:
AlvaroBegue wrote:
bob wrote:[...] As opposed to a pointer, or a bit map, or a move (which has several related pieces of information like from, to, piece, etc) or whatever, where interleaved update might produce a more serious problem than just losing one counter update.
If you are allowing writes to a variable from several threads without proper guards, you are invoking undefined behavior. You are probably right that the only likely consequence is missing the occasional update, but this could change in the future. I am afraid a few years down the line you are going to be screaming about how some compiler "broke" your code.
Sorry, but the only problem is a lost update. If the value is a counter, losing an addition or a subtraction is not going to break anything, and is often acceptable over the cost of a lock acquisition which is extremely expensive. No compiler or hardware is going to "break my code" in this instance. Interleaved update is only caused when two different threads read, update and write the same location. Since the orders of those operations can be interleaved. The only effect is that ONE of the updates is lost. For a statistical counter that has no effect on program behavior, missing an update is inconsequential. And is more than acceptable when compared to the cost of the lock. Since most of these are once-per-node-searched at best (and usually less often) the probability is minimal when spread over N counters.

If you were building a move, that might be different if you somehow forced everything to go to memory, and you first add the from, then the to, etc. But not with a counter. And no, I won't be screaming, unless apple decides to intentionally break my code as they did with strcpy(), something that got them raked over the coals by tens of thousands of complaints. If they do what I write, the results will be usable forever. If they take it upon themselves to make the program abort on a perfectly safe operation, yes I will complaint.
Here's an example of a program that increments a counter from multiple threads without any synchronization, and it looks like it can already be a problem:

Code: Select all

#include <iostream>
#include <thread>
#include <atomic>

int counter&#40;0&#41;;
//If you use std&#58;&#58;atomic<int> it works fine

bool is_prime&#40;int n&#41; &#123;
  if &#40;n < 2&#41;
    return false;
  if &#40;n % 2 == 0&#41;
    return n == 2;
  for &#40;int k = 3; k * k <= n; k += 2&#41; &#123;
    if &#40;n % k == 0&#41;
      return false;
  &#125;
  return true;
&#125;

void call_from_thread&#40;) &#123;
  for &#40;int i = 0; i < 1000000; ++i&#41;
    if &#40;is_prime&#40;i&#41;)
      ++counter;
&#125;

int main&#40;) &#123;
  std&#58;&#58;thread t&#91;10&#93;;
  for &#40;int i = 0; i < 10; ++i&#41;
    t&#91;i&#93; = std&#58;&#58;thread&#40;call_from_thread&#41;;
  for &#40;int i = 0; i < 10; ++i&#41;
    t&#91;i&#93;.join&#40;);
  std&#58;&#58;cout << counter << '\n';
&#125;
Using g++-4.8.1 on my computer, I get 78498 as the output, instead of the expected 784980.

EDIT: The exact compilation command is this:
g++ my_prog.cpp -O -std=c++11 -lpthread

What do you think is the key difference between the way you are using counters and this program that makes it impossible for a compiler to mess up your program but not this one?
Multiple issues here. (1) your interleaved update window is a significant part of the computation. In chess, where you do something once per node max, the window for interleaving the updates is very small; (2) we don't need to count some things exactly. For example does it matter if a particular move failed high 1000 times or 1001 times? Not really. (3) you MUST declare such a variable as volatile, otherwise the optimizer will totally wreck the computation by keeping the value in a register where it is not shared at all.

Yes, the counter will "lose" ticks. But the more complex the computation between ticks, the less it will lose. The alternative is a high-cost lock acquisition. sometimes "close" is much better than "correct" when speed is factored in.
Last edited by bob on Wed Aug 26, 2015 6:00 pm, edited 1 time in total.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: The cost of transposition table instrumentation

Post by bob »

AlvaroBegue wrote:
Joost Buijs wrote:What Bob means is that for a node-counter or something similar it does not matter if it is a few counts off from what it should be because it won't break his engine.

On the other hand it is very simple to have per-thread counters and add them together when you need them (which is most likely at the root).
I understand what Bob means. The problem is that there's no guarantee that it will only be a few counts off from what it should be. The compiler could do things like putting the variable in a register and writing it to memory at the very end, which is what g++ is doing in my example.
Counters MUST be declared volatile. Then the register problem vanishes.
AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: The cost of transposition table instrumentation

Post by AlvaroBegue »

bob wrote:
AlvaroBegue wrote:
Joost Buijs wrote:What Bob means is that for a node-counter or something similar it does not matter if it is a few counts off from what it should be because it won't break his engine.

On the other hand it is very simple to have per-thread counters and add them together when you need them (which is most likely at the root).
I understand what Bob means. The problem is that there's no guarantee that it will only be a few counts off from what it should be. The compiler could do things like putting the variable in a register and writing it to memory at the very end, which is what g++ is doing in my example.
Counters MUST be declared volatile. Then the register problem vanishes.
Interesting website: http://cxx.isvolatileusefulwiththreads.com/

:)

But I actually see your point in this discussion. Don't get used to it. :)
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: The cost of transposition table instrumentation

Post by bob »

AlvaroBegue wrote:
bob wrote:
AlvaroBegue wrote:
Joost Buijs wrote:What Bob means is that for a node-counter or something similar it does not matter if it is a few counts off from what it should be because it won't break his engine.

On the other hand it is very simple to have per-thread counters and add them together when you need them (which is most likely at the root).
I understand what Bob means. The problem is that there's no guarantee that it will only be a few counts off from what it should be. The compiler could do things like putting the variable in a register and writing it to memory at the very end, which is what g++ is doing in my example.
Counters MUST be declared volatile. Then the register problem vanishes.
Interesting website: http://cxx.isvolatileusefulwiththreads.com/

:)

But I actually see your point in this discussion. Don't get used to it. :)
My reply would be "nothing is useful with threads if you don't know what you are doing..." :)

Works flawlessly in C. Don't care about C++ at all...