The cost of transposition table instrumentation

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: The cost of transposition table instrumentation

Post by bob »

AlvaroBegue wrote:Not an expert in multi-threaded code, but... Can you store per-thread counters and add them just before you show them to the user? (Make sure the counters are separated enough that they live in separate cache lines, or what I am suggesting might not help at all.)
That's what I do (and what you have to do) for commonly counted things like nodes searched, time spent waiting, extensions done, reductions done) unless you choose to ignore the minimal race-condition issues that can make the counters slightly off.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: The cost of transposition table instrumentation

Post by bob »

Joost Buijs wrote:
AlvaroBegue wrote:Not an expert in multi-threaded code, but... Can you store per-thread counters and add them just before you show them to the user? (Make sure the counters are separated enough that they live in separate cache lines, or what I am suggesting might not help at all.)
This is exactly what I'm doing in my engine, for each thread I have a struct which contains everything that belongs to that thread, including node counters etc.
When I make sure with an alignment pragma that these structs are at least separated one cache-line apart from each other, I don't see any performance degradation at all.
When I want to show the counters to the user I just add the separate thread counters together, this works beautifully.
Shared counters already have a major issue with scaling. When a thread modifies the counter, the local cache has to broadcast a "invalidate" to all other caches since their values are now wrong. When they need the counter, they have to wait for another cache to forward the block with that counter to them. Too much of this and you smoke everything. And that ignores the spin lock which has to do exactly the same thing, almost certainly in a different cache block.

Sharing trivial bits of data doesn't scale well. Sometimes the cost is worth it (history counters or counter-moves) but most of the time, no.
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: The cost of transposition table instrumentation

Post by Joost Buijs »

bob wrote:
Joost Buijs wrote:
AlvaroBegue wrote:Not an expert in multi-threaded code, but... Can you store per-thread counters and add them just before you show them to the user? (Make sure the counters are separated enough that they live in separate cache lines, or what I am suggesting might not help at all.)
This is exactly what I'm doing in my engine, for each thread I have a struct which contains everything that belongs to that thread, including node counters etc.
When I make sure with an alignment pragma that these structs are at least separated one cache-line apart from each other, I don't see any performance degradation at all.
When I want to show the counters to the user I just add the separate thread counters together, this works beautifully.
Shared counters already have a major issue with scaling. When a thread modifies the counter, the local cache has to broadcast a "invalidate" to all other caches since their values are now wrong. When they need the counter, they have to wait for another cache to forward the block with that counter to them. Too much of this and you smoke everything. And that ignores the spin lock which has to do exactly the same thing, almost certainly in a different cache block.

Sharing trivial bits of data doesn't scale well. Sometimes the cost is worth it (history counters or counter-moves) but most of the time, no.
Well, the only thing I share between threads is the transposition table and the history table, nothing else.
I don't use shared counters at all.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: The cost of transposition table instrumentation

Post by bob »

Joost Buijs wrote:
bob wrote:
Joost Buijs wrote:
AlvaroBegue wrote:Not an expert in multi-threaded code, but... Can you store per-thread counters and add them just before you show them to the user? (Make sure the counters are separated enough that they live in separate cache lines, or what I am suggesting might not help at all.)
This is exactly what I'm doing in my engine, for each thread I have a struct which contains everything that belongs to that thread, including node counters etc.
When I make sure with an alignment pragma that these structs are at least separated one cache-line apart from each other, I don't see any performance degradation at all.
When I want to show the counters to the user I just add the separate thread counters together, this works beautifully.
Shared counters already have a major issue with scaling. When a thread modifies the counter, the local cache has to broadcast a "invalidate" to all other caches since their values are now wrong. When they need the counter, they have to wait for another cache to forward the block with that counter to them. Too much of this and you smoke everything. And that ignores the spin lock which has to do exactly the same thing, almost certainly in a different cache block.

Sharing trivial bits of data doesn't scale well. Sometimes the cost is worth it (history counters or counter-moves) but most of the time, no.
Well, the only thing I share between threads is the transposition table and the history table, nothing else.
I don't use shared counters at all.

Isn't your history table just an array of counters???
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: The cost of transposition table instrumentation

Post by Joost Buijs »

bob wrote:
Isn't your history table just an array of counters???
I would not call them counters.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: The cost of transposition table instrumentation

Post by bob »

Joost Buijs wrote:
bob wrote:
Isn't your history table just an array of counters???
I would not call them counters.
They certainly behave like one. Whether you do nodes_searched++ or history_counter[index]+=n*n you are doing exactly the same thing, and the occasional interleaved update is not exactly going to wreck things.
mvk
Posts: 589
Joined: Tue Jun 04, 2013 10:15 pm

Re: The cost of transposition table instrumentation

Post by mvk »

bob wrote:
Joost Buijs wrote:
bob wrote:
Isn't your history table just an array of counters???
I would not call them counters.
They certainly behave like one. Whether you do nodes_searched++ or history_counter[index]+=n*n you are doing exactly the same thing, and the occasional interleaved update is not exactly going to wreck things.
What you call a 'counter' here is commonly known as a 'variable' in computer science. There are programs that adjust the value of the history table elements in either direction, for example.
[Account deleted]
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: The cost of transposition table instrumentation

Post by bob »

mvk wrote:
bob wrote:
Joost Buijs wrote:
bob wrote:
Isn't your history table just an array of counters???
I would not call them counters.
They certainly behave like one. Whether you do nodes_searched++ or history_counter[index]+=n*n you are doing exactly the same thing, and the occasional interleaved update is not exactly going to wreck things.
What you call a 'counter' here is commonly known as a 'variable' in computer science. There are programs that adjust the value of the history table elements in either direction, for example.
Sure. And counters go in either direction. IE "saturating counters" in branch prediction. I am using a saturating counter for my history data for the moment. I used the term "counter" for a variable that has something related to counting. And certainly history counters are just that. As opposed to a pointer, or a bit map, or a move (which has several related pieces of information like from, to, piece, etc) or whatever, where interleaved update might produce a more serious problem than just losing one counter update.
AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: The cost of transposition table instrumentation

Post by AlvaroBegue »

bob wrote:[...] As opposed to a pointer, or a bit map, or a move (which has several related pieces of information like from, to, piece, etc) or whatever, where interleaved update might produce a more serious problem than just losing one counter update.
If you are allowing writes to a variable from several threads without proper guards, you are invoking undefined behavior. You are probably right that the only likely consequence is missing the occasional update, but this could change in the future. I am afraid a few years down the line you are going to be screaming about how some compiler "broke" your code.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: The cost of transposition table instrumentation

Post by bob »

AlvaroBegue wrote:
bob wrote:[...] As opposed to a pointer, or a bit map, or a move (which has several related pieces of information like from, to, piece, etc) or whatever, where interleaved update might produce a more serious problem than just losing one counter update.
If you are allowing writes to a variable from several threads without proper guards, you are invoking undefined behavior. You are probably right that the only likely consequence is missing the occasional update, but this could change in the future. I am afraid a few years down the line you are going to be screaming about how some compiler "broke" your code.
Sorry, but the only problem is a lost update. If the value is a counter, losing an addition or a subtraction is not going to break anything, and is often acceptable over the cost of a lock acquisition which is extremely expensive. No compiler or hardware is going to "break my code" in this instance. Interleaved update is only caused when two different threads read, update and write the same location. Since the orders of those operations can be interleaved. The only effect is that ONE of the updates is lost. For a statistical counter that has no effect on program behavior, missing an update is inconsequential. And is more than acceptable when compared to the cost of the lock. Since most of these are once-per-node-searched at best (and usually less often) the probability is minimal when spread over N counters.

If you were building a move, that might be different if you somehow forced everything to go to memory, and you first add the from, then the to, etc. But not with a counter. And no, I won't be screaming, unless apple decides to intentionally break my code as they did with strcpy(), something that got them raked over the coals by tens of thousands of complaints. If they do what I write, the results will be usable forever. If they take it upon themselves to make the program abort on a perfectly safe operation, yes I will complaint.