multithreading questions

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Martin Fierz

Re: multithreading questions

Post by Martin Fierz »

Hi Tord!

thanks for the answer. I was under the impression that Crafty was using DTS, because Bob wrote his paper on that. Well, I guess I will just have to start looking at the source code of Viper and Crafty :-(
I find it much easier to read sensible explanations, like for instance your short description of late move reductions - and just wanted to say thank you for that!

cheers
Martin
User avatar
Zach Wegner
Posts: 1922
Joined: Thu Mar 09, 2006 12:51 am
Location: Earth

Re: multithreading questions

Post by Zach Wegner »

hgm wrote:Which sections of an SMP Chess engine are critical? Are there in which the threads really spend a significant amount of time, so that waiting for access can become a serious bottle neck?
I haven't experienced any serious bottlenecks, but then again I only have 2 processors at my disposal. What some places with high contention are the locks for global SMP data and individual split points. Each time a split point is created or destroyed the first lock is used. This isn't so bad because it doesn't happen as often. Individual split points are bad. The lock is set for each time a processor attaches or detaches from the split point, as well as each time a processor grabs a move. What's good is that each action's cost in CPU cycles is roughly inversely proportional to how often it occurs.
Pradu
Posts: 287
Joined: Sat Mar 11, 2006 3:19 am
Location: Atlanta, GA

Re: multithreading questions

Post by Pradu »

Martin Fierz wrote:1) my checkers program uses 64-bit hash records. if i understand bob's paper right, then on a 64-bit system, a hashtable read or write would be "atomic", i.e. on such systems i wouldn't have to care about hashing and locks at all?
Yes I think this is true unless you modify parts of the global hash records at a time. So perhaps when you write you can have a local hash record and modify that. Then when you want to write, you can write the local hash record all at once to the global hash record. When you read the global hash record you have to read it all at once into a local hash record before you use it.
jswaff

Re: multithreading questions

Post by jswaff »

Martin Fierz wrote:aloha!

1) my checkers program uses 64-bit hash records. if i understand bob's paper right, then on a 64-bit system, a hashtable read or write would be "atomic", i.e. on such systems i wouldn't have to care about hashing and locks at all?
Lockless hashing (as described by Hyatt) is just a clever trick to make sure the "value" corresponds to the "key". In his implementation a HashEntry is composed of both a 64 bit key and a 64 bit value. So, there is a risk that some other thread will change, say, the "value" right in the middle of your read (after you've read the key but before you've read the value).

If you are using just a single 64 bit entry, and it's read in a single atomic operation, then no - you wouldn't need to deal with that issue. (Side question: Are you storing any additional bits of your key in that 64 bit word?)
Martin Fierz wrote: 2) there is another paper on bob's site on DTS, while tord seems to be using YBW in viper. is there any reason for choosing one or the other in your experience? in particular, which is easiest to implement?
YBWC is much, much easier to implement. I was surprised to learn that not even Bob is really doing DTS. YBW isn't quite as flexible, and I believe it's a different "philosophy" altogether. In the YBW implementations I have seen (Crafty and Viper), once a node searches the "eldest brother", or the first move, it attempts a split. It will succeed if a some processor is idle, and maybe some other criteria are met. But basically, the processors just sit and wait until they are given work to do. Once they are given work, they help their "master" search the remaining moves of the node it's currently working on.

Conversely, in DTS the processors actively look for work when they become idle. All of the busy processors will report some state information, and the idle processor has to decide where to jump in and help. The really interesting thing is that it might decide to help some processor that is currently at ply 10 by splitting at ply 5. So, there is some real "decision making" in DTS.

Most chess programs use a recursive search, and with YBW that is fine as you only split "deeper." WIth DTS on the other hand, you really need an iterative search, since you can split way up the tree.

I think in the future you will see more chess programs using true DTS. At least, that's my gut feeling. I think YBW is sufficient for few processors, but little inefficiencies add up in exponential trees. I think when we start seeing 8 or 16 processor desktops become common place the DTS algorithm will become more common place.

I am still working on implementing an iterative search. My goal is to implement DTS, then to employ a neural net to do the decision making. But, I've a long way to go. I know of at least two others that have done or are working on DTS: the Loop Team and Pradu (Buzz).

Martin Fierz wrote: 3) in checkers, we have huge endgame databases, which are very important for the playing strength (very different than in chess). the endgame databases are too big to fit in the computer's RAM, and therefore we load and unload blocks of the database during the search into the database cache. i am using 1K-sized blocks. obviously, i need to keep track of this cache somehow: every block has a unique number assigned to it, and i have an array which tells me which slot of the database cache is holding which block. whenever a block is loaded, i need to update that. i also have a doubly-linked-list to implement a LRU (least recently used) caching scheme, that is, when a block has to be loaded, i can look up which one of my N blocks was not used for the longest time, then i throw it out. to keep track of this list, i also need to update it every time a block is used which is in memory. so i have:

dblookup with block in memory:
- find out in which block the current position is
- find out where the block is in memory
- look up the current position in the block
- move this block to the front of the LRU-list

dblookup with block on disk
- find out in which block the current position is
- find out that the block is not in memory
- find out which block has not been used for the longest time
- replace that block with the one i want to use
- update the index array which tells me which block is where
- move the block to the front of the LRU-list

compared to hashing, this seems like a much bigger problem to make thread-safe.... my experience with the locks (EnterCriticalSection, LeaveCriticalSection) has been very bad, i.e. they slow down the search a lot. is there anything else that i can do about this database lookup? in particular, how are you handling your database lookups in chess?

cheers
martin
Alessandro Scotti

Re: multithreading questions

Post by Alessandro Scotti »

jswaff wrote:I know of at least two others that have done or are working on DTS: the Loop Team and Pradu (Buzz).
IIRC Zappa multiprocessing is based on DTS.
User avatar
Zach Wegner
Posts: 1922
Joined: Thu Mar 09, 2006 12:51 am
Location: Earth

Re: multithreading questions

Post by Zach Wegner »

jswaff wrote:I know of at least two others that have done or are working on DTS: the Loop Team and Pradu (Buzz).
I am getting closer and closer to completing my implementation. All I have now is YBW with iterative search, but it shouldn't take long.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: multithreading questions

Post by bob »

Pradu wrote:
Gerd Isenberg wrote:I had a look in the K10 optimization manual and 32-bit [LOCK] xchg is direct-path 15 cycles - of course relatively slow but it ensurses no other thread/process may modify the content between the read- and write cycle of the instruction due to the imlicit lock. I didn't inspect the assembly of the lock-instrinsics yet.

While "usual" applications may better sleep while ressources are blocked, we like our threads to be immediatly ready to search - and to keep the associated processor with 100% even if idle.
Yeah, I'm not really sure what that "pause" instruction does .. whether makes the CPU not use up 100% or not ... but from what I've read it's highly reccomended to use it in a spin-wait loop:
Intel wrote:The PAUSE instruction is first introduced for the Pentium 4 processor. Technically, it is a hint to
the hardware that says that the code being executed is a spin-wait loop. What the hardware does
with this hint is specific to a particular processor. On the Pentium 4 processor, the instruction
behaves as described above, resulting in a net performance improvement. However for all
known existing IA-32 processors the instruction is interpreted as a NOP (no operation) and does
not affect the program in any way. It has been verified that PAUSE is a NOP for all known Intel ®
architectures prior to the Pentium 4 processor. It is even known to behave as a NOP on the non-Intel
x86 family processors that were available at the time of testing.

For this reason, it is highly recommended that you insert the PAUSE instruction into all spin-wait
code immediately. Using the PAUSE instruction does not affect the correctness of programs on
existing platforms, and it improves performance on Pentium 4 processor platforms.
My guess is that "pause" is possibly related to only hyper-threading and, like you said, possibly doesn't concern us.
Pause is a hyper-threading facility only. No longer needed. It would prevent one of the two "shared cores" from spinning on trying to acquire the lock while the other "shared core" was temporarily idle (the two cores ran in a time-sliced manner). Using pause would instantly switch to the other active thread (the other logical cpu) and avoid wasting the spinning cycles on the current thread...

Not needed in real dual-core boxes.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: multithreading questions

Post by bob »

Steve Maughan wrote:Aleks,
Aleks Peshkov wrote:Integer flag decrement is not thread safe. If you make it safe it will be very slow.
I don't know that much about multi-core programming so I'm sure you're right. Nevertheless could you expand upon this? How is it not safe? What would happen if you implemented a raw integer decrement?

Steve
To do a raw (unlocked) decrement, the CPU has to fetch the word from memory, subtract one, then write it back. If both cpus do this at the same time, you have what is known as a "write after write" type hazard condition that produced undefined behavior. Both processors fetch the value, both decrement it, both write the same value back. It was decremented two times, but is only one smaller. If you slightly shift the timing, so that the first loads, then decrements, then writes, then the second does the same, you get the expected result.

Any time a shared value (a value visible to more than one processor) is updated (and can potentially be updated by more than one thread simultaneously) some form of atomic locking is required. The classic "shadow lock" used in Crafty (borrowed from the Linux kernel source) is the best known way to do this when raw efficiency is important.

Synchronizing multiple updates (also known as potential interleaved updates) is a tricky thing to manage. And the failure to do it properly leads to more parallel programming bugs than all other issues put together. Second most common problem is not understanding the difference between:

volatile int x;
volatile int *x;
int * volatile x;
volatile int * volatile x;

Until those are completely understood to the point of becoming trivial, bugs will abound.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: multithreading questions

Post by bob »

Pradu wrote:
Martin Fierz wrote:1) my checkers program uses 64-bit hash records. if i understand bob's paper right, then on a 64-bit system, a hashtable read or write would be "atomic", i.e. on such systems i wouldn't have to care about hashing and locks at all?
Yes I think this is true unless you modify parts of the global hash records at a time. So perhaps when you write you can have a local hash record and modify that. Then when you want to write, you can write the local hash record all at once to the global hash record. When you read the global hash record you have to read it all at once into a local hash record before you use it.
Problem is I don't think this is "guaranteed". Nothing says that when processor A writes word X, and processor B writes word X+1, that the two can't be combined in the memory system. Even worse, memory is read/written in 8 byte chunks, starting on addresses divisible evenly by 8. But nothing demands that this be the case in future generations.

I got burned by this way too many times in the past, on Crays, on Univacs, on Alphas, on Intels, and so forth. Far better to protect yourself carefully rather than debug something that is _very_ difficult to detect.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: multithreading questions

Post by bob »

Martin Fierz wrote:aloha!

thanks for all your answers. i looked up bob's lockless hashing paper, and also the joint paper with anthony cozzie where the effect of hash errors on the search is described. from this i gather that lockless hashing is cool, but not really necessary, as the hashtable can be used without locking. however, it would appear to me that the lockless hashing costs nearly nothing and would therefore be the method of choice.

of course, i have some more questions :-)

1) my checkers program uses 64-bit hash records. if i understand bob's paper right, then on a 64-bit system, a hashtable read or write would be "atomic", i.e. on such systems i wouldn't have to care about hashing and locks at all?

2) there is another paper on bob's site on DTS, while tord seems to be using YBW in viper. is there any reason for choosing one or the other in your experience? in particular, which is easiest to implement?

3) in checkers, we have huge endgame databases, which are very important for the playing strength (very different than in chess). the endgame databases are too big to fit in the computer's RAM, and therefore we load and unload blocks of the database during the search into the database cache. i am using 1K-sized blocks. obviously, i need to keep track of this cache somehow: every block has a unique number assigned to it, and i have an array which tells me which slot of the database cache is holding which block. whenever a block is loaded, i need to update that. i also have a doubly-linked-list to implement a LRU (least recently used) caching scheme, that is, when a block has to be loaded, i can look up which one of my N blocks was not used for the longest time, then i throw it out. to keep track of this list, i also need to update it every time a block is used which is in memory. so i have:

dblookup with block in memory:
- find out in which block the current position is
- find out where the block is in memory
- look up the current position in the block
- move this block to the front of the LRU-list

dblookup with block on disk
- find out in which block the current position is
- find out that the block is not in memory
- find out which block has not been used for the longest time
- replace that block with the one i want to use
- update the index array which tells me which block is where
- move the block to the front of the LRU-list

compared to hashing, this seems like a much bigger problem to make thread-safe.... my experience with the locks (EnterCriticalSection, LeaveCriticalSection) has been very bad, i.e. they slow down the search a lot. is there anything else that i can do about this database lookup? in particular, how are you handling your database lookups in chess?

cheers
martin
When doing I/O in threads, the best approach is generally to have yet another thread that only does I/O. This eliminates the problems you describe. And the more hits you get in your LRU chain, the less contention you hit in your DB read thread. Each individual thread can probe the LRU buffers if you are careful to handle the obvious case of getting a hit but having the block overwritten while you are using it, and then when a LRU buffer miss happens, let the single I/O thread do the I/O and get the data into the buffers.

Of course moving something to the front of the LRU chain is a critical section since you can't allow more than one thread to update a linked list at any instant in time.