Parallel processing ?
Moderator: Ras
-
- Posts: 313
- Joined: Tue Aug 03, 2021 2:41 pm
- Full name: Bill Beame
Parallel processing ?
Back to chess. I have an engine which is debugged and running fine, although slow compared to most on this board, about 650,000 positions/sec. I now have access to an INTC I7 with 2 cores. I'm new to parallel programming in C#, so I am getting ideas on how to proceed. The question I have concerns synchronization. Do I need a separate engine for each core? Otherwise, the search seems like it would be full of collusions. It would be nice if those who successfully implemented parallel search engines could shed some light on this. Keep in mind I have no experience with parallel processing but have a basic understanding of the topic from examples and reading the documentation.
-
- Posts: 253
- Joined: Mon Aug 26, 2019 4:34 pm
- Location: Clearwater, Florida USA
- Full name: JoAnn Peeler
Re: Parallel processing ?
One place I've found interesting is Tom Kerrigan's website.
And then of course there is the always trusty Chess Programming Wiki.
My approach is to spin off multiple search threads from my Engine. Also, since to reap the benefits you need to share a transposition table, so you need to work out an approach to sharing it so that competing threads do not corrupt the data. I use a lockless method, but you could always use an array of spinlocks to control access to table items. Lockless will be faster, but it may not fit with your table scheme. It seems like the easiest method to implement (but maybe not perfect) is the Lazy SMP approach.
Good luck.
And then of course there is the always trusty Chess Programming Wiki.
My approach is to spin off multiple search threads from my Engine. Also, since to reap the benefits you need to share a transposition table, so you need to work out an approach to sharing it so that competing threads do not corrupt the data. I use a lockless method, but you could always use an array of spinlocks to control access to table items. Lockless will be faster, but it may not fit with your table scheme. It seems like the easiest method to implement (but maybe not perfect) is the Lazy SMP approach.
Good luck.
-
- Posts: 313
- Joined: Tue Aug 03, 2021 2:41 pm
- Full name: Bill Beame
Re: Parallel processing ?
thx, exactly what I was looking for. This seems like a lot of very complex programming that's way over my head right now. Unfortunately, this chess engine jones keep drawing me back. best of luck on your engineJoAnnP38 wrote: ↑Tue Mar 07, 2023 2:10 am One place I've found interesting is Tom Kerrigan's website.
And then of course there is the always trusty Chess Programming Wiki.
My approach is to spin off multiple search threads from my Engine. Also, since to reap the benefits you need to share a transposition table, so you need to work out an approach to sharing it so that competing threads do not corrupt the data. I use a lockless method, but you could always use an array of spinlocks to control access to table items. Lockless will be faster, but it may not fit with your table scheme. It seems like the easiest method to implement (but maybe not perfect) is the Lazy SMP approach.
Good luck.
-
- Posts: 5695
- Joined: Tue Feb 28, 2012 11:56 pm
Re: Parallel processing ?
It turns out that the best way to implement parallel search for chess is also the simplest. Let each thread do its own search while sharing the transposition table. The threads will quickly get out of sync (or you could force that by starting them at different depths) and search their own part of the tree, communicating their intermediate results to other threads via the transposition table. No need to do any locking as long as you check moves from the hash table for legality. Some TT entries may get corrupted by multiple threads writing to them at the same time, but it won't affect search performance (locking would greatly affect performance).Chessnut1071 wrote: ↑Mon Mar 06, 2023 3:02 am Back to chess. I have an engine which is debugged and running fine, although slow compared to most on this board, about 650,000 positions/sec. I now have access to an INTC I7 with 2 cores. I'm new to parallel programming in C#, so I am getting ideas on how to proceed. The question I have concerns synchronization. Do I need a separate engine for each core? Otherwise, the search seems like it would be full of collusions. It would be nice if those who successfully implemented parallel search engines could shed some light on this. Keep in mind I have no experience with parallel processing but have a basic understanding of the topic from examples and reading the documentation.
Kerrigan's page is outdated. Nobody uses YBWC or DTS anymore. These techniques are unsuitable for today's super-narrow search tree algorithms and high core counts.
-
- Posts: 1062
- Joined: Tue Apr 28, 2020 10:03 pm
- Full name: Daniel Infuehr
Re: Parallel processing ?
This!
Also think about interlocked compare exchange or generally std::atomic is also quite performant. No locks needed.
But yeah multithread = talk via shared TT.
Works the same way in MCTS where the tree itself is the shared state.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Daniel Inführ - Software Developer
-
- Posts: 253
- Joined: Mon Aug 26, 2019 4:34 pm
- Location: Clearwater, Florida USA
- Full name: JoAnn Peeler
Re: Parallel processing ?
If I were going to use a spinlock I would use std::atomic to implement. Given a large enough array of spinlocks indexed by the lower bits of the hash, I suspect the contention would be small and you would be guaranteed uncorrupted cache entries. I don't use this method, but instead store the hash XOR'd with the data and then reverse the process when probing as a check to see if the data has been corrupted by competing writes. Using a corrupted cache entry would definitely have negative consequences.dangi12012 wrote: ↑Thu Mar 09, 2023 12:45 am Also think about interlocked compare exchange or generally std::atomic is also quite performant. No locks needed.
-
- Posts: 28353
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Parallel processing ?
Locking is slow even if there is no contention. Just testing the lock is already slow. Much better to use lockless hashing, which uses the signature as checksum, and would treat corrupted entries as misses. Or ignore the problem completely.
Or you could design your hash bucket such that the entires are stored mostly atomic. E.g. if you make an entry that is 3 byte signature, 2 byte move + flags, 2 byte score and 1 byte depth, that totals to 8 byte, and can be stored atomically if you properly align the entries in memory. If a 24-bit signature is smaller than you would like, you could store key extensions in the first word of the hash bucket. E.g. the first 64-bit word with seven 1-byte key extensions, and would be followed by seven 8-byte entries. This gives a speedup during probing, as in 97% of the cases you would already know you have a hash miss by just examining the key extensions in the first word of the bucket. So you would not have to wait for all the words being read from memory.
If you put the always-replace entries of the bucket in the 2nd and 3rd word, you would be able to process a hit on those before even half the bucket has been read from memory. (And nearly all hits are on the always-replace entry!)
Splitting the signature over two non-atomically written words achieves the same as lockless hashing through XOR-ing the words: When the two words are the remnants from colliding store operations, which tried to store something different in the same entry, the resulting signature matches neither, and the probability that it matches anything else is the same as for a normal key collission.
If you want slightly larger entries (and thus fewer per bucket) you can put a larger part of the signature in the key-extension word(s). E.g. with 6 entries per bucket you could have two key-extension words, each containing three extensions of 21 bits. So that the 64-bit words containing the remainder of the entry only need 11 bits to get the usual 32-bit signature. (And the remaining bits in the 2 bytes that hold this could be used for flags and aging.) It might be best to put 2 buckets, each containing 3 entries, in a 64-byte cache line in this case: then you would always retrieve first the word containing the key extension for the three entries where the sought data could be stored, and the next word would contain the remainder of the always-replace entry in that bucket.
Or you could design your hash bucket such that the entires are stored mostly atomic. E.g. if you make an entry that is 3 byte signature, 2 byte move + flags, 2 byte score and 1 byte depth, that totals to 8 byte, and can be stored atomically if you properly align the entries in memory. If a 24-bit signature is smaller than you would like, you could store key extensions in the first word of the hash bucket. E.g. the first 64-bit word with seven 1-byte key extensions, and would be followed by seven 8-byte entries. This gives a speedup during probing, as in 97% of the cases you would already know you have a hash miss by just examining the key extensions in the first word of the bucket. So you would not have to wait for all the words being read from memory.
If you put the always-replace entries of the bucket in the 2nd and 3rd word, you would be able to process a hit on those before even half the bucket has been read from memory. (And nearly all hits are on the always-replace entry!)
Splitting the signature over two non-atomically written words achieves the same as lockless hashing through XOR-ing the words: When the two words are the remnants from colliding store operations, which tried to store something different in the same entry, the resulting signature matches neither, and the probability that it matches anything else is the same as for a normal key collission.
If you want slightly larger entries (and thus fewer per bucket) you can put a larger part of the signature in the key-extension word(s). E.g. with 6 entries per bucket you could have two key-extension words, each containing three extensions of 21 bits. So that the 64-bit words containing the remainder of the entry only need 11 bits to get the usual 32-bit signature. (And the remaining bits in the 2 bytes that hold this could be used for flags and aging.) It might be best to put 2 buckets, each containing 3 entries, in a 64-byte cache line in this case: then you would always retrieve first the word containing the key extension for the three entries where the sought data could be stored, and the next word would contain the remainder of the always-replace entry in that bucket.
-
- Posts: 253
- Joined: Mon Aug 26, 2019 4:34 pm
- Location: Clearwater, Florida USA
- Full name: JoAnn Peeler
Re: Parallel processing ?
You are obviously a very smart man (no patronizing intended), but I just wanted to ask, have you really looked into spinlocks? Spinlocks don't require a context shift, yielding to the OS, saving context, putting the thread to sleep to be reawaken when lock is acquired. Instead, the lock is "polled" inside the thread and acquired immediately upon availability. I would think this is really best used when the locks are super short lived like for instance -- reading/writing from/to a cache entry so they may only cause a 50-100ns delay (super wild guess) or so. While this may still be slower than using other lockless methods, its advantage is that you are guaranteed that the data you are reading is coherent. Perhaps your proposal still guarantees coherency (my lockless method does not), but I'm still wrapping my head around it. While I'm doing my homework, here is the spinlock implementation I'm thinking about using. It is described in much more detail here.
Code: Select all
struct spin_lock
{
private:
std::atomic<bool> _lock = { false };
public:
void lock() noexcept
{
for (;;)
{
// Optimistically assume the lock is free on the first try
if (!_lock.exchange(true, std::memory_order_acquire))
{
return;
}
// Wait for lock to be released without generating cache misses
while (_lock.load(std::memory_order_relaxed))
{
// Issue X86 PAUSE or ARM YIELD instruction to reduce contention between
// hyper-threads
_mm_pause();
}
}
}
bool try_lock() noexcept
{
// First do a relaxed load to check if lock is free in order to prevent
// unnecessary cache misses if someone does while(!try_lock())
return !_lock.load(std::memory_order_relaxed) &&
!_lock.exchange(true, std::memory_order_acquire);
}
void unlock() noexcept
{
_lock.store(false, std::memory_order_release);
}
};
-
- Posts: 28353
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Parallel processing ?
Well, 100ns is an eternity, these days. You could execute about 400 normal instructions in that time. The point is that the exchange instruction used to implement atomic test-and-modify is necessarily slow. It is executed in hardware, but the hardware has to communicate with other cores or CPUs, to ensure cache coherency.
The method I proposed does not guarantee anything. The point is that in the standard implementation of a TT through (Zobrist) hashing you cannot guarantee anything, even with single-threaded execution. You will have key collisions, and when these occur your hash probe will essentially return garbage. In a favorable case this results only in assigning a wrong score to the node, but when you get a hash move that is not valid for the current position (e.g. moving an empty square, promoting a Knight, capturing a King, failing to resolve a check...), it might even crash the search.
Using an entry that was corrupted by simultaneous writing isn't any more harmful than using the wrong entry from a key collision. You just get (completely or partially) wrong data. So it is just a matter of the frequency with which they occur. If the corruption due to multi-threading is negligible by those of key collisions, there is no significant benefit in eliminating that (and so the slightest cost tips the balance against doing it).
Now the frequency of key collisions can be arbitrarily reduced, by using longer keys. But this is not free either. Longer keys give larger entries, so fewer entries in a TT of given size. This reduces has hits and slows down the search. At some point this does more damage than the collisions you eliminate by it. And Stockfish has shown that this occurs pretty quickly: I believe that it only uses 16 bits for the signature. So one in 64K TT probes will return garbage.
I wonder whether the fraction of corrupted entries would be much larger than that. Have you measured how many probes return a corrupted entry if you ignore the problem completely? The lockless hashing tricks would reduce the damage corrupted entries would do, by detecting those with a high probability. Splitting the signature as 1 and 3 bytes over the two words that contain data on the entry, reduces it 256-fold, and splitting it as 2 plus 2 bytes even 64K-fold.
The method I proposed does not guarantee anything. The point is that in the standard implementation of a TT through (Zobrist) hashing you cannot guarantee anything, even with single-threaded execution. You will have key collisions, and when these occur your hash probe will essentially return garbage. In a favorable case this results only in assigning a wrong score to the node, but when you get a hash move that is not valid for the current position (e.g. moving an empty square, promoting a Knight, capturing a King, failing to resolve a check...), it might even crash the search.
Using an entry that was corrupted by simultaneous writing isn't any more harmful than using the wrong entry from a key collision. You just get (completely or partially) wrong data. So it is just a matter of the frequency with which they occur. If the corruption due to multi-threading is negligible by those of key collisions, there is no significant benefit in eliminating that (and so the slightest cost tips the balance against doing it).
Now the frequency of key collisions can be arbitrarily reduced, by using longer keys. But this is not free either. Longer keys give larger entries, so fewer entries in a TT of given size. This reduces has hits and slows down the search. At some point this does more damage than the collisions you eliminate by it. And Stockfish has shown that this occurs pretty quickly: I believe that it only uses 16 bits for the signature. So one in 64K TT probes will return garbage.
I wonder whether the fraction of corrupted entries would be much larger than that. Have you measured how many probes return a corrupted entry if you ignore the problem completely? The lockless hashing tricks would reduce the damage corrupted entries would do, by detecting those with a high probability. Splitting the signature as 1 and 3 bytes over the two words that contain data on the entry, reduces it 256-fold, and splitting it as 2 plus 2 bytes even 64K-fold.