"SuperLazy SMP" by using move list chunks: would this work?

mvanthoor · Post by **mvanthoor** » Wed Sep 23, 2020 3:06 pm

Hi

Most people here know how the perft function works: you create a move list, iterate over it, make a move, and then call perft on the new board with depth - 1. As soon as depth == 0, you "return 1" to increase the leaf count. Because threading is so easy in Rust, I decided to try and add it to perft without looking how other engines did it.

I divided the initial move list into chunks, and then attached a thread to each chunk.

When I have (let's say) four threads, I first call a function create_chunks(). It generates a move list, and then creates 4 chunks, one for each thread, and distributes the moves from that list over the chunks. Thus I end up with 4 chunks, each having a partial move lists. Instead of just calling perft (which would need a complete move list to work), I call perft_on_chunk(), which launches perft for each chunk (in its own thread, attaching the handle to the chunk). After each of the chunk threads ends, I can add the resulting leaf counts from all threads together, and perft is completed.

It works: on four real cores, the speedup is about 3.5x.

The code is here:
https://github.com/mvanthoor/rustic/blo ... c/perft.rs

With regard to LazySMP, I've been reading that the different threads must differentiate from one another so they do not search the same moves. At some point they won't, because of the hash table and timing differences (and thus speed differences) between the threads, but it seems this can take some time.

Would it work to do exactly the same thing as I did in perft? Create chunks, and then call "search_on_chunk()" before actually calling the "real" search function? Then each thread would start out at a completely different point, and they will each find their own best_move. After the search time is up, the controlling thread can pick the best best_move. Obviously adding a TT will be necessary to avoid searching/evaluating a position multiple times.

(The other way around... if this sort of SMP works, then adding a TT to the currently working multi-threaded perft would also work.)

I don't immediately see a reason why this wouldn't work, besides the fact that I never implemented a multi-threaded alpha/beta search, and that this way would just search more useless stuff faster (i.e.: it could "speed up" the search with regard to node counts, but the searched moves are useless and the result is the same as it would have been with one thread).

Anybody has any thoughts about this?

(At this point I'm writing a normal a/b search. Even though it already has a controller thread, it'll only have one worker for now, actually searching single-threaded until all this stuff works and actually plays chess.)

mvanthoor · Post by **mvanthoor** » Wed Sep 23, 2020 5:29 pm

Addition:

Hm... one problem would obviously be that, if you have more threads than there are initial moves, you'd end up with chunks that have no moves. For example, in my perft implementation, it would be useless to launch it with more than 20 threads in the starting position, because there are only 20 possible moves. If you launch 32 threads, 12 of them won't have any moves, return immediately, and the program will finish with 20 threads.

It would be possible to keep a running thread count (which is known by all the threads), and if there are less than the amount requested, and if a thread that hasn't yet finished detects this, it could actually chunk its own move list and launch threads of its own. It would probably work, but i assume the overhead is going to be huge.

mar · Post by **mar** » Wed Sep 23, 2020 6:22 pm

this is splitting at root, it may work well for perft but how is it supposed to work well for alphabeta/pvs. where do you get the bounds?

also your "move list chunks" splitting - why not simply use a parallel for? with step 1, you will avoid starvation if the amount of time spent on each move differs + ideally doing the most expensive task first (assuming step 1) is typically a good idea

mar · Post by **mar** » Wed Sep 23, 2020 6:28 pm

also, for perft, you may expand one ply deeper, i.e. generating all moves for all root moves, so assuming you have 30 moves at root and each expands to 30 moves, this would give you 900 moves to crunch on, plenty even if you have 64 cores.

Ronald · Post by **Ronald** » Wed Sep 23, 2020 6:35 pm

When I started with rofChade, I also did experiments with perft and was trying to get optimal speed for too long...

Some SPM tips: You could start a thread at the second (or even further) ply, this gives you 20 * 20 startpositions, and also gives less overhead with unbalanced threads (if you run the starpos with 20 threads, some threads are finished earlier but the whole perft is finished after the last one finishes. With 400 positions the difference between threads become less).

You can also differ the way you loop between the moves per thread, for instance even threads first move to last, odd threads last move to first, this creates a much higher hitratio in the hashtable. If remember correctly , with 2 threads the speedup was already 3.5.

You can also start experimenting with different replacement schemes in the hashtable.

Maybe more important tip: If I remember correctly, you are already working a long time on perft to get optimal speed, like I did. I needed a push to get to developing a real chess engine (by visiting a CSVN tournament). When your engine is finally playing real chess games that's when the fun really starts!

You will also start to notice more and more that speed of your movegen is only a minor part of developing a strong chess engine.

mvanthoor · Post by **mvanthoor** » Wed Sep 23, 2020 7:24 pm

@mar: Thanks for reminding me of the parallel for. Rust has the "rayon" crate which can do that; but AFAIK, it tries to use all physical cores. I'll see what rayon can do. It has been some time since Ive looked into it. Executing the threading from ply two is a good idea, because it gives you a lot more moves. (And as Ronal says: it'll probably negate most of the differences between threads.)

@Ronald: I'm no trying to get the best speeds anymore; adding threading actually did cost a bit of speed. I'm now just using perft to experiment with some Rust-specific stuff I wanted to play with (such as threading), and indeed, finding the optimal strategy on how to procrastinate writing the search as long as possible

Writing this chess engine has become somewhat more than just the engine; I'm now also using it as a testbed to experiment with Rust, some chess engine concepts, and some software engineering concepts as well.

fabianVDW · Post by **fabianVDW** » Wed Sep 23, 2020 8:07 pm

mvanthoor wrote: ↑Wed Sep 23, 2020 7:24 pm @mar: Thanks for reminding me of the parallel for. Rust has the "rayon" crate which can do that; but AFAIK, it tries to use all physical cores. I'll see what rayon can do. It has been some time since Ive looked into it. Executing the threading from ply two is a good idea, because it gives you a lot more moves. (And as Ronal says: it'll probably negate most of the differences between threads.)

No need for rayon if you want to keep it simple: You can do one Thread Safe queue containing all of the moves(or say two moves if you do it for the first two depths) and the associated depth the threads have to search, and each thread will pop of the move and the depth do it's work, and pop off again. Do it for the first two depths(or until you reach enough moves, say like 400) and it will be a pretty good parallelization. Of course this does not scale super well, but it should work for most scenarios and is simple.

mvanthoor · Post by **mvanthoor** » Wed Sep 23, 2020 8:52 pm

fabianVDW wrote: ↑Wed Sep 23, 2020 8:07 pm No need for rayon if you want to keep it simple: You can do one Thread Safe queue containing all of the moves(or say two moves if you do it for the first two depths) and the associated depth the threads have to search, and each thread will pop of the move and the depth do it's work, and pop off again. Do it for the first two depths(or until you reach enough moves, say like 400) and it will be a pretty good parallelization. Of course this does not scale super well, but it should work for most scenarios and is simple.

I don't think I understand this completely.

Obviously I can generate the moves at depth 1 (getting 20 moves) and put them into a queue; then I can pop off a move and start a thread for it. Pop off another move, start another thread. As soon as a thread ends, it pops off the next free move, until all the moves in the list are done. It's a scheme where threads "eat through" the list instead of dividing up the list into 2 or more pieces and assigning a thread to each piece.

But how do I do this at depth 2?

- First generate moves at depth 1 and store them in a queue with d=1, giving 20 moves...
- Then generate all moves at depth 2 and store them in the same queue with d=2... giving 400 moves and the original 20.

Then I have 420 moves sitting in a queue... and what after that? In this case, you don't have the position that belongs to the move.

Or, do you mean that instead of generating a move list and iterating over it, calling perft on each move, just add the entire list at the end of the queue, then start the threads that go through the queue, and have each thread extend the queue until it has a certain maximum length and then stop extending it?

I'd have to implement something like this to actually understand exactly how it's supposed to work.

@ronald:

When your engine is finally playing real chess games that's when the fun really starts!

You're right; I probably needed a kick in the nuts. It's about time I wrote the search and finish the UCI-module for this engine, so the first version can be released.

The move generator and make/unmake haven't changed for l long time now; since about May, when I stopped development. Since I picked up development of the engine again (after the cataract surgery) I did some experiments:

- Add a communication module that can be swapped out (uci, xboard, console/text UI), so I don't have to interweave the communication protocol with the rest of the engine or the search. It taught me how trait objects work in Rust.
- Add multi-threading to perft (even if it's not perfect); it taught me about how things like threads, mutexes, channels, etc... work in Rust.
- Added a search controller with worker threads in preparation of this thing having multi-threaded search (using what I learned during the experiment on perft.)

Everything works as intended, and I learned what I wanted to know. So, I put the multi-threaded perft and into its own branch for later reference (just like I did with the hash table experiment a lot earlier), and reduced the search module to only a single-threaded search for now. Time to actually write a first simple search...

Ronald · Post by **Ronald** » Wed Sep 23, 2020 9:29 pm

But how do I do this at depth 2?

- First generate moves at depth 1 and store them in a queue with d=1, giving 20 moves...
- Then generate all moves at depth 2 and store them in the same queue with d=2... giving 400 moves and the original 20.

You don't store the moves in a queue but the resulting positions after you made the moves. So what you sort of doing is a perft2 where you use the leaf positions as the queue for you threads. From the start position you end up with 400 leaf positions from where you do a perft depth - 2.

You're right; I probably needed a kick in the nuts. It's about time I wrote the search and finish the UCI-module for this engine, so the first version can be released

Great! A very special moment was when I finally got rofChade to work on my DGT pi with DGT board and got beaten very bad by my own engine. I hope you will get to enjoy that moment soon

fabianVDW · Post by **fabianVDW** » Wed Sep 23, 2020 9:35 pm

Ronald wrote: ↑Wed Sep 23, 2020 9:29 pm
But how do I do this at depth 2?

- First generate moves at depth 1 and store them in a queue with d=1, giving 20 moves...
- Then generate all moves at depth 2 and store them in the same queue with d=2... giving 400 moves and the original 20.
You don't store the moves in a queue but the resulting positions after you made the moves. So what you sort of doing is a perft2 where you use the leaf positions as the queue for you threads. From the start position you end up with 400 leaf positions from where you do a perft depth - 2.

Correct, some psuedo code might be able to describe it better

Code: Select all

type Depth = usize;
let TARGET_SIZE = 400;
let mut queue: Vec<(GameState, Depth)> = Vec::new();
queue.push((root_node, perft_depth));
while queue.len() < TARGET_SIZE:
	if queue.len() == 0 {break;}
	let (node, depth) = queue.pop();
	assert(depth > 0) //Don't want to think about this case now
	let legal_moves = gen_moves(node);
	for mv in legal_moves{
		queue.push_last((make_move(node, mv), depth -1));
	}

This will fill the queue up until it has reached a certain size, and will leave the "Tasks" with higher depths at the front of the queue, so that they will be dealt with first, which is good since they are the most probable to take the most of time. So IIRC this is just a typical breadth-first schema.

"SuperLazy SMP" by using move list chunks: would this work?

"SuperLazy SMP" by using move list chunks: would this work?

Re: "SuperLazy SMP" by using move list chunks: would this work?

Re: "SuperLazy SMP" by using move list chunks: would this work?

Re: "SuperLazy SMP" by using move list chunks: would this work?

Re: "SuperLazy SMP" by using move list chunks: would this work?

Re: "SuperLazy SMP" by using move list chunks: would this work?

Re: "SuperLazy SMP" by using move list chunks: would this work?

Re: "SuperLazy SMP" by using move list chunks: would this work?

Re: "SuperLazy SMP" by using move list chunks: would this work?

Re: "SuperLazy SMP" by using move list chunks: would this work?