The algorithm I use works roughly as follows:
1. Main process forks "n" child processes at startup
2. Command communication between main process and children is through pipes
3. Child sends candidate split moves to the main process, and continues searching as normal
4. Child applies YBW and minsplitdepth = 6 for this
5. Main process collects these split moves, and when a child is idle, dispatches a search there
6. The idea is that the main process should always have a stash of moves to send to idle cpus (in reality the stash often runs empty, due to YBW)
7. A child doesn't wait for a result of a split move. If it is out of work, it will pick a move which it has sent to the main process, and searches it himself.
8. The child sends a cancel request for that move to the parent when that happens
9. If the parent hasn't dispatched the move yet, it just removes it from its stash
10. Otherwise it aborts the search at the child to which it has dispatched the move before
11. Abort messages from main process to childs are SIGINT signals
12. All communication on the pipes is in human readable command line text
13. Search results of split moves are stored in the global shared memory hash table. No additional communication other than that.
So the main process doesn't search itself, it just acts as a conductor of the orchestra.
This is a horrible algorithm, but note that it was optimised for something. Not for speed or elo. It was optimised for implementation time first, and second to keep the door open for a distributed search (which I never made, but still). If I look back in my log book, I started the implementation on Sep/18 (2010), and it was running reliably two weeks later on Oct/02. Well in time to trust it for the Leiden tournament of November that year, where Rookie v3.0 debuted and won the Programmer's Prize.
I haven't touched the algorithm ever since. Before I would do that, I need to know how well it performs. Recently I started to measure that.
Test conditions:
- AMD 8350 (8 cores) @ 4GHz, 16 GB ram, 128 GB SSD.
- cutechess-cli
- Round-robin tournament between 1cpu, 2cpu, 4cpu and 8cpu version.
- When 8cpu was playing, cutechess-cli concurrency was set to 1. When 4cpu was playing, concurrency was 2. When 2cpu was playing, concurrency was 4.
- Hash table 256MB per player
- 6pc Syzygy on SSD
- No pondering
- Timecontrol: 120s base with 5s increment
- 1000 openings, played from both sides.
- Resign when score below -8 pawns for 3 moves
- The 1cpu version doesn't fork processes. The search in that case is done by the main process. (This shouldn't matter, but still).
The tournament runs started Oct/19 and finished Feb/04, or just over 100 days. You can see why I haven't done this before.
Results are bad or not so bad, depending on your point of view:
Code: Select all
Rank Name Elo + - games score oppo. draws
1 rookie3.8b2-cpu8 84 7 7 6000 58% 39 53%
2 rookie3.8b2-cpu4 73 7 7 6000 55% 42 55%
3 rookie3.8b2-cpu2 43 7 7 6000 48% 52 55%
4 rookie3.8b2 0 7 7 6000 39% 67 51%
Conclusion: not bad for 2 weeks of programming effort. Not good for what should be possible with 8 cores.
I have somewhat lost my interest in chess programming, so I will probably not touch it in the coming years anyway. I just wanted to share a result.