Devlog of Leorik
Moderator: Ras
- 
				Mike Sherwin
- Posts: 965
- Joined: Fri Aug 21, 2020 1:25 am
- Location: Planet Earth, Sol system
- Full name: Michael J Sherwin
Re: Devlog of Leorik
Since Leorik has static piece square tables Leorik's evaluation may benefit by receiving feedback from the search. I was first to implement this idea in RomiChess. Later it was added to Rebel. And Tord had a feedback method in Glaurung. I have no idea if SF kept it. Basically the history tables can be used to modify by a small amount the piece square tables before the search.
			
			
									
						
										
						- 
				Mike Sherwin
- Posts: 965
- Joined: Fri Aug 21, 2020 1:25 am
- Location: Planet Earth, Sol system
- Full name: Michael J Sherwin
Re: Devlog of Leorik
Does Leorik use Zero Window searches? If so do you check for (beta - alpha > 1)?
EXAMPLE CODE
			
			
									
						
										
						EXAMPLE CODE
Code: Select all
 
      if (beta - alpha > 1) {
        score = -Search(-alpha - 1, -alpha, depth - 1);
        if (score > alpha)
          score = -Search(-beta, -alpha, depth - 1);
      }
      else {
        score = -Search(-beta, -alpha, depth - 1);
      }- 
				Mike Sherwin
- Posts: 965
- Joined: Fri Aug 21, 2020 1:25 am
- Location: Planet Earth, Sol system
- Full name: Michael J Sherwin
Re: Devlog of Leorik
Just tossing ideas into the air. Is the move number known at the previous plys of search? If so it is possible to LMR (or LMP) more aggressively. If on the previous ply it is on move 20 then there is a good chance that side can't hurt the current side.
			
			
									
						
										
						- 
				lithander  
- Posts: 915
- Joined: Sun Dec 27, 2020 2:40 am
- Location: Bremen, Germany
- Full name: Thomas Jahn
Re: Devlog of Leorik
Thanks for the ideas, Mike!
Here's the relevant source code: https://github.com/lithander/Leorik/blo ... eSearch.cs
			
			
									
						
										
						I'm now using a king-relative piece-square formula with 8 parameters that (thanks to SIMD) are quick enough to compute on the fly. Your idea might be applicable (even though I can't imagine the details involved) but I feel like the existing move-history techniques try to achieve the same goal, penalizing certain lines based on previous search results.Mike Sherwin wrote: ↑Tue Aug 29, 2023 8:58 pm Since Leorik has static piece square tables Leorik's evaluation may benefit by receiving feedback from the search.
Yes, of course. In practice 99% of my lines are on zero window and it's been a pillar of Leorik's search since a long time.Mike Sherwin wrote: ↑Tue Aug 29, 2023 9:39 pm Does Leorik use Zero Window searches? If so do you check for (beta - alpha > 1)?
Here's the relevant source code: https://github.com/lithander/Leorik/blo ... eSearch.cs
I tried something like that. Kept track of how many of the previous moves where in the "late quiet" category to prune these lines more aggressively but coulnd't get it to work in a way that would increase strength! I think that's because LMR is already cumulative e.g. the reductions add up already.Mike Sherwin wrote: ↑Tue Aug 29, 2023 10:04 pm Is the move number known at the previous plys of search? If so it is possible to LMR (or LMP) more aggressively. If on the previous ply it is on move 20 then there is a good chance that side can't hurt the current side.
- 
				spirch
- Posts: 95
- Joined: Fri Nov 09, 2012 12:36 am
Re: Devlog of Leorik
with the release of .net 8 (RC1), can you measure the performance increase from it?
			
			
									
						
										
						- 
				lithander  
- Posts: 915
- Joined: Sun Dec 27, 2020 2:40 am
- Location: Bremen, Germany
- Full name: Thomas Jahn
Re: Devlog of Leorik
I switched to a preview of .Net 8 already half a year ago. According to my commit message it gave a 5% performance increase:
https://github.com/lithander/Leorik/com ... 4b8516bcf1
- 
				spirch
- Posts: 95
- Joined: Fri Nov 09, 2012 12:36 am
Re: Devlog of Leorik
nice, do you mind retesting it to see if any more performance increase since the last time?lithander wrote: ↑Tue Sep 19, 2023 11:29 amI switched to a preview of .Net 8 already half a year ago. According to my commit message it gave a 5% performance increase:
https://github.com/lithander/Leorik/com ... 4b8516bcf1
- 
				lithander  
- Posts: 915
- Joined: Sun Dec 27, 2020 2:40 am
- Location: Bremen, Germany
- Full name: Thomas Jahn
Re: Devlog of Leorik
When I noted 5% in the commit message I compared .Net6 to .Net8 (preview) but now comparing .Net7 with .Net8 (RC) the difference is less than 1%.
Leorik compiled with .Net7 takes 28236ms to search startpos to depth 22.
Leorik compiled with .Net8 takes 28052ms to search startpos to depth 22.
(I took the best result out of 5 runs each)
- 
				spirch
- Posts: 95
- Joined: Fri Nov 09, 2012 12:36 am
Re: Devlog of Leorik
i am wondering if the "new" int128 could be useful or not in some scenario
			
			
									
						
										
						- 
				lithander  
- Posts: 915
- Joined: Sun Dec 27, 2020 2:40 am
- Location: Bremen, Germany
- Full name: Thomas Jahn
Re: Devlog of Leorik
Seeing that a lot of chess engines in the CCRL list above Leorik are tested with 4 CPUs I wanted to add support for the 'Threads' uci option as well for the upcoming version.
Ideally I want to maintain only *one* search implementation no matter the amount of cores and so the idea of implementing parallelism the "lazy" way with just a shared TT appealed to me. That way I would just instantiate the already existing IterativeSearch.cs multiple times and run them in parallel.
So I did that and proceeded to run my WAC test set to depth 15 but didn't get very far due to the TT becoming corrupted. Makes sense because reading and writing the 16 byte sized entries isn't atomic. And the higher the Thread count the faster it happens that two threads happen to access exactly the same slot at the same time, I just didn't think that it would happen quite as frequently as I observed it.
The simple fix of using a lock and allowing only one thread exclusive access at a time worked but now I'd run into scaling issues:
The reported nodes and nps values are always from the first search instance only. Running on two threads and sharing information via the TT allows a search to conclude with 30% less nodes searched than usual. The speed penalty from locking is only about 10% and so we have a net win.
But I was surprised by how poorly this scales with more threads. While the nodes searched per instance continue to drop the search speed tanks way quicker. If a search would keep it's normal 5M nodes per second speed than with 32 threads we'd have to search only 10% of the nodes per instance to reach the target depth. But in practice adding more than 3 or 4 threads is hurting more than it helps because the threads start to wait for the TT more than that they are searching.
Luckily there's an ingenious idea described in the chess programming wiki https://www.chessprogramming.org/Shared ... #Lock-less The idea is that before you write the key into the TT you XOR it with the data you're also going to write. When you want to know the key you can just do the same operation again provided that you have still the same data available. But if the table get's corrupted and key and data don't match anymore this won't produce the key that you expect. It's like a checksum but without dedicating any space to it.
In C++ you can use 'union' to map different members of a struct to the same memory. In C# that's rarely done but possible with using explicit struct layouts as seen above. And I was happy to find that the performance overhead is so negligible that I can just use the new lockless TT in single-threaded search as well!
Now I observed much better scaling. The speed of individual search instances still drops when you add more threads total but not so much that it would outweigh the gains. Why does it still drop? I think there are many reasons... one is that you just get more TT hits and better PV moves and that gives less opportunities to brute-force-search. The other is that my 12 core CPU boosts a single core much higher than the clock rate I see on a multi-core load because I have constrained the max power consumption to 100W. So these results felt about right.
A short match confirmed that the engine was gaining strength from of using multiple (e.g. 4) search instances.
I don't know what other authors do to prevent the search instances to be too much in synchrony but I just implemented a Shuffle() method that randomizes the order of the root moves completely randomly at the beginning of the search. And with several individual search instances you also get to make the choice of which instance has the honor to provide the final result and PV. My first implementation just always took the the result from the first instance that was started. I experimented a bit with trying to use the one whose search concludes first. But afterwards it felt pointless to wait for the other's that were still running, so I tried aborting them and that saved me a bit of time but for some reason I lost strength that way. So I allowed each search instance to conclude and then took the result of the one which finished last. This performed 20 Elo better than just taking the first one in the list or the first one that concludes. Why? No idea... the longer the search the better the result, right? Makes enough sense to not doubt my results and move on!
So I did a final short test with 8 threads vs 1 at 5s + 200ms and the results indicated a 165 Elo advantage!
For me that was better than expected and considering that concurrent programming can be difficult and painful sometimes I found the experience of adding this feature surprisingly smooth and gratifying! That you can have a lock-less shared TT is pretty awesome and I didn't expect it to be that effective in terms of strength.
If you've implemented parallel search in your own engine I'd be curious to know if you think my results make sense and how they compare to yours.
			
			
									
						
										
						Ideally I want to maintain only *one* search implementation no matter the amount of cores and so the idea of implementing parallelism the "lazy" way with just a shared TT appealed to me. That way I would just instantiate the already existing IterativeSearch.cs multiple times and run them in parallel.
So I did that and proceeded to run my WAC test set to depth 15 but didn't get very far due to the TT becoming corrupted. Makes sense because reading and writing the 16 byte sized entries isn't atomic. And the higher the Thread count the faster it happens that two threads happen to access exactly the same slot at the same time, I just didn't think that it would happen quite as frequently as I observed it.
The simple fix of using a lock and allowing only one thread exclusive access at a time worked but now I'd run into scaling issues:
Code: Select all
1 Threads (15): 188014105 nodes, 36,351s, 6% CPU, 286/300 (5M nps)
2 Threads (15): 127214891 nodes, 28,264s, 11% CPU, 286/300 (4.5M nps)
3 Threads (15): 106139316 nodes, 24,963s, 16% CPU, 285/3000 (4.2M nps)
4 Threads (15):  89509121 nodes, 25,483s, 21% CPU, 284/300 (3.5M nps)
8 Threads (15):  50923786 nodes, 66,646s, 38% CPU, 285/300 (0.76M nps)
8 Threads (15):  62190943 nodes, 50,294s, 38% CPU, 285/300 (1.2M nps)
12Threads (15):  39271485 nodes, 54,197s, 39% CPU, 285/300 (0.7M nps)
32Threads (15):  18303405 nodes, 66,025s, 38% CPU, 287/300 (0.3M nps)
But I was surprised by how poorly this scales with more threads. While the nodes searched per instance continue to drop the search speed tanks way quicker. If a search would keep it's normal 5M nodes per second speed than with 32 threads we'd have to search only 10% of the nodes per instance to reach the target depth. But in practice adding more than 3 or 4 threads is hurting more than it helps because the threads start to wait for the TT more than that they are searching.
Luckily there's an ingenious idea described in the chess programming wiki https://www.chessprogramming.org/Shared ... #Lock-less The idea is that before you write the key into the TT you XOR it with the data you're also going to write. When you want to know the key you can just do the same operation again provided that you have still the same data available. But if the table get's corrupted and key and data don't match anymore this won't produce the key that you expect. It's like a checksum but without dedicating any space to it.
Code: Select all
[StructLayout(LayoutKind.Explicit)]
public struct HashEntry
{
	[FieldOffset(0)]
	public ulong Key;        //8 Bytes
	[FieldOffset(8)]
	public short Score;      //2 Bytes
	[FieldOffset(10)]
	public byte Depth;       //1 Byte
	[FieldOffset(11)]
	public byte Age;         //1 Byte
	[FieldOffset(12)]
	public int MoveAndType;  //4 Byte
	//=================================
	//                        16 Bytes
	[FieldOffset(8)]
	public ulong Data;
	public ScoreType GetScoreType() => (ScoreType)(MoveAndType >> 29);
	public Move GetMove() => (Move)(MoveAndType & 0x1FFFFFFF);
	public ulong GetHash() => Key ^ Data;
	public void SetHash(ulong hash) => Key = hash ^ Data;
}Code: Select all
1 Threads (18): 158949431 nodes, 31,305s, 5077K NPS, 98 / 100
2 Threads (18): 112319050 nodes, 23,062s, 4870K NPS, 98 / 100
4 Threads (18):  75451726 nodes, 17,06s,  4422K NPS, 98 / 100
8 Threads (18):  54538679 nodes, 15,665,  3481K NPS, 98 / 100
12Threads (18):  39171151 nodes, 13,778,  2842K NPS, 98 / 100Code: Select all
Match Settings: book=varied.bin tc=5+0.2 hash=32
Leorik-2.4.7b @4Threads vs Leorik-2.4.7:
491 - 179 - 606  [0.622] 1276
Elo difference: 86.7 +/- 13.8, LOS: 100.0 %, DrawRatio: 47.5 %I don't know what other authors do to prevent the search instances to be too much in synchrony but I just implemented a Shuffle() method that randomizes the order of the root moves completely randomly at the beginning of the search. And with several individual search instances you also get to make the choice of which instance has the honor to provide the final result and PV. My first implementation just always took the the result from the first instance that was started. I experimented a bit with trying to use the one whose search concludes first. But afterwards it felt pointless to wait for the other's that were still running, so I tried aborting them and that saved me a bit of time but for some reason I lost strength that way. So I allowed each search instance to conclude and then took the result of the one which finished last. This performed 20 Elo better than just taking the first one in the list or the first one that concludes. Why? No idea... the longer the search the better the result, right? Makes enough sense to not doubt my results and move on!

So I did a final short test with 8 threads vs 1 at 5s + 200ms and the results indicated a 165 Elo advantage!
Code: Select all
conf="Leorik-2.4.7c" option.Threads=8 -each tc=5+0.2 book=varied.bin option.Hash=64
Score of Leorik-2.4.7c vs Leorik-2.4.7: 165 - 24 - 129  [0.722] 318
Elo difference: 165.5 +/- 30.1, LOS: 100.0 %, DrawRatio: 40.6 %If you've implemented parallel search in your own engine I'd be curious to know if you think my results make sense and how they compare to yours.