Testing strategies for my engines playing strength

unserializable · Post by **unserializable** » Wed Jan 27, 2021 3:59 pm

lithander wrote: ↑Wed Jan 27, 2021 11:38 am Are there well established testing procedures like that? Data sets? Did you do something like that with your own engine? (Apart from running tournaments between different versions of course which has already been discussed?)

Hi Thomas. With regard to data sets, there are EPD sets (Extended Position Description, https://www.chessprogramming.org/Extend ... escription) with positions annotated with e.g. best moves and avoid moves, etc. You should be able to find some with searches, also some chess engine source distributions include some. I do not have any particular set to recommend from experience (planning to use them, but have not yet) -- have a look at https://github.com/ChrisWhittington/Chess-EPDs and https://www.chessprogramming.org/Test-Positions

Also, as engine is developed, one is bound to stumble on positions which engine cannot yet solve or where it does not act optimally, it is good to keep these positions and use them to verify new features and that fixes do not cause regressions. One simple position below that I have used to verify that underpromotion works:

[d] 8/p2QPrk1/6p1/2q5/8/6P1/PPP2R1P/6K1 w - - 0 34

e8=N is best move for white on the move, depth 4 is enough to find it as such (queening leads to loss in 4 plies, 1. e8=Q Qxf2+ 2. Kh1 Qf1#):

EDIT: ah, realized that your perft suite already is EPD, so you know about them -- nevertheless maybe some links are with new material

lithander · Post by **lithander** » Fri Jan 29, 2021 12:57 am

unserializable wrote: ↑Wed Jan 27, 2021 3:59 pm EDIT: ah, realized that your perft suite already is EPD, so you know about them -- nevertheless maybe some links are with new material

The links where super helpful!

And I think I've arrived at a pretty solid testing strategy to guide my next steps. For that I've added two new commands to my engine.

'make_bench' reads FENs out of *.epd files I found in the repo Taimo linked. It then performs a search on a given depth and creates the following output which it writes to a new file.

Code: Select all

r2q1r2/1b2bpkp/n3p1p1/2ppP1P1/p6R/1PN1BQR1/NPP2P1P/4K3 w - -;ce 300;bm g3h3;acd 7;acn 88020825;acs 34.792
r1b1k2r/ppppqppp/8/2bP4/3p4/6P1/PPQPPPBP/R1B2RK1 b kq -;ce 200;bm e7e2;acd 7;acn 111587140;acs 56.816
rnbqkb1r/p3pppp/1p6/2ppP3/3N4/2P5/PPP1QPPP/R1B1KB1R w KQkq -;ce 100;bm e2f3 e2b5 d4b3 d4f3 d4b5 e5e6;acd 7;acn 122484466;acs 36.726
6R1/4qp1p/ppr1n1pk/8/1P2P1QP/6N1/P4PP1/6K1 w - -;ce 33300;bm g4h5;acd 7;acn 58336141;acs 20.965

The data generated and attached is

ce = evaluation of best moves
bm = best moves
acd = searched depth
acn = number of nodes evaluated
acs = search duration in seconds

I've evaluated a thousand positions like that using my min max implementation which takes very long but has to be done only once. (Unless you change the eval which I don't plan to do for now. I just count material and nothing else)

After generating this data I started to improve the min max algorithm by pruning. I assume that for each position searched at the same depth (1) the same eval and best moves should be found as in the min max implementation and (2) the number of nodes evaluated and the search duration get's reduced by pruning.

And to confirm that I've implemented a 'bench' command which reads a file produced with 'make_bench' containing dozens or even hundreds of test positions. It will basically evaluate the position at the same depth and compare the results and generate output such as this:

Code: Select all

1 OK! acn 13815676 / 4192086432 = 0.003, acs 4.797 / 836.49 = 0.006.
2 OK! acn 1917864 / 486503398 = 0.004, acs 1.867 / 116.351 = 0.016.
3 OK! acn 4960790 / 1260872611 = 0.004, acs 1.649 / 243.975 = 0.007.
4 OK! acn 5389962 / 988286346 = 0.005, acs 4.349 / 243.634 = 0.018.
...
Test finished with 0 wrong results! Avg time ratio: 0.016, Avg node ratio: 0.007, Performance: 1831kN/sec

So that means my implementation of alpha-beta did indeed improve the search by providing (1) the same eval and best moves on all testposition but is (2) on average 60x faster (Avg time ratio: 0.016) and that's because it evaluated only 1% of the leaf nodes that it would eval without pruning (Avg node ratio: 0.007). Also I can see that the Performance of N/s is sadly reduced by a a significant factor so that gives me new ideas for improving the search further.

But the important benefit of this benchmark is that I can now have the same trust in the alpha-beta implementation that I had in the min max one. And use it to generate another dataset at even deeper depths. That can be used to verify the next improvements in my search for correctness.

So that's almost like 'perft' but not to test the move generator but the search!

mvanthoor · Post by **mvanthoor** » Fri Jan 29, 2021 1:33 am

unserializable wrote: ↑Wed Jan 27, 2021 3:59 pm Also, as engine is developed, one is bound to stumble on positions which engine cannot yet solve or where it does not act optimally, it is good to keep these positions and use them to verify new features and that fixes do not cause regressions. One simple position below that I have used to verify that underpromotion works:

[d] 8/p2QPrk1/6p1/2q5/8/6P1/PPP2R1P/6K1 w - - 0 34

e8=N is best move for white on the move, depth 4 is enough to find it as such (queening leads to loss in 4 plies, 1. e8=Q Qxf2+ 2. Kh1 Qf1#):

EDIT: ah, realized that your perft suite already is EPD, so you know about them -- nevertheless maybe some links are with new material

If your engine has both "in-check extension" and quiescence search, it can find the correct promotion at depth 1 already.

My engine never considers anything else but e7e8n, because NOT giving check basically loses the game.

Code: Select all

go
info score cp 1100 depth 1 seldepth 3 time 0 nodes 103 nps 0 pv e7e8n g7g8 d7f7
info score cp 1120 depth 2 seldepth 3 time 0 nodes 267 nps 0 pv e7e8n g7h8 d7f7
info score cp 1120 depth 3 seldepth 7 time 0 nodes 2098 nps 0 pv e7e8n g7h8 d7f7 g6g5
info score cp 1150 depth 4 seldepth 9 time 5 nodes 17331 nps 3466200 pv e7e8n g7h6 d7h3 c5h5 h3h5 g6h5 f2f7
info score cp 1150 depth 5 seldepth 9 time 33 nodes 105166 nps 3186848 pv e7e8n g7h6 d7h3 c5h5 h3h5 g6h5 f2f7 a7a5
info score cp 1175 depth 6 seldepth 11 time 232 nodes 819600 nps 3532759 pv e7e8n g7h6 d7h3 c5h5 h3h5 g6h5 f2f7 a7a5 e8d6
info score cp 1240 depth 7 seldepth 11 time 1496 nodes 5343413 nps 3571800 pv e7e8n g7h6 d7h3 c5h5 h3h5 g6h5 f2f7 a7a5 f7a7 h6g5 a7a5
info score cp 1725 depth 8 seldepth 15 time 10187 nodes 37896833 nps 3720117 pv e7e8n g7h6 d7f7 c5d4 f7f8 h6g5 h2h4 g5g4 f8f3 g4h3 f3g2 h3g4 e8f6 d4f6 f2f6
info score cp 1725 depth 9 seldepth 11 time 68812 nodes 258439645 nps 3755735 pv e7e8n g7h6 d7f7 c5d4 f7f8 h6g5 h2h4 g5g4 f8f3 g4h3 f3g2 h3g4 e8f6 d4f6 f2f6

PS: I've watched a part of your first video. Good start

I'm going to follow your series. Good luck with the engine. I look forward to playing Rustic against it some day.

mvanthoor · Post by **mvanthoor** » Fri Jan 29, 2021 2:35 am

lithander wrote: ↑Fri Jan 29, 2021 12:57 am But the important benefit of this benchmark is that I can now have the same trust in the alpha-beta implementation that I had in the min max one. And use it to generate another dataset at even deeper depths. That can be used to verify the next improvements in my search for correctness.

So that's almost like 'perft' but not to test the move generator but the search!

There is one thing you are overlooking.

Min-Max finds the best move it can, using the evaluation function you provide: true.
Alpha-Beta should find exactly the same move as Min-Max, given the same evaluation function: true.

So yes, the testing method you are using is sound in this case, because alpha-beta is an optimization for Min-Max. (Alpha-beta is a faster Min-Max.) However, there are many pruning methods that work by throwing moves away because they are "probably not really good." What this means is that there are many pruning methods that will may NOT always give you the same results as Alpha-Beta.

The reason is that these pruning methods involve risk: they throw moves they consider as "probably not so good" away, so they can search the moves they consider "probably very good" deeper, to pick the best candidate from the latter. So you can search deeper, but you run the risk of potentially throwing good moves away and thus missing something. Most of the time, this works out fine: searching the "probably good" moves deeper yields a stronger engine than searching more moves less deep for fear of missing something.

This does mean that you can't use this testing method for most future search enhancements, as you'll need to find a fine line between what to prune, and what to keep. This completely depends on your own engine; the pruning parameters set for one engine may completely not work for yours, because your engine runs at a different speed (and can thus search more or less moves), has a different evaluation (so it decides differently on what is good and bad).

In the end, the only real way of testing is to set up a gauntlet against a list of engines, play 2000 games at a time control such as CCRL Blitz's 2s+1s, and see where the engine ends up. You can estimate the rating by running BayesElo on the games, and then calibrate the list against the top rated engine in your tournament. Then you have your base list. After you make changes, you have the new engine run the exact same gauntlet, and you make a new list. If, in this new list, the engine is rated higher than in the previous list, you know you have improved by somewhere around that many rating points.

When you have a few versions of your engine, you can also arrange a self-play tournament (every version against every other version) and see how they end up. Your most recent version should be the strongest.

ydebilloez · Post by **ydebilloez** » Fri Jan 29, 2021 10:04 am

Welcome to the club. In belofte, I did following for testing:

I have some scripts together with engines.json to run cutechess tournaments, I also use ordo to re-evaluate score;

I have some scripts that can be thrown at the engine and that will run an individual test;

I have support for epd, together with some epd files

Evaluation functions, search algorithms and quiescence are runtime configurable to undo the move and search again with another algorithm.

Feel free to grab anything from the trunk branch or the 0.9.20 branch. It works fine for windows with git bash and linux. Should work on mac as well.

unserializable · Post by **unserializable** » Fri Jan 29, 2021 10:36 am

lithander wrote: ↑Fri Jan 29, 2021 12:57 am But the important benefit of this benchmark is that I can now have the same trust in the alpha-beta implementation that I had in the min max one. And use it to generate another dataset at even deeper depths. That can be used to verify the next improvements in my search for correctness.

Very nice, clean and methodical, your threats about upcoming v0.2 are to be taken most seriously

! Would not be too concerned about any later additional pruning methods affecting the move selection, test positions can then be restricted to be such that algorithms would agree about the move selection or test conditions be set such that move selections after additional pruning are indeed required to deviate.

derjack · Post by **derjack** » Fri Jan 29, 2021 5:10 pm

lithander wrote: ↑Fri Jan 29, 2021 12:57 am So that means my implementation of alpha-beta did indeed improve the search by providing (1) the same eval and best moves on all testposition but is (2) on average 60x faster (Avg time ratio: 0.016) and that's because it evaluated only 1% of the leaf nodes that it would eval without pruning (Avg node ratio: 0.007). Also I can see that the Performance of N/s is sadly reduced by a a significant factor so that gives me new ideas for improving the search further.

But the important benefit of this benchmark is that I can now have the same trust in the alpha-beta implementation that I had in the min max one. And use it to generate another dataset at even deeper depths. That can be used to verify the next improvements in my search for correctness.

So that's almost like 'perft' but not to test the move generator but the search!

Good, this way you can test safe pruning things like move ordering, pvs/negascout or aspiration windows. Is this with TT, killer moves and/or other order moves enhancements?

lithander · Post by **lithander** » Sat Jan 30, 2021 12:34 am

mvanthoor wrote: ↑Fri Jan 29, 2021 1:33 am I look forward to playing Rustic against it some day.

I guess with the self imposed constraint of making my first engine intentionally a minimal one that will never be a fair fight!

mvanthoor wrote: ↑Fri Jan 29, 2021 2:35 am There is one thing you are overlooking. [...] There are many pruning methods that will NOT always give you the same results as Alpha-Beta.

That's a problem future-me will have to solve in whatever more ambitious engine I'll attempt after minimal chess.

It might sound crazy to some of you that I limit myself intentionally like that but I've experienced burning out on too ambitious projects time and time again and learned my lesson. And so I'm not trying to maximize the ELO of my first engine but rather it's price-performance ratio where price may be defined as the confusion and despair a chess programming noob is experiencing while trying to understand, debug and maintain it's source code.

ydebilloez wrote: ↑Fri Jan 29, 2021 10:04 am Welcome to the club. In belofte, I did following for testing: [...]

Thanks! I like the idea of using shell-scripts that use the build engine for testing. Currently I have another executable besides the UCI engine that can render a board as ASCII and accept various commands (manually typed in) for debugging and testing but that's not as easy to automate as your approach!

And I also added belofte 2.1.0 to my list of sparring partners. It seems like it isn't totally out of reach regarding it's playing strength on default settings! What setting would your recommend? E.g. what's it's strongest and weakest settings configuration?

derjack wrote: ↑Fri Jan 29, 2021 5:10 pm Is this with TT, killer moves and/or other order moves enhancements?

Nope. So far I only used the testing approach to compare minmax vs alpha beta. I'm going to add move ordering and some kind of iterative deepening next. I can post the results here if anyone else is curious!

mvanthoor · Post by **mvanthoor** » Sat Jan 30, 2021 12:59 am

lithander wrote: ↑Sat Jan 30, 2021 12:34 am
mvanthoor wrote: ↑Fri Jan 29, 2021 1:33 am I look forward to playing Rustic against it some day.
I guess with the self imposed constraint of making my first engine intentionally a minimal one that will never be a fair fight!

It depends... my engine is also minimal, but in a different sort of way. I want to try and create an engine as strong as possible, with the least amount of features. That means that I'll add new features only when I think the current ones are maxed out.

At this point in the development, there are only two things I can do:
- Add new features
- Start looking into Texel tuning for the material+PST evaluation.

I'll probably add a transposition table first, because it greatly boosts endgame performance (and increases search depth in the middle game) without actually adding any new knowledge or capabilities to the engine. Then I think I'll start looking into tuning the PST's before I do other things.

I've run a test against your engine (the BigExecutables version). At this point it is indeed not a fair fight. Some observations:

- Rustic is 4-7x faster, depending on the position. You'd need to look into the code with a profiler to improve this.
- Your engine doesn't seem to have quiescence search (or it doesn't work correctly), as it sometimes just gives away a piece for no reason.
- The UCI commands on the command line have to come in a very particular order. The engine works in CuteChess, but I have been able to crash the engine or make it quit in numerous ways when typing valid UCI commands on the command line.

That's a problem future-me will have to solve in whatever more ambitious engine I'll attempt after minimal chess.

It might sound crazy to some of you that I limit myself intentionally like that but I've experienced burning out on too ambitious projects time and time again and learned my lesson. And so I'm not trying to maximize the ELO of my first engine but rather it's price-performance ratio where price may be defined as the confusion and despair a chess programming noob is experiencing while trying to understand, debug and maintain it's source code.

Doesn't sound strange at all. For me, Rustic is a very long term project. I started it in August 2019, when I finally (FINALLY after 20 years) bit the bullet and decided to write an engine. I set the following goals:

- Clear, readable code that I can understand myself in another 10 years
- It has to be well commented and documented
- Every function I add has to be extensively tested. I DON'T want to go back and fix bugs; only go back and improve or optimize. (In the entire development of the engine up until now, I only had to fix ONE bug: at some point I forgot to remove a castling permission on rook capture.)
- Speed. I'm running this engine through a profiler more often than is healthy for a sane person, and I try to improve/replace any line of code that is slower than other lines.
- Strength per feature. There are many engines that have lots and lots of features, but still are weak due to bugs. I don't want that, because these kinds of bugs are VERY hard to fix long after the fact.
- It can't be written in C or C++.

The one thing you're not seeing is: "I want to achieve X ELO within Y months, or be on spot A of CCRL before..." Not in your life. I'll write code for the engine when I have the time and motivation... I had two huge development sprints to get to the point where the engine is now (because I had the time and motivation). Now I have less time due to several factors, so development is slower. That isn't a problem, because the engine is basically done: I "only" "need" to add features to make it stronger.

Nope. So far I only used the testing approach to compare minmax vs alpha beta. I'm going to add move ordering and some kind of iterative deepening next. I can post the results here if anyone else is curious!

I'm certainly curious. I'm always curious to see how far new engines can go, especially in the beginning: how strong can they become with the features they have?

lithander · Post by **lithander** » Sat Jan 30, 2021 2:51 am

mvanthoor wrote: ↑Sat Jan 30, 2021 12:59 am - Start looking into Texel tuning for the material+PST evaluation.

I have replaced my piece counting evaluation with the tuned PSTs from Sunfish because I was curious and it made the engine play much stronger. But it's a bit like using neural networks - sure they work but you won't have an intuition why. I'll stick with counting pieces for now, that I understand just fine!

mvanthoor wrote: ↑Sat Jan 30, 2021 12:59 am Some observations:
- Rustic is 4-7x faster, depending on the position. You'd need to look into the code with a profiler to improve this.
- Your engine doesn't seem to have quiescence search (or it doesn't work correctly), as it sometimes just gives away a piece for no reason.
- The UCI commands on the command line have to come in a very particular order. The engine works in CuteChess, but I have been able to crash the engine or make it quit in numerous ways when typing valid UCI commands on the command line.

Many thanks for taking the time! Really appreciate it.

The performance doesn't surprise me. First of all I wrote my move generation code from scratch and so it doesn't use bitboards. My board representation is just an 8x8 array of pieces and then you loop over that and whenever you find a piece there's a switch statement that calls piece-type-specific routines to add the moves of this piece. It's very straight forward. Here's the Youtube video with timecode where I explain that part to my (immagined

) audience. But of course conjuring black magic on 64bit integers would be much faster. I just didn't feel like I could do that without looking at existing implementations which is against my rules. My move generator's 'perft 6' takes 15 seconds on the start position. That's about 8 Million nodes generated per second. For a short while I was very proud about myself, then I discovered that 'qperft' finds the same result in 0.47 seconds and I cried a little!

The search in the version you tried was just minmax, so no pruning at all. It reaches a fixed depth of 4 plys and stops. But quiescence is on my list for version 0.2 because the kind of blunders the lack of it causes is really hard to endure. Even for a bad chess player like myself!

I only implement a tiny subset of UCI at the moment. In version 0.2 I plan to at least support time controls and I also want to send an 'info' string back to the GUI. But I don't think I will add all UCI commands. Is there anything you found particularily irritating or consider important for even the most barebone engine?

mvanthoor wrote: ↑Sat Jan 30, 2021 12:59 am For me, Rustic is a very long term project.

I noticed your engine before and liked the patience and care that oozes from it. Rust is also an interesting choice. It's been on my radar for a while but I never gave it a try, yet. But it sounds perfect for chess engines.

When I feel like I have a good understanding of what components go into a strong engine (and why) I can totally see myself trying something more ambitious. I even have a name already for my potential long-term project! But one step at a time... it's been only a few months ago that I watched Queens Gambit and got interested in chess, after all, haha.

Testing strategies for my engines playing strength

Re: Testing strategies for my engines playing strength

Re: Testing strategies for my engines playing strength

Re: Testing strategies for my engines playing strength

Re: Testing strategies for my engines playing strength

Re: Testing strategies for my engines playing strength

Re: Testing strategies for my engines playing strength

Re: Testing strategies for my engines playing strength

Re: Testing strategies for my engines playing strength

Re: Testing strategies for my engines playing strength

Re: Testing strategies for my engines playing strength