How do you know you improved ?

Uri Blass · Post by **Uri Blass** » Thu Feb 03, 2022 9:13 pm

lithander wrote: ↑Thu Feb 03, 2022 4:47 pm
Carbec wrote: ↑Thu Feb 03, 2022 4:39 pm In fact, I will spend more time testing than developing.
> write 10 lines
> launch test
> come back next day....
Exactly how Stockfish is developed. But at far bigger scale.

Dozens of oftentimes trivial changes are submitted to a testing queue and validated or falsified by a distributed testing system called Fishtest where supporters can contribute their own CPU time. Currently the stockfish project has access to ~3000 cores and can do over 3000 games per minute.

I do not understand why so many people give their computer time only to improve stockfish's elo.
What is so important in elo?

I do not think that elo is more important than being able to mate faster when you have a big advantage or
being able to beat weak engines with a big material handicap.

I believe stockfish is not the strongest in these aspects and I think that it is better if they use part of these cores to test derivatives of stockfish that are the best in mating faster or the best in winning weak engines with a big handicap.

algerbrex · Post by **algerbrex** » Thu Feb 03, 2022 10:18 pm

Uri Blass wrote: ↑Thu Feb 03, 2022 9:13 pm
lithander wrote: ↑Thu Feb 03, 2022 4:47 pm
Carbec wrote: ↑Thu Feb 03, 2022 4:39 pm In fact, I will spend more time testing than developing.
> write 10 lines
> launch test
> come back next day....
Exactly how Stockfish is developed. But at far bigger scale.

Dozens of oftentimes trivial changes are submitted to a testing queue and validated or falsified by a distributed testing system called Fishtest where supporters can contribute their own CPU time. Currently the stockfish project has access to ~3000 cores and can do over 3000 games per minute.
I do not understand why so many people give their computer time only to improve stockfish's elo.
What is so important in elo?

I do not think that elo is more important than being able to mate faster when you have a big advantage or
being able to beat weak engines with a big material handicap.

I believe stockfish is not the strongest in these aspects and I think that it is better if they use part of these cores to test derivatives of stockfish that are the best in mating faster or the best in winning weak engines with a big handicap.

I suppose that's really only a question Stockfish contributors can answer, although I see the point you're making.

Personally, for me, my primary goal is to increase the strength of Blunder because I think it's fun and exciting to see my engine play and stronger against different opponents, and it's rewarding and interesting to me to see how I can increase the strength of my engine by adding/removing/tweaking different features.

Although Elo isn't my only concern. I remember there was an evaluation patch a while ago that statistically made Blunder play stronger, but the style of play and roughness of the evaluation was so unappealing to me that I reverted the patch and settled on having a slighter weaker, but more attractive playing style for Blunder. Also a chess player, I love aggressive attacking games, where one side immediately goes for the throat, so I've tried to tweak Blunder to play more to this style. But maybe Blunder would be a bit stronger if I tuned that aggressivity down. But I'm not going to.

That's all to say that on top of Elo, I also want Blunder to have a certain personality, even at the expense of a couple of Elo here or there. And I won't pretend to speak for every developer (I obviously can't), but I imagine some of their motivations are similar to mine.

algerbrex · Post by **algerbrex** » Thu Feb 03, 2022 10:30 pm

Carbec wrote: ↑Thu Feb 03, 2022 3:09 pm Hi,

I did 2 matches and get rather different results :

Code: Select all

score of Zangdar 0.38.00 vs Blunder 6.1.0: 61 - 22 - 17  [0.695] 100
...      Zangdar 0.38.00 playing White: 33 - 11 - 6  [0.720] 50
...      Zangdar 0.38.00 playing Black: 28 - 11 - 11  [0.670] 50
...      White vs Black: 44 - 39 - 17  [0.525] 100
Elo difference: 143.1 +/- 67.2, LOS: 100.0 %, DrawRatio: 17.0 %

Score of Zangdar 0.38.00 vs Blunder 6.1.0: 49 - 33 - 18  [0.580] 100
...      Zangdar 0.38.00 playing White: 25 - 15 - 10  [0.600] 50
...      Zangdar 0.38.00 playing Black: 24 - 18 - 8  [0.560] 50
...      White vs Black: 43 - 39 - 18  [0.520] 100
Elo difference: 56.1 +/- 62.9, LOS: 96.1 %, DrawRatio: 18.0 %

How do you proceed to validate a modification ?
For info, games were at tc=10+0.3

Thanks

Philippe

Happy to see Blunder's been useful in your testing.

Anyways, others have posted very good answers already. More games will produce better results. And a field of opponents will produce even better results.

My testing style has been a bit sloppy because of my business (I recently had to revert a whole host of strength "gains"), but generally, if I use self-play, I'm running at least a couple thousand games, normally something like 2000 at least, at a time control of 10+0.1s.

As I've continued developing though, I've noticed two problems with the above scheme. One, self-play can be pretty biased sometimes. I remember once I made a search tweak which gained 50 Elo in self-play, but only something like 10 Elo in gauntlet testing. So now I'm leaning more towards gauntlet testing, although self-play testing can still be used. Just make sure to note that it can sometimes be quite biased.

A useful to use self-play testing and gauntlet testing together is to first run a couple of hundred games of self-play at a decently fast time control (e.g. tc=8+0.8s). You won't get perfectly accurate Elo gain, but it should be pretty clear if one version is stronger than the other. You'll have to be a bit careful with this too though, and use a bit of common sense. If in a 500 game match version 2 loses against version 1 120-340-40, then it's pretty safe to say you messed something up, and you shouldn't waste time doing any longer gauntlet testing until you've tried to go back and fix the bug. But if the score is something like 220-225-55, it's not so obvious which version is stronger, so you should still opt for more precise testing methods.

mvanthoor · Post by **mvanthoor** » Fri Feb 04, 2022 12:05 am

algerbrex wrote: ↑Thu Feb 03, 2022 10:18 pm
Personally, for me, my primary goal is to increase the strength of Blunder because I think it's fun and exciting to see my engine play and stronger against different opponents, and it's rewarding and interesting to me to see how I can increase the strength of my engine by adding/removing/tweaking different features.

Same for me, but an additional goal for me is to get my engine to 3000 Elo (single-threaded) on the CCRL-list so I can say for sure that it is stronger than Fritz 11 and it can replace it as my primary analysis engine. I've always liked Fritz, though the 11-13 versions had a playing style a bit more like the Rybka-versions of those days. Deep Fritz 10.1 has an absolutely legendary playing style. (I don't own Fritz 10.1; I was already almost too late with buying Fritz 11, as 12 was already out for some time. Deep Fritz 10.1 was out of "print" already.)

Although Elo isn't my only concern. I remember there was an evaluation patch a while ago that statistically made Blunder play stronger, but the style of play and roughness of the evaluation was so unappealing to me that I reverted the patch and settled on having a slighter weaker, but more attractive playing style for Blunder. Also a chess player, I love aggressive attacking games, where one side immediately goes for the throat, so I've tried to tweak Blunder to play more to this style. But maybe Blunder would be a bit stronger if I tuned that aggressivity down. But I'm not going to.

Seems we're of the same mind here as well. Rustic Alpha 1 and 2 play relatively aggressive because I wrote the PST's myself. (Actually, Rustic 1 and 2 calculate to roughly the same depth as I do, and forgetting the positional blunders it sometimes makes because of lack of knowledge, it plays eerily like myself...) I'd rather have an engine that goes for an all-out fight in each game with attractive chess, than an engine that plays boring shuffling games but ends up +20 Elo.

That's all to say that on top of Elo, I also want Blunder to have a certain personality, even at the expense of a couple of Elo here or there. And I won't pretend to speak for every developer (I obviously can't), but I imagine some of their motivations are similar to mine.

Mine are. I know my engine will probably never in the top 10. I don't have the development time nor the testing power, and I refuse to use something like OpenBench, because I don't want to rely on other people's hardware. If it can hit 3000 Elo single-threaded and then hit 3100+ on four cores, I'll be very happy to either call it quits for the engine and look into other chess-related developments, or I'll go and add things like MCTS and NNUE as alternatives to normal a/b-search and HCE.

Carbec · Post by **Carbec** » Fri Feb 04, 2022 9:07 am

Hello,

Thanks all for for comments. I have now a better picture of how to test my engine.

Its true that I have to test against multiple engines. I already noticed that I could perform well with one, and not so well with another, although they have the same elo. I am now collecting engines, but I met problems. Either they don't have a binary available, or they don't compile, or they communicate only with xboard, and I can't do a match with cutechess and xboard (even if I write it in the command line). I wonder how the guys at CCRL do all their matches.

Zangdar is pretty good now, and I am not sure of how to continue. I read a lot, in forums, or in papers. There are several ways ahead of me, positional evaluation ? be faster ? prune more ? Obviously all is interesting.

Philippe

Carbec · Post by **Carbec** » Fri Feb 04, 2022 12:14 pm

Hi,

How accurate is CCRL ?

Looking at this,
https://ccrl.chessdom.com/ccrl/404/cgi/ ... _28_64-bit
There was only 476 games....
Im doing more myself almost each day. In the past, I thought that the CCRL rating was somewhat an fiable value. Now, I have big doubts when you ask for thousands of games.

What do you think ?

Philippe

emadsen · Post by **emadsen** » Fri Feb 04, 2022 4:45 pm

Carbec wrote: ↑Fri Feb 04, 2022 12:14 pm Hi,

How accurate is CCRL ?

Looking at this,
https://ccrl.chessdom.com/ccrl/404/cgi/ ... _28_64-bit
There was only 476 games.

You may conclude the CCRL testers are 95% confident the rating of your engine at their blitz time controls is 1984 +/-28 Elo. In other words, there's a 95% likelihood the true rating of your engine lies between 1956 and 2012 Elo. This assumes an equal playing field for all engines. CCRL testers are very diligent about assuring conditions are fair.

I believe the confidence interval is 95%, though I'm not entirely sure. A CCRL tester can clarify.

jdart · Post by **jdart** » Fri Feb 04, 2022 5:00 pm

A useful to use self-play testing and gauntlet testing together is to first run a couple of hundred games of self-play at a decently fast time control (e.g. tc=8+0.8s). You won't get perfectly accurate Elo gain, but it should be pretty clear if one version is stronger than the other. You'll have to be a bit careful with this too though, and use a bit of common sense. If in a 500 game match version 2 loses against version 1 120-340-40, then it's pretty safe to say you messed something up, and you shouldn't waste time doing any longer gauntlet testing until you've tried to go back and fix the bug. But if the score is something like 220-225-55, it's not so obvious which version is stronger, so you should still opt for more precise testing methods.

Sometimes you make a change that has a big improvement or regression and it is easy to see with a few hundred games. Even then you should be using SPRT to assess significance.

Note that Stockfish regularly accepts changes that take tens of thousands of games to show a significant improvement. They have built up the strength of the engine largely through incremental changes each of which was small in ELO gain, although there were some big ones like adding NNUE.

Uri Blass · Post by **Uri Blass** » Fri Feb 04, 2022 11:33 pm

jdart wrote: ↑Fri Feb 04, 2022 5:00 pm
A useful to use self-play testing and gauntlet testing together is to first run a couple of hundred games of self-play at a decently fast time control (e.g. tc=8+0.8s). You won't get perfectly accurate Elo gain, but it should be pretty clear if one version is stronger than the other. You'll have to be a bit careful with this too though, and use a bit of common sense. If in a 500 game match version 2 loses against version 1 120-340-40, then it's pretty safe to say you messed something up, and you shouldn't waste time doing any longer gauntlet testing until you've tried to go back and fix the bug. But if the score is something like 220-225-55, it's not so obvious which version is stronger, so you should still opt for more precise testing methods.
Sometimes you make a change that has a big improvement or regression and it is easy to see with a few hundred games. Even then you should be using SPRT to assess significance.

Note that Stockfish regularly accepts changes that take tens of thousands of games to show a significant improvement. They have built up the strength of the engine largely through incremental changes each of which was small in ELO gain, although there were some big ones like adding NNUE.

Using sprt tells you nothing about the elo improvement and you may have no idea if the change that passed relatively fast is 1 elo improvement or 10 elo improvement.

I think that it may be more interesting to test at least every change that you decide to accept later with a fixed number of 100,000 games to have a more accurate evaluation of how many elo the change worth.

Other interesting questions are if the change help to beat weaker opponents with knight handicap or maybe counterproductive for that purpose and if the change is productive for winning faster from winning positions or not.

algerbrex · Post by **algerbrex** » Sun Feb 06, 2022 3:14 am

mvanthoor wrote: ↑Fri Feb 04, 2022 12:05 am Same for me, but an additional goal for me is to get my engine to 3000 Elo (single-threaded) on the CCRL-list so I can say for sure that it is stronger than Fritz 11 and it can replace it as my primary analysis engine. I've always liked Fritz, though the 11-13 versions had a playing style a bit more like the Rybka-versions of those days. Deep Fritz 10.1 has an absolutely legendary playing style. (I don't own Fritz 10.1; I was already almost too late with buying Fritz 11, as 12 was already out for some time. Deep Fritz 10.1 was out of "print" already.)

Right. I don't necessarily have a set Elo goal in mind, but if Blunder were to reach anywhere between 2850-3000, I would be pretty happy with how this "weekend" project turned out. Because originally when I started writing Blunder a year ago now in Python, it wasn't supposed to be much more than a fun little weekend project that I'd use to learn more about the minimax algorithm. It just slowly evolved into what it is now.

mvanthoor wrote: ↑Fri Feb 04, 2022 12:05 am Seems we're of the same mind here as well. Rustic Alpha 1 and 2 play relatively aggressive because I wrote the PST's myself. (Actually, Rustic 1 and 2 calculate to roughly the same depth as I do, and forgetting the positional blunders it sometimes makes because of lack of knowledge, it plays eerily like myself...) I'd rather have an engine that goes for an all-out fight in each game with attractive chess, than an engine that plays boring shuffling games but ends up +20 Elo.

Yup, that's my mindset. Of course, there's a balance, but I much prefer seeing Blunder taking the game to its opponent than visa versa. Of course, that's still very much a work in progress since I'm still very much a lower-level chess player, but I believe I've been able to meet my goal in some sense, particularly with tuning the king safety parameters. Once you add King safety to Rustic, I think you'll start to see a very attractive personality emerge.

mvanthoor wrote: ↑Fri Feb 04, 2022 12:05 am Mine are. I know my engine will probably never in the top 10. I don't have the development time nor the testing power, and I refuse to use something like OpenBench, because I don't want to rely on other people's hardware. If it can hit 3000 Elo single-threaded and then hit 3100+ on four cores, I'll be very happy to either call it quits for the engine and look into other chess-related developments, or I'll go and add things like MCTS and NNUE as alternatives to normal a/b-search and HCE.

Right, same here. Blunder's already far exceeded my expectations as is since I was originally going to call it quits back at 2000 Elo. And realistically I don't have the time, money, or skill to necessarily have Blunder crack the top 10, or even the top 20 probably, on the CCRL. I'm now a full-time college Freshman with a whole host of responsibilities and jobs, so development of Blunder has had to take a seat on the back burner.

But I've started to have a bit more time as this second semester has mellowed out, and I'll hopefully have more time over the summer, even if do apply for an internship somewhere. I've enjoyed working on Blunder, and from purely business standpoint, I think it'll be a nice project to have on the resume.

But Blunder always has, and will still first and foremost be a labor of love. If it wasn't I honestly would've stopped development a long time ago. Just now tonight I started my re-factoring process of Blunder, which is more or less me writing everything in the codebase from scratch (around 5k lines of code), as I feel like there's still more room to clean-up and reduce my code, on top of speeding things up.

I'm also starting from scratch because now that I've learned more about how to properly test, I feel like there are many features in Blunder that I added in that either aren't getting as much Elo as they should, or are actually reducing the Elo due to poor testing on my part. So for this re-write of Blunder, every search or evaluation patch I add on top of the bare basics (pure negamax + only material evaluation) will now be rigorously tested using a gauntlet of opponents, and every release going forward will be tested consistently at a shorter and longer time control, probably something like 8+0.8s and 60+0.6s, over a couple thousand games. Hopefully, this more rigorous testing process will result in the next release of Blunder (currently slated as 8.0.0) having feature parity or even fewer features than Blunder 7.6.0, and yet being stronger.

Well see how well that goes in a month or two, once I've finished my re-write

How do you know you improved ?

Re: How do you know you improved ?

Re: How do you know you improved ?

Re: How do you know you improved ?

Re: How do you know you improved ?

Re: How do you know you improved ?

Re: How do you know you improved ?

Re: How do you know you improved ?

Re: How do you know you improved ?

Re: How do you know you improved ?

Re: How do you know you improved ?