How do you know you improved ?

Discussion of chess software programming and technical issues.

Moderator: Ras

Carbec
Posts: 162
Joined: Thu Jan 20, 2022 9:42 am
Location: France
Full name: Philippe Chevalier

How do you know you improved ?

Post by Carbec »

Hi,

I did 2 matches and get rather different results :

Code: Select all

score of Zangdar 0.38.00 vs Blunder 6.1.0: 61 - 22 - 17  [0.695] 100
...      Zangdar 0.38.00 playing White: 33 - 11 - 6  [0.720] 50
...      Zangdar 0.38.00 playing Black: 28 - 11 - 11  [0.670] 50
...      White vs Black: 44 - 39 - 17  [0.525] 100
Elo difference: 143.1 +/- 67.2, LOS: 100.0 %, DrawRatio: 17.0 %

Score of Zangdar 0.38.00 vs Blunder 6.1.0: 49 - 33 - 18  [0.580] 100
...      Zangdar 0.38.00 playing White: 25 - 15 - 10  [0.600] 50
...      Zangdar 0.38.00 playing Black: 24 - 18 - 8  [0.560] 50
...      White vs Black: 43 - 39 - 18  [0.520] 100
Elo difference: 56.1 +/- 62.9, LOS: 96.1 %, DrawRatio: 18.0 %
How do you proceed to validate a modification ?
For info, games were at tc=10+0.3

Thanks

Philippe
User avatar
lithander
Posts: 915
Joined: Sun Dec 27, 2020 2:40 am
Location: Bremen, Germany
Full name: Thomas Jahn

Re: How do you know you improved ?

Post by lithander »

...isn't it obvious? Play more games!
Minimal Chess (simple, open source, C#) - Youtube & Github
Leorik (competitive, in active development, C#) - Github & Lichess
User avatar
mvanthoor
Posts: 1784
Joined: Wed Jul 03, 2019 4:42 pm
Location: Netherlands
Full name: Marcel Vanthoor

Re: How do you know you improved ?

Post by mvanthoor »

Carbec wrote: Thu Feb 03, 2022 3:09 pm I did 2 matches and get rather different results...
Two ways:
1. Between to engines, play at least 1000 games to make the result meaningful. (I have noticed that 1000 gives a very good indication; playing more games makes the indication more precise, but doesn't greatly change it anymore.) Then you will know your relative rating against this engine.
2. Play a gauntlet against 5 or 6 engines (with 1000 games per match, which would thus mean 5000-6000 games), with your version X. Note down the result X obtains, such as +35. This means that your engine version X obtained +35 Elo against the average of the field. Then play against the exact same engines in the same conditions, but use your engine version X+1. If you now score +100, you have improved by +65 Elo. This is the best method to guess the Elo-range where your engine will fall if it was tested by a rating list such as CCRL.

Personally I always do 1. first, and pick a target engine which is around the rating I expect my new version to be. If the other engine turns out to be too strong, I pick a weaker one; and the other way around. When I found an engine against which the new version scores roughly 50%, I run a gauntlet with 5-6 engines around the rating of that other engine. The rating span is about +/-50 up and down, so I would expect my engine to end somewhere in the middle of the gauntlet.

Third way, after you have progressed some way: SPRT testing. Test your new version x+1 against the old version X and see if it comes out better in CuteChess. Search Rustic's site (listed in my sig); I have a page online on how to do this.
Author of Rustic, an engine written in Rust.
Releases | Code | Docs | Progress | CCRL
Carbec
Posts: 162
Joined: Thu Jan 20, 2022 9:42 am
Location: France
Full name: Philippe Chevalier

Re: How do you know you improved ?

Post by Carbec »

Hi,

Thanks.
...isn't it obvious? Play more games!
yep, but how many ?

I didn't thought I had to do so many games. In fact, I will spend more time testing than developing.
> write 10 lines
> launch test
> come back next day....

Thanks

Philippe
User avatar
lithander
Posts: 915
Joined: Sun Dec 27, 2020 2:40 am
Location: Bremen, Germany
Full name: Thomas Jahn

Re: How do you know you improved ?

Post by lithander »

Carbec wrote: Thu Feb 03, 2022 4:39 pm In fact, I will spend more time testing than developing.
> write 10 lines
> launch test
> come back next day....
Exactly how Stockfish is developed. But at far bigger scale.

Dozens of oftentimes trivial changes are submitted to a testing queue and validated or falsified by a distributed testing system called Fishtest where supporters can contribute their own CPU time. Currently the stockfish project has access to ~3000 cores and can do over 3000 games per minute.
Minimal Chess (simple, open source, C#) - Youtube & Github
Leorik (competitive, in active development, C#) - Github & Lichess
Carbec
Posts: 162
Joined: Thu Jan 20, 2022 9:42 am
Location: France
Full name: Philippe Chevalier

Re: How do you know you improved ?

Post by Carbec »

woot, that's really impressive !
Henk
Posts: 7251
Joined: Mon May 27, 2013 10:31 am

Re: How do you know you improved ?

Post by Henk »

Maybe play first matches against itself and try to find out what might the maximum difference in score you can get.
Then make a change and play matches against previous version. If it can beat the maximum then it may be a gain
but it still can be inbreed.

So maybe better to first collect many engines with similar elo. But too much work.

My engine never made any progress during many years. So I should be an expert in fooling yourself with thinking you've found an improvement.
pedrojdm2021
Posts: 157
Joined: Fri Apr 30, 2021 7:19 am
Full name: Pedro Duran

Re: How do you know you improved ?

Post by pedrojdm2021 »

Carbec wrote: Thu Feb 03, 2022 3:09 pm Hi,

I did 2 matches and get rather different results :

Code: Select all

score of Zangdar 0.38.00 vs Blunder 6.1.0: 61 - 22 - 17  [0.695] 100
...      Zangdar 0.38.00 playing White: 33 - 11 - 6  [0.720] 50
...      Zangdar 0.38.00 playing Black: 28 - 11 - 11  [0.670] 50
...      White vs Black: 44 - 39 - 17  [0.525] 100
Elo difference: 143.1 +/- 67.2, LOS: 100.0 %, DrawRatio: 17.0 %

Score of Zangdar 0.38.00 vs Blunder 6.1.0: 49 - 33 - 18  [0.580] 100
...      Zangdar 0.38.00 playing White: 25 - 15 - 10  [0.600] 50
...      Zangdar 0.38.00 playing Black: 24 - 18 - 8  [0.560] 50
...      White vs Black: 43 - 39 - 18  [0.520] 100
Elo difference: 56.1 +/- 62.9, LOS: 96.1 %, DrawRatio: 18.0 %
How do you proceed to validate a modification ?
For info, games were at tc=10+0.3

Thanks

Philippe
Start a tournament in cutechess or arena your new engine version vs your older engine version. and see how many wins / loss you get now.
JVMerlino
Posts: 1397
Joined: Wed Mar 08, 2006 10:15 pm
Location: San Francisco, California

Re: How do you know you improved ?

Post by JVMerlino »

pedrojdm2021 wrote: Thu Feb 03, 2022 6:33 pm Start a tournament in cutechess or arena your new engine version vs your older engine version. and see how many wins / loss you get now.
I've never found self-play to be reliable. You might have just found a way to exploit a problem in your old engine, but actually gained no elo (or perhaps even made it weaker).

I play a gauntlet against 12 other engines mostly within +/- 50 elo of my engine. I find that gives me very good results.
pedrojdm2021
Posts: 157
Joined: Fri Apr 30, 2021 7:19 am
Full name: Pedro Duran

Re: How do you know you improved ?

Post by pedrojdm2021 »

JVMerlino wrote: Thu Feb 03, 2022 7:56 pm
pedrojdm2021 wrote: Thu Feb 03, 2022 6:33 pm Start a tournament in cutechess or arena your new engine version vs your older engine version. and see how many wins / loss you get now.
I've never found self-play to be reliable. You might have just found a way to exploit a problem in your old engine, but actually gained no elo (or perhaps even made it weaker).

I play a gauntlet against 12 other engines mostly within +/- 50 elo of my engine. I find that gives me very good results.
playing against other engines is a good idea too. I currently measure strenght by playing against VICE engine v1.1 (http://bluefeversoft.com/)