How do you know you improved ?

Carbec · Post by **Carbec** » Thu Feb 03, 2022 3:09 pm

Hi,

I did 2 matches and get rather different results :

score of Zangdar 0.38.00 vs Blunder 6.1.0: 61 - 22 - 17  [0.695] 100
...      Zangdar 0.38.00 playing White: 33 - 11 - 6  [0.720] 50
...      Zangdar 0.38.00 playing Black: 28 - 11 - 11  [0.670] 50
...      White vs Black: 44 - 39 - 17  [0.525] 100
Elo difference: 143.1 +/- 67.2, LOS: 100.0 %, DrawRatio: 17.0 %

Score of Zangdar 0.38.00 vs Blunder 6.1.0: 49 - 33 - 18  [0.580] 100
...      Zangdar 0.38.00 playing White: 25 - 15 - 10  [0.600] 50
...      Zangdar 0.38.00 playing Black: 24 - 18 - 8  [0.560] 50
...      White vs Black: 43 - 39 - 18  [0.520] 100
Elo difference: 56.1 +/- 62.9, LOS: 96.1 %, DrawRatio: 18.0 %

How do you proceed to validate a modification ?
For info, games were at tc=10+0.3

Thanks

Philippe

lithander · Post by **lithander** » Thu Feb 03, 2022 3:39 pm

...isn't it obvious? Play more games!

mvanthoor · Post by **mvanthoor** » Thu Feb 03, 2022 4:06 pm

Carbec wrote: ↑Thu Feb 03, 2022 3:09 pm I did 2 matches and get rather different results...

Two ways:
1. Between to engines, play at least 1000 games to make the result meaningful. (I have noticed that 1000 gives a very good indication; playing more games makes the indication more precise, but doesn't greatly change it anymore.) Then you will know your relative rating against this engine.
2. Play a gauntlet against 5 or 6 engines (with 1000 games per match, which would thus mean 5000-6000 games), with your version X. Note down the result X obtains, such as +35. This means that your engine version X obtained +35 Elo against the average of the field. Then play against the exact same engines in the same conditions, but use your engine version X+1. If you now score +100, you have improved by +65 Elo. This is the best method to guess the Elo-range where your engine will fall if it was tested by a rating list such as CCRL.

Personally I always do 1. first, and pick a target engine which is around the rating I expect my new version to be. If the other engine turns out to be too strong, I pick a weaker one; and the other way around. When I found an engine against which the new version scores roughly 50%, I run a gauntlet with 5-6 engines around the rating of that other engine. The rating span is about +/-50 up and down, so I would expect my engine to end somewhere in the middle of the gauntlet.

Third way, after you have progressed some way: SPRT testing. Test your new version x+1 against the old version X and see if it comes out better in CuteChess. Search Rustic's site (listed in my sig); I have a page online on how to do this.

Carbec · Post by **Carbec** » Thu Feb 03, 2022 4:39 pm

Hi,

Thanks.

...isn't it obvious? Play more games!

yep, but how many ?

I didn't thought I had to do so many games. In fact, I will spend more time testing than developing.
> write 10 lines
> launch test
> come back next day....

Thanks

Philippe

lithander · Post by **lithander** » Thu Feb 03, 2022 4:47 pm

Carbec wrote: ↑Thu Feb 03, 2022 4:39 pm In fact, I will spend more time testing than developing.
> write 10 lines
> launch test
> come back next day....

Exactly how Stockfish is developed. But at far bigger scale.

Dozens of oftentimes trivial changes are submitted to a testing queue and validated or falsified by a distributed testing system called Fishtest where supporters can contribute their own CPU time. Currently the stockfish project has access to ~3000 cores and can do over 3000 games per minute.

Carbec · Post by **Carbec** » Thu Feb 03, 2022 4:57 pm

woot, that's really impressive !

Henk · Post by **Henk** » Thu Feb 03, 2022 5:42 pm

Maybe play first matches against itself and try to find out what might the maximum difference in score you can get.
Then make a change and play matches against previous version. If it can beat the maximum then it may be a gain
but it still can be inbreed.

So maybe better to first collect many engines with similar elo. But too much work.

My engine never made any progress during many years. So I should be an expert in fooling yourself with thinking you've found an improvement.

pedrojdm2021 · Post by **pedrojdm2021** » Thu Feb 03, 2022 6:33 pm

Carbec wrote: ↑Thu Feb 03, 2022 3:09 pm Hi,

I did 2 matches and get rather different results :

Code: Select all

score of Zangdar 0.38.00 vs Blunder 6.1.0: 61 - 22 - 17  [0.695] 100
...      Zangdar 0.38.00 playing White: 33 - 11 - 6  [0.720] 50
...      Zangdar 0.38.00 playing Black: 28 - 11 - 11  [0.670] 50
...      White vs Black: 44 - 39 - 17  [0.525] 100
Elo difference: 143.1 +/- 67.2, LOS: 100.0 %, DrawRatio: 17.0 %

Score of Zangdar 0.38.00 vs Blunder 6.1.0: 49 - 33 - 18  [0.580] 100
...      Zangdar 0.38.00 playing White: 25 - 15 - 10  [0.600] 50
...      Zangdar 0.38.00 playing Black: 24 - 18 - 8  [0.560] 50
...      White vs Black: 43 - 39 - 18  [0.520] 100
Elo difference: 56.1 +/- 62.9, LOS: 96.1 %, DrawRatio: 18.0 %

How do you proceed to validate a modification ?
For info, games were at tc=10+0.3

Thanks

Philippe

Start a tournament in cutechess or arena your new engine version vs your older engine version. and see how many wins / loss you get now.

JVMerlino · Post by **JVMerlino** » Thu Feb 03, 2022 7:56 pm

pedrojdm2021 wrote: ↑Thu Feb 03, 2022 6:33 pm Start a tournament in cutechess or arena your new engine version vs your older engine version. and see how many wins / loss you get now.

I've never found self-play to be reliable. You might have just found a way to exploit a problem in your old engine, but actually gained no elo (or perhaps even made it weaker).

I play a gauntlet against 12 other engines mostly within +/- 50 elo of my engine. I find that gives me very good results.

pedrojdm2021 · Post by **pedrojdm2021** » Thu Feb 03, 2022 8:02 pm

JVMerlino wrote: ↑Thu Feb 03, 2022 7:56 pm
pedrojdm2021 wrote: ↑Thu Feb 03, 2022 6:33 pm Start a tournament in cutechess or arena your new engine version vs your older engine version. and see how many wins / loss you get now.
I've never found self-play to be reliable. You might have just found a way to exploit a problem in your old engine, but actually gained no elo (or perhaps even made it weaker).

I play a gauntlet against 12 other engines mostly within +/- 50 elo of my engine. I find that gives me very good results.

playing against other engines is a good idea too. I currently measure strenght by playing against VICE engine v1.1 (http://bluefeversoft.com/)

How do you know you improved ?

How do you know you improved ?

Re: How do you know you improved ?

Re: How do you know you improved ?

Re: How do you know you improved ?

Re: How do you know you improved ?

Re: How do you know you improved ?

Re: How do you know you improved ?

Re: How do you know you improved ?

Re: How do you know you improved ?

Re: How do you know you improved ?