test method; how to compare versions

flok · Post by **flok** » Thu Aug 18, 2016 2:33 pm

Hi,

Everytime I implement a change, I run a test where fairymax, the version before the change and the version after a change run 30k games each. Then I compare the elo ratings to see if the change helped.

Now the odd thing is, if I compare multiple versions of my program, then suddenly I don't see an increase of elo for each newer version but instead newer versions can play far worse:

Code: Select all

Rank Name                         Elo    +    - games score oppo. draws
   1 fairymax_tl3                 160    4    4 28808   75%   -40   15%
   2 Embla2634-v0.9.2              13    3    4 28796   54%    -3   29%
   3 Embla2777_v0.9.3              -6    3    4 28802   50%     2   33%
   4 Embla2598_v0.9.1_fixesonly   -83    4    3 28791   35%    21   51%
   5 Emblatrunk                   -84    4    4 28789   37%    21   36%

Then, if I remove features from e.g. 0.9.3 and trunk to see which one is the culprit, then it gets even more confusing:

Code: Select all

Rank Name                         Elo    +    - games score oppo. draws
   1 fairymax_tl3                 189    5    4 36160   75%   -21   12%
   2 Embla2777_v0.9.3_no_tage      21    4    4 36156   55%    -2   37%
   3 Emblatrunk_no_sibl            18    4    4 36158   54%    -2   38%
   4 Emblatrunk_no_tage_sibl       14    4    4 36157   53%    -2   38%
   5 Embla2634-v0.9.2              11    4    3 36158   52%    -1   42%
   6 Embla2777_v0.9.3              11    3    4 36156   52%    -1   38%
   7 Embla2777_v0.9.3_no_sib      -42    3    4 36158   43%     5   47%
   8 Emblatrunk_no_tage           -73    3    4 36157   39%     8   27%
   9 Emblatrunk                   -73    3    4 36158   39%     8   26%
  10 Embla2598_v0.9.1_fixesonly   -76    4    4 36170   36%     9   58%

no_sibl: use only 1 sibbling (killer move) instead of two
no_tage: use no tt age mechanism

Still ongoing test (trunk compared to trunk minus a change) but no change between ~1000 and current ratings:

Code: Select all

Rank Name            Elo    +    - games score oppo. draws 
   1 fairymax_tl3    177   13   13  2234   74%   -25   11% 
   2 Emblamin_2886   103   12   12  2225   66%   -14   22% 
   3 Emblamin_2850    63   12   12  2232   59%    -8   14% 
   4 Emblamin_2861   -26   11   11  2230   47%     2   33% 
   5 Emblamin_2824   -77   12   12  2239   39%    12   13% 
   6 Emblamin_2828   -78   12   12  2230   39%    11   13% 
   7 Emblatrunk      -81   12   12  2229   38%    10   13% 
   8 Emblamin_2825   -82   12   12  2231   38%    11   12%

What is wrong with my methodology?

ZirconiumX · Post by **ZirconiumX** » Thu Aug 18, 2016 5:50 pm

Looking at that Bayeselo output, you appear to be running a round-robin with all your engines rather than a gauntlet.

If you want to compare strength against fairy-max, your engines should play fairy-max only, and not each other.

Also remember that what may be optimal for fairy-max may not be optimal overall.

Ferdy · Post by **Ferdy** » Thu Aug 18, 2016 7:57 pm

flok wrote:Hi,

Everytime I implement a change, I run a test where fairymax, the version before the change and the version after a change run 30k games each. Then I compare the elo ratings to see if the change helped.

Now the odd thing is, if I compare multiple versions of my program, then suddenly I don't see an increase of elo for each newer version but instead newer versions can play far worse:
Code: Select all
Rank Name                         Elo    +    - games score oppo. draws
   1 fairymax_tl3                 160    4    4 28808   75%   -40   15%
   2 Embla2634-v0.9.2              13    3    4 28796   54%    -3   29%
   3 Embla2777_v0.9.3              -6    3    4 28802   50%     2   33%
   4 Embla2598_v0.9.1_fixesonly   -83    4    3 28791   35%    21   51%
   5 Emblatrunk                   -84    4    4 28789   37%    21   36%
Then, if I remove features from e.g. 0.9.3 and trunk to see which one is the culprit, then it gets even more confusing:
Code: Select all
Rank Name                         Elo    +    - games score oppo. draws
   1 fairymax_tl3                 189    5    4 36160   75%   -21   12%
   2 Embla2777_v0.9.3_no_tage      21    4    4 36156   55%    -2   37%
   3 Emblatrunk_no_sibl            18    4    4 36158   54%    -2   38%
   4 Emblatrunk_no_tage_sibl       14    4    4 36157   53%    -2   38%
   5 Embla2634-v0.9.2              11    4    3 36158   52%    -1   42%
   6 Embla2777_v0.9.3              11    3    4 36156   52%    -1   38%
   7 Embla2777_v0.9.3_no_sib      -42    3    4 36158   43%     5   47%
   8 Emblatrunk_no_tage           -73    3    4 36157   39%     8   27%
   9 Emblatrunk                   -73    3    4 36158   39%     8   26%
  10 Embla2598_v0.9.1_fixesonly   -76    4    4 36170   36%     9   58%
no_sibl: use only 1 sibbling (killer move) instead of two
no_tage: use no tt age mechanism

Still ongoing test (trunk compared to trunk minus a change) but no change between ~1000 and current ratings:
Code: Select all
Rank Name            Elo    +    - games score oppo. draws 
   1 fairymax_tl3    177   13   13  2234   74%   -25   11% 
   2 Emblamin_2886   103   12   12  2225   66%   -14   22% 
   3 Emblamin_2850    63   12   12  2232   59%    -8   14% 
   4 Emblamin_2861   -26   11   11  2230   47%     2   33% 
   5 Emblamin_2824   -77   12   12  2239   39%    12   13% 
   6 Emblamin_2828   -78   12   12  2230   39%    11   13% 
   7 Emblatrunk      -81   12   12  2229   38%    10   13% 
   8 Emblamin_2825   -82   12   12  2231   38%    11   12% 
What is wrong with my methodology?

In your 3 rating lists, make sure the fairy has same rating on the 3 rating lists, being a ref engine.

Or just combine the pgn for the 3 rating list and run again. Make fairymax to be the ref engine and you may set its rating to 0. You can then observe all your tested versions from this ref. Looking from it, fairy is too strong, and the rating gap seems too high. It is difficult to measure the differences between your versions because fairy dominates by a large margin.

Better is to find an opponent that is within +/- 20 elo from your best engine or current stable version.

Daniel Anulliero · Post by **Daniel Anulliero** » Thu Aug 18, 2016 8:07 pm

I think it's better to have more engines to play against .
And another things that could be very disapointing :
-sometimes self tests ( test vs versions each other) can give a diférents results that the tests against others engines
-sometimes a modification is good at short time control and worse at long time control ... or the reverse

-sometimes you obtain diférents results against diférents engines...
And so on ...
For testing my engine Isa , I have 10 engines ( from 1950 to 2250 elo CCRL) , I add the better last version of Isa too.
I run 660 games (60 games against each other) at 1 minute + 250 ms
Same opennings with reversed color
If I think I have an improvement , I retest at 5 minutes + 250 ms
Well , yes , testing is THE BOTTLENECK

Courage and be patient
Chess programming is all of patience and methodology

test method; how to compare versions

test method; how to compare versions

Re: test method; how to compare versions

Re: test method; how to compare versions

Re: test method; how to compare versions