Everytime I implement a change, I run a test where fairymax, the version before the change and the version after a change run 30k games each. Then I compare the elo ratings to see if the change helped.
Now the odd thing is, if I compare multiple versions of my program, then suddenly I don't see an increase of elo for each newer version but instead newer versions can play far worse:
Code: Select all
Rank Name Elo + - games score oppo. draws
1 fairymax_tl3 160 4 4 28808 75% -40 15%
2 Embla2634-v0.9.2 13 3 4 28796 54% -3 29%
3 Embla2777_v0.9.3 -6 3 4 28802 50% 2 33%
4 Embla2598_v0.9.1_fixesonly -83 4 3 28791 35% 21 51%
5 Emblatrunk -84 4 4 28789 37% 21 36%
Code: Select all
Rank Name Elo + - games score oppo. draws
1 fairymax_tl3 189 5 4 36160 75% -21 12%
2 Embla2777_v0.9.3_no_tage 21 4 4 36156 55% -2 37%
3 Emblatrunk_no_sibl 18 4 4 36158 54% -2 38%
4 Emblatrunk_no_tage_sibl 14 4 4 36157 53% -2 38%
5 Embla2634-v0.9.2 11 4 3 36158 52% -1 42%
6 Embla2777_v0.9.3 11 3 4 36156 52% -1 38%
7 Embla2777_v0.9.3_no_sib -42 3 4 36158 43% 5 47%
8 Emblatrunk_no_tage -73 3 4 36157 39% 8 27%
9 Emblatrunk -73 3 4 36158 39% 8 26%
10 Embla2598_v0.9.1_fixesonly -76 4 4 36170 36% 9 58%
no_tage: use no tt age mechanism
Still ongoing test (trunk compared to trunk minus a change) but no change between ~1000 and current ratings:
Code: Select all
Rank Name Elo + - games score oppo. draws
1 fairymax_tl3 177 13 13 2234 74% -25 11%
2 Emblamin_2886 103 12 12 2225 66% -14 22%
3 Emblamin_2850 63 12 12 2232 59% -8 14%
4 Emblamin_2861 -26 11 11 2230 47% 2 33%
5 Emblamin_2824 -77 12 12 2239 39% 12 13%
6 Emblamin_2828 -78 12 12 2230 39% 11 13%
7 Emblatrunk -81 12 12 2229 38% 10 13%
8 Emblamin_2825 -82 12 12 2231 38% 11 12%