Page 1 of 3

A word for casual testers

Posted: Tue Dec 25, 2012 4:37 pm
by Don
Image

From time to time people will post results on talkchess from a match
they have run or they will send me results from matches they have
played with Komodo. Typically these will be 100 games matches and the
results will come with a conclusion such as, "what is wrong here?"

If the match was good the comment may be more enthusiastic, that we
have achieved some wonderful breakthough. If it disagree's with the
rating lists they are convinced of testing bias. Sometimes people
will have a favorite version of minor revisions of a program and
will swear by this version even though there is no solid evidence that
it is any different than another. You may notice that in some cases
someone will make a minor modification to some open source program and
based on some 100 game match declare a breakthrough.

There are a few who cannot be convinced that error margins are more
than just hypothetical nonsense. They will patiently listen to what I
am saying but they don't believe it applies in "real life" or in any
practical sense and it's just some theoretical thing and I need to
loosen up and look at the clear evidence staring me in the face. To
them "common sense" and the "gut" is more reliable than fuzzy
numbers.

I hope that actions speak louder than words - we are currently testing
a change that has given us a minor improvement of perhaps 2 or 3 ELO
with a statistically signficant number of games at various time
controls. So I ran 120 separate 300 games matches under equal
conditions. The idea is to see what conclusions people might come to
based on any particular match.

If you look at the first match result, the improved version scores 6
ELO down. This by itself would convince some people that the change
was tried, but just didn't pay off. The second match does better,
scoring only a little more than 2 ELO down. Surely, we have 2 matches
that BOTH scored negatively, the change should be discarded, right?
What more overwhelming evidence do you need? Together in fact
this forms a 600 game match that was 4 ELO down!

So I sorted the various match results by ELO score. Out of the 120
different matches the FIRST 49 (when sorted this way) came out
negative. This means that if you are running 300 game matches you
would have come to the wrong conclusion almost half the time - you are
basically wasting your time on these matches unless of course you are
having fun running them and enjoy watching the games. Please note
that these are 300 games matches, not the usual 100 or 200 game
matches that are often reported with conclusions being drawn.

It gets worse. There were several matches that show a ridiculously
bad result, the worst showing a "regression" of about 32 ELO. Had
that been the first match run, even the more professional testers would
start to suspect that something is wrong. With a score like that
after 300 games the tentative conclusion that "something is probably
wrong with this change" is a valid one - as long as the word
"probably" is in there. The error margin is about 28 ELO after these
30 games, so if you were pre-testing with 300 game samples you might
legitimately reject such a change to save time. There is no way of
getting around the fact that you might reject good changes no matter
how many games you play.

Of course you also find many examples of matches that finished with
exceptional results in these 120 matches. The best one showed a
whopping 37 ELO advantage. Again, there are some who would
report this as a major breakthrough.

So here is the word of advice for casual testers. There is a right
way to report results and a wrong way. The right way to report
results is to ONLY report them, do not interpret them. You can never
go wrong if you give the match conditions and then just say, "here is
what happened." It is very useful to use bayeselo, ordo or elostat to
get a proper report which displays the appropriate error margins that
go with these results - those number tell us a lot about how much we
can trust the results. The error margins is a good way to get a sense
of how much "sample error" we can expect to see. So if you play a 200
games match and the score is close, you know that there is a great
deal of sample error which stands in the way of drawing any firm
conclusions about the value of a change.

After sorting the individual 120 matches by score, I built a graph in
order to help you visualize how this works. The graph appears at
the top of this post. The y-axis is the ELO rating, the x-axis is the match number (after sorting.) You will notice
that a bit more than half the results are positive becuase the change
is a good one, but a very signficant number of results are negative
simply due to sample error. Also, please note that even if you
combine the 36,000 games you are left with sample error. In other
words we cannot say with certainty that the change is really an
improvement! All we can say is that the change is very likely to be
an improvement. In this case we have a lot of other evidence from
several other tests, but that is not the point. If we cannot say for
sure with 36,000 games then 100 games is surely not enough.


Don

Re: A word for casual testers.

Posted: Tue Dec 25, 2012 5:12 pm
by Ajedrecista
Hello Don:

Your post is very interesting.
Don wrote:The error margin is about 28 ELO after these
30 games, so if you were pre-testing with 300 game samples you might
legitimately reject such a change to save time.
I am sure that you wanted to write ± 28 Elo for 300 games... for 30 games, error bars should be ± 28*sqrt(300/30) ~ ± 88.5 Elo.

------------------------

I guess that your graph indicates a positive gain (I suppose that 3000 Elo is no gain). Averaging the 'more lineal' zone by eye (so, a very poor estimate), it should give a +3 or +4 Elo gain (forgetting about error bars).

Supposing 95% confidence for ± 28 Elo after 300 games and doing little math, I guess that the draw ratio is around 49% more less (please write which was the true draw ratio, just for comparison). I wrote a Fortran 95 programme called Minimum_score_for_no_regression some months ago:

Code: Select all

Minimum_score_for_no_regression, ® 2012.

 Calculation of the minimum score for no regression (i.e. negative Elo gain) in a match between two engines:

 Write down the number of games of the match (it must be a positive integer, up to 1073741823):

36000

Write down the draw ratio (in percentage):

49

 Write down the likelihood of superiority (in percentage) between 75% and 99.9% (LOS will be rounded up to 0.01%):

97.5

Write down the clock rate of the CPU (in GHz), only for timing the elapsed time of the calculations:

3
_______________________________________________________________________________

Theoretical minimum score for no regression: 50.3688 %
Theoretical standard deviation in this case:  0.3688 %

Minimum number of won points for the engine in this match:     18133.0 points.

Minimum Elo advantage, which is also the negative part of the error bar:
  2.5672 Elo (for a LOS value of 97.50 %).

A LOS value of 97.50 % is equivalent to 95.00 % confidence in a two-sided test.
_______________________________________________________________________________

End of the calculations. Approximated elapsed time:  14 ms.

Thanks for using Minimum_score_for_no_regression. Press Enter to exit.
So, if you get an improvement of +2.6 Elo (or more) after 36000 games, you can be sure that there is not a regression with 95% confidence. Of course I used my own model of mean and standard deviation in a normal distribution approach, but I think it works fine with such amount of games. Good luck with your tests!

Regards from Spain.

Ajedrecista.

Re: A word for casual testers

Posted: Tue Dec 25, 2012 7:59 pm
by Adam Hair
Don wrote:Image


Don
That looks quite close to what it should look like, which is the quantile function of the logistic distribution.

As Don points out, the results of any short test will fall on that sort of graph. While 2/3rds of the results are within ~14 Elo of the average, the result from one short match may be quite a bit higher or lower than the average result. In other words, this is one more example of how the results of one short match can be misleading.

If it is easy for Don to do, it would be enlightening if the games were placed in smaller bins (say 50 games each) and larger bins (800 to 1000 games), and then sorted and graphed in this manner. This would demonstrate how the results from longer matches are more reliable.

If not, I have 22,000 games from testing a beta version of Gaviota that I can use. Perhaps I can work on this and present the results this evening (7 to 9 hours from now) or tomorrow evening.

Re: A word for casual testers.

Posted: Tue Dec 25, 2012 8:35 pm
by Ajedrecista
Hello Adam:
Adam Hair wrote:That looks quite close to what it should look like, which is the quantile function of the logistic distribution.

As Don points out, the results of any short test will fall on that sort of graph. While 2/3rds of the results are within ~14 Elo of the average, the result from one short match may be quite a bit higher or lower than the average result. In other words, this is one more example of how the results of one short match can be misleading.
That's it! Indeed, we agree in '2/3 issue':

My numeric method for determine draw trends of each engine.
Ajedrecista wrote:But, what can one must understand from a 'not unbalanced match'? Maybe all the engines of a Round Robin must be in the range of [0.15, 0.85] for µ? This interval could be other, of course! But the way, I did not write [0.15, 0.85] by chance:

Code: Select all

w = wins; d = draws; l = loses; n = games = w + d + l. 

(Rating difference) = 400*log{[1 + (w - l)/n]/[1 - (w - l)/n]} by definition (just toy with variables). 

(Rating performance): win ---> rating + 400; draw ---> rating; lose ---> rating - 400. 
&#40;Average rating performance&#41; = <rating> + 400*&#40;w - l&#41;/n; &#40;average opponent's rating&#58; <rating>). 
&#40;Rating difference&#41; = &#40;average rating performance&#41; - <rating> = 400*&#40;w - l&#41;/n.
If you call x = (w - l)/n and then you plot 400*log[(1 + x)/(1 - x)] and 400x (you can even get rid of the constant 400, which is the same in both cases), you will see that they are similar in the range -2/3 < x < 2/3 more less. It means the following:

Code: Select all

µ_min. ===> w = 0, l = 2n/3, d = n/3&#58; µ = d/&#40;2n&#41; = 1/6. 
µ_max. ===> w = 2n/3, l = 0, d = n/3&#58; µ = &#40;w + d/2&#41;/n = 5/6. 

&#91;µ_min., µ_max.&#93; = &#91;1/6, 5/6&#93;; for not being so strict&#58; &#91;0.15, 0.85&#93;.
It is only an example, so it is not a rigurous definition of a 'not unbalanced match'.
What I called the 'lineal zone' in my previous post is exactly what you said: the central 2/3 of the graph. In the example of my quote: [2/3 - (-2/3)]/[1 - (-1)] = (4/3)/2 = 2/3 (the central 2/3 of the graph). Of course it is only an approximation, but a good one IMHO. In Don's graph, it corresponds to the range [20, 100] in x-axis, which in fact is more less lineal:

Image

As I said before, the central point of this line (x = 60) is a little over 3000 Elo (something like +3 or +4 Elo over 3000), so it seems that there is a small gain in average inside the lineal zone. I do not know how non-lineal zones are well or bad compensated.

@Don, Adam et al: thank you very much for your work!

Regards from Spain.

Ajedrecista.

Re: A word for casual testers

Posted: Tue Dec 25, 2012 10:09 pm
by Rebel
Another example to make Don's point:

Image

A match of 2500 games graphical divided in steps of 100 games. It starts not so good, after 200 games the score is still below 50% then starts to rise peeking at 300 games (51.8% indicating +12-13 elo) and after 900 games the score is still above 50%. Then the random nature strikes back and eventually the score ends with a poor 48.8% after 2500 games.

Here I stopped the match. It's not unlikely that after 5000 games the score to rise at 50% again, it's just very unlikely the change was a measurable improvement with limited hardware.

Re: A word for casual testers.

Posted: Wed Dec 26, 2012 4:41 am
by Adam Hair
Ajedrecista wrote:Hello Adam:
Adam Hair wrote:That looks quite close to what it should look like, which is the quantile function of the logistic distribution.

As Don points out, the results of any short test will fall on that sort of graph. While 2/3rds of the results are within ~14 Elo of the average, the result from one short match may be quite a bit higher or lower than the average result. In other words, this is one more example of how the results of one short match can be misleading.
That's it! Indeed, we agree in '2/3 issue':

My numeric method for determine draw trends of each engine.
Ajedrecista wrote:But, what can one must understand from a 'not unbalanced match'? Maybe all the engines of a Round Robin must be in the range of [0.15, 0.85] for µ? This interval could be other, of course! But the way, I did not write [0.15, 0.85] by chance:

Code: Select all

w = wins; d = draws; l = loses; n = games = w + d + l. 

&#40;Rating difference&#41; = 400*log&#123;&#91;1 + &#40;w - l&#41;/n&#93;/&#91;1 - &#40;w - l&#41;/n&#93;&#125; by definition &#40;just toy with variables&#41;. 

&#40;Rating performance&#41;&#58; win ---> rating + 400; draw ---> rating; lose ---> rating - 400. 
&#40;Average rating performance&#41; = <rating> + 400*&#40;w - l&#41;/n; &#40;average opponent's rating&#58; <rating>). 
&#40;Rating difference&#41; = &#40;average rating performance&#41; - <rating> = 400*&#40;w - l&#41;/n.
If you call x = (w - l)/n and then you plot 400*log[(1 + x)/(1 - x)] and 400x (you can even get rid of the constant 400, which is the same in both cases), you will see that they are similar in the range -2/3 < x < 2/3 more less. It means the following:

Code: Select all

µ_min. ===> w = 0, l = 2n/3, d = n/3&#58; µ = d/&#40;2n&#41; = 1/6. 
µ_max. ===> w = 2n/3, l = 0, d = n/3&#58; µ = &#40;w + d/2&#41;/n = 5/6. 

&#91;µ_min., µ_max.&#93; = &#91;1/6, 5/6&#93;; for not being so strict&#58; &#91;0.15, 0.85&#93;.
It is only an example, so it is not a rigurous definition of a 'not unbalanced match'.
What I called the 'lineal zone' in my previous post is exactly what you said: the central 2/3 of the graph. In the example of my quote: [2/3 - (-2/3)]/[1 - (-1)] = (4/3)/2 = 2/3 (the central 2/3 of the graph). Of course it is only an approximation, but a good one IMHO. In Don's graph, it corresponds to the range [20, 100] in x-axis, which in fact is more less lineal:

Image
And I definitely agree with you, Jesús. Any inferences we make about computer chess matches, whether we are discussing draw rates or rates of success with various openings or anything else, should involve matches that are not too "unbalanced". Avoid results that are in the tails of the logistic distribution. (0.15, 0.85) is approximately the maximum "good" score interval.

Of course, I believe HGM would say that we need to study the tails in order to determine if we are using the correct distribution for the ratings. But that is a study for another day.

Ajedrecista wrote: As I said before, the central point of this line (x = 60) is a little over 3000 Elo (something like +3 or +4 Elo over 3000), so it seems that there is a small gain in average inside the lineal zone. I do not know how non-lineal zones are well or bad compensated.

@Don, Adam et al: thank you very much for your work!

Regards from Spain.

Ajedrecista.
Thank you, Jesús, for your work. I think it is great that there is a handful of us who are interested in these sorts of studies. The other 900+ members may not care very much :lol: , but as long as there is someone interested I will keep posting this stuff :D .

Re: A word for casual testers

Posted: Wed Dec 26, 2012 6:02 am
by Adam Hair
I had 22,000 games played by Gaviota beta against a set of opponents. I randomized the games and then split them up in two different ways. The first way was into 20 sets of ~1100 games. The second way was into 220 sets with ~100 games each. In each set, I gave Gaviota beta an unique name. Then I combined the sets back together. I also added a set of games played by the opponents against each other and ~22,000 games played by Gaviota 0.85.1 against the same opponents. I then calculated the ratings with Ordo 0.6, with the Elo of Gaviota 0.85.1 set to zero. The following graphs show the Elo rating of Gaviota beta based on the games from each set. This is essentially the same as playing 20 1100 game gauntlets (220 100 game gauntlets) and calculating Gaviota beta's Elo each time.

1100 game sets:

Image

While only 20 points makes the pattern harder to see, it is not unlike Don's graph. Note that the median point is at approximately 41 Elo, and the difference between the smallest estimate and the largest is ~35 Elo and the difference of the middle 2/3rd estimates is ~18 Elo.

100 game sets:

Image

This looks like Don's graph. The median point is ~42 Elo, the difference between the smallest and largest estimate is ~175 Elo, and the difference for the middle 2/3rd estimates is ~61 Elo.

Here is the point in all of this, the reason for Don's post. From all of the games (22,000 for Gaviota beta and 22,000 Gaviota 0.85.1, plus the games between the opponents), the calculated Elo difference between Gaviota beta and v0.85.1 is 41.1 Elo. An Elo calculation from a random 1100 game gauntlet is much more likely to be close to the true Elo difference (where the Elo estimate from the total group of games is used for the "true" Elo difference) than from a random 100 game gauntlet. This does not mean that the results from a 100 game gauntlet can't be closer to the "truth" than the results from a 1100 game gauntlet. But it does mean that it is a somewhat rare event.

Now, the improvement in Gaviota beta over v0.85.1 is substantial enough so that most of the 100 game gauntlets give estimates that Gaviota beta is better than v0.85.1. But even some of those gauntlets "show" the beta to be worse. And some of those gauntlets "show" the beta to be a lot stronger than v0.85.1. Imagine if the actual increase in strength was less, such as in Don's case. Small matches/gauntlets just do not prove a lot by themselves.

Re: A word for casual testers.

Posted: Wed Dec 26, 2012 6:04 am
by JuLieN
Adam Hair wrote:Thank you, Jesús, for your work. I think it is great that there is a handful of us who are interested in these sorts of studies. The other 900+ members may not care very much :lol: , but as long as there is someone interested I will keep posting this stuff :D .
Image

But I read the letters that seem to compose words, between the strange graphics. ;)

Re: A word for casual testers.

Posted: Wed Dec 26, 2012 6:31 am
by Adam Hair
JuLieN wrote:
Adam Hair wrote:Thank you, Jesús, for your work. I think it is great that there is a handful of us who are interested in these sorts of studies. The other 900+ members may not care very much :lol: , but as long as there is someone interested I will keep posting this stuff :D .
Image

But I read the letters that seem to compose words, between the strange graphics. ;)


Image


:lol: :lol: :lol:

Re: A word for casual testers

Posted: Wed Dec 26, 2012 9:49 am
by Werner
Hi Don,
I have 2 examples:

Code: Select all

Nemo 1.0.1 x64 - Toga II 1.4 Beta5c 1CPU  11.0 - 9.0  55.00%   
Nemo 1.0.1 x64 - Toga II 1.4.2 JD 1CPU  12.5 - 7.5  62.50%   
Nemo 1.0.1 x64 - Toga II 1.4.3JDbeta19a  10.5 - 9.5  52.50%   
Nemo 1.0.1 x64 - Toga II 2.02 JA  13.0 - 7.0  65.00%   
Nemo 1.0.1 x64 - Toga Returns 1.0  11.0 - 9.0  55.00% 
Toga II 2.02 is not 30 Points ahead of Toga II 1.4.3JDbeta19a: wrong?

or

924 Delphil 2.9g w32 1CPU 2321

Code: Select all

Delphil 2.9g x64 1CPU    2217 - Djinn 0.969 x64          2356   15.5 - 34.5    +7/-26/=17    31.00%
Delphil 2.9g x64 1CPU    2267 - Rodin 4.0                2330   20.5 - 29.5    +13/-22/=15    41.00%
It seems Delphil 2.9g x64 1CPU cannot reach the Rating of 32bit Version: wrong ?