## A word for casual testers

**Moderators:** hgm, Harvey Williamson, bob

**Forum rules**

This textbox is used to restore diagrams posted with the [d] tag before the upgrade.

### A word for casual testers

From time to time people will post results on talkchess from a match

they have run or they will send me results from matches they have

played with Komodo. Typically these will be 100 games matches and the

results will come with a conclusion such as, "what is wrong here?"

If the match was good the comment may be more enthusiastic, that we

have achieved some wonderful breakthough. If it disagree's with the

rating lists they are convinced of testing bias. Sometimes people

will have a favorite version of minor revisions of a program and

will swear by this version even though there is no solid evidence that

it is any different than another. You may notice that in some cases

someone will make a minor modification to some open source program and

based on some 100 game match declare a breakthrough.

There are a few who cannot be convinced that error margins are more

than just hypothetical nonsense. They will patiently listen to what I

am saying but they don't believe it applies in "real life" or in any

practical sense and it's just some theoretical thing and I need to

loosen up and look at the clear evidence staring me in the face. To

them "common sense" and the "gut" is more reliable than fuzzy

numbers.

I hope that actions speak louder than words - we are currently testing

a change that has given us a minor improvement of perhaps 2 or 3 ELO

with a statistically signficant number of games at various time

controls. So I ran 120 separate 300 games matches under equal

conditions. The idea is to see what conclusions people might come to

based on any particular match.

If you look at the first match result, the improved version scores 6

ELO down. This by itself would convince some people that the change

was tried, but just didn't pay off. The second match does better,

scoring only a little more than 2 ELO down. Surely, we have 2 matches

that BOTH scored negatively, the change should be discarded, right?

What more overwhelming evidence do you need? Together in fact

this forms a 600 game match that was 4 ELO down!

So I sorted the various match results by ELO score. Out of the 120

different matches the FIRST 49 (when sorted this way) came out

negative. This means that if you are running 300 game matches you

would have come to the wrong conclusion almost half the time - you are

basically wasting your time on these matches unless of course you are

having fun running them and enjoy watching the games. Please note

that these are 300 games matches, not the usual 100 or 200 game

matches that are often reported with conclusions being drawn.

It gets worse. There were several matches that show a ridiculously

bad result, the worst showing a "regression" of about 32 ELO. Had

that been the first match run, even the more professional testers would

start to suspect that something is wrong. With a score like that

after 300 games the tentative conclusion that "something is probably

wrong with this change" is a valid one - as long as the word

"probably" is in there. The error margin is about 28 ELO after these

30 games, so if you were pre-testing with 300 game samples you might

legitimately reject such a change to save time. There is no way of

getting around the fact that you might reject good changes no matter

how many games you play.

Of course you also find many examples of matches that finished with

exceptional results in these 120 matches. The best one showed a

whopping 37 ELO advantage. Again, there are some who would

report this as a major breakthrough.

So here is the word of advice for casual testers. There is a right

way to report results and a wrong way. The right way to report

results is to ONLY report them, do not interpret them. You can never

go wrong if you give the match conditions and then just say, "here is

what happened." It is very useful to use bayeselo, ordo or elostat to

get a proper report which displays the appropriate error margins that

go with these results - those number tell us a lot about how much we

can trust the results. The error margins is a good way to get a sense

of how much "sample error" we can expect to see. So if you play a 200

games match and the score is close, you know that there is a great

deal of sample error which stands in the way of drawing any firm

conclusions about the value of a change.

After sorting the individual 120 matches by score, I built a graph in

order to help you visualize how this works. The graph appears at

the top of this post. The y-axis is the ELO rating, the x-axis is the match number (after sorting.) You will notice

that a bit more than half the results are positive becuase the change

is a good one, but a very signficant number of results are negative

simply due to sample error. Also, please note that even if you

combine the 36,000 games you are left with sample error. In other

words we cannot say with certainty that the change is really an

improvement! All we can say is that the change is very likely to be

an improvement. In this case we have a lot of other evidence from

several other tests, but that is not the point. If we cannot say for

sure with 36,000 games then 100 games is surely not enough.

Don

Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.

- Ajedrecista
**Posts:**1395**Joined:**Wed Jul 13, 2011 7:04 pm**Location:**Madrid, Spain.-
**Contact:**

### Re: A word for casual testers.

Hello Don:

Your post is very interesting.

------------------------

I guess that your graph indicates a positive gain (I suppose that 3000 Elo is no gain). Averaging the 'more lineal' zone by eye (so, a very poor estimate), it should give a +3 or +4 Elo gain (forgetting about error bars).

Supposing 95% confidence for ± 28 Elo after 300 games and doing little math, I guess that the draw ratio is around 49% more less (please write which was the true draw ratio, just for comparison). I wrote a Fortran 95 programme called Minimum_score_for_no_regression some months ago:

So, if you get an improvement of +2.6 Elo (or more) after 36000 games, you can be sure that there is not a regression

Regards from Spain.

Ajedrecista.

Your post is very interesting.

I am sure that you wanted to write ± 28 Elo for 300 games... for 30 games, error bars should be ± 28*sqrt(300/30) ~ ± 88.5 Elo.Don wrote:The error margin is about 28 ELO after these

30 games, so if you were pre-testing with 300 game samples you might

legitimately reject such a change to save time.

------------------------

I guess that your graph indicates a positive gain (I suppose that 3000 Elo is no gain). Averaging the 'more lineal' zone by eye (so, a very poor estimate), it should give a +3 or +4 Elo gain (forgetting about error bars).

Supposing 95% confidence for ± 28 Elo after 300 games and doing little math, I guess that the draw ratio is around 49% more less (please write which was the true draw ratio, just for comparison). I wrote a Fortran 95 programme called Minimum_score_for_no_regression some months ago:

Code: Select all

```
Minimum_score_for_no_regression, ® 2012.
Calculation of the minimum score for no regression (i.e. negative Elo gain) in a match between two engines:
Write down the number of games of the match (it must be a positive integer, up to 1073741823):
36000
Write down the draw ratio (in percentage):
49
Write down the likelihood of superiority (in percentage) between 75% and 99.9% (LOS will be rounded up to 0.01%):
97.5
Write down the clock rate of the CPU (in GHz), only for timing the elapsed time of the calculations:
3
_______________________________________________________________________________
Theoretical minimum score for no regression: 50.3688 %
Theoretical standard deviation in this case: 0.3688 %
Minimum number of won points for the engine in this match: 18133.0 points.
Minimum Elo advantage, which is also the negative part of the error bar:
2.5672 Elo (for a LOS value of 97.50 %).
A LOS value of 97.50 % is equivalent to 95.00 % confidence in a two-sided test.
_______________________________________________________________________________
End of the calculations. Approximated elapsed time: 14 ms.
Thanks for using Minimum_score_for_no_regression. Press Enter to exit.
```

**with 95% confidence.**Of course I used my own model of mean and standard deviation in a normal distribution approach, but I think it works fine with such amount of games. Good luck with your tests!Regards from Spain.

Ajedrecista.

### Re: A word for casual testers

That looks quite close to what it should look like, which is the quantile function of the logistic distribution.Don wrote:

Don

As Don points out, the results of any short test will fall on that sort of graph. While 2/3rds of the results are within ~14 Elo of the average, the result from one short match may be quite a bit higher or lower than the average result. In other words, this is one more example of how the results of one short match can be misleading.

If it is easy for Don to do, it would be enlightening if the games were placed in smaller bins (say 50 games each) and larger bins (800 to 1000 games), and then sorted and graphed in this manner. This would demonstrate how the results from longer matches are more reliable.

If not, I have 22,000 games from testing a beta version of Gaviota that I can use. Perhaps I can work on this and present the results this evening (7 to 9 hours from now) or tomorrow evening.

- Ajedrecista
**Posts:**1395**Joined:**Wed Jul 13, 2011 7:04 pm**Location:**Madrid, Spain.-
**Contact:**

### Re: A word for casual testers.

Hello Adam:

My numeric method for determine draw trends of each engine.

As I said before, the central point of this line (x = 60) is a little over 3000 Elo (something like +3 or +4 Elo over 3000), so it seems that there is a small gain

@Don, Adam et al: thank you very much for your work!

Regards from Spain.

Ajedrecista.

That's it! Indeed, we agree in '2/3 issue':Adam Hair wrote:That looks quite close to what it should look like, which is the quantile function of the logistic distribution.

As Don points out, the results of any short test will fall on that sort of graph. While 2/3rds of the results are within ~14 Elo of the average, the result from one short match may be quite a bit higher or lower than the average result. In other words, this is one more example of how the results of one short match can be misleading.

My numeric method for determine draw trends of each engine.

What I called the 'lineal zone' in my previous post is exactly what you said: the central 2/3 of the graph. In the example of my quote: [2/3 - (-2/3)]/[1 - (-1)] = (4/3)/2 = 2/3 (the central 2/3 of the graph). Of course it is only an approximation, but a good one IMHO. In Don's graph, it corresponds to the range [20, 100] in x-axis, which in fact is more less lineal:Ajedrecista wrote:But, what can one must understand from a 'not unbalanced match'? Maybe all the engines of a Round Robin must be in the range of [0.15, 0.85] for µ? This interval could be other, of course! But the way, I did not write [0.15, 0.85] by chance:

If you callCode: Select all

`w = wins; d = draws; l = loses; n = games = w + d + l. (Rating difference) = 400*log{[1 + (w - l)/n]/[1 - (w - l)/n]} by definition (just toy with variables). (Rating performance): win ---> rating + 400; draw ---> rating; lose ---> rating - 400. (Average rating performance) = <rating> + 400*(w - l)/n; (average opponent's rating: <rating>). (Rating difference) = (average rating performance) - <rating> = 400*(w - l)/n.`

x= (w - l)/n and then you plot 400*log[(1 +x)/(1 -x)] and 400x(you can even get rid of the constant 400, which is the same in both cases), you will see that they are similar in the range -2/3 <x< 2/3 more less. It means the following:

It is only an example, so it is not a rigurous definition of a 'not unbalanced match'.Code: Select all

`µ_min. ===> w = 0, l = 2n/3, d = n/3: µ = d/(2n) = 1/6. µ_max. ===> w = 2n/3, l = 0, d = n/3: µ = (w + d/2)/n = 5/6. [µ_min., µ_max.] = [1/6, 5/6]; for not being so strict: [0.15, 0.85].`

As I said before, the central point of this line (x = 60) is a little over 3000 Elo (something like +3 or +4 Elo over 3000), so it seems that there is a small gain

*in average*inside the lineal zone. I do not know how non-lineal zones are well or bad compensated.@Don, Adam et al: thank you very much for your work!

Regards from Spain.

Ajedrecista.

### Re: A word for casual testers

Another example to make Don's point:

A match of 2500 games graphical divided in steps of 100 games. It starts not so good, after 200 games the score is still below 50% then starts to rise peeking at 300 games (51.8% indicating +12-13 elo) and after 900 games the score is still above 50%. Then the random nature strikes back and eventually the score ends with a poor 48.8% after 2500 games.

Here I stopped the match. It's not unlikely that after 5000 games the score to rise at 50% again, it's just very unlikely the change was a measurable improvement with limited hardware.

A match of 2500 games graphical divided in steps of 100 games. It starts not so good, after 200 games the score is still below 50% then starts to rise peeking at 300 games (51.8% indicating +12-13 elo) and after 900 games the score is still above 50%. Then the random nature strikes back and eventually the score ends with a poor 48.8% after 2500 games.

Here I stopped the match. It's not unlikely that after 5000 games the score to rise at 50% again, it's just very unlikely the change was a measurable improvement with limited hardware.

### Re: A word for casual testers.

And I definitely agree with you, Jesús. Any inferences we make about computer chess matches, whether we are discussing draw rates or rates of success with various openings or anything else, should involve matches that are not too "unbalanced". Avoid results that are in the tails of the logistic distribution. (0.15, 0.85) is approximately the maximum "good" score interval.Ajedrecista wrote:Hello Adam:

That's it! Indeed, we agree in '2/3 issue':Adam Hair wrote:That looks quite close to what it should look like, which is the quantile function of the logistic distribution.

As Don points out, the results of any short test will fall on that sort of graph. While 2/3rds of the results are within ~14 Elo of the average, the result from one short match may be quite a bit higher or lower than the average result. In other words, this is one more example of how the results of one short match can be misleading.

My numeric method for determine draw trends of each engine.

What I called the 'lineal zone' in my previous post is exactly what you said: the central 2/3 of the graph. In the example of my quote: [2/3 - (-2/3)]/[1 - (-1)] = (4/3)/2 = 2/3 (the central 2/3 of the graph). Of course it is only an approximation, but a good one IMHO. In Don's graph, it corresponds to the range [20, 100] in x-axis, which in fact is more less lineal:Ajedrecista wrote:But, what can one must understand from a 'not unbalanced match'? Maybe all the engines of a Round Robin must be in the range of [0.15, 0.85] for µ? This interval could be other, of course! But the way, I did not write [0.15, 0.85] by chance:

If you callCode: Select all

`w = wins; d = draws; l = loses; n = games = w + d + l. (Rating difference) = 400*log{[1 + (w - l)/n]/[1 - (w - l)/n]} by definition (just toy with variables). (Rating performance): win ---> rating + 400; draw ---> rating; lose ---> rating - 400. (Average rating performance) = <rating> + 400*(w - l)/n; (average opponent's rating: <rating>). (Rating difference) = (average rating performance) - <rating> = 400*(w - l)/n.`

x= (w - l)/n and then you plot 400*log[(1 +x)/(1 -x)] and 400x(you can even get rid of the constant 400, which is the same in both cases), you will see that they are similar in the range -2/3 <x< 2/3 more less. It means the following:

It is only an example, so it is not a rigurous definition of a 'not unbalanced match'.Code: Select all

`µ_min. ===> w = 0, l = 2n/3, d = n/3: µ = d/(2n) = 1/6. µ_max. ===> w = 2n/3, l = 0, d = n/3: µ = (w + d/2)/n = 5/6. [µ_min., µ_max.] = [1/6, 5/6]; for not being so strict: [0.15, 0.85].`

Of course, I believe HGM would say that we need to study the tails in order to determine if we are using the correct distribution for the ratings. But that is a study for another day.

Thank you, Jesús, for your work. I think it is great that there is a handful of us who are interested in these sorts of studies. The other 900+ members may not care very much , but as long as there is someone interested I will keep posting this stuff .Ajedrecista wrote: As I said before, the central point of this line (x = 60) is a little over 3000 Elo (something like +3 or +4 Elo over 3000), so it seems that there is a small gainin averageinside the lineal zone. I do not know how non-lineal zones are well or bad compensated.

@Don, Adam et al: thank you very much for your work!

Regards from Spain.

Ajedrecista.

### Re: A word for casual testers

I had 22,000 games played by Gaviota beta against a set of opponents. I randomized the games and then split them up in two different ways. The first way was into 20 sets of ~1100 games. The second way was into 220 sets with ~100 games each. In each set, I gave Gaviota beta an unique name. Then I combined the sets back together. I also added a set of games played by the opponents against each other and ~22,000 games played by Gaviota 0.85.1 against the same opponents. I then calculated the ratings with Ordo 0.6, with the Elo of Gaviota 0.85.1 set to zero. The following graphs show the Elo rating of Gaviota beta based on the games from each set. This is essentially the same as playing 20 1100 game gauntlets (220 100 game gauntlets) and calculating Gaviota beta's Elo each time.

1100 game sets:

While only 20 points makes the pattern harder to see, it is not unlike Don's graph. Note that the median point is at approximately 41 Elo, and the difference between the smallest estimate and the largest is ~35 Elo and the difference of the middle 2/3rd estimates is ~18 Elo.

100 game sets:

This looks like Don's graph. The median point is ~42 Elo, the difference between the smallest and largest estimate is ~175 Elo, and the difference for the middle 2/3rd estimates is ~61 Elo.

Here is the point in all of this, the reason for Don's post. From all of the games (22,000 for Gaviota beta and 22,000 Gaviota 0.85.1, plus the games between the opponents), the calculated Elo difference between Gaviota beta and v0.85.1 is 41.1 Elo. An Elo calculation from a random 1100 game gauntlet is much more likely to be close to the true Elo difference (where the Elo estimate from the total group of games is used for the "true" Elo difference) than from a random 100 game gauntlet. This does not mean that the results from a 100 game gauntlet can't be closer to the "truth" than the results from a 1100 game gauntlet. But it does mean that it is a somewhat rare event.

Now, the improvement in Gaviota beta over v0.85.1 is substantial enough so that most of the 100 game gauntlets give estimates that Gaviota beta is better than v0.85.1. But even some of those gauntlets "show" the beta to be worse. And some of those gauntlets "show" the beta to be a lot stronger than v0.85.1. Imagine if the actual increase in strength was less, such as in Don's case. Small matches/gauntlets just do not prove a lot by themselves.

1100 game sets:

While only 20 points makes the pattern harder to see, it is not unlike Don's graph. Note that the median point is at approximately 41 Elo, and the difference between the smallest estimate and the largest is ~35 Elo and the difference of the middle 2/3rd estimates is ~18 Elo.

100 game sets:

This looks like Don's graph. The median point is ~42 Elo, the difference between the smallest and largest estimate is ~175 Elo, and the difference for the middle 2/3rd estimates is ~61 Elo.

Here is the point in all of this, the reason for Don's post. From all of the games (22,000 for Gaviota beta and 22,000 Gaviota 0.85.1, plus the games between the opponents), the calculated Elo difference between Gaviota beta and v0.85.1 is 41.1 Elo. An Elo calculation from a random 1100 game gauntlet is much more likely to be close to the true Elo difference (where the Elo estimate from the total group of games is used for the "true" Elo difference) than from a random 100 game gauntlet. This does not mean that the results from a 100 game gauntlet can't be closer to the "truth" than the results from a 1100 game gauntlet. But it does mean that it is a somewhat rare event.

Now, the improvement in Gaviota beta over v0.85.1 is substantial enough so that most of the 100 game gauntlets give estimates that Gaviota beta is better than v0.85.1. But even some of those gauntlets "show" the beta to be worse. And some of those gauntlets "show" the beta to be a lot stronger than v0.85.1. Imagine if the actual increase in strength was less, such as in Don's case. Small matches/gauntlets just do not prove a lot by themselves.

### Re: A word for casual testers.

Adam Hair wrote:Thank you, Jesús, for your work. I think it is great that there is a handful of us who are interested in these sorts of studies. The other 900+ members may not care very much , but as long as there is someone interested I will keep posting this stuff .

But I read the letters that seem to compose words, between the strange graphics.

"The only good bug is a dead bug." (Don Dailey)

[Blog: http://tinyurl.com/predateur ] [Facebook: http://tinyurl.com/fbpredateur ] [MacEngines: http://tinyurl.com/macengines ]

[Blog: http://tinyurl.com/predateur ] [Facebook: http://tinyurl.com/fbpredateur ] [MacEngines: http://tinyurl.com/macengines ]

### Re: A word for casual testers.

JuLieN wrote:Adam Hair wrote:Thank you, Jesús, for your work. I think it is great that there is a handful of us who are interested in these sorts of studies. The other 900+ members may not care very much , but as long as there is someone interested I will keep posting this stuff .

But I read the letters that seem to compose words, between the strange graphics.

### Re: A word for casual testers

Hi Don,

I have 2 examples:
Toga II 2.02 is not 30 Points ahead of Toga II 1.4.3JDbeta19a: wrong?

or

924 Delphil 2.9g w32 1CPU 2321
It seems Delphil 2.9g x64 1CPU cannot reach the Rating of 32bit Version: wrong ?

I have 2 examples:

Code: Select all

```
Nemo 1.0.1 x64 - Toga II 1.4 Beta5c 1CPU 11.0 - 9.0 55.00%
Nemo 1.0.1 x64 - Toga II 1.4.2 JD 1CPU 12.5 - 7.5 62.50%
Nemo 1.0.1 x64 - Toga II 1.4.3JDbeta19a 10.5 - 9.5 52.50%
Nemo 1.0.1 x64 - Toga II 2.02 JA 13.0 - 7.0 65.00%
Nemo 1.0.1 x64 - Toga Returns 1.0 11.0 - 9.0 55.00%
```

or

924 Delphil 2.9g w32 1CPU 2321

Code: Select all

```
Delphil 2.9g x64 1CPU 2217 - Djinn 0.969 x64 2356 15.5 - 34.5 +7/-26/=17 31.00%
Delphil 2.9g x64 1CPU 2267 - Rodin 4.0 2330 20.5 - 29.5 +13/-22/=15 41.00%
```

Werner