A word for casual testers
Posted: Tue Dec 25, 2012 4:37 pm
From time to time people will post results on talkchess from a match
they have run or they will send me results from matches they have
played with Komodo. Typically these will be 100 games matches and the
results will come with a conclusion such as, "what is wrong here?"
If the match was good the comment may be more enthusiastic, that we
have achieved some wonderful breakthough. If it disagree's with the
rating lists they are convinced of testing bias. Sometimes people
will have a favorite version of minor revisions of a program and
will swear by this version even though there is no solid evidence that
it is any different than another. You may notice that in some cases
someone will make a minor modification to some open source program and
based on some 100 game match declare a breakthrough.
There are a few who cannot be convinced that error margins are more
than just hypothetical nonsense. They will patiently listen to what I
am saying but they don't believe it applies in "real life" or in any
practical sense and it's just some theoretical thing and I need to
loosen up and look at the clear evidence staring me in the face. To
them "common sense" and the "gut" is more reliable than fuzzy
numbers.
I hope that actions speak louder than words - we are currently testing
a change that has given us a minor improvement of perhaps 2 or 3 ELO
with a statistically signficant number of games at various time
controls. So I ran 120 separate 300 games matches under equal
conditions. The idea is to see what conclusions people might come to
based on any particular match.
If you look at the first match result, the improved version scores 6
ELO down. This by itself would convince some people that the change
was tried, but just didn't pay off. The second match does better,
scoring only a little more than 2 ELO down. Surely, we have 2 matches
that BOTH scored negatively, the change should be discarded, right?
What more overwhelming evidence do you need? Together in fact
this forms a 600 game match that was 4 ELO down!
So I sorted the various match results by ELO score. Out of the 120
different matches the FIRST 49 (when sorted this way) came out
negative. This means that if you are running 300 game matches you
would have come to the wrong conclusion almost half the time - you are
basically wasting your time on these matches unless of course you are
having fun running them and enjoy watching the games. Please note
that these are 300 games matches, not the usual 100 or 200 game
matches that are often reported with conclusions being drawn.
It gets worse. There were several matches that show a ridiculously
bad result, the worst showing a "regression" of about 32 ELO. Had
that been the first match run, even the more professional testers would
start to suspect that something is wrong. With a score like that
after 300 games the tentative conclusion that "something is probably
wrong with this change" is a valid one - as long as the word
"probably" is in there. The error margin is about 28 ELO after these
30 games, so if you were pre-testing with 300 game samples you might
legitimately reject such a change to save time. There is no way of
getting around the fact that you might reject good changes no matter
how many games you play.
Of course you also find many examples of matches that finished with
exceptional results in these 120 matches. The best one showed a
whopping 37 ELO advantage. Again, there are some who would
report this as a major breakthrough.
So here is the word of advice for casual testers. There is a right
way to report results and a wrong way. The right way to report
results is to ONLY report them, do not interpret them. You can never
go wrong if you give the match conditions and then just say, "here is
what happened." It is very useful to use bayeselo, ordo or elostat to
get a proper report which displays the appropriate error margins that
go with these results - those number tell us a lot about how much we
can trust the results. The error margins is a good way to get a sense
of how much "sample error" we can expect to see. So if you play a 200
games match and the score is close, you know that there is a great
deal of sample error which stands in the way of drawing any firm
conclusions about the value of a change.
After sorting the individual 120 matches by score, I built a graph in
order to help you visualize how this works. The graph appears at
the top of this post. The y-axis is the ELO rating, the x-axis is the match number (after sorting.) You will notice
that a bit more than half the results are positive becuase the change
is a good one, but a very signficant number of results are negative
simply due to sample error. Also, please note that even if you
combine the 36,000 games you are left with sample error. In other
words we cannot say with certainty that the change is really an
improvement! All we can say is that the change is very likely to be
an improvement. In this case we have a lot of other evidence from
several other tests, but that is not the point. If we cannot say for
sure with 36,000 games then 100 games is surely not enough.
Don