more on engine testing

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
hgm
Posts: 28356
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: more on engine testing

Post by hgm »

Dirt wrote:Even as the worst consecutive runs he's ever had those last two runs would be surprising.
I agree. A 6-sigma deviation should occur only once every 500 million tries. Like I said, there clearly is something broken.

But the most-deviating result is acualy a quite unreliable measure for the standard deviation of a Gaussian distribution. It would be much better if Bob presents the entire list of all results, so that it can be seen how the entire distribution looks. Does it just have a much larger standard deviation than it should (and by how much), or is it non-Gaussian, with long tails?

Otherwise it is more sensationalism reporting than science.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more on engine testing

Post by bob »

hgm wrote:
bob wrote:Here's the point, that continually gets overlooked. If 800 games gives two rating ranges that do _not_ overlap, for two identical programs, then it is impossible to use 800 game matches to determine whether A' is better than A, when that change will, by definition, be small.
This is not so much overlooked, as well rejected because it is wrong. If two identical programs habitually produce rating ranges that do not overlap from identical tests, (i.e. if they in more than 5% of the cass are outside the 95% confidence interval), it follows that the 800-game test match was not done properly (as the games were apparently not independent). It does not follow that 800 game matches cannot be used to measure a small program difference when conducted properly (i.e. with independent games).

You seem to be attempting burning down your own straw man, though, as for the results of the 800-game matches you show, the rating ranges did overlap. There was nothing suspicious with these results. They were distibuted as one would expect from standard statistics.
I hate to keep asking, but _did_ you actually read my post? If so, I did not say that they did not match what statistical analysis should happen. What I said was that there was _far_ too much variability to use any of them to draw a conclusion. Why is that so hard to grasp? Has zero to do with statistics. Has zero to do with opinion. If an 800 game test was useful, I should be able to run it twice and get results that are consistent. Those results are not. If I can't draw any conclusions because the results are inconsistent, then 800 games is not enough. I then moved on to 25,000 games and there the rating ranges did not even overlap. Showing that is not enough to detect small improvements in a program. What more should I say? You keep changing the topic, and trying to beat down arguments no one is making. I simply said "800 games is nowhere near enough, and gave a pair of consecutive 25,000 game matches to show _that_ is nowhere near enough." I didn't say this result was abnormal. In fact, I said it was common. So can we get back to what I said, and not something off-the-wall that isn't part of the discussion?


The Rybka testing using 80K games to "measure Elo changes as small as 1" (Larry K's words, not mine) is, as I said when it was posted, incorrect. I don't believe you can actually measure Elo changes of +/- 1 with any degree of confidence at all, without playing hundreds of thousands of games at a minimum.
I think they can, as they undoubtedly conduct the tests properly, keeping an eye on the statistics, and confirming that the games were indeed independent.
The BayesElo confidence interval is 95%. Which, as I have said previously, doesn't mean a lot when programs exhibit so much natural non-determinism.
I love that "superior attitude" you have. Nobody but you knows how to run a "proper test". I have already spent months analyzing what is going on. And I tracked it down to "timing jitter" and absolutely nothing else. Several have looked at the data. We've checked NPS. We restart engines after each game. The list goes on and on. The games are as independent as games can be when they use the same group of opponents and starting positions are used. There is no left over hash. No learning of any kind. No random loads on the nodes. Nothing that changes from one game to another. Carefully confirmed by hundreds of NPS checks. I even run with code in Crafty to watch this from time to time, on occasions when I run on the cluster in "open mode" (where other users can use nodes I am not using but not the nodes I am using). But these results were on a "closed system" where I was "it".

You keep wanting to say the experimental setup is wrong. I say you are full of it and need to change the channel to something that is worthwhile. Perhaps you could explain to me how even random performance noise would affect the match anyway. Isn't random timing taken care of in statistical sampling? In my books it is.

I am sorry I have to say this for the umptieth time, but this is pure nonsense. Non-determinism cannot explain these results. Only dependence of the game results can.
And I will say it for the umptieth + 1 time, there is _zero_ dependence. Why don't you tell me a methodology to make the games somehow dependent on each other in the first place. Using my program plus 5 completely unmodified open source programs (one of which is an old version of my program of course). Just tell me how to make the games "dependent". when each game is played, then two more instances of the two programs are started, sides are switched, and the game is played again.

I'm waiting on that intellectual jewel so that I might understand how it is even possible. My testing scheme (lower level this time around) goes like this: I create a pot load of simple shell scripts to run my referee program, which fires up one instance of each opponent, and connects them together much as xboard/winboard does. Once that game ends, everything terminates, and once the script ends, another one is sent to that node where it is started and plays another game. No shared data. No shared files. No interaction between nodes. No endgame tables. No nothing.

So I am waiting for you to tell me how, testing in that methodology, I might contrive an experimental setup such that the games are dependent on each other.

Once more, the runs were consecutive runs. I did not cherry-pick the wildest 4 our of hundreds of matches. I just ran 4 matches and cut/pasted the results. These results are _typical_. And most "testers" are using a similar number of opponents, playing _far_ fewer games, and then making go/no-go decisions about changes based on the results. And much of that is pure random noise.
As I said before, there was nothing suspicious about these 4 results. You have no case.
What on earth are you talking about? I didn't say there was anything "suspicious". That is your nonsense. I simply said that the four runs show that one can not use 800 games against a total of 5 opponents and 40 starting positions, to decide whether a new version is better or worse than the previous version. Somehow you want to make that statement into something else. But that's all I have said. Feel free to point out anywhere I used the word "suspicious" or anything remotely related...
BTW the stuff about the 40 positions, or the 5 opponents is way beside the point. No matter what 40 positions I choose, and it would seem to me that the smaller number the more stable the results, I ought to be able to produce a stable answer about those 40 positions, whether or not that carries over to other positions depends on how "general" those positions are (and the silver positions are pretty generic/representative of opening positions). My point is this: If I can't play a long match against _one_ opponent, using 5 starting positions, and get similar results for each run (in terms of Elo) then adding more programs and more positions only increases the noise, it certainly doesn't reduce it.
Indeed. Like I said, it proves your testing method is flawed, because the games are apparently dependent. Not because there aren't enough games. The number of posititions and opponents only shows that the effort to obtain more accurate results by playing more games is doomed from the outset, even if it would not be executed in a flawed way. Even if would have converged, it will not converge to the correct number, and conclusions based on it can be wrong.
"methodology is flawed" is a nice attempt to weasel out of the original discussion. But it is wrong. So I am waiting for you to tell me how, using a single shell script for each game, running two at a time on dual-cpu nodes, I can somehow make the games dependent on each other. Your chance to shine... or not...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more on engine testing

Post by bob »

Michael Sherwin wrote:
bob wrote:These results are _typical_. And most "testers" are using a similar number of opponents, playing _far_ fewer games, and then making go/no-go decisions about changes based on the results. And much of that is pure random noise.
Okay, I have one computer to test on, what do I do? Just give up and quit, I guess.
I actually am not sure, to be honest. It is a real problem, however...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more on engine testing

Post by bob »

hgm wrote:
bob wrote:You'd have to explain how the test-method is "completely broken".
No, I don't. If I see a pile of glass fragments lying on the floor, I can conclude that the bottle is broken without having to explain how it got there. Seeing the fragments justifies the conclusion that it is a broken bottle no matter what. And you show us the fragments. (For the 25,000 games run, that is. The 800-games runs look like a perfectly OK bottle to me. :lol: )

It is for you to figure out why you did not succeed in playing independent games. It is your cluster, after all,so you fix it! If you were to offer me equal time on it, I would of course be willing to help you out... :roll:
That's what I thought. I see a pile of glass fragments and wonder "who dumped those glass fragments there" or "who broke something there". Nothing says that someone broke something there. That is an assumption, something you are good at making, whether there is evidence or not to support it.

So again, tell me how to "break" my cluster to make games played somehow be dependent on each other...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more on engine testing

Post by bob »

Uri Blass wrote:
bob wrote:A while back I mentioned how difficult it is to draw conclusions about relatively modest changes in a chess program, requiring a ton of games to get usable comparisons. Here is a sample to show that in a way that is pretty easy to understand.

First, I played crafty vs 5 other opponents, including an older 21.7 version. The version I am testing here is not particularly good yet, representing some significant "removals" from the evaluation. So the results are not particularly interesting from that perspective. The 5 opponents were played on 40 starting positions, playing 4 rounds for each position, alternating colors. So a total of 800 games per match, and I am giving 4 consecutive match results, all the same opponents, all played at a time control of 5 + 5 (5 minutes on clock, 5 seconds increment added per move). I lost a game here and there due to data corruption on our big storage system, so some of the matches show 799 rather than 800 games because once in a while the PGN for the last game would be somehow corrupted (a different issue).

I ran these 800 game matches thru Remi's BayesElo. You can look at the four sets of results, but imagine that in each of those tests, crafty-22.2 was a slightly different version with a tweak or two added. Which of the four looks the best? And then realize that all programs are identical for the 4 matches. How would one reliably draw any conclusion from a match containing only 800 games since the error bar is significant, and the variability is even more significant. First the data:

Code: Select all

Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   121   42   41   160   68%   -18   17%
   2 Glaurung 1.1 SMP        61   42   41   160   60%   -18   13%
   3 Fruit 2.1               49   41   40   160   59%   -18   15%
   4 opponent-21.7           13   38   38   159   55%   -18   33%
   5 Crafty-22.2            -18   18   18   799   47%     4   19%
   6 Arasan 10.0           -226   42   45   160   23%   -18   18%
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5    81   42   41   160   63%   -17   16%
   2 opponent-21.7           61   38   38   159   62%   -17   33%
   3 Glaurung 1.1 SMP        46   42   41   160   58%   -17   13%
   4 Fruit 2.1               35   40   40   160   57%   -17   19%
   5 Crafty-22.2            -17   18   18   799   47%     3   19%
   6 Arasan 10.0           -205   42   45   160   26%   -17   16%
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   113   43   41   160   66%   -12   12%
   2 opponent-21.7           73   39   38   159   63%   -12   32%
   3 Fruit 2.1               21   41   40   160   54%   -12   15%
   4 Crafty-22.2            -12   18   18   799   48%     2   18%
   5 Glaurung 1.1 SMP       -35   41   41   160   47%   -12   11%
   6 Arasan 10.0           -161   41   43   160   30%   -12   18%
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   131   45   42   160   70%   -33   10%
   2 Fruit 2.1               64   41   40   160   63%   -33   19%
   3 Glaurung 1.1 SMP        25   41   40   160   58%   -33   15%
   4 opponent-21.7           13   37   37   160   57%   -33   36%
   5 Crafty-22.2            -33   18   18   800   45%     7   19%
   6 Arasan 10.0           -199   42   44   160   29%   -33   15%
Notice first that _everybody_ in the test is getting significantly different results each match. The overall order (with the exception of Glaurung 2 which stays at the top) flips around significantly.

Now does anyone _really_ believe that 800 games are enough? Later I will show some _much_ bigger matches as well, showing the same kind of variability. Here are two quickies that represent 25,000 games per match for two matches, just for starters (same time control):

Code: Select all

Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   123    8    8  5120   66%     2   15%
   2 Fruit 2.1               38    8    7  5119   55%     2   19%
   3 opponent-21.7           28    7    7  5119   54%     2   34%
   4 Crafty-22.2              2    4    4 25597   50%     0   19%
   5 Glaurung 1.1 SMP         2    8    8  5120   50%     2   14%
   6 Arasan 10.0           -193    8    9  5119   26%     2   15%
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   118    8    8  5120   67%   -19   13%
   2 Fruit 2.1               42    8    8  5120   58%   -19   17%
   3 opponent-21.7           32    7    7  5115   58%   -19   36%
   4 Glaurung 1.1 SMP        20    8    8  5120   55%   -19   12%
   5 Crafty-22.2            -19    4    4 25595   47%     4   19%
   6 Arasan 10.0           -193    8    8  5120   28%   -19   16%
The question you want to answer from the above is this: crafty-22.2 in the first run was slightly modified for the second run. Was the change good or bad? How sure are you? Then I will add that crafty-22.2 for _both_ runs was identical. Now which one is better? :) There is a 21 Elo difference between the two. The first result says 2 +/- 8, while the second says -19 +/- 4. The ranges don't even overlap. Which points out that this kind of statistic is good for the sample under observation, but not necessarily representative of the total population of potential games, without playing a _lot_ more games. Some would say that the second match says crafty is somewhere between -15 and -23. Which is OK. But then what does the first bigger match say? :)

"things that make you go hmmm......."

Looking at your data(and I do not talk about the 800 games but about the
last table of more than 20000 games) it seems that Crafty is not the same.

The difference in rating for Crafty is bigger than the difference in rating for everyone of the opponents inspite of the fact that Crafty played more games.

All opponents scored better against latest Crafty.

Glaurung 2-epsilon/5 66%->67%
Fruit 2.1 55%->58%
opponent-21.7 54%->58%
Glaurung 1.1 SMP 50%->55%
Arasan 10.0 26%->28%

Uri
Sorry, but crafty is _exactly_ the same. The main thing here is that Crafty played an equal number of games against each opponent, but the opponents did not play each other. So the numbers will be different because of that. If you notice, crafty vs crafty-21.7 produces 2x as many draws as crafty vs any other opponent. Just something that happens since those two versions are reasonably similar (although not in evaluation by a significant amount).

I simply run the test multiple times, in an automated shell script. nothing changes. All logs and such are removed after each run so that there is no file overhead for deleting an old log during a game. etc...

The only thing that changes is that the "BayesElo" data file I copied here changes after each run as I save the results on the end... Every other file in the cluster home directory is completely identical from run to run.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more on engine testing

Post by bob »

hgm wrote:And I don't believe for a minute that these are just the first two 25,000-game runs he ever tried, and that Bob would also have made this post if they head been dead-on the same score. A 5+5 games lasts about 20 min, so 25,000 of them is 347 days on a single core. Seems to me that

I personally don't give a rat's a** about what you "believe". I just made the switch to the BayesElo output to see what it would say. I ran 4 quick tests just to see how the numbers looked from an Elo perspective as opposed to the raw match totals we have been using. I posted the results. If you want to believe this is made up, I suggest you just buzz off and stay out of the thread.
1) Bob is in this business for more than one year
2) he uses significantly more than a single core

So he must have made many of such runs before?

And that of course would raise the question: why did he post dis time, and why didn't he post any of these earlier runs? Not enough difference in those cases to be noteworthy? This has the word 'selected data' stamped all over it!
Jesus. Yes I do use more than a single core. I either run on cluster A, which has 256 cores. Or I run on cluster B, which has 540 cores. Do you _ever_ read anything? A 5+5 game lasts about 15 minutes on average. four per hour. 2,500 per hour on one cluster, 1,000 per hour on the other cluster. First four runs were only 160 games total. Doesn't take very long. Last two runs were longer and took something under 24 hours to run total, I did not look to see exactly how long. I started it one day, checked later the next day and it was done.

So exactly _how_ did you miss the fact I was running on a cluster, when _every_ post I have made with respect to posting here over the past two years has always mentioned using a cluster? And you wonder why you don't get what is going on? Shoot, you don't even read what is going on...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more on engine testing

Post by bob »

hgm wrote:And I don't believe for a minute that these are just the first two 25,000-game runs he ever tried, and that Bob would also have made this post if they head been dead-on the same score. A 5+5 games lasts about 20 min, so 25,000 of them is 347 days on a single core. Seems to me that

1) Bob is in this business for more than one year
2) he uses significantly more than a single core

So he must have made many of such runs before?

And that of course would raise the question: why did he post dis time, and why didn't he post any of these earlier runs? Not enough difference in those cases to be noteworthy? This has the word 'selected data' stamped all over it!

Boy oh boy oh boy. :) Hate to say it, but what an idiot.

So I have _never_ posted such results previously? :)

As to "why this time?" I answered that already. We had been using raw match results, weighted according to how we rate each opponent, but I decided to try some tests using Elo calculations instead. I then automated the data collection using BayesElo, and ran some quick 800 game runs to get a feel for how the Elo numbers might look. And when I saw 'em, I simply posted the results. Guys like you make me seriously consider just not posting any data at all, because if you don't like the results, you want to make idiotic accusations. And draw conclusions that make you look like a real idiot (How can he do this running on a single core? [ in another post] and such statements are simply funny to read and should be embarassing to write.)

You remind me of Vincent. "Your speedup is _never_ 3.1 on 4 cores. Why don't you post the data and I'll prove it?" I posted the data and Martin quickly discovered that Vincent was correct, my speedup was not 3.1... it was actually 3.3... So no matter what I post, you are not going to accept the data. So why don't you move on, butt out, and let the ones that are interested continue to discuss this?
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more on engine testing

Post by bob »

hgm wrote:
Dirt wrote:Even as the worst consecutive runs he's ever had those last two runs would be surprising.
I agree. A 6-sigma deviation should occur only once every 500 million tries. Like I said, there clearly is something broken.

But the most-deviating result is acualy a quite unreliable measure for the standard deviation of a Gaussian distribution. It would be much better if Bob presents the entire list of all results, so that it can be seen how the entire distribution looks. Does it just have a much larger standard deviation than it should (and by how much), or is it non-Gaussian, with long tails?

Otherwise it is more sensationalism reporting than science.
Again, what are you talking about? I presented _all_ the data that I had. I made 4 800 game runs. The Elo was somewhat surprising although I knew 800 games would not cut it anyway. But I wanted to quickly see how the Elo numbers would look. I then ran two "normal" runs, which finished right as I was making the original post. I had planned on reporting those later, but when I noticed they had finished, I just added 'em to the bottom rather than doing as I had said earlier in the post and reporting them in a later follow-on.

To date, I had started several more 800 game runs, but our cluster is temporarily dead as it had to be shut down due to an A/C problem. When it comes back up, I can run as many 800 game runs as you want and post 'em all if you like. My only interest was in finding a small number of games that would reliably tell me whether a change was good or bad. that appears to be impossible, either using raw match results or Elo statistical analysis.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more on engine testing

Post by bob »

Uri Blass wrote:
hgm wrote:And I don't believe for a minute that these are just the first two 25,000-game runs he ever tried, and that Bob would also have made this post if they head been dead-on the same score. A 5+5 games lasts about 20 min, so 25,000 of them is 347 days on a single core. Seems to me that

1) Bob is in this business for more than one year
2) he uses significantly more than a single core

So he must have made many of such runs before?

And that of course would raise the question: why did he post dis time, and why didn't he post any of these earlier runs? Not enough difference in those cases to be noteworthy? This has the word 'selected data' stamped all over it!
The difference here is so high so it does not change much if it is selected data of one data out of 10.

The probability that a difference that is so high is going to happen is something like 1/1000,000(I did not calculate it and it is a fast estimate) and changing it to 1/100,000 is not going to cause me to believe that it is because of luck.

Uri
Let's not confuse this discussion with any rational thinking. I must have either used every computer on the planet for the past 10 years to produce enough runs to be able to select two with such a wide variance, or I must have made up the data. Or this is all real and is exactly as I have reported. I know the truth. I'm beginning to not care about what others think...
krazyken

Re: more on engine testing

Post by krazyken »

bob wrote: The only thing that changes is that the "BayesElo" data file I copied here changes after each run as I save the results on the end... Every other file in the cluster home directory is completely identical from run to run.
Sounds to me like there probably isn't going to be enough information from the 2 runs to prove/disprove if the two truly were the same. Do you have any more matching data sets?