more on engine testing

hgm · Post by **hgm** » Fri Aug 01, 2008 10:53 pm

bob wrote:They teach me to not worry about some odd remote possibility if it is an odd remote possibility that _everyone_ has to deal with in their testing.

Flawed thinking. No possibility is remote compared to a 6-sigma deviation. Even a n army of extra-terrestrial invaders that has been camping in one of your computers for the duration of the first match before they decided that Earth defenses were too big to take them on, is an extremely palusible hypothesis compared to the one that it is due to statistical fluctuation in two identical experiments.

If you don't want to play scientist, and divulge your data, you should not expect me to explain why it is rotten. Muddle on, if you like!

bob · Post by **bob** » Fri Aug 01, 2008 11:02 pm

Dirt wrote:
hgm wrote:
mathmoi wrote:If Dr Hyatt is right the crafty elo of each sample should still vary "a lot".
I don't understand that conclusion. I would say that if the difference between two randomly picked 25,00 samples from the 50,000- games would differ to much, it shows you did not pick randomly.

That and only that.
That and nothing more. Quoth the raven, "nevermore"...

Or that there is something wrong with your, and BayesElo's, statistics.

Would such a test which failed to show the large swings he's been seeing convince Bob that something is wrong with his cluster testing, even if he can't find the cause?

No. First a side-track here.

I am certain that my test platform is far more consistent than anything a normal user can test on. The cluster is "behind" a front-end box that users have to log in to in order to run things. Users then submit "shell scripts" to the scheduling manager and it kicks the scripts off on nodes that are unused. The cluster nodes do not run any typical daemon processes that user systems would use. No http daemon, no email, no crontab, etc... All they do is run whatever is sent to them. They all have identical processors, identical memory sizes, identical disk drives, identical configurations (they are all cloned from a standard kernel setup whenever the cluster is cold-started, the only thing that is different for each node is its namp and IP address. We run QAC monthly to pick up on network cards that are going bad, or whatever (we currently have 130 compute nodes, two are down waiting on parts from the last such test). Etc. The platform is as controlled as a machine can be, unless each were in single-user with no network connection to the head, which would be next-to-useless.

Now, if any of those things are an issue, then they are even more of an issue for normal users with machines that are running other things. Windows is a good example. Boot your box, let it get all the way up, then run a network sniffer to watch the network traffic it produces, even when not running any user programs at all and with no one logged in.

So, since if I have a problem as given above, then by george, everyone else has an even bigger problem, and I don't see why it would be an issue since we can't eliminate it anywhere.

Whether or not BayesElo is right is something I can't/won't address since I didn't write it and am not going to go thru it with a fine-tooth comb. But since I have seen this variability for a couple of years, I suspect it has been there all along. Just that nobody has ever had the resources and taken the time to make two such runs with the same version. What would be the point and how long would it take the average person to play such a number of games.

I'm simply reporting what I found. 4 short tests, and two runs that I would consider "long" by any stretch of the imagination. We normally run 1+1 time controls to make the tests run faster, but for the first test of the Elo output, I stretched the time out a bit to attempt to reduce the effect of timing randomness, even though it can't be eliminated completely.

What would be more interesting would be fore someone else to play a bunch of very fast games (1+1 is OK) and see what their 800 games against 5 opponents looks like. Like mine? Or something much more consistent? But nobody is willing to do that, and a couple are only willing to say "no way" and offer nothing concrete in rebuttal.

But to run 160 games against 5 opponents, at 1+1 ought to be doable. A single game would take around 3 minutes probably. 20 per hour. Or under two days total. Then repeat for another 2 days to have two runs to compare. I could run the same test several times at 1+1 here, and we could compare results. That would answer a lot of questions. Because I don't believe I will be the _only_ one with this variability. I could even provide the source for the 5 versions + current crafty so that the programs would be the same on both platforms...

bob · Post by **bob** » Fri Aug 01, 2008 11:07 pm

krazyken wrote:
bob wrote: The only thing that changes is that the "BayesElo" data file I copied here changes after each run as I save the results on the end... Every other file in the cluster home directory is completely identical from run to run.
Sounds to me like there probably isn't going to be enough information from the 2 runs to prove/disprove if the two truly were the same. Do you have any more matching data sets?

All I think I have is the complete PGN from the last long run. I usually submit a job that runs a test, computes the Elo stuff and appends to a file, and saves the PGN in a temp directory. But the next run first erases the temp directory. I don't save the PGN across runs unless I am trying to understand why I lose every game on a single position to one opponent, playing either black or white, or some other sort of unusual behavior. At 25,000 games per run, the PGN becomes simply unmanagable.

But if you tell me what you want, I might can provide it, even if it means running the test again, once our A/C problem is fixed.

bob · Post by **bob** » Fri Aug 01, 2008 11:14 pm

hgm wrote:Well, BayseElo won't tell you anything that you would not immediately see from the scores. The Elo differences are all very small, and you are entirely in the linear range, so that 1% corresponds to 7 Elo.

bob wrote:Depends on your definition of "nothing remarkable". If you mean "worthless for evaluating any sort of program change" then I will agree...
Wotrthless for evaluating changes smaller than 18*sqrt(2) Elo. That is a bit different from "any sort of".

not in my context. I have always referred to A and A' where A' is a minor modification of A, which is the way we normally test and develop. Very few truly revolutionary ideas crop up nowadays, most are eval changes or minor search changes that are not going to be huge Elo boosters. In the above, 25 Elo is a _significant_ change, one that will rarely happen on a small change.

I don't see what you are so hung up about. A sampling process will have a statistical error associated with it, that can be accurately calculated using no other information than that the result of a single game is limited to the interval [0,1], and the games are independent. This requires nothing more fancy than taking a square root. And if the difference you want to measure is smaller than that error, yes, than the result is of course worthless.

One doesn't have t play 2,400 games to come to that conclusion. The 2400 games more or less confirm that the error bars given by BayesElo are correct, though.

My point was that such a number of games can not be used for what many are using it for. Still don't get that point? For example, adding endgame tables produces an Elo boost that is under that noise level, so does it help or hurt? It takes a huge number of games to answer this. And nobody (except me) has run 'em. I see Elo changes based on SMP results published regularly. After 20 games, etc. And while 20 games might be enough to somewhat accurately conclude that 8 processors is stronger than 1, it isn't enough to conclude with any significant accuracy. And when I see BayesElo numbers of X +/- N where N is very small, and I compare that to the results I just posted where the two ranges are disjoint, I can only conclude that N has to be taken with a huge grain of salt when dealing with computers which are certainly subject to outside and random influences all the time...

bob · Post by **bob** » Fri Aug 01, 2008 11:16 pm

hgm wrote:
bob wrote:They teach me to not worry about some odd remote possibility if it is an odd remote possibility that _everyone_ has to deal with in their testing.
Flawed thinking. No possibility is remote compared to a 6-sigma deviation. Even a n army of extra-terrestrial invaders that has been camping in one of your computers for the duration of the first match before they decided that Earth defenses were too big to take them on, is an extremely palusible hypothesis compared to the one that it is due to statistical fluctuation in two identical experiments.

If you don't want to play scientist, and divulge your data, you should not expect me to explain why it is rotten. Muddle on, if you like!

What do you mean "and divulge your data". I have always done that. If you want the PGN, I can easily re-run the tests and put the PGN files on my ftp box. If you want to wade thru 50,000 games. Doesn't bother me a bit. But you imply that I don't divulge data when in reality, any time anyone has asked, I've always been happy to make the data available...

krazyken · Post by **krazyken** » Fri Aug 01, 2008 11:32 pm

bob wrote:
krazyken wrote:
bob wrote:
Michael Sherwin wrote:
bob wrote:These results are _typical_. And most "testers" are using a similar number of opponents, playing _far_ fewer games, and then making go/no-go decisions about changes based on the results. And much of that is pure random noise.
Okay, I have one computer to test on, what do I do? Just give up and quit, I guess.
I actually am not sure, to be honest. It is a real problem, however...
I think the answer is to become less dependent on statistics, and less obsessed with ELO points. I would think the right way to go would be to identify problems in play and understand what causes that problem. Playing many many games is more helpful if you can analyze those games and find out why you lost the ones you did.
How do you decide if something you add is good or bad? I can't count the number of times I have added something I was sure made the program better, only to find out later than it was worse... that's what I (and many others) are trying to measure with these test games...

If you are going about the approach of fixing a specific problem, it either fixes it or it doesn't. Yes the process of fixing the problem may make the engine play worse, perhaps the engine was tuned to compensate for the problem you fixed? You can roll back the change going for the lesser of two evils. Or you can go forward with a weaker engine for a time in order to identify and fix the bigger problems. The idea I am trying to communicate is that it is far more important to analyze lost games and look for improvements than it is to accumulate a collection of thousands of lost games with no idea of why they are lost. This may require more thought and time from the programmer than, just trying out a few changes and running thousands of games to try to find evidence to support a change. But then again, it may be possible to write a program to go through those thousands of lost games, and subject them to analysis and classification to provide clues and patterns as to what needs fixing.

bob · Post by **bob** » Fri Aug 01, 2008 11:46 pm

krazyken wrote:
bob wrote:
krazyken wrote:
bob wrote:
Michael Sherwin wrote:
bob wrote:These results are _typical_. And most "testers" are using a similar number of opponents, playing _far_ fewer games, and then making go/no-go decisions about changes based on the results. And much of that is pure random noise.
Okay, I have one computer to test on, what do I do? Just give up and quit, I guess.
I actually am not sure, to be honest. It is a real problem, however...
I think the answer is to become less dependent on statistics, and less obsessed with ELO points. I would think the right way to go would be to identify problems in play and understand what causes that problem. Playing many many games is more helpful if you can analyze those games and find out why you lost the ones you did.
How do you decide if something you add is good or bad? I can't count the number of times I have added something I was sure made the program better, only to find out later than it was worse... that's what I (and many others) are trying to measure with these test games...
If you are going about the approach of fixing a specific problem, it either fixes it or it doesn't. Yes the process of fixing the problem may make the engine play worse, perhaps the engine was tuned to compensate for the problem you fixed? You can roll back the change going for the lesser of two evils. Or you can go forward with a weaker engine for a time in order to identify and fix the bigger problems. The idea I am trying to communicate is that it is far more important to analyze lost games and look for improvements than it is to accumulate a collection of thousands of lost games with no idea of why they are lost. This may require more thought and time from the programmer than, just trying out a few changes and running thousands of games to try to find evidence to support a change. But then again, it may be possible to write a program to go through those thousands of lost games, and subject them to analysis and classification to provide clues and patterns as to what needs fixing.

There are two different issues here. Often, it requires careful examination of one or several games to determine the root cause of excessive losses or draws. We do that regularly. And it is not hard (usually) to design and code a fix and then tests on the positions examined can show "OK, it doesn't make the mistake that it was making." But,

That is not enough. Just because you play better in the positions you specifically checked, you have to verify that you play as well or better in other positions that you did not think about or consider. And that is a big problem. We have (our group) made hundreds of changes over the past couple of years, and the largest part of them show up worse under testing in real games than in the specific cases where the problem was isolated. Quite often there is a trade-off as well, in that you do a more accurate evaluation of something, but it slows you down some due to the extra computation. Did the improved accuracy more than compensate for the increased cost? Only a lot of games will tell.

So some sort of game testing is absolutely essential. But it is beginning to appear that the number of games required is nearly intractable unless the games are _very_ fast games. And quite often an eval change has a different effect in short games as opposed to long games. I burned myself on that issue in 1986 as I have mentioned in the past... easy to do. Hard to recognize. Absolutely wastes a ton of time when something appears to be good but is not on better testing.

As to automatically determining where a problem is, that would be wonderful. But it is not easy for humans to do, much less to write a program to do.

krazyken · Post by **krazyken** » Sat Aug 02, 2008 12:06 am

bob wrote:
krazyken wrote:
bob wrote: The only thing that changes is that the "BayesElo" data file I copied here changes after each run as I save the results on the end... Every other file in the cluster home directory is completely identical from run to run.
Sounds to me like there probably isn't going to be enough information from the 2 runs to prove/disprove if the two truly were the same. Do you have any more matching data sets?
All I think I have is the complete PGN from the last long run. I usually submit a job that runs a test, computes the Elo stuff and appends to a file, and saves the PGN in a temp directory. But the next run first erases the temp directory. I don't save the PGN across runs unless I am trying to understand why I lose every game on a single position to one opponent, playing either black or white, or some other sort of unusual behavior. At 25,000 games per run, the PGN becomes simply unmanagable.

But if you tell me what you want, I might can provide it, even if it means running the test again, once our A/C problem is fixed.

Honestly I don't know what I want.

I am interested only because I like statistical problems. I guess if I had the PGNs from both sets, I might be able do some looking for anomalies, such as % of duplicate games. Average game lengths, and such. I probably wouldn't mind the referee program you use to run the games, as it may have use to me. Now that I think about it, I probably want a cluster of my own as well.

I had an anomaly in some games with Crafty 22.1 just a couple of days ago running in an xboard tournament, where everything was fine for several matches, then for the next 18 launches it stopped reading the .craftyrc file, then it finished the remaining matches normally. I have no idea why that happened, as all the matches were started by the same Bash script looping through an array of opponents for the -scp.

bob · Post by **bob** » Sat Aug 02, 2008 12:23 am

krazyken wrote:
bob wrote:
krazyken wrote:
bob wrote: The only thing that changes is that the "BayesElo" data file I copied here changes after each run as I save the results on the end... Every other file in the cluster home directory is completely identical from run to run.
Sounds to me like there probably isn't going to be enough information from the 2 runs to prove/disprove if the two truly were the same. Do you have any more matching data sets?
All I think I have is the complete PGN from the last long run. I usually submit a job that runs a test, computes the Elo stuff and appends to a file, and saves the PGN in a temp directory. But the next run first erases the temp directory. I don't save the PGN across runs unless I am trying to understand why I lose every game on a single position to one opponent, playing either black or white, or some other sort of unusual behavior. At 25,000 games per run, the PGN becomes simply unmanagable.

But if you tell me what you want, I might can provide it, even if it means running the test again, once our A/C problem is fixed.
Honestly I don't know what I want. I am interested only because I like statistical problems. I guess if I had the PGNs from both sets, I might be able do some looking for anomalies, such as % of duplicate games. Average game lengths, and such. I probably wouldn't mind the referee program you use to run the games, as it may have use to me. Now that I think about it, I probably want a cluster of my own as well.

I had an anomaly in some games with Crafty 22.1 just a couple of days ago running in an xboard tournament, where everything was fine for several matches, then for the next 18 launches it stopped reading the .craftyrc file, then it finished the remaining matches normally. I have no idea why that happened, as all the matches were started by the same Bash script looping through an array of opponents for the -scp.

I had too many quirks myself with xboard, that is one reason I don't use it on the cluster (the other is I don't need all the graphical data and the network traffic is unbearable for 256 simultaneous games going on.

The only issue with my referee is that one of the opponents has to be crafty, as the referee depends on Crafty to accurately tell it when a game is technically over, which is simpler than having the referee keep up with board position and such here...

bob · Post by **bob** » Sat Aug 02, 2008 12:25 am

krazyken wrote:
bob wrote:
krazyken wrote:
bob wrote: The only thing that changes is that the "BayesElo" data file I copied here changes after each run as I save the results on the end... Every other file in the cluster home directory is completely identical from run to run.
Sounds to me like there probably isn't going to be enough information from the 2 runs to prove/disprove if the two truly were the same. Do you have any more matching data sets?
All I think I have is the complete PGN from the last long run. I usually submit a job that runs a test, computes the Elo stuff and appends to a file, and saves the PGN in a temp directory. But the next run first erases the temp directory. I don't save the PGN across runs unless I am trying to understand why I lose every game on a single position to one opponent, playing either black or white, or some other sort of unusual behavior. At 25,000 games per run, the PGN becomes simply unmanagable.

But if you tell me what you want, I might can provide it, even if it means running the test again, once our A/C problem is fixed.
Honestly I don't know what I want. I am interested only because I like statistical problems. I guess if I had the PGNs from both sets, I might be able do some looking for anomalies, such as % of duplicate games. Average game lengths, and such. I probably wouldn't mind the referee program you use to run the games, as it may have use to me. Now that I think about it, I probably want a cluster of my own as well.

I had an anomaly in some games with Crafty 22.1 just a couple of days ago running in an xboard tournament, where everything was fine for several matches, then for the next 18 launches it stopped reading the .craftyrc file, then it finished the remaining matches normally. I have no idea why that happened, as all the matches were started by the same Bash script looping through an array of opponents for the -scp.

I had too many quirks myself with xboard, that is one reason I don't use it on the cluster (the other is I don't need all the graphical data and the network traffic is unbearable for 256 simultaneous games going on.

The only issue with my referee is that one of the opponents has to be crafty, as the referee depends on Crafty to accurately tell it when a game is technically over, which is simpler than having the referee keep up with board position and such here...

more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing