more on engine testing

Carey · Post by **Carey** » Sat Aug 02, 2008 8:13 pm

Bob;

First and foremost, I need to say that I am not a statistician, a mathematician, a professional chess player, a good chess player or a good chess programmer.

I'm just a hobbiest chess programmer who has followed this thread and previous similar threads with interest.

I also believe that the results you are getting you are indeed getting. You aren't lying to us, distorting things, etc. You are simply reporting results. And the results are likely to be valid and not due to some flawed system.

Now, having said that...

I wonder if the problem isn't so much the statistics but the fact that this is computer vs. computer.

Bob, you said that as a test, you ripped out large parts of Crafty's evaluator and still had trouble detecting changes. Wouldn't that suggest that what is important in these computer vs. computer matches are the tactical searches rather than the evaluator?

I've always heard that in computer vs. computer, the one with the better search is likely to win, regardless of the evaluator.

Perhaps you could get statistically significant results if you experimented with the search itself instead? That might give some insight into the validity of your testing method.

Chess is tactics, above all else, and you've even done tests with hashing errors returning totally bogus results to the search. You've said that it just flat out doesn't matter because the search itself is so inherently robust that it'll fix about any problem with bad scores being returned.

And that the deeper the errors are, the less important they are.

(Not your exact words, of course.)

Then that would seem to suggest a similar situation for the evaluator. As long as its not flatly wrong and encourages bad behavior, then wouldn't the tactical search fix about anything?

And the deeper you are searching, the easier it would fix small problems with the evaluator and hide small improvements in the evaluator. It'd kind of blur the evaluator results into a bit of a smear.

I'm assuming that Crafty's search is probably better than most of the programs you are testing.

Too bad we don't have a "TECH" style program that could be used as a benchmark. (I still think Gillogly had the right idea. A benchmark program using good algorithms but poor evaluator that could be used a 'fixed point' to judge everything else against.)

I also wonder whether these kinds of tests you are doing would apply to human vs. human and human vs. computer matches.

I have no way to test or judge, but I would suspect that human vs. human matches wouldn't see the kind of issues you are seeing. The history of chess playing and the rating systems resulting in stronger players having higher ratings seems to suggest it works about right.

As for the human vs. computer matches, I don't know... I suspect they wouldn't have your problems either.

I think the problem is likely to be inherent in computer vs. computer testing.

It's so much about the search itself that the best you can do for the evaluator is get it in the right area.

After that, for high quality tuning & testing of the evaluator.... Don't know.

krazyken · Post by **krazyken** » Sat Aug 02, 2008 9:15 pm

bob wrote:
krazyken wrote:If you do have the PGN from the last game run, it might be useful to see how many duplicate games are in there. Perhaps the sample size is significantly smaller than 25000.
I think they were removed when I tried to re-run the test. Then the A/C went south. I will queue up a 25,000 game run and post the PGN as soon as this gets fixed. No idea about duplicates, although there must be quite a few since there are 40 starting positions played from black and white side for 80 different starting points for each pair of opponents.

BTW wouldn't "duplicates" be a good thing here? More consistency? If every position always produced 2 wins and 2 losses, or 2 wins and 2 draws, then the variability would be gone and the issue would not be problematic.

Well if you have a large number of duplicates, your error margin will be larger. To illustrate, take an extreme case, if your first sample had 25000 unique games, and the second sample had 800 unique games how useful would comparing the two samples be? the error margin on the second sample would go from +/-4 to +/-18. One of the assumptions of good statistics is a good random sampling technique. Here the population is all possible games starting from the given positions with the given algorithms. You are collecting a random sample from those games, by running matches. If there are duplicates in the sample, it is not representative of the population.

Richard Allbert · Post by **Richard Allbert** » Sat Aug 02, 2008 9:22 pm

Hi,

My simple opinion is

a. the results are real, and disturbing for the normal "one computer" programmer

but !

b. to calculate a "realistic" rating for the prgrams, the engines used vs crafty need to play vs themselves in a round robin 800 games each, and these results need to be included in the rating calculation.

Then it might be a bit more accurate,

Richard

bob · Post by **bob** » Sat Aug 02, 2008 9:58 pm

krazyken wrote:
bob wrote:
krazyken wrote:If you do have the PGN from the last game run, it might be useful to see how many duplicate games are in there. Perhaps the sample size is significantly smaller than 25000.
I think they were removed when I tried to re-run the test. Then the A/C went south. I will queue up a 25,000 game run and post the PGN as soon as this gets fixed. No idea about duplicates, although there must be quite a few since there are 40 starting positions played from black and white side for 80 different starting points for each pair of opponents.

BTW wouldn't "duplicates" be a good thing here? More consistency? If every position always produced 2 wins and 2 losses, or 2 wins and 2 draws, then the variability would be gone and the issue would not be problematic.
Well if you have a large number of duplicates, your error margin will be larger. To illustrate, take an extreme case, if your first sample had 25000 unique games, and the second sample had 800 unique games how useful would comparing the two samples be? the error margin on the second sample would go from +/-4 to +/-18. One of the assumptions of good statistics is a good random sampling technique. Here the population is all possible games starting from the given positions with the given algorithms. You are collecting a random sample from those games, by running matches. If there are duplicates in the sample, it is not representative of the population.

Maybe you are thinking about this in the wrong way. If there are a lot of duplicates, and we should know in a couple of days as I can now use 28 nodes on the cluster and have started a 25K game run already, wouldn't the base number of non-duplicates be the actual expected value? And the more duplicates you get, the more convinced you should be that those duplicated games actually do represent what happens most of the time? It is not like there are billions of possible games and we just somehow randomly choose 25,000 that are mostly duplicates, when in the total population there are actually very few duplicates.

Programs are deterministic completely, except for time usage which can vary by a second or so per move depending on how time is measured within each engine. And that second per move turns into a couple of million extra (or fewer) positions per move (at 2M nodes per second or so) that can certainly introduce some ramdomness at points in the tree. Most moves won't change, to be sure, but once any move changes, we are now into a different game where the results can be influenced by other timing changes later on. I might try to experiment with more positions, but my initial interest was in finding representative positions and playing 2 games per opponent per position, and use that as a measuring tool for progress. It isn't enough games. So then you can try either more opponents, or more positions, or both, but either of which greatly adds to the computational requirements. More positions leads to the potential for other kinds of errors if there are several duplicated "themes", so that becomes yet another issue. then the opponents become an issue because some might be similar to each other and that introduces yet another bias...

Uri Blass · Post by **Uri Blass** » Sat Aug 02, 2008 10:52 pm

bob wrote:
henkf wrote:so basically what you are saying is that statistics doesn't apply to computer chess, since the results of computer chess games are too random?
No, what I am saying is that perhaps what Elo derived as a good measure for _human_ chess might not apply with equal quality to computer chess games.

i'm not an statistics expert, but it seems to me that if you typically get 6 sigma differences over consecutive samples of 25000 games, no number of games will fix this. in this case the clusters, no matter how many cores they have are lost on you and the only benefit in your testing is the contribution to global warming.
Not to mention headaches caused by examining huge quantities of data. But I tend to agree at present.

as said before i'm not an statistics expert and i'm not stating whether you are wrong or right, however if you are right my conclusion seems to me legitimate.

The discussion is not about human games but about statistics.
Statistics assume nothing about human games.

If you get something that does not seem logical based on statistics even if every result is random result then the conclusion is that something is probably broken.

If I get the same results as you then I could only say that I believe that it is broken and I do not know why.

Claiming that statistics is broken is simply not convincing and also claiming that it was luck is not convincing.

If I get 30-0 result between 2 programs from fixed positions that I believed to be the same then my conclusion is simply that they are not the same or that something else was broken in the data(for example the same program always got white and I did not play with reversed colors because of a bug).

If I lost the pgn of the games and do not know what was the problem it does not change the fact that I believe that there was some problem.

I am not going to argue more about this point because it seems that I am not going to convince people except people who agree with me.

Uri

Uri Blass · Post by **Uri Blass** » Sat Aug 02, 2008 11:02 pm

"I've always heard that in computer vs. computer, the one with the better search is likely to win, regardless of the evaluator."

This claim is completely wrong.
change your evaluation to only material evaluation and you are going to lose every game.

"I'm assuming that Crafty's search is probably better than most of the programs you are testing."

No reason to think that it is better.
Crafty is losing in the tests against Fruit and Glaurung.

The only correct claim is

"Perhaps you could get statistically significant results if you experimented with the search itself instead?"

Maybe but if it is the case the only reason is that for Crafty
improvements in the search can be more significant than improving the evaluation and not because evaluation is unimportant.

Uri

bob · Post by **bob** » Sat Aug 02, 2008 11:46 pm

Uri Blass wrote:
bob wrote:
henkf wrote:so basically what you are saying is that statistics doesn't apply to computer chess, since the results of computer chess games are too random?
No, what I am saying is that perhaps what Elo derived as a good measure for _human_ chess might not apply with equal quality to computer chess games.

i'm not an statistics expert, but it seems to me that if you typically get 6 sigma differences over consecutive samples of 25000 games, no number of games will fix this. in this case the clusters, no matter how many cores they have are lost on you and the only benefit in your testing is the contribution to global warming.
Not to mention headaches caused by examining huge quantities of data. But I tend to agree at present.

as said before i'm not an statistics expert and i'm not stating whether you are wrong or right, however if you are right my conclusion seems to me legitimate.
The discussion is not about human games but about statistics.
Statistics assume nothing about human games.

If you get something that does not seem logical based on statistics even if every result is random result then the conclusion is that something is probably broken.

Or that basic one or more basic assumptions about what is being tested is invalid...

If I get the same results as you then I could only say that I believe that it is broken and I do not know why.

Claiming that statistics is broken is simply not convincing and also claiming that it was luck is not convincing.

If I get 30-0 result between 2 programs from fixed positions that I believed to be the same then my conclusion is simply that they are not the same or that something else was broken in the data(for example the same program always got white and I did not play with reversed colors because of a bug).

And you wouldn't check this by looking at the PGN and log files? I did to make certain that for several 800 game matches, every last PGN matched Crafty's log file expectations exactly.

If I lost the pgn of the games and do not know what was the problem it does not change the fact that I believe that there was some problem.

I am not going to argue more about this point because it seems that I am not going to convince people except people who agree with me.

Uri

No, in my case, because the data I produced suggested that something is different from what we expect. Clearly games are not dependent on each other in computer play with no learning possible, which is different from human play. Clearly the programs do not change in any way, because for each game they are restarted from a fresh copy obtained from the master copy located in a central location. Clearly there are no books in use, because, simply, there are no books in use and no book files (or anything else other than a .ini file for each).

there are lots of things that could explain this odd behavior, one is that perhaps the assumptions about the basic "population" are not quite correct. Normally distributed? Perhaps not. excessively long streaks that suggest non-randomness? perhaps. And the list goes on. So it is not just a suggestion that the test methodology is flawed, as I simply can not conceive of any way where any two games are somehoe dependent on each other somehow, except by the programs playing the games and the starting position. But the result is dependent on how each program plays, and nothing else. I even have long-term performance graphs for each and every node showing the load, which varies from 0 to 2.0 depending on whether nothing is running or two simultaneous games are in progress... And nothing unusual popping in at non-random times to influence games. So rather than suspecting the testing methodology here, I suspect the underlying assumptions are the problem.

But many don't want to hear that, because it would mean that testing is a _serious_ issue if you want useful results.

Dirt · Post by **Dirt** » Sat Aug 02, 2008 11:54 pm

bob wrote:BTW wouldn't "duplicates" be a good thing here? More consistency? If every position always produced 2 wins and 2 losses, or 2 wins and 2 draws, then the variability would be gone and the issue would not be problematic.

Let's say one of the starting positions always results in one of two games, one a win for white and the other a loss. If there is some small change, maybe the barometric pressure or something, that can change the outcome from one to the other then we would see inconsistent results. By examining the PGN this sort of problem should be easy to spot because there would be an extremely large change in one or at most a few starting positions.

Someone who had a less stable environment, with many random factors affecting the outcome, wouldn't see the same problem.

Bill Rogers · Post by **Bill Rogers** » Sun Aug 03, 2008 12:10 am

Carey
A chess program without an evaluation routine can not really play chess in the normal sense. It could only make random moves no matter how fast it could search. A search routine all by itself only generates moves and does not know which move is better or not. In fact it does not even know when it makes a capture or what the piece is worth.
Even the simplest evaluation subroutine which understands good captures from bad ones won't be able to play a decent game of chess. It might beat some begining chess players but not anyone with any kind of real chess playing skills, computer or not.
Bill

Graham Banks · Post by **Graham Banks** » Sun Aug 03, 2008 12:15 am

Michael Sherwin wrote:All the chess engine raters out there do a fair to good job rating the engines and their work taken as a whole and averaged is rather accurate.

Correct.

more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing