more test data

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more test data

Post by bob »

Tony wrote:
bob wrote:
Tony wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
hgm wrote:Grand average 0.7, observed SD of result of mini-matches 8.1. Prediction according to the 0.8*sqrt(M) rule-of-thumb 7.2. Note, however, that the draw percentage is low (18%), so that the pre-factor really should be sqrt(1-18%) = sqrt(0.82) = 0.9. So the observed SD is smack on what we would expect for independent games. Nothin special in the 13-match result.

None of the individual deviations is exceptional:the largest, -17.7, corresponds to 2.18*sigma, which (two-sided) should have a frequency of 2.8%. In 13 samples it should thus occur on the average 13*2.8% = 0.364 times. The probability that it occurs once is thus 0.364/1! * exp(-0.364) = 25.3%. This is about as high as it gets. (one seldomly sees things that have 100% probability, as that would mean that there could be no variation in what you see at all...) So not very remarkable.

If one would want to see anything unlikely in this data is that the 4 largest deviations, although not very exceptional in themselves, occur all in the first 4 samples. One observation of this is not significant enough to draw any conclusions on, though.

Now if you would have several such data sets, and they would all have significantly larger deviations in the first ~4 mini-matches, or at least pass some objective test that shows these erarly variances are significantly larger on average than the others, it would be very interesting. Because the first mini-matches can only be different from the later ones if they somehow _know_ that they belong to the first four. And there should be no way they could know that, as that would violate the independence of the games.

So if such above-average variances would occur systematically occur in the first few mini-matches, it would point to some defect in the setup subverting the independence. My guess would still be that this is merely a coincidence, and that other such data sets would show that the largest deviations are randomly scattered over the data set, and do not prefer particular mini-matches.

Or, in summary, nothing remarkable in this data; variances exactly as expected. But watch those early variances.
The wild cases don't always come first. They don't usually come last. They just "come" which makes using a small sample unreliable. Nothing more, nothing less..

The fact that the 4 largest deviations, occur all in the samples X X+1 X+2 X+3.
cause me to suspect that something is wrong in your testing even without the case that X=1 in your testing.

I do not know what is wrong.
Maybe there is something that you did not think about that cause correlation between result of the match.

A possibility that I thought about except the possibility of the case that one program is slowed down by a significant factor is that by some reason you do not use the same exe in different samples or you use the same exe but with different parameters so the exe show a different behaviour in different matches.

Uri
First, the highest deviation so far was not in the first 4. The next two were worse (14 and 15).

Second, you can think something is wrong all you want. The worst runs are not always first. The big set I published last week had the biggest variances farther down.

There is absolutely no difference in how things are executed. Each program is started exactly the same way each time, exactly the same options, exactly the same hash, nobody else running on that processor. And I monitor the NPS for each and every game inside Crafty.

Unfortunately the "other programs" can't produce logs because they just use one log file as opposed to crafty that can log for each separate game that is played.

But forget about the idea of things changing from game to game. I have one giant command file, produced automatically, one line per game, and I fire these off to individual computers one at a time until all machines are busy, and then as one completes I send that node the next game after it re-initializes...

You have to take this data for what it is, significant random variance in some matches, apparent stability in others. For the matches that produce more stable results, that is only because I am using the total score for the 80 games. If you were to look at the individual game strings, you would find lots of variance even though the final scores are nearly the same. So the variance is still there, it is just not expressed in the final result...
Would that mean that an engine that has learning enabled, and (accidently) gets a fast win, will have a new game faster than the other machines, and with the extra knowledge (learnfile) will give it a bigger chance of a new fast win wich will go into the leraning file etc ?

Whereas if there is no accidentle fast win, this won't happen ?

Tony
I don't know if that could happen when using learning or not. But in this case, it can't possibly happen for several reasons:

(1) learning is explicitly disabled (position learning is all that could be used since there are no opening books used at all).

(2) no files are carried over between individual games. Each game is played in a completely "sterile" environment.

(3) to further rule that out, I have played these same tests but using a fixed number of nodes for both opponents, and there the games are identical run after run. Even with fixed node counts, position learning would still change scores, and moves, particularly for the side that lost. But with this kind of test, 64 matches produce 64 sets of 80 games that match perfectly, move for move. The only variance is very slight time differences (one move takes 5.21 seconds, in the next match it might take 5.23 or 5.19, but since time is not used to end searches in that test, it made no difference at all.
OK, for me the big difference is strange (between fixed nodes and normal mode). So either the result is wrong, or my expectation.

OTOH

Flipping a coin should give 50%, with expected deviations etc.

Assuming the result of the coinflip is dependent on the force and angle I flip it, fixing this force and angle, would result in a repeatable experiment.

But would that improve the measurement ? There would be a big difference with the nonfixed flips. But how do I know my fixed variables aren't biased ?

Did you find a noticable difference between fixed node matches and nonfixed ? (results or std )

If the results are "equal" but the std is lower, fixed nodes could be an alternative (to the lengthy matches)


Tony
OK... for that I ran the following tests:

1. crafty version A, to fixed number of nodes, against a set of opponents to the same number of nodes. Results were 100% reproducible, as expected, since there are no random evaluation terms or random search extensions in any of the programs used. This gave me a result B.

2. Crafty version A', to fixed number of nodes also, against the same opponents. A' had a couple of minor eval changes that over a very large sample showed no significant difference from version A. But for the 80 game match, the results were _way_ different. I looked at the games and in each case, Crafty varied in several places, primarily due to the different shape of the tree with the slightly different evaluation.

3. Crafty version A vs the opponents to a fixed number of nodes, then Crafty version A vs the same opponents to the same fixed number of nodes, except crafty searched 10K more nodes than the previous test (I was doing 10M node tests to 10K is a tiny change). The results were way different.

What that taught me was that there is so much variability, it is necessary to (a) run a large number of games and (b) If I want to test with fixed nodes, I need to use a bunch of different node limits to make sure my sample cross-section is broad enough to get a representative collection of games.

After a lot of fiddling around, it became pretty apparent that the best solution was to just use time, let the results vary, and run enough samples to smooth the variance out... which is what I am doing.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more test data

Post by bob »

pedrox wrote:I imagine that you are making this test in the cluster of the university. You have proven to make your test in the laptop and to see if they are similar?

Yes. The only difference is the length of time it takes. :)

The test platform software was originally developed and tested on my laptop before I started doing cluster runs, in fact...

I have not found a rival for my engine that in a test of 80 games I can win of 10 and the following to lose it of 17. When I repeat the test normally I can find a 0-3 points of difference.

Perhaps Crafty is a engine more irregular than the majority in its results.
I don't think so. I have used Crafty, Fruit 2, glaurung 1/2, arasan 9/10, gnuchessx 4 and 5, and a couple of others. Pick any two from that set and run this test and the results are equally random. I had originally thought I must have a bug, until I started trying this kind of test to see, and I discovered everyone in the above pool plays very non-deterministically.

No, I have not tried to analyze the data statistically to see if one varies more than another. I really don't care, since I have to deal with the variability no matter where it comes from.


I have proven single in occasion my engine versus Crafty, with openings books, after 40 games result was even, which surprised me very much although some people commented that the proven version of Crafty was inferior to the previous ones (I believe that it was version 20.1), in the following games Crafty began to win clearly, about its moment I thought that it was to have the learning and because my engine played repeat the same openings much. Perhaps it was not the learning and that it is that Crafty is really more irregular in its results that others engines.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more test data

Post by bob »

OK, this thread is dead. Our cluster shared filesystem died, which terminated that test run. So there is no way to continue reporting on the results...

I have started a new run and will report the results if it does not terminate badly as well. This IBRIX filesystem hardware has been utterly unreliable and has almost ruined more tests than it has allowed to complete.