Why testing on only 80 positions is no good

Allard Siemelink · Post by **Allard Siemelink** » Tue Sep 25, 2007 9:18 pm

hgm wrote:I guess the best method would be to generate a large number of positions from letting two extensive book engines play against each other, and then randomly select positions from those games.

Even better may be to create the set of positions from the book directly!

Assuming that the book stores the number of games for each position, you can select a set of n positions that covers the most frequent n opening lines.
Also, if the book stores win/draw/loss percentages, you can make sure the selected positions are not biased and that they have more than one playable move.

bob · Post by **bob** » Wed Sep 26, 2007 1:59 am

hgm wrote:
bob wrote:Test however you want. Draw whatever conclusions you want. I started this discussion as a way of pointing out just how dangerous and inaccurate it is to depend on a few games or a few hundred games, to predict whether a change is good or bad.
Indeed. just as dangerous and inaccurate to depend on just a few, or just a few hundred initial positions. So in cases where more accuracy is required than can be obtained in 80 games, that accuracy can only be increased by increasing the number of initial positions as well as the number of games.

I'm not sure B follows from A above. First, the positions are very broad. They are not all attacking positions, or blockaded positions, or such. They really do come from a variety of common openings, and are different enough that nobody looking at them would think "OK, 23 is very much like 40." or something similar. Secondly I get enough variation when I play from the same position that it falls into different game paths quite naturally. 40 positions (actually 80 positions since I play black and white and the same position is quite different depending on who moves first) turns into hundreds of different games anyway, so there is significant variety. If every position came from (say) a QGA opening, then I would agree that they might leave a little to be desired. But they come from many openings that are considered reasonably equal, rather than being wildly imbalanced...

I guess the best method would be to generate a large number of positions from letting two extensive book engines play against each other, and then randomly select positions from those games.
It keeps getting away from that.
Perhaps people get bored quickly from re-iterating the obvious time after time...
So feel free to create whatever test methodology you want to use and use it. But a thread topic such as this is really meant more as a confrontational thing than an informative thing. I choose to not get involved beyond this point...
Too bad. It would be interesting to know how the conclusions you draw on the accumulated result of the 80-game minimatches would differ if you owuld just base them on the first 40 games, rather than all 80.

I noted from the date shown earlier:
Code: Select all
01&#58;  -+-=-++=-+=++-=-+-+=+-+--=-+=--++=-++--=-+=-===+-+===+=++-=++--+-+--+--+=--++--- 
02&#58;  -+=+=++-=+--+-=-+-+==-+----+=-=++--+=--+--+---==-+-==+=+---++--+--+-=--+-=--+--- 
03&#58;  -+--=++==+-++-+-+-+-+-==-+-++--+=---+--+-++--+-+===+=+-=+-==+==+---=---++--++=+- 
04&#58;  -+=-=+=--=-++-+-+-+=+-+--+-+-==++=-+==-+--+=-+-+-+=+=--+---==--+-++----++-=++--- 
05&#58;  -+=--+=--+=-+-+---=-+-+----++=-++=-+--=+=-+--+-+-+=+=+-+===++--+-+-==--+----+-+- 
06&#58;  -+---=+=-+-+====+-+-+-+-=+-+==-=+--++--==++-=+---+=+=+-+--=++--+=-+-=--+=-----=- 
07&#58;  -+-+--+--+=-+==-=-+-+-+--+=-=-=++-=++=-+-=+--+-+-+=+-+-++-=++-=+--=-=--++---+=+- 
08&#58;  -+-+-=+--+-==-=-+=+=+=+--+-==-=+=--++----++--+=--+=--+-=+-----=+-++==--++=--+--= 
09&#58;  -+-+-=+--+-+==--+-+-+-+--==----++==++=-+=++--+=+-=-=-+-+=--++-=+--=-+=---=-+--+- 
10&#58;  -+=+-++--+-++-==+-+-+-==-+-+=-=++-=++-=+==+--+-=-+-==+==+--++--+=++==--+----+-+= 
11&#58;  -+---====+-++=--==+---+--+-+----+--+=-=+-=+--+=+-+-+-+-==--=+--+=++-+--++=-++-== 
12&#58;  -+=+=++--+-++-+-=-+-+-+=-+-+-=--+--+=--+=-+==+-=-+=--+-=+-==+-=+-++-=--++---+--- 
13&#58;  -+-+--+==+--==+-+-+-=-+=-+-+=--+-=-++-=--++--+===--+=+=+---++-=+-=+-----+---+--- 
14&#58;  -+-=-++--+-+==+-=-+=+-+=-+=+=--++=-++--+==+--==-===+-+=-+-=-+-=+--+-=-=++--+=-+- 
15&#58;  -+-+-++--+-=+==-+-+-+-+--+-+=---+--++-=+-++---=+-+==-+-+---==-=+--=-+--+=--++-+= 
16&#58;  -+=+=++=-+=++=+=--+-+=+--+=+=---+--++--+=-+==+==-+-+=+-++---+-=+-=+-==-++===+-+- 
17&#58;  -=-+-=+--+-++==-+-+-+-+--+==+-=++--++--+==+-=+-==+=--+-++--++-=+--+----++---+-+- 
18&#58;  -+==-++--+=====-+=+-+-+--+-++--++=-+=--+-++--+=+-+=+-+=++=-++-=+==+==--+--=++--- 
19&#58;  =+-+=+-=-=-=+-+---+-+-+--+-++--++=-++-==-++-=+-+++=-=+-++--=+--+-++-==-++--++-=- 
20&#58;  -+---++-=+=-+-+=+-=-+-+--+-+=--++=-+=-=+-++--+=+=+--=+-++--++--+-=+-+--++--++-+- 
21&#58;  -+=-=++--+=+=-+=+-+-+-+----=--=++=-=---+-++---=+-+=+-+=+--=-+--+--+=+--=+---=--- 
22&#58;  -+-=-=+-==--+=+-=-+-+=+----+=-=++-=++====++--+---+===+=++--++--+--+---=++---+-=- 
23&#58;  -+-=-++=-+-+=---=-=-+--=-+=+-===-=-+=--==++--+=+=+=+=+=++--++-=+--+-----+-----+- 
24&#58;  -+-+-+-=-+-++-+-+---+-+--+-=----+--++--+-=+--=-====+==-++--++-=--=+--=-++--++--- 
25&#58;  =+---+==-+-+==+-=-+-+-+=++-+=--++--=+--+==+--+==-+-===-++--++=-+-==--==++=--+-+- 
26&#58;  -+=+-++--+-=+=+=+-=-+=+--+-+-=-=+-=++--=-++--+-+-+=-=+-++---+-=+-=+-=--=+=-++==- 
27&#58;  -+-+-++--=-+====+-=-+-==-+=-+---+--+=--=-=+===-+-=-+-+-++--=+--+-++-=--++=-++-+= 
28&#58;  =+---+---+-++-+-+-+-+----=-+===++--++-=+==+--+---+-+-+-+=-==+-=+--+==-=+=---+-+- 
29&#58;  =+-+-++-===+--+=+-+-+-=--+=+---++--+=-==-++-=+---+=+=+--+-=++--+-++==--++-=++--= 
30&#58;  -+=+-++-=+-=+-+=+-=-+-+--+-+---++--++-=+=++-==-+=--+=+=++--++----+=-+--+---++--= 
31&#58;  -+=+==+--+-++=+-+-+=+-+=-+-++=-++--+=--+=-+--+-+=--+=+-+--=++--+--+-=--++=--+--- 
32&#58;  -+-=--+-=+-+-=-=+-+-+-+--+-+=--++=-+=-=+-++=-==--+-+=--++--++--+-++-+-=++=-++-+- 
32 distinct runs &#40;2560 games&#41; found 
that the variance from 64-game sets starting from the same start position (32 with white, 32 with black) is enormous. The results vary from 14 to -29, in the set of 40. That means that some of the games are quite biased despite being played with both colors. And the statistical error in some of those is already quite low, as the white-vs-black bias is also quite extreme. (One position has 31 wins and one draw fith black, while being overwhelmingly lost (no wins, but many draws) with the reversed color. Such a match should have a very low variance, as the result is almost fixed by the color (which is not randomly chosen).)

So it seems that the variance contributed by position selection is enormous. The individual positions do often not give an idea of the relative engine strength. It would be interesting to see how "improved" versions of the engine would distribute that improvement over the positions.

bob · Post by **bob** » Wed Sep 26, 2007 2:00 am

Allard Siemelink wrote:
hgm wrote:I guess the best method would be to generate a large number of positions from letting two extensive book engines play against each other, and then randomly select positions from those games.
Even better may be to create the set of positions from the book directly!

Assuming that the book stores the number of games for each position, you can select a set of n positions that covers the most frequent n opening lines.
Also, if the book stores win/draw/loss percentages, you can make sure the selected positions are not biased and that they have more than one playable move.

How do you think AS chose these positions in the first place?

It's not like he just grabbed 40 positions out of a random box...

bob · Post by **bob** » Wed Sep 26, 2007 2:03 am

That is exactly the point for some positions. It is not that white always wins, or one program always wins with white, it might well be that black just has no idea about the plan necessary to equalize or even win. And there are a couple of positions that Crafty (at least) has no clue about what to do from either side. Granted, those openings I simply will not allow it to play at present, but I eventually want it to be well-rounded so choosing a random set of positions is a workable solution.

I can also point out positions that glaurung, or fruit, or other programs just can't play well against Crafty, so it isn't a one-way street at all, either...

And such positions really do highlight a hole in the program that needs to be filled in...

hgm · Post by **hgm** » Wed Sep 26, 2007 3:07 pm

I don't believe that a mere 40 positions can give an exhaustive summary of how engies can emerge out of the opening book. They are just a sample. You argue that it is a very representative sample. Good. But from the variance of the results within the sample, one can estimate how large the variance of the entire population of out-of-book positions is. (This is especially true if it is a representative sample.) This observed variance is large. Based on it, one can calculate how large the variance in the position total would be in the set of other samples of 40 (slightly different) positions. That will set a lower limit to the accuracy of your strength measurement, no matter how many games you play.

Have you tested if altering the positions slightly (e.g. replace the last two moves leading to them by two other, commonly played moves) would alter the result? E.g. in a game that Crafty predominantly loses with both black and white, would it still do that from this modified position? Or could it suddenly predominantly win with black and white? In other words, is the total (b+w) result on the position a very wild function of the position, or a very smooth one, which hardly changes on changing the location of several pieces?

bob · Post by **bob** » Wed Sep 26, 2007 5:42 pm

hgm wrote:I don't believe that a mere 40 positions can give an exhaustive summary of how engies can emerge out of the opening book. They are just a sample. You argue that it is a very representative sample. Good. But from the variance of the results within the sample, one can estimate how large the variance of the entire population of out-of-book positions is. (This is especially true if it is a representative sample.) This observed variance is large. Based on it, one can calculate how large the variance in the position total would be in the set of other samples of 40 (slightly different) positions. That will set a lower limit to the accuracy of your strength measurement, no matter how many games you play.

Have you tested if altering the positions slightly (e.g. replace the last two moves leading to them by two other, commonly played moves) would alter the result? E.g. in a game that Crafty predominantly loses with both black and white, would it still do that from this modified position? Or could it suddenly predominantly win with black and white? In other words, is the total (b+w) result on the position a very wild function of the position, or a very smooth one, which hardly changes on changing the location of several pieces?

I have tested more than that. I have tested the original Nunn positions. I have tested the new Silver positions. I have tested silver + nunn positions. I have tested silver + nunn + some positions I have collected over the years.

new silver seems to be better than old Nunn (more positions). silver + nunn did not change anything significantly other than the time required to run the test.

I have also done things like removing 3-4 positions that have either lopsided or symmetric with respect to color results. All that does is change the final scores, but it doesn't change the result when comparing two versions... Is 40 positions optimal? Don't know. Don't have enough time to try to figure that out. Is this the best 40 positions? Don't know. And again don't have enough time to try to find better ones. These positions are giving us what we need to know, in a time-frame we can live with. If someone finds a better set of positions, I'll certainly use them. But I am currently spending my time testing and modifying crafty, not working to develop yet another testing environment.

As far as your question about slightly altering the positiosn, No, I have not done that. These positions are FEN positions, not PGN game fragments.

Any good program should be able to play either side of these positions reasonably, or else it has some serious holes that can be exploited...

hgm · Post by **hgm** » Wed Sep 26, 2007 8:14 pm

Well, that more positions is better, is to be expected. I don't know how many Nunn positions you used (was this the Nunn-10 or Nun-25 set?), but the SD in going from 40 to 50 samples is of course only expected to drop by 10%, and it is questionable if that would be noticable. Between 40 and 10 it would be a factor 2.

It is indeed annoying that the positions are given as FEN, but as these are common openings it should of course be possible to find the move sequence that leads to them.

If any program is expected to play both sides reasonably, how do you explain the extreme deviations that occur in your data? Some individual positions don't seem to reflect the relative capabilities at all.

I am still curious if small improvements that you observe are spread evenly over the positions, or concentrated in just a few.

bob · Post by **bob** » Thu Sep 27, 2007 3:32 am

hgm wrote:Well, that more positions is better, is to be expected. I don't know how many Nunn positions you used (was this the Nunn-10 or Nun-25 set?), but the SD in going from 40 to 50 samples is of course only expected to drop by 10%, and it is questionable if that would be noticable. Between 40 and 10 it would be a factor 2.

I used a test with 20 positions. Whether it started with 25 and I culled them or what, I don't know. I think Albert published a revised set a year or so ago with 50 positions rather than 40. And I will probably switch to them when a good point comes up, I haven't yet so that all of my comparisons and test results are directly comparable...

It is indeed annoying that the positions are given as FEN, but as these are common openings it should of course be possible to find the move sequence that leads to them.

If any program is expected to play both sides reasonably, how do you explain the extreme deviations that occur in your data? Some individual positions don't seem to reflect the relative capabilities at all.

If you look at the positions, and look at suggested book lines (Crafty can give the set of book moves played in each position) you will notice that none are simply "won by the side on move". But many are equal positions where things are happening on opposite sides. One side is supposed to attack on the queenside, the other side is supposed to attack on the kingside. Some programs, mine included, don't get this right every time. And some are very passive (gnuchess comes to mind) and it will lose both ends, it never attacks on the kingside, and it doesn't understand the requirement for counterplay on the queenside.

So there are plenty of positions that two good humans would remark "the position is pretty equal, but it is unbalanced." Then it is up to the program to do something with it. I see two distinct cases regularly. (1) Crafty knows more about something than its opponent, which gives it an edge in those kinds of positions. Pawn majorities come to mind or the classic trapped bishop problem at a2/h2/a7/h7. So it outplays its opponent based on that knowledge; (2) Crafty doesn't know a lot about a position, but the opponent has no clue at all and struggles to shuffle around and Crafty slowly builds up enough pressure somewhere to break through.

Of course both of those work against Crafty when playing good opponents just as easily. But winning all or losing all doesn't necessarily mean the position is bad or one-sided, it could just mean the programs are one-sided in that particular position, and one of them has plenty of room for improvement.

I am still curious if small improvements that you observe are spread evenly over the positions, or concentrated in just a few.

I don't have the data at present, but my next test I will save the 80-character result strings to see what happens. However, I suspect that we are going to see just the normal variability in the matches, with an occasional win here or there extra, to account for the improved result margin. It will be easier to show once I can run again. We are still dead here, waiting on the maint. bozos to get the new A/C compressor installed, and for our sysadmin to get some new network cards for the ibrix filesystem that has been pretty unreliable.

I'll post the next batch, but it might be a few days based on what I have seen so far...

Why testing on only 80 positions is no good

Re: Why testing on only 80 positions is no good

Re: Why testing on only 80 positions is no good

Re: Why testing on only 80 positions is no good

Re: Why testing on only 80 positions is no good

Re: Why testing on only 80 positions is no good

Re: Why testing on only 80 positions is no good

Re: Why testing on only 80 positions is no good

Re: Why testing on only 80 positions is no good