Why testing on only 80 positions is no good

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
hgm
Posts: 27808
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Why testing on only 80 positions is no good

Post by hgm »

Chess allows an astronomically large number of positions, even for engines that play 100% deterministically (as the opponents might vary their moves). The strength of an engine is determined by how well it does on teh average in all these positions. To determine that, we cannot try them all, so we would have to sample.

The sampling leads to a statistical error. There are various sources for this error. One source is intrinsic variability of the engine or its opponent. Playing against the same opponent from the same position might not always (or hardly ever) give the same game or same result. If the games are independent, however, one can determine the scoring fraction for this position/opponent combination as precisely as one likes. The statistical error in the observed overall scoring fraction will tend to zero as 1/sqrt(NrOfGames).

This does not mean that after playing an infinite number of games, one knows exactly how strong the engine is. Even if it wins 90% (+/- 0.0001%)of the games, in other positions against the same opponent it might win only 20%. The position in itself might be biased, but usually we eliminate that by playing it after reversing sides. If in that case the engine would score 30%, it has on the average scored 60% against this opponent, from this position. That might lead you to think it is better than the opponent. But in practice there will be other positions where (averaged over both colors) it scores only 55%, or even 40%. The win probability against this particular opponent will be the average of this over all conceivable positions.

When we use only a limited number of positions, the sampling of these positions from the larger set of possible positions also causes a statistical error. If we are unlucky, we select an above-average number of positions where our engine happens to be poor (e.g. it has the tendency to make a certain mistake when playing one of the colors from that position). That would mean we observe an apparently poorer performance of our engine against that opponent as it would have in real games. That error would not increase by increasing the umber of games from the given position. It could only decrease by increasing the number of different starting positions.

If we would just measure the performance of our engine (and its various versions) based on repeating the same 80-game gauntlet many times against the same opponent, we would in particular run the risk of adopting changes that happen to change the preference for a certain bad move in one of the initial positions, while averaged over all possible positions (including those we don't test from) it would actually make the engine weaker. E.g. if one of the positions would offer the possibility for a Knight move that is bad for deep tactical reasons, a positionally driven tendency to play that move (which does not even involve the opponent!) might cause that this game is mostly lost, in stead of giving an average 50-50 score that would be more representative against this opponent. So you would lose 0.5/80 = 0.65% on the gauntlet score. "Freezing" that knight by upping the piece-square value of the square it is on to the point where another move is better will earn you that 0.65% in an easy way. Even if it loses 0.3% in average play on the remaining position, and in the grand total of all possible positions, the change would still evaluate as good.

Conclusion: very precise measurement of the gauntlet result by repeating the gauntlet, to an accuracy better than or similar to the inverse of the number of positions in your gauntlet, will start to train your engine to score better in the gauntlet, even if this goes at the expense of playing good chess. One should be careful not to accept small "improvements" that are not sufficiently far above the statistical noise cause by sampling the positions.
nczempin

Re: Why testing on only 80 positions is no good

Post by nczempin »

hgm wrote: E.g. if one of the positions would offer the possibility for a Knight move that is bad for deep tactical reasons, a positionally driven tendency to play that move (which does not even involve the opponent!) might cause that this game is mostly lost, in stead of giving an average 50-50 score that would be more representative against this opponent. So you would lose 0.5/80 = 0.65% on the gauntlet score. "Freezing" that knight by upping the piece-square value of the square it is on to the point where another move is better will earn you that 0.65% in an easy way. Even if it loses 0.3% in average play on the remaining position, and in the grand total of all possible positions, the change would still evaluate as good.
Wouldn't this effect tend to average out over the 80 positions?
User avatar
hgm
Posts: 27808
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Why testing on only 80 positions is no good

Post by hgm »

Yes, it would. But never perfectly, as 80 is only a finite number. The statistical error in your observed average score due to choice of initial position would decreas as the inverse square root of the number of positions, just as the statistical error due to move-choice variability would approach zero when the number of games played from each position tends to infinity.

But having the number of games tend to infinity while keeping the number of positions fixed, would never reduce the magnitude of the statistical error due to position selection. By always measuring from the same positions with more and more games your observed scores would converge to some limit, but they will in general converge to the wrong value. One can only get the correct limit value by increasing both nr of games and nr of positions. Driving up the number of games per position while the error is dominated by the choice of positions does not provide an improvement of the accuracy. It just means you are measuring sampled noise very precisely.
nczempin

Re: Why testing on only 80 positions is no good

Post by nczempin »

hgm wrote:Yes, it would. But never perfectly, as 80 is only a finite number. The statistical error in your observed average score due to choice of initial position would decreas as the inverse square root of the number of positions, just as the statistical error due to move-choice variability would approach zero when the number of games played from each position tends to infinity.

But having the number of games tend to infinity while keeping the number of positions fixed, would never reduce the magnitude of the statistical error due to position selection. By always measuring from the same positions with more and more games your observed scores would converge to some limit, but they will in general converge to the wrong value. One can only get the correct limit value by increasing both nr of games and nr of positions. Driving up the number of games per position while the error is dominated by the choice of positions does not provide an improvement of the accuracy. It just means you are measuring sampled noise very precisely.
Okay, so what would be the most practical way to test assuming the above?
a) for mature engines
b) for immature engines
User avatar
hgm
Posts: 27808
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Why testing on only 80 positions is no good

Post by hgm »

One way would be to evaluate a change based on a gauntlet starting from a limited number of positions, and then repeat the test with a gauntlet of different initial positions, to see if the conclusion is the same. Of course it only makes sense to do this for a small improvement; for an overwhelming improvement you would do better from pretty much any position. So something like having or not having a penalty for doubled pawns.

If you would get the oppositeconclusion from one set of positions as for the other, you could try if this difference would disappear if the number of games per position (against the same opponents) would be increased.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Why testing on only 80 positions is no good

Post by bob »

hgm wrote:Chess allows an astronomically large number of positions, even for engines that play 100% deterministically (as the opponents might vary their moves). The strength of an engine is determined by how well it does on teh average in all these positions. To determine that, we cannot try them all, so we would have to sample.

The sampling leads to a statistical error. There are various sources for this error. One source is intrinsic variability of the engine or its opponent. Playing against the same opponent from the same position might not always (or hardly ever) give the same game or same result. If the games are independent, however, one can determine the scoring fraction for this position/opponent combination as precisely as one likes. The statistical error in the observed overall scoring fraction will tend to zero as 1/sqrt(NrOfGames).

This does not mean that after playing an infinite number of games, one knows exactly how strong the engine is. Even if it wins 90% (+/- 0.0001%)of the games, in other positions against the same opponent it might win only 20%. The position in itself might be biased, but usually we eliminate that by playing it after reversing sides. If in that case the engine would score 30%, it has on the average scored 60% against this opponent, from this position. That might lead you to think it is better than the opponent. But in practice there will be other positions where (averaged over both colors) it scores only 55%, or even 40%. The win probability against this particular opponent will be the average of this over all conceivable positions.

When we use only a limited number of positions, the sampling of these positions from the larger set of possible positions also causes a statistical error. If we are unlucky, we select an above-average number of positions where our engine happens to be poor (e.g. it has the tendency to make a certain mistake when playing one of the colors from that position). That would mean we observe an apparently poorer performance of our engine against that opponent as it would have in real games. That error would not increase by increasing the umber of games from the given position. It could only decrease by increasing the number of different starting positions.

If we would just measure the performance of our engine (and its various versions) based on repeating the same 80-game gauntlet many times against the same opponent, we would in particular run the risk of adopting changes that happen to change the preference for a certain bad move in one of the initial positions, while averaged over all possible positions (including those we don't test from) it would actually make the engine weaker. E.g. if one of the positions would offer the possibility for a Knight move that is bad for deep tactical reasons, a positionally driven tendency to play that move (which does not even involve the opponent!) might cause that this game is mostly lost, in stead of giving an average 50-50 score that would be more representative against this opponent. So you would lose 0.5/80 = 0.65% on the gauntlet score. "Freezing" that knight by upping the piece-square value of the square it is on to the point where another move is better will earn you that 0.65% in an easy way. Even if it loses 0.3% in average play on the remaining position, and in the grand total of all possible positions, the change would still evaluate as good.

Conclusion: very precise measurement of the gauntlet result by repeating the gauntlet, to an accuracy better than or similar to the inverse of the number of positions in your gauntlet, will start to train your engine to score better in the gauntlet, even if this goes at the expense of playing good chess. One should be careful not to accept small "improvements" that are not sufficiently far above the statistical noise cause by sampling the positions.
You can say whatever you want. But an occasional form of consistency would be nice.

Yes, using just 40 positions leads to sampling error.

So does searching to a fixed number of nodes.

There is _absolutely_ no difference in the two approaches, because both are attempts to limit the variability, and they both do it in the same way.

Choosing representative starting positions is far easier than choosing representative node counts...

So either you want to play _every_ possible game (I consider that impossible) or you want to restrict the openings to a selected group, then you have to make choices. I have explained that I did test Albert's positoins pretty carefully. First by eye. Second by playing real games between several programs using a common opening book. No that doesn't test "favorite openings" but it does test general chess skill. And I didn't find any significant difference between using a common book and using Albert's positions. Except that Albert's positions require a whole lot fewer games to get a stable result.

Test however you want. Draw whatever conclusions you want. I started this discussion as a way of pointing out just how dangerous and inaccurate it is to depend on a few games or a few hundred games, to predict whether a change is good or bad.

It keeps getting away from that. So feel free to create whatever test methodology you want to use and use it. But a thread topic such as this is really meant more as a confrontational thing than an informative thing. I choose to not get involved beyond this point...
User avatar
hgm
Posts: 27808
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Why testing on only 80 positions is no good

Post by hgm »

bob wrote:Test however you want. Draw whatever conclusions you want. I started this discussion as a way of pointing out just how dangerous and inaccurate it is to depend on a few games or a few hundred games, to predict whether a change is good or bad.
Indeed. just as dangerous and inaccurate to depend on just a few, or just a few hundred initial positions. So in cases where more accuracy is required than can be obtained in 80 games, that accuracy can only be increased by increasing the number of initial positions as well as the number of games.

I guess the best method would be to generate a large number of positions from letting two extensive book engines play against each other, and then randomly select positions from those games.
It keeps getting away from that.
Perhaps people get bored quickly from re-iterating the obvious time after time...
So feel free to create whatever test methodology you want to use and use it. But a thread topic such as this is really meant more as a confrontational thing than an informative thing. I choose to not get involved beyond this point...
Too bad. It would be interesting to know how the conclusions you draw on the accumulated result of the 80-game minimatches would differ if you owuld just base them on the first 40 games, rather than all 80.

I noted from the date shown earlier:

Code: Select all

01:  -+-=-++=-+=++-=-+-+=+-+--=-+=--++=-++--=-+=-===+-+===+=++-=++--+-+--+--+=--++--- 
02:  -+=+=++-=+--+-=-+-+==-+----+=-=++--+=--+--+---==-+-==+=+---++--+--+-=--+-=--+--- 
03:  -+--=++==+-++-+-+-+-+-==-+-++--+=---+--+-++--+-+===+=+-=+-==+==+---=---++--++=+- 
04:  -+=-=+=--=-++-+-+-+=+-+--+-+-==++=-+==-+--+=-+-+-+=+=--+---==--+-++----++-=++--- 
05:  -+=--+=--+=-+-+---=-+-+----++=-++=-+--=+=-+--+-+-+=+=+-+===++--+-+-==--+----+-+- 
06:  -+---=+=-+-+====+-+-+-+-=+-+==-=+--++--==++-=+---+=+=+-+--=++--+=-+-=--+=-----=- 
07:  -+-+--+--+=-+==-=-+-+-+--+=-=-=++-=++=-+-=+--+-+-+=+-+-++-=++-=+--=-=--++---+=+- 
08:  -+-+-=+--+-==-=-+=+=+=+--+-==-=+=--++----++--+=--+=--+-=+-----=+-++==--++=--+--= 
09:  -+-+-=+--+-+==--+-+-+-+--==----++==++=-+=++--+=+-=-=-+-+=--++-=+--=-+=---=-+--+- 
10:  -+=+-++--+-++-==+-+-+-==-+-+=-=++-=++-=+==+--+-=-+-==+==+--++--+=++==--+----+-+= 
11:  -+---====+-++=--==+---+--+-+----+--+=-=+-=+--+=+-+-+-+-==--=+--+=++-+--++=-++-== 
12:  -+=+=++--+-++-+-=-+-+-+=-+-+-=--+--+=--+=-+==+-=-+=--+-=+-==+-=+-++-=--++---+--- 
13:  -+-+--+==+--==+-+-+-=-+=-+-+=--+-=-++-=--++--+===--+=+=+---++-=+-=+-----+---+--- 
14:  -+-=-++--+-+==+-=-+=+-+=-+=+=--++=-++--+==+--==-===+-+=-+-=-+-=+--+-=-=++--+=-+- 
15:  -+-+-++--+-=+==-+-+-+-+--+-+=---+--++-=+-++---=+-+==-+-+---==-=+--=-+--+=--++-+= 
16:  -+=+=++=-+=++=+=--+-+=+--+=+=---+--++--+=-+==+==-+-+=+-++---+-=+-=+-==-++===+-+- 
17:  -=-+-=+--+-++==-+-+-+-+--+==+-=++--++--+==+-=+-==+=--+-++--++-=+--+----++---+-+- 
18:  -+==-++--+=====-+=+-+-+--+-++--++=-+=--+-++--+=+-+=+-+=++=-++-=+==+==--+--=++--- 
19:  =+-+=+-=-=-=+-+---+-+-+--+-++--++=-++-==-++-=+-+++=-=+-++--=+--+-++-==-++--++-=- 
20:  -+---++-=+=-+-+=+-=-+-+--+-+=--++=-+=-=+-++--+=+=+--=+-++--++--+-=+-+--++--++-+- 
21:  -+=-=++--+=+=-+=+-+-+-+----=--=++=-=---+-++---=+-+=+-+=+--=-+--+--+=+--=+---=--- 
22:  -+-=-=+-==--+=+-=-+-+=+----+=-=++-=++====++--+---+===+=++--++--+--+---=++---+-=- 
23:  -+-=-++=-+-+=---=-=-+--=-+=+-===-=-+=--==++--+=+=+=+=+=++--++-=+--+-----+-----+- 
24:  -+-+-+-=-+-++-+-+---+-+--+-=----+--++--+-=+--=-====+==-++--++-=--=+--=-++--++--- 
25:  =+---+==-+-+==+-=-+-+-+=++-+=--++--=+--+==+--+==-+-===-++--++=-+-==--==++=--+-+- 
26:  -+=+-++--+-=+=+=+-=-+=+--+-+-=-=+-=++--=-++--+-+-+=-=+-++---+-=+-=+-=--=+=-++==- 
27:  -+-+-++--=-+====+-=-+-==-+=-+---+--+=--=-=+===-+-=-+-+-++--=+--+-++-=--++=-++-+= 
28:  =+---+---+-++-+-+-+-+----=-+===++--++-=+==+--+---+-+-+-+=-==+-=+--+==-=+=---+-+- 
29:  =+-+-++-===+--+=+-+-+-=--+=+---++--+=-==-++-=+---+=+=+--+-=++--+-++==--++-=++--= 
30:  -+=+-++-=+-=+-+=+-=-+-+--+-+---++--++-=+=++-==-+=--+=+=++--++----+=-+--+---++--= 
31:  -+=+==+--+-++=+-+-+=+-+=-+-++=-++--+=--+=-+--+-+=--+=+-+--=++--+--+-=--++=--+--- 
32:  -+-=--+-=+-+-=-=+-+-+-+--+-+=--++=-+=-=+-++=-==--+-+=--++--++--+-++-+-=++=-++-+- 
32 distinct runs (2560 games) found 
that the variance from 64-game sets starting from the same start position (32 with white, 32 with black) is enormous. The results vary from 14 to -29, in the set of 40. That means that some of the games are quite biased despite being played with both colors. And the statistical error in some of those is already quite low, as the white-vs-black bias is also quite extreme. (One position has 31 wins and one draw fith black, while being overwhelmingly lost (no wins, but many draws) with the reversed color. Such a match should have a very low variance, as the result is almost fixed by the color (which is not randomly chosen).)

So it seems that the variance contributed by position selection is enormous. The individual positions do often not give an idea of the relative engine strength. It would be interesting to see how "improved" versions of the engine would distribute that improvement over the positions.
Uri Blass
Posts: 10297
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Why testing on only 80 positions is no good

Post by Uri Blass »

hgm wrote:
bob wrote:Test however you want. Draw whatever conclusions you want. I started this discussion as a way of pointing out just how dangerous and inaccurate it is to depend on a few games or a few hundred games, to predict whether a change is good or bad.
Indeed. just as dangerous and inaccurate to depend on just a few, or just a few hundred initial positions. So in cases where more accuracy is required than can be obtained in 80 games, that accuracy can only be increased by increasing the number of initial positions as well as the number of games.

I guess the best method would be to generate a large number of positions from letting two extensive book engines play against each other, and then randomly select positions from those games.
It keeps getting away from that.
Perhaps people get bored quickly from re-iterating the obvious time after time...
So feel free to create whatever test methodology you want to use and use it. But a thread topic such as this is really meant more as a confrontational thing than an informative thing. I choose to not get involved beyond this point...
Too bad. It would be interesting to know how the conclusions you draw on the accumulated result of the 80-game minimatches would differ if you owuld just base them on the first 40 games, rather than all 80.

I noted from the date shown earlier:

Code: Select all

01:  -+-=-++=-+=++-=-+-+=+-+--=-+=--++=-++--=-+=-===+-+===+=++-=++--+-+--+--+=--++--- 
02:  -+=+=++-=+--+-=-+-+==-+----+=-=++--+=--+--+---==-+-==+=+---++--+--+-=--+-=--+--- 
03:  -+--=++==+-++-+-+-+-+-==-+-++--+=---+--+-++--+-+===+=+-=+-==+==+---=---++--++=+- 
04:  -+=-=+=--=-++-+-+-+=+-+--+-+-==++=-+==-+--+=-+-+-+=+=--+---==--+-++----++-=++--- 
05:  -+=--+=--+=-+-+---=-+-+----++=-++=-+--=+=-+--+-+-+=+=+-+===++--+-+-==--+----+-+- 
06:  -+---=+=-+-+====+-+-+-+-=+-+==-=+--++--==++-=+---+=+=+-+--=++--+=-+-=--+=-----=- 
07:  -+-+--+--+=-+==-=-+-+-+--+=-=-=++-=++=-+-=+--+-+-+=+-+-++-=++-=+--=-=--++---+=+- 
08:  -+-+-=+--+-==-=-+=+=+=+--+-==-=+=--++----++--+=--+=--+-=+-----=+-++==--++=--+--= 
09:  -+-+-=+--+-+==--+-+-+-+--==----++==++=-+=++--+=+-=-=-+-+=--++-=+--=-+=---=-+--+- 
10:  -+=+-++--+-++-==+-+-+-==-+-+=-=++-=++-=+==+--+-=-+-==+==+--++--+=++==--+----+-+= 
11:  -+---====+-++=--==+---+--+-+----+--+=-=+-=+--+=+-+-+-+-==--=+--+=++-+--++=-++-== 
12:  -+=+=++--+-++-+-=-+-+-+=-+-+-=--+--+=--+=-+==+-=-+=--+-=+-==+-=+-++-=--++---+--- 
13:  -+-+--+==+--==+-+-+-=-+=-+-+=--+-=-++-=--++--+===--+=+=+---++-=+-=+-----+---+--- 
14:  -+-=-++--+-+==+-=-+=+-+=-+=+=--++=-++--+==+--==-===+-+=-+-=-+-=+--+-=-=++--+=-+- 
15:  -+-+-++--+-=+==-+-+-+-+--+-+=---+--++-=+-++---=+-+==-+-+---==-=+--=-+--+=--++-+= 
16:  -+=+=++=-+=++=+=--+-+=+--+=+=---+--++--+=-+==+==-+-+=+-++---+-=+-=+-==-++===+-+- 
17:  -=-+-=+--+-++==-+-+-+-+--+==+-=++--++--+==+-=+-==+=--+-++--++-=+--+----++---+-+- 
18:  -+==-++--+=====-+=+-+-+--+-++--++=-+=--+-++--+=+-+=+-+=++=-++-=+==+==--+--=++--- 
19:  =+-+=+-=-=-=+-+---+-+-+--+-++--++=-++-==-++-=+-+++=-=+-++--=+--+-++-==-++--++-=- 
20:  -+---++-=+=-+-+=+-=-+-+--+-+=--++=-+=-=+-++--+=+=+--=+-++--++--+-=+-+--++--++-+- 
21:  -+=-=++--+=+=-+=+-+-+-+----=--=++=-=---+-++---=+-+=+-+=+--=-+--+--+=+--=+---=--- 
22:  -+-=-=+-==--+=+-=-+-+=+----+=-=++-=++====++--+---+===+=++--++--+--+---=++---+-=- 
23:  -+-=-++=-+-+=---=-=-+--=-+=+-===-=-+=--==++--+=+=+=+=+=++--++-=+--+-----+-----+- 
24:  -+-+-+-=-+-++-+-+---+-+--+-=----+--++--+-=+--=-====+==-++--++-=--=+--=-++--++--- 
25:  =+---+==-+-+==+-=-+-+-+=++-+=--++--=+--+==+--+==-+-===-++--++=-+-==--==++=--+-+- 
26:  -+=+-++--+-=+=+=+-=-+=+--+-+-=-=+-=++--=-++--+-+-+=-=+-++---+-=+-=+-=--=+=-++==- 
27:  -+-+-++--=-+====+-=-+-==-+=-+---+--+=--=-=+===-+-=-+-+-++--=+--+-++-=--++=-++-+= 
28:  =+---+---+-++-+-+-+-+----=-+===++--++-=+==+--+---+-+-+-+=-==+-=+--+==-=+=---+-+- 
29:  =+-+-++-===+--+=+-+-+-=--+=+---++--+=-==-++-=+---+=+=+--+-=++--+-++==--++-=++--= 
30:  -+=+-++-=+-=+-+=+-=-+-+--+-+---++--++-=+=++-==-+=--+=+=++--++----+=-+--+---++--= 
31:  -+=+==+--+-++=+-+-+=+-+=-+-++=-++--+=--+=-+--+-+=--+=+-+--=++--+--+-=--++=--+--- 
32:  -+-=--+-=+-+-=-=+-+-+-+--+-+=--++=-+=-=+-++=-==--+-+=--++--++--+-++-+-=++=-++-+- 
32 distinct runs (2560 games) found 
that the variance from 64-game sets starting from the same start position (32 with white, 32 with black) is enormous. The results vary from 14 to -29, in the set of 40. That means that some of the games are quite biased despite being played with both colors. And the statistical error in some of those is already quite low, as the white-vs-black bias is also quite extreme. (One position has 31 wins and one draw fith black, while being overwhelmingly lost (no wins, but many draws) with the reversed color. Such a match should have a very low variance, as the result is almost fixed by the color (which is not randomly chosen).)

So it seems that the variance contributed by position selection is enormous. The individual positions do often not give an idea of the relative engine strength. It would be interesting to see how "improved" versions of the engine would distribute that improvement over the positions.

Note that 31 wins and one draw does not prove that the position is a win for white or even that the position is better for white.

It is clearly possible that the engines simply do not know to play the right moves for black.

Uri
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Why testing on only 80 positions is no good

Post by bob »

hgm wrote:
bob wrote:Test however you want. Draw whatever conclusions you want. I started this discussion as a way of pointing out just how dangerous and inaccurate it is to depend on a few games or a few hundred games, to predict whether a change is good or bad.
Indeed. just as dangerous and inaccurate to depend on just a few, or just a few hundred initial positions. So in cases where more accuracy is required than can be obtained in 80 games, that accuracy can only be increased by increasing the number of initial positions as well as the number of games.

I guess the best method would be to generate a large number of positions from letting two extensive book engines play against each other, and then randomly select positions from those games.
It keeps getting away from that.
Perhaps people get bored quickly from re-iterating the obvious time after time...
So feel free to create whatever test methodology you want to use and use it. But a thread topic such as this is really meant more as a confrontational thing than an informative thing. I choose to not get involved beyond this point...
Too bad. It would be interesting to know how the conclusions you draw on the accumulated result of the 80-game minimatches would differ if you owuld just base them on the first 40 games, rather than all 80.

I noted from the date shown earlier:

Code: Select all

01:  -+-=-++=-+=++-=-+-+=+-+--=-+=--++=-++--=-+=-===+-+===+=++-=++--+-+--+--+=--++--- 
02:  -+=+=++-=+--+-=-+-+==-+----+=-=++--+=--+--+---==-+-==+=+---++--+--+-=--+-=--+--- 
03:  -+--=++==+-++-+-+-+-+-==-+-++--+=---+--+-++--+-+===+=+-=+-==+==+---=---++--++=+- 
04:  -+=-=+=--=-++-+-+-+=+-+--+-+-==++=-+==-+--+=-+-+-+=+=--+---==--+-++----++-=++--- 
05:  -+=--+=--+=-+-+---=-+-+----++=-++=-+--=+=-+--+-+-+=+=+-+===++--+-+-==--+----+-+- 
06:  -+---=+=-+-+====+-+-+-+-=+-+==-=+--++--==++-=+---+=+=+-+--=++--+=-+-=--+=-----=- 
07:  -+-+--+--+=-+==-=-+-+-+--+=-=-=++-=++=-+-=+--+-+-+=+-+-++-=++-=+--=-=--++---+=+- 
08:  -+-+-=+--+-==-=-+=+=+=+--+-==-=+=--++----++--+=--+=--+-=+-----=+-++==--++=--+--= 
09:  -+-+-=+--+-+==--+-+-+-+--==----++==++=-+=++--+=+-=-=-+-+=--++-=+--=-+=---=-+--+- 
10:  -+=+-++--+-++-==+-+-+-==-+-+=-=++-=++-=+==+--+-=-+-==+==+--++--+=++==--+----+-+= 
11:  -+---====+-++=--==+---+--+-+----+--+=-=+-=+--+=+-+-+-+-==--=+--+=++-+--++=-++-== 
12:  -+=+=++--+-++-+-=-+-+-+=-+-+-=--+--+=--+=-+==+-=-+=--+-=+-==+-=+-++-=--++---+--- 
13:  -+-+--+==+--==+-+-+-=-+=-+-+=--+-=-++-=--++--+===--+=+=+---++-=+-=+-----+---+--- 
14:  -+-=-++--+-+==+-=-+=+-+=-+=+=--++=-++--+==+--==-===+-+=-+-=-+-=+--+-=-=++--+=-+- 
15:  -+-+-++--+-=+==-+-+-+-+--+-+=---+--++-=+-++---=+-+==-+-+---==-=+--=-+--+=--++-+= 
16:  -+=+=++=-+=++=+=--+-+=+--+=+=---+--++--+=-+==+==-+-+=+-++---+-=+-=+-==-++===+-+- 
17:  -=-+-=+--+-++==-+-+-+-+--+==+-=++--++--+==+-=+-==+=--+-++--++-=+--+----++---+-+- 
18:  -+==-++--+=====-+=+-+-+--+-++--++=-+=--+-++--+=+-+=+-+=++=-++-=+==+==--+--=++--- 
19:  =+-+=+-=-=-=+-+---+-+-+--+-++--++=-++-==-++-=+-+++=-=+-++--=+--+-++-==-++--++-=- 
20:  -+---++-=+=-+-+=+-=-+-+--+-+=--++=-+=-=+-++--+=+=+--=+-++--++--+-=+-+--++--++-+- 
21:  -+=-=++--+=+=-+=+-+-+-+----=--=++=-=---+-++---=+-+=+-+=+--=-+--+--+=+--=+---=--- 
22:  -+-=-=+-==--+=+-=-+-+=+----+=-=++-=++====++--+---+===+=++--++--+--+---=++---+-=- 
23:  -+-=-++=-+-+=---=-=-+--=-+=+-===-=-+=--==++--+=+=+=+=+=++--++-=+--+-----+-----+- 
24:  -+-+-+-=-+-++-+-+---+-+--+-=----+--++--+-=+--=-====+==-++--++-=--=+--=-++--++--- 
25:  =+---+==-+-+==+-=-+-+-+=++-+=--++--=+--+==+--+==-+-===-++--++=-+-==--==++=--+-+- 
26:  -+=+-++--+-=+=+=+-=-+=+--+-+-=-=+-=++--=-++--+-+-+=-=+-++---+-=+-=+-=--=+=-++==- 
27:  -+-+-++--=-+====+-=-+-==-+=-+---+--+=--=-=+===-+-=-+-+-++--=+--+-++-=--++=-++-+= 
28:  =+---+---+-++-+-+-+-+----=-+===++--++-=+==+--+---+-+-+-+=-==+-=+--+==-=+=---+-+- 
29:  =+-+-++-===+--+=+-+-+-=--+=+---++--+=-==-++-=+---+=+=+--+-=++--+-++==--++-=++--= 
30:  -+=+-++-=+-=+-+=+-=-+-+--+-+---++--++-=+=++-==-+=--+=+=++--++----+=-+--+---++--= 
31:  -+=+==+--+-++=+-+-+=+-+=-+-++=-++--+=--+=-+--+-+=--+=+-+--=++--+--+-=--++=--+--- 
32:  -+-=--+-=+-+-=-=+-+-+-+--+-+=--++=-+=-=+-++=-==--+-+=--++--++--+-++-+-=++=-++-+- 
32 distinct runs (2560 games) found 
that the variance from 64-game sets starting from the same start position (32 with white, 32 with black) is enormous. The results vary from 14 to -29, in the set of 40. That means that some of the games are quite biased despite being played with both colors. And the statistical error in some of those is already quite low, as the white-vs-black bias is also quite extreme. (One position has 31 wins and one draw fith black, while being overwhelmingly lost (no wins, but many draws) with the reversed color. Such a match should have a very low variance, as the result is almost fixed by the color (which is not randomly chosen).)

So it seems that the variance contributed by position selection is enormous. The individual positions do often not give an idea of the relative engine strength. It would be interesting to see how "improved" versions of the engine would distribute that improvement over the positions.
Again, I will say "so what if they are 'quite biased'???"

I am not trying to determine if my program is better than program X, by playing these positions. I am trying to compare two versions of my program and determine which is better by playing these positions against a pool of opponents. I claim that if I do something that improves my results, then that version is better. If I have positions that I lose (or win) from both sides, again, "so what"??? If I lose them from both sides, I have something to improve. If I win them from both sides, then I have something that needs to be protected in future changes so that I don't undo that if possible.

All I care about is winning more games with A' than with A, regardless of the starting positions. And these positions (Silver's) are more than representative enough to give me a large cross-section of potential real games I will play. In tournament games, if my results show me that I should avoid Sicilian lines as black, then I will do so. But one of my long-term goals will be to eliminate whatever weaknesses are causing that problem, I don't want to ignore them.

So I am not sure where this discussion is going. Since I don't tune for just one position, there is very little danger that a change that helps just one position will somehow make me play worse overall, without changing the results against other positions at the same time. If I feel I need more accuracy, I can certainly include more positions, and play more games. But at present we are doing pretty well in detecting whether or not reasonable changes produce better or worse play... And that is our goal...

As far as the first 40 games vs the entire 80, I could certainly have my analysis program compare the first 40 to the last 40 to the entire set of 80 to see what happens. However, the 40 positions are chosen to give a broad cross-section of reachable opening positions, so reducing it is necessarily going to exclude some openings that are reasonably common. I'm not sure whether that would hurt or not, and have not tried to answer that. There's lots of questions that can be asked and answered. But right now we only care about "Is A' better than A?"

There will always be positions Crafty wins with black and white, and positions where it loses with both. So yes, the variability of the outcome overall is high. And sometimes those results stand across many runs, sometimes they do not. I can play enough games to eliminate the variability, but reducing the number of positions obviously begs for more of that.

As far as the "color dictates win" case, that is very common in real chess as well, for some openings. However, it doesn't have to remain that way as new changes are added. Sometimes lots of those wins slip away, sometimes lots of those black losses become draws or wins which shows an improvement. Not all 50-50 positions are unwinable by black or by white, the games just go that way. The result can get better with eval/search changes...
User avatar
hgm
Posts: 27808
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Why testing on only 80 positions is no good

Post by hgm »

bob wrote:Again, I will say "so what if they are 'quite biased'???"
I think I explained that quite well above. Strongly biased positions point out that there is something one of the engines does wrong very early. There would be a very strong drive to repair that very specific problem, as it is weighted so heavly in the total result.
I am not trying to determine if my program is better than program X, by playing these positions. I am trying to compare two versions of my program and determine which is better by playing these positions against a pool of opponents. I claim that if I do something that improves my results, then that version is better. If I have positions that I lose (or win) from both sides, again, "so what"??? If I lose them from both sides, I have something to improve. If I win them from both sides, then I have something that needs to be protected in future changes so that I don't undo that if possible.
It would still be very relevant to know if such an improvement in average score came mainly from a single position, or if it was evenly distributed over all positions. In other words, what is the variance of the _improvement_ with respect to positions.

Improvements that mainly come from a single position would have a very high variance wrt position, and the measurement on them can be considered less reliable. You would not even have detected that improvement if you had not included that position. Improvements that work 'across the board' have a low variance (almost all positions improve their scores by the same amount), and can thus be considered a very reliable indication that the improvement is real.

This is especially important as there is no reason at all to assume why the position bias would be normally distributed. For non-normal distributions, averaging (or adding) is a dangerous game. The occasional flukes can spoil more accuracy than the averaging gains. The recommended statistical procedure is to ignore the positions with the most extreme bias.
All I care about is winning more games with A' than with A, regardless of the starting positions. And these positions (Silver's) are more than representative enough to give me a large cross-section of potential real games I will play. In tournament games, if my results show me that I should avoid Sicilian lines as black, then I will do so. But one of my long-term goals will be to eliminate whatever weaknesses are causing that problem, I don't want to ignore them.

So I am not sure where this discussion is going. Since I don't tune for just one position, there is very little danger that a change that helps just one position will somehow make me play worse overall, without changing the results against other positions at the same time. If I feel I need more accuracy, I can certainly include more positions, and play more games. But at present we are doing pretty well in detecting whether or not reasonable changes produce better or worse play... And that is our goal...

As far as the first 40 games vs the entire 80, I could certainly have my analysis program compare the first 40 to the last 40 to the entire set of 80 to see what happens. However, the 40 positions are chosen to give a broad cross-section of reachable opening positions, so reducing it is necessarily going to exclude some openings that are reasonably common. I'm not sure whether that would hurt or not, and have not tried to answer that. There's lots of questions that can be asked and answered. But right now we only care about "Is A' better than A?"

There will always be positions Crafty wins with black and white, and positions where it loses with both. So yes, the variability of the outcome overall is high. And sometimes those results stand across many runs, sometimes they do not. I can play enough games to eliminate the variability, but reducing the number of positions obviously begs for more of that.

As far as the "color dictates win" case, that is very common in real chess as well, for some openings. However, it doesn't have to remain that way as new changes are added. Sometimes lots of those wins slip away, sometimes lots of those black losses become draws or wins which shows an improvement. Not all 50-50 positions are unwinable by black or by white, the games just go that way. The result can get better with eval/search changes...