New testing thread

bob · Post by **bob** » Sun Aug 10, 2008 12:49 am

xsadar wrote:
bob wrote:
xsadar wrote: This point makes sense, but I don't like it. That means we would need 10 times as many positions as I originally thought for an ideal test. Also, if we don't play the positions as both white and black, it seems (to me at least) to make it even more important that the positions be about equal for white and black. I hope your kind enough, Bob, to make the positions you finally settle on (however many that may be) available to the rest of us.
Of course. However, once I run the first run (cluster is a bit busy but I just started a test so results will start to trickle in) I would like to start a discussion about how to select the positions.

To continue the discussion above, the idea of using a different position for each opponent seems reasonable. not playing black and white from each, however, becomes more problematic because then we have to be sure that the positions are all relatively balanced, which is not exactly an easy tasks, when we start talking about 10K+ (or more) positions.
That's the main thing I was worried about.

What I have done, is to take the PGN collection we use for our normal wide book (good quality games) and then just modify the book create procedure so that on wtm move 11 (both sides have played 10 moves) I write out the FEN. The good side is that probably most of these positions are decent. The down side is that this does not cover unusual openings very well, and might not cover some at all. So you might find out your program does well in normal positions but you might not know it handles off-the-wall positions poorly...

So there is a ton of room for further discussion. But let me get some hard Elo data from these positions. I want to run them 4 times so that I get 4 sets of Elo data, which will hopefully be very close to the same each time...
I hope it gives some good results. This is with Crafty playing both black and white against each engine for each position, right?

OK, let me recap (and recant) just a bit. In selecting the positions, I have had a couple of screw-ups that I have now fixed. In the main problem, the way I was capturing FEN could attach a bogus castling status. Which made the match worthless of course since some programs revert to a normal starting position with bad FEN, some just use the FEN as is but ignoring the bogus castling status, etc. That has been fixed. I also took the positions after move 9, where I intended to use positions after 10 moves. I have fixed all this and as a result, I now have 3891 positions for a first cut. So that will be played twice per opponent, which is 38,910 games total. I may well try the alternate idea of playing 1/5 of the positions against each opponent so that no opponent will play the same position against Crafty... That is a minor change to the shell script that creates the run scripts. And I can compare each. The second approach would play 1/5 the total games, which will be quicker to run, although either will really need a cluster since there are so many games...

But first let me get a complete run done. I'll post the results but I will need to look at overall results in a couple of different ways to make sure nothing unexpected is happing in the positions I choose. The positions are sorted into lexical order, and I probably don't want to play the first 1/5 against X, the second 1/5 against Y, probably take them one at a time and parcel them out to each opponent in sequence so that positions that are close together lexically won't all be fed to the same opponent...

I like the simple steps first. So, the first question to be answered is a two-parter. (1) how many positions are needed, and (2) how many opponents are needed? If the numbers are as big as some are suggesting, then no testing being done today is worth anything. But at least I can answer the questions assuming the answers are anywhere near reasonable. rather than 5 opponents, 50 makes the test run 10x longer. A day or less vs a week. If we conclude 10,000 positions and 100 opponents, that is going to be beyond doable in any reasonable time. I have played several million games, but at maybe 50,000 per day, that is almost 3 weeks which is not so useful. Using both clusters at once could cut that to one week, but that is a _ton_ of computing to do, and I really want good/bad feedback quicker. I'd prefer minutes or hours, but not days and certainly not weeks.

bob · Post by **bob** » Sun Aug 10, 2008 12:58 am

Something is not right now. Somehow Arasan is not happy with the starting FEN positions so I am looking to see what is up there... There are other quirks so I might have done something that Crafty is happy with (I fed all FEN strings thru it to look for any complaints) but others do not like... Thought I had it done, but based on preliminary output, it is a _long_ way from there...

xsadar · Post by **xsadar** » Sun Aug 10, 2008 1:21 am

bob wrote:
Zach Wegner wrote:
Karl wrote:If we want to measure how well engine A plays in the absolute, however, then "randomness" (or more precisely independence between
measurements) is a good thing. We want to do everything possible to kill off correlations between games that A plays and other games
that A plays. This can include having the opponent's node count set to a random number, so that there is less correlation between games
that reuse the same opponent. That said, if we randomize the opposing node count we should save the node count we used, so we can use
exactly the same node count for the same game when A' plays in our comparison game.

I think, therefore, that test suite results will be most significant if the time control is taken out of the picture completely. If the
bots are limited by node count rather than time control, we can control the randomness so that we get the "good randomness" (achieving
less correlation among the games of one engine) and can simultaneously eliminate the "bad randomness" (removing noise from the comparison
between engines).
Isn't this pretty much exactly what I've been saying?
Yes. but there is a fatal flaw. You already saw where 10,001,000 nodes produces different results from 10,000,000 nodes, correct? Now make one tiny change to your evaluation or search and guess what happens to the tree? It will change by _more_ than 1,000 nodes, and you get a different random result.

There is no way to search the same exact tree with a different evaluation or search, so you are right back to where we started. There will be a random offset between the two sample sets just as there is with random times at present...

The changes in the search tree are part of what we're trying to test, whereas fluctuations in the clock are not.

However, I will agree with you that a fixed (or random) number of nodes won't work. Any change to the evaluation or search will also change the nodes per second of the search, which is also something we would definitely want included in our test. That is, afterall, the major tradeoff when adding new evaluation terms. Also, in my experience, nps varies drastically from one stage of the game to another, so I don't think it would even work to use an average nps in combination with a node count.

Dirt · Post by **Dirt** » Sun Aug 10, 2008 1:29 am

Sven Schüle wrote:I think that including too many unbalanced positions without playing them with both colors may suffer from one problem. You want to compare estimates of the playing strength of version A and A'. But actually you would deal too often with comparing only the abilities to handle a won position, or a lost position, which is only one aspect of playing strength. So while the ability of A' to win better positions may have increased compared to A, its ability to handle balanced positions may have decreased.

Therefore I propose to include both balanced and unbalanced positions in the test set (perhaps more of the balanced kind since this might be closer to "reality"), but to play all positions with both colors. The latter seems important for me since it may be quite difficult to decide whether a given position belongs to one or the other group (you may ask Rybka but you still have to define some artificial margin to say where "unbalanced" begins).

Sven

I don't think you need Rybka to tell whether a position is balanced. Any position where the results are near 100% for one side isn't going to be a good use of the hardware, even if it's actually theoretically drawn.

Testing the engine in positions that are somewhat better or worse sounds reasonable, but I see no good reason why, when testing for the better version of Crafty, those should be the same positions with colors reversed. If you are interested in testing something else, such as how strong Crafty is relative to the test programs, or which positions are truly unbalanced, that would change things.

If you really want to optimize testing you should probably vary which side gets the better position depending on the strength, so that Crafty gets the better side against Glaurung and the worse side against Arasan. I'm not so sure this is really practical, though.

Zach Wegner · Post by **Zach Wegner** » Sun Aug 10, 2008 1:52 am

xsadar wrote:However, I will agree with you that a fixed (or random) number of nodes won't work. Any change to the evaluation or search will also change the nodes per second of the search, which is also something we would definitely want included in our test. That is, afterall, the major tradeoff when adding new evaluation terms. Also, in my experience, nps varies drastically from one stage of the game to another, so I don't think it would even work to use an average nps in combination with a node count.

First, I think using a node count proportional to NPS would be fine. The NPS doesn't really change all that much from version to version, and from version to version with respect to game stage. If we had a way to measure time in an exact way, and completely get rid of clock jitter, then I would suggest using a random time. But we don't, so I think nodes is the next best option.

But anyways, I think the tradeoff is not between evaluation size and speed, as it was thought to be for ages, but rather the robustness/generality of the evaluation and how it affects every aspect of the game, not just the part it is evaluating. The NPS difference that a (relatively small) evaluation term causes is very minor WRT strength IMO.

Fritzlein · Post by **Fritzlein** » Sun Aug 10, 2008 2:17 am

xsadar wrote:However, I will agree with you that a fixed (or random) number of nodes won't work. Any change to the evaluation or search will also change the nodes per second of the search, which is also something we would definitely want included in our test. That is, afterall, the major tradeoff when adding new evaluation terms.

Ah, good point. I didn't think of that. One should give engine A and engine A' the same time rather than the same node count. However it still makes sense to give the opponent of A and the opponent of A' the same node count, right? That way they are playing an opponent of fixed strength rather than a slightly varying opponent.

xsadar · Post by **xsadar** » Sun Aug 10, 2008 2:18 am

bob wrote:
xsadar wrote:
bob wrote:
xsadar wrote: This point makes sense, but I don't like it. That means we would need 10 times as many positions as I originally thought for an ideal test. Also, if we don't play the positions as both white and black, it seems (to me at least) to make it even more important that the positions be about equal for white and black. I hope your kind enough, Bob, to make the positions you finally settle on (however many that may be) available to the rest of us.
Of course. However, once I run the first run (cluster is a bit busy but I just started a test so results will start to trickle in) I would like to start a discussion about how to select the positions.

To continue the discussion above, the idea of using a different position for each opponent seems reasonable. not playing black and white from each, however, becomes more problematic because then we have to be sure that the positions are all relatively balanced, which is not exactly an easy tasks, when we start talking about 10K+ (or more) positions.
That's the main thing I was worried about.

What I have done, is to take the PGN collection we use for our normal wide book (good quality games) and then just modify the book create procedure so that on wtm move 11 (both sides have played 10 moves) I write out the FEN. The good side is that probably most of these positions are decent. The down side is that this does not cover unusual openings very well, and might not cover some at all. So you might find out your program does well in normal positions but you might not know it handles off-the-wall positions poorly...

So there is a ton of room for further discussion. But let me get some hard Elo data from these positions. I want to run them 4 times so that I get 4 sets of Elo data, which will hopefully be very close to the same each time...
I hope it gives some good results. This is with Crafty playing both black and white against each engine for each position, right?
OK, let me recap (and recant) just a bit. In selecting the positions, I have had a couple of screw-ups that I have now fixed. In the main problem, the way I was capturing FEN could attach a bogus castling status. Which made the match worthless of course since some programs revert to a normal starting position with bad FEN, some just use the FEN as is but ignoring the bogus castling status, etc. That has been fixed. I also took the positions after move 9, where I intended to use positions after 10 moves. I have fixed all this and as a result, I now have 3891 positions for a first cut. So that will be played twice per opponent, which is 38,910 games total. I may well try the alternate idea of playing 1/5 of the positions against each opponent so that no opponent will play the same position against Crafty... That is a minor change to the shell script that creates the run scripts. And I can compare each. The second approach would play 1/5 the total games, which will be quicker to run, although either will really need a cluster since there are so many games...

But first let me get a complete run done. I'll post the results but I will need to look at overall results in a couple of different ways to make sure nothing unexpected is happing in the positions I choose. The positions are sorted into lexical order, and I probably don't want to play the first 1/5 against X, the second 1/5 against Y, probably take them one at a time and parcel them out to each opponent in sequence so that positions that are close together lexically won't all be fed to the same opponent...

I like the simple steps first. So, the first question to be answered is a two-parter. (1) how many positions are needed, and (2) how many opponents are needed? If the numbers are as big as some are suggesting, then no testing being done today is worth anything. But at least I can answer the questions assuming the answers are anywhere near reasonable. rather than 5 opponents, 50 makes the test run 10x longer. A day or less vs a week. If we conclude 10,000 positions and 100 opponents, that is going to be beyond doable in any reasonable time. I have played several million games, but at maybe 50,000 per day, that is almost 3 weeks which is not so useful. Using both clusters at once could cut that to one week, but that is a _ton_ of computing to do, and I really want good/bad feedback quicker. I'd prefer minutes or hours, but not days and certainly not weeks.

I don't think you need to determine the number of games based on the number of engines and positions, but the other way around. Except, of course, you have to take into consideration whether or not you can come up with enough good positions, and (especially) engines.
I think it would make sense to pick a reasonable number of games (if it's reasonable, around 25,000 would be nice of course so the numbers could be compared to your previous results) then try to determine how many engines and positions resulting in the predetermined number of games would give you the most reliable results.

krazyken · Post by **krazyken** » Sun Aug 10, 2008 2:18 am

bob wrote:there is simply something going _badly_ wrong with communication here.

(1) if we play with X nodes per search we get one result. If we play with X+N nodes per search, we get another result.

(2) if we play with time limit for search, each run gets a different result.

I ran several tests using (2) and got different results. I then showed that the _only_ thing possible to produce these changes are a result of the different nodes caused by using time.

Why is that so hard to explain? Every search of exactly 1 second is not exactly one second. It is 1 second +/- N where N is not precisely known because of hardware differences. Before you can fix something, you have to understand what is happening. I wanted to prove, beyond any doubt, that this variability was purely a result of timing fluctuation, which I quite clearly did. Otherwise I didn't care about the varying node count searches at all with respect to testing. It was just a method of validating a hypothesis.

Honestly, I'm not sure that this has been proven beyond any doubt. My guess here is if time fluctuations is a factor, there is no reason to believe that one run has more fluctuation than the other. The original runs of 25k games are gone, no data to support either side of any argument about why the results were different. That horse is dead. When new data comes in, we'll be able to analyze rather than speculate.

Here is an experiment people should be able to reproduce on their home machines that should be analogous to the situation produced in those 2 runs.
Pick one of the 40 silver positions, and run 5 series of 128 games of their engine against 5 opponents from that 1 position. For a total of 640 games. Make sure to disable learning and opening books and such. Then set up the same run again and see how much different the results are. This should be the simplest case of determining the value of the number of starting positions.

xsadar · Post by **xsadar** » Sun Aug 10, 2008 2:29 am

Fritzlein wrote:
xsadar wrote:However, I will agree with you that a fixed (or random) number of nodes won't work. Any change to the evaluation or search will also change the nodes per second of the search, which is also something we would definitely want included in our test. That is, afterall, the major tradeoff when adding new evaluation terms.
Ah, good point. I didn't think of that. One should give engine A and engine A' the same time rather than the same node count. However it still makes sense to give the opponent of A and the opponent of A' the same node count, right? That way they are playing an opponent of fixed strength rather than a slightly varying opponent.

I think I can agree with that.

Fritzlein · Post by **Fritzlein** » Sun Aug 10, 2008 2:31 am

hgm wrote:
Fritzlein wrote:So what is left to calculate?
This is in relation to Bob's complaint that 10,000 positions might be too much, and 10,000 opponents will certainly be too much.

What I meant is this: suppose you have 10 positions, 10 opponents and play each opponent from each position 100 times at slightly different number of nodes. So you play 10,000 games. Each position is played 1000 times. Suppose from one position the engine under test scores 60%, from 2 others 55%, from 4 others 50%, from 2 other 45%, and from the last one 40%. So the total test score is 50%.

The individual positions deviate from this average score by 10% (2x) or 5% (4x). This makes the standard deviation in the score of the individual positions sqrt(30)% ~ 5.5%. Very little of this is due to samplling noise in the percentages themselves, as the variance in 1000 games should be at most 2500/N (square percents) = 2.5 << 30. So the observed variance in the individual position scores is almost exclusively due to variation of the position. And the standard deviation the score suffers by selecting a single position, and using it for all games, is thus ~5%.

Now if you average over M independently chosen positions, the SD of the average will be 5%/sqrt(N). So if you want to acheive an accuracy of 1%, (at 95% confidence, so 2-sigma), and you want not more than one quarter of the corresponding variance to be contributed by the position sampling, you need a SD of 0.25% (1-sigma). Thus N has to be at least 400.

There is not much benifit in doing the experiment with more than 400 positions: With 10,000 games the SD of the average score of the entire sample will be 50%/sqrt(10,000) = 0.5%, even with 10,000 different positions. Adding an extra 0.25% SD to that because of position-sampling noise averaged over 400 positions will only drive up the overall SD to 0.55%.

Got it. You are talking about how having too few positions can add noise to the estimate of the playing strength even if the positions can be re-used in a way that is completely independent. You estimate that using more than 400 positions to eliminate this noise will add little extra benefit.

What we apparently disagree is about the potential that we will be able to re-use positions in a way that is independent. I highly doubt it can be done. Therefore, I say we need 10,000 positions, not to remove the source of noise you are talking about, but to remove the source of noise that comes from correlations when a position is re-used. I contend that if Crafty playing white from a given position beats Fruit limited to exactly 3,000,000 nodes, then I can make a lot of money betting on Crafty to win playing white from the same position against Fruit limited to 3,000,000 nodes +/- 100,000 nodes. Yes, at some of the other nodes counts Fruit will win, but I still expect to consistently make money, because I just don't believe the results will be independent.

New testing thread

Re: Correlated data discussion

BTW, don't expect results too soon

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion