OK, let me recap (and recant) just a bit. In selecting the positions, I have had a couple of screw-ups that I have now fixed. In the main problem, the way I was capturing FEN could attach a bogus castling status. Which made the match worthless of course since some programs revert to a normal starting position with bad FEN, some just use the FEN as is but ignoring the bogus castling status, etc. That has been fixed. I also took the positions after move 9, where I intended to use positions after 10 moves. I have fixed all this and as a result, I now have 3891 positions for a first cut. So that will be played twice per opponent, which is 38,910 games total. I may well try the alternate idea of playing 1/5 of the positions against each opponent so that no opponent will play the same position against Crafty... That is a minor change to the shell script that creates the run scripts. And I can compare each. The second approach would play 1/5 the total games, which will be quicker to run, although either will really need a cluster since there are so many games...xsadar wrote:That's the main thing I was worried about.bob wrote:Of course. However, once I run the first run (cluster is a bit busy but I just started a test so results will start to trickle in) I would like to start a discussion about how to select the positions.xsadar wrote: This point makes sense, but I don't like it. That means we would need 10 times as many positions as I originally thought for an ideal test. Also, if we don't play the positions as both white and black, it seems (to me at least) to make it even more important that the positions be about equal for white and black. I hope your kind enough, Bob, to make the positions you finally settle on (however many that may be) available to the rest of us.
To continue the discussion above, the idea of using a different position for each opponent seems reasonable. not playing black and white from each, however, becomes more problematic because then we have to be sure that the positions are all relatively balanced, which is not exactly an easy tasks, when we start talking about 10K+ (or more) positions.
I hope it gives some good results. This is with Crafty playing both black and white against each engine for each position, right?What I have done, is to take the PGN collection we use for our normal wide book (good quality games) and then just modify the book create procedure so that on wtm move 11 (both sides have played 10 moves) I write out the FEN. The good side is that probably most of these positions are decent. The down side is that this does not cover unusual openings very well, and might not cover some at all. So you might find out your program does well in normal positions but you might not know it handles off-the-wall positions poorly...
So there is a ton of room for further discussion. But let me get some hard Elo data from these positions. I want to run them 4 times so that I get 4 sets of Elo data, which will hopefully be very close to the same each time...
But first let me get a complete run done. I'll post the results but I will need to look at overall results in a couple of different ways to make sure nothing unexpected is happing in the positions I choose. The positions are sorted into lexical order, and I probably don't want to play the first 1/5 against X, the second 1/5 against Y, probably take them one at a time and parcel them out to each opponent in sequence so that positions that are close together lexically won't all be fed to the same opponent...
I like the simple steps first. So, the first question to be answered is a two-parter. (1) how many positions are needed, and (2) how many opponents are needed? If the numbers are as big as some are suggesting, then no testing being done today is worth anything. But at least I can answer the questions assuming the answers are anywhere near reasonable. rather than 5 opponents, 50 makes the test run 10x longer. A day or less vs a week. If we conclude 10,000 positions and 100 opponents, that is going to be beyond doable in any reasonable time. I have played several million games, but at maybe 50,000 per day, that is almost 3 weeks which is not so useful. Using both clusters at once could cut that to one week, but that is a _ton_ of computing to do, and I really want good/bad feedback quicker. I'd prefer minutes or hours, but not days and certainly not weeks.