New testing thread
Posted: Thu Aug 07, 2008 8:16 am
Thought I would move this since the other thread has gotten pretty long, as well as getting off-topic a bit. I have now fixed my referee program so that it can properly adjudicate games as won/lost/drawn. It maintains a board state (borrowed code from crafty but rewrote move generator to avoid the magic stuff to keep the code short. Since speed is not an issue, it just directly generates moves in bitboards rather than doing rotated lookups or magic stuff. I had to do this because not all programs provide accurate "Result" commands and can't be trusted. It can now ignore them and play anybody vs anybody. Only thing that is not allowed is draw offers. I had too many problems with programs not handling that correctly and decided to just disable the "offer draw" stuff and let the game continue. It is a forced draw if the 3-fold repetition is hit, or the 50 move rule, or insufficient material, regardless of whether one side claims a draw or not, which keeps the test games to a reasonable length.
I have run a partial test with the 6 programs so far. I am going to make 4 runs overnight, each opponent plays 160 games against every other opponent. And then I will do that 4 times to see how things look. I'm going to show the data two different ways, as the preliminary results are interesting. I will produce 4 sets of output from BayesElo where everybody plays everybody, and then 4 sets of output using only the Crafty vs everybody else PGN files, to see what happens. Remember that there are a couple of issues. One is what is the Elo spread from best to worst, and then what is the stability like. I should be able to post all of this in the morning if my shell script doesn't have a glaring error I missed.
More to follow...
Here are some partial results just for information: First batch is everybody vs everybody, second batch is just crafty vs each opponent...
The two samples were made at the same time, for reference. Has a ways to go until it is done. I will then run this again, but with the "big" match. And will eventually stop running the all vs all, since once each non-crafty has played all the others, those versions do not change and there is little use in re-running them over and over. I will just take those PGN files and add them to the crafty vs everyone so that I just have to run crafty vs everyone after the first run to get all the PGN.
opponent 21.7 is crafty 21.7 for reference, we wanted to keep a 21.7 version in to see how the new version is doing. 22.2 is pretty incomplete with pieces of the eval chopped out...
I have run a partial test with the 6 programs so far. I am going to make 4 runs overnight, each opponent plays 160 games against every other opponent. And then I will do that 4 times to see how things look. I'm going to show the data two different ways, as the preliminary results are interesting. I will produce 4 sets of output from BayesElo where everybody plays everybody, and then 4 sets of output using only the Crafty vs everybody else PGN files, to see what happens. Remember that there are a couple of issues. One is what is the Elo spread from best to worst, and then what is the stability like. I should be able to post all of this in the morning if my shell script doesn't have a glaring error I missed.
More to follow...
Here are some partial results just for information: First batch is everybody vs everybody, second batch is just crafty vs each opponent...
Code: Select all
2291 game(s) loaded, 0 game(s) with unknown result ignored.
Rank Name Elo + - games score oppo. draws
1 Glaurung 2-epsilon/5 73 19 19 763 61% -15 18%
2 opponent-21.7 26 18 18 759 54% -5 23%
3 Fruit 2.1 5 19 19 754 51% -2 15%
4 Glaurung 1.1 SMP -15 19 19 764 47% 2 16%
5 Crafty-22.2 -33 18 18 757 45% 6 22%
6 Arasan 10.0 -56 19 19 785 42% 11 10%
760 game(s) loaded, 0 game(s) with unknown result ignored.
Rank Name Elo + - games score oppo. draws
1 Glaurung 2-epsilon/5 115 44 42 153 69% -31 17%
2 Fruit 2.1 64 42 41 149 63% -31 18%
3 Glaurung 1.1 SMP 48 42 41 152 61% -31 16%
4 opponent-21.7 20 38 37 152 58% -31 41%
5 Crafty-22.2 -31 19 19 760 45% 6 22%
6 Arasan 10.0 -216 43 45 154 26% -31 16%
opponent 21.7 is crafty 21.7 for reference, we wanted to keep a 21.7 version in to see how the new version is doing. 22.2 is pretty incomplete with pieces of the eval chopped out...