Uri Blass wrote: bob wrote:
Thought I would move this since the other thread has gotten pretty long, as well as getting off-topic a bit. I have now fixed my referee program so that it can properly adjudicate games as won/lost/drawn. It maintains a board state (borrowed code from crafty but rewrote move generator to avoid the magic stuff to keep the code short. Since speed is not an issue, it just directly generates moves in bitboards rather than doing rotated lookups or magic stuff. I had to do this because not all programs provide accurate "Result" commands and can't be trusted. It can now ignore them and play anybody vs anybody. Only thing that is not allowed is draw offers. I had too many problems with programs not handling that correctly and decided to just disable the "offer draw" stuff and let the game continue. It is a forced draw if the 3-fold repetition is hit, or the 50 move rule, or insufficient material, regardless of whether one side claims a draw or not, which keeps the test games to a reasonable length.
I have run a partial test with the 6 programs so far. I am going to make 4 runs overnight, each opponent plays 160 games against every other opponent. And then I will do that 4 times to see how things look. I'm going to show the data two different ways, as the preliminary results are interesting. I will produce 4 sets of output from BayesElo where everybody plays everybody, and then 4 sets of output using only the Crafty vs everybody else PGN files, to see what happens. Remember that there are a couple of issues. One is what is the Elo spread from best to worst, and then what is the stability like. I should be able to post all of this in the morning if my shell script doesn't have a glaring error I missed.
More to follow...
Here are some partial results just for information: First batch is everybody vs everybody, second batch is just crafty vs each opponent...
Code: Select all
2291 game(s) loaded, 0 game(s) with unknown result ignored.
Rank Name Elo + - games score oppo. draws
1 Glaurung 2-epsilon/5 73 19 19 763 61% -15 18%
2 opponent-21.7 26 18 18 759 54% -5 23%
3 Fruit 2.1 5 19 19 754 51% -2 15%
4 Glaurung 1.1 SMP -15 19 19 764 47% 2 16%
5 Crafty-22.2 -33 18 18 757 45% 6 22%
6 Arasan 10.0 -56 19 19 785 42% 11 10%
760 game(s) loaded, 0 game(s) with unknown result ignored.
Rank Name Elo + - games score oppo. draws
1 Glaurung 2-epsilon/5 115 44 42 153 69% -31 17%
2 Fruit 2.1 64 42 41 149 63% -31 18%
3 Glaurung 1.1 SMP 48 42 41 152 61% -31 16%
4 opponent-21.7 20 38 37 152 58% -31 41%
5 Crafty-22.2 -31 19 19 760 45% 6 22%
6 Arasan 10.0 -216 43 45 154 26% -31 16%
The two samples were made at the same time, for reference. Has a ways to go until it is done. I will then run this again, but with the "big" match. And will eventually stop running the all vs all, since once each non-crafty has played all the others, those versions do not change and there is little use in re-running them over and over. I will just take those PGN files and add them to the crafty vs everyone so that I just have to run crafty vs everyone after the first run to get all the PGN.
opponent 21.7 is crafty 21.7 for reference, we wanted to keep a 21.7 version in to see how the new version is doing. 22.2 is pretty incomplete with pieces of the eval chopped out...
Based on this results it may be interesting to look at the games of arasan to see if there is something wrong with them.
Arasan had 26% after 154 games(probably 40/154) when it only played against Crafty in the first 154 games and it has 42% after 785 games.
I do not believe that arasan is strong enough to score more than 40% against programs like fruit and glaurung and it seems that something is wrong in your test(maybe you are wrong in adjudicating part of the games of arasan against other opponents).
The adjudication is not wrong. I have played several thousand games and compared results to the output of Crafty's search. The rules are simple:
(1) if one side has no legal moves and is in check, he is checkmated and loses (loss).
(2) if one side has no legal moves and is not in check, he is stalemated (draw).
(3) if the position is repeated for the third time with the same side on move, it is a draw by repetition.
(4) if the position satisfies the 50-move rule, it is a draw.
(5) if the position has no pawns and only one minor piece per side, it is a draw, and yes I know there is an exception. I've never seen it happen in a real game and find it reasonable to say draw.
(6) if one side resigns, it is a loss.
I played several thousand test games between Crafty and the 5 opponents, I then grabbed the Result tag, and the last value from Crafty's search, and correlated them by hand. Since crafty is always +=good for white, it was easy to verify that the side with the evaluation edge always got the edge except for a few draws, in cases where Crafty thought it was worse but the opponent repeated or allowed a 50-move draw, for example.
Once I get a complete set of data, I'll put the whole thing on my ftp box if you want to look at the results for yourself. No program can end the game in any way except by resigning, which is an instant loss. Otherwise the referee detects the mates and such itself.