New testing thread

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

New testing thread

Post by bob »

Thought I would move this since the other thread has gotten pretty long, as well as getting off-topic a bit. I have now fixed my referee program so that it can properly adjudicate games as won/lost/drawn. It maintains a board state (borrowed code from crafty but rewrote move generator to avoid the magic stuff to keep the code short. Since speed is not an issue, it just directly generates moves in bitboards rather than doing rotated lookups or magic stuff. I had to do this because not all programs provide accurate "Result" commands and can't be trusted. It can now ignore them and play anybody vs anybody. Only thing that is not allowed is draw offers. I had too many problems with programs not handling that correctly and decided to just disable the "offer draw" stuff and let the game continue. It is a forced draw if the 3-fold repetition is hit, or the 50 move rule, or insufficient material, regardless of whether one side claims a draw or not, which keeps the test games to a reasonable length.

I have run a partial test with the 6 programs so far. I am going to make 4 runs overnight, each opponent plays 160 games against every other opponent. And then I will do that 4 times to see how things look. I'm going to show the data two different ways, as the preliminary results are interesting. I will produce 4 sets of output from BayesElo where everybody plays everybody, and then 4 sets of output using only the Crafty vs everybody else PGN files, to see what happens. Remember that there are a couple of issues. One is what is the Elo spread from best to worst, and then what is the stability like. I should be able to post all of this in the morning if my shell script doesn't have a glaring error I missed.

More to follow...

Here are some partial results just for information: First batch is everybody vs everybody, second batch is just crafty vs each opponent...

Code: Select all

2291 game(s) loaded, 0 game(s) with unknown result ignored.
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5    73   19   19   763   61%   -15   18% 
   2 opponent-21.7           26   18   18   759   54%    -5   23% 
   3 Fruit 2.1                5   19   19   754   51%    -2   15% 
   4 Glaurung 1.1 SMP       -15   19   19   764   47%     2   16% 
   5 Crafty-22.2            -33   18   18   757   45%     6   22% 
   6 Arasan 10.0            -56   19   19   785   42%    11   10% 

760 game(s) loaded, 0 game(s) with unknown result ignored.
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   115   44   42   153   69%   -31   17% 
   2 Fruit 2.1               64   42   41   149   63%   -31   18% 
   3 Glaurung 1.1 SMP        48   42   41   152   61%   -31   16% 
   4 opponent-21.7           20   38   37   152   58%   -31   41% 
   5 Crafty-22.2            -31   19   19   760   45%     6   22% 
   6 Arasan 10.0           -216   43   45   154   26%   -31   16% 
The two samples were made at the same time, for reference. Has a ways to go until it is done. I will then run this again, but with the "big" match. And will eventually stop running the all vs all, since once each non-crafty has played all the others, those versions do not change and there is little use in re-running them over and over. I will just take those PGN files and add them to the crafty vs everyone so that I just have to run crafty vs everyone after the first run to get all the PGN.

opponent 21.7 is crafty 21.7 for reference, we wanted to keep a 21.7 version in to see how the new version is doing. 22.2 is pretty incomplete with pieces of the eval chopped out...
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: New testing thread

Post by hgm »

The most importaint thing is that you make the referee save the core or machine it was running on and the time it was played, next to the starting position, players and result. You could do this in the PGN tags, so that it can be extracted later, or write it in a more compact form, so that aberrant results can be scrutinized for ' cause of death'. Rather than having to rely on an "it did not work, tell me what's wrong :cry: " approach.
Uri Blass
Posts: 10281
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: New testing thread

Post by Uri Blass »

bob wrote:Thought I would move this since the other thread has gotten pretty long, as well as getting off-topic a bit. I have now fixed my referee program so that it can properly adjudicate games as won/lost/drawn. It maintains a board state (borrowed code from crafty but rewrote move generator to avoid the magic stuff to keep the code short. Since speed is not an issue, it just directly generates moves in bitboards rather than doing rotated lookups or magic stuff. I had to do this because not all programs provide accurate "Result" commands and can't be trusted. It can now ignore them and play anybody vs anybody. Only thing that is not allowed is draw offers. I had too many problems with programs not handling that correctly and decided to just disable the "offer draw" stuff and let the game continue. It is a forced draw if the 3-fold repetition is hit, or the 50 move rule, or insufficient material, regardless of whether one side claims a draw or not, which keeps the test games to a reasonable length.

I have run a partial test with the 6 programs so far. I am going to make 4 runs overnight, each opponent plays 160 games against every other opponent. And then I will do that 4 times to see how things look. I'm going to show the data two different ways, as the preliminary results are interesting. I will produce 4 sets of output from BayesElo where everybody plays everybody, and then 4 sets of output using only the Crafty vs everybody else PGN files, to see what happens. Remember that there are a couple of issues. One is what is the Elo spread from best to worst, and then what is the stability like. I should be able to post all of this in the morning if my shell script doesn't have a glaring error I missed.

More to follow...

Here are some partial results just for information: First batch is everybody vs everybody, second batch is just crafty vs each opponent...

Code: Select all

2291 game(s) loaded, 0 game(s) with unknown result ignored.
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5    73   19   19   763   61%   -15   18% 
   2 opponent-21.7           26   18   18   759   54%    -5   23% 
   3 Fruit 2.1                5   19   19   754   51%    -2   15% 
   4 Glaurung 1.1 SMP       -15   19   19   764   47%     2   16% 
   5 Crafty-22.2            -33   18   18   757   45%     6   22% 
   6 Arasan 10.0            -56   19   19   785   42%    11   10% 

760 game(s) loaded, 0 game(s) with unknown result ignored.
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   115   44   42   153   69%   -31   17% 
   2 Fruit 2.1               64   42   41   149   63%   -31   18% 
   3 Glaurung 1.1 SMP        48   42   41   152   61%   -31   16% 
   4 opponent-21.7           20   38   37   152   58%   -31   41% 
   5 Crafty-22.2            -31   19   19   760   45%     6   22% 
   6 Arasan 10.0           -216   43   45   154   26%   -31   16% 
The two samples were made at the same time, for reference. Has a ways to go until it is done. I will then run this again, but with the "big" match. And will eventually stop running the all vs all, since once each non-crafty has played all the others, those versions do not change and there is little use in re-running them over and over. I will just take those PGN files and add them to the crafty vs everyone so that I just have to run crafty vs everyone after the first run to get all the PGN.

opponent 21.7 is crafty 21.7 for reference, we wanted to keep a 21.7 version in to see how the new version is doing. 22.2 is pretty incomplete with pieces of the eval chopped out...
Based on this results it may be interesting to look at the games of arasan to see if there is something wrong with them.

Arasan had 26% after 154 games(probably 40/154) when it only played against Crafty in the first 154 games and it has 42% after 785 games.

I do not believe that arasan is strong enough to score more than 40% against programs like fruit and glaurung and it seems that something is wrong in your test(maybe you are wrong in adjudicating part of the games of arasan against other opponents).

Uri
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: New testing thread

Post by bob »

Uri Blass wrote:
bob wrote:Thought I would move this since the other thread has gotten pretty long, as well as getting off-topic a bit. I have now fixed my referee program so that it can properly adjudicate games as won/lost/drawn. It maintains a board state (borrowed code from crafty but rewrote move generator to avoid the magic stuff to keep the code short. Since speed is not an issue, it just directly generates moves in bitboards rather than doing rotated lookups or magic stuff. I had to do this because not all programs provide accurate "Result" commands and can't be trusted. It can now ignore them and play anybody vs anybody. Only thing that is not allowed is draw offers. I had too many problems with programs not handling that correctly and decided to just disable the "offer draw" stuff and let the game continue. It is a forced draw if the 3-fold repetition is hit, or the 50 move rule, or insufficient material, regardless of whether one side claims a draw or not, which keeps the test games to a reasonable length.

I have run a partial test with the 6 programs so far. I am going to make 4 runs overnight, each opponent plays 160 games against every other opponent. And then I will do that 4 times to see how things look. I'm going to show the data two different ways, as the preliminary results are interesting. I will produce 4 sets of output from BayesElo where everybody plays everybody, and then 4 sets of output using only the Crafty vs everybody else PGN files, to see what happens. Remember that there are a couple of issues. One is what is the Elo spread from best to worst, and then what is the stability like. I should be able to post all of this in the morning if my shell script doesn't have a glaring error I missed.

More to follow...

Here are some partial results just for information: First batch is everybody vs everybody, second batch is just crafty vs each opponent...

Code: Select all

2291 game(s) loaded, 0 game(s) with unknown result ignored.
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5    73   19   19   763   61%   -15   18% 
   2 opponent-21.7           26   18   18   759   54%    -5   23% 
   3 Fruit 2.1                5   19   19   754   51%    -2   15% 
   4 Glaurung 1.1 SMP       -15   19   19   764   47%     2   16% 
   5 Crafty-22.2            -33   18   18   757   45%     6   22% 
   6 Arasan 10.0            -56   19   19   785   42%    11   10% 

760 game(s) loaded, 0 game(s) with unknown result ignored.
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   115   44   42   153   69%   -31   17% 
   2 Fruit 2.1               64   42   41   149   63%   -31   18% 
   3 Glaurung 1.1 SMP        48   42   41   152   61%   -31   16% 
   4 opponent-21.7           20   38   37   152   58%   -31   41% 
   5 Crafty-22.2            -31   19   19   760   45%     6   22% 
   6 Arasan 10.0           -216   43   45   154   26%   -31   16% 
The two samples were made at the same time, for reference. Has a ways to go until it is done. I will then run this again, but with the "big" match. And will eventually stop running the all vs all, since once each non-crafty has played all the others, those versions do not change and there is little use in re-running them over and over. I will just take those PGN files and add them to the crafty vs everyone so that I just have to run crafty vs everyone after the first run to get all the PGN.

opponent 21.7 is crafty 21.7 for reference, we wanted to keep a 21.7 version in to see how the new version is doing. 22.2 is pretty incomplete with pieces of the eval chopped out...
Based on this results it may be interesting to look at the games of arasan to see if there is something wrong with them.

Arasan had 26% after 154 games(probably 40/154) when it only played against Crafty in the first 154 games and it has 42% after 785 games.

I do not believe that arasan is strong enough to score more than 40% against programs like fruit and glaurung and it seems that something is wrong in your test(maybe you are wrong in adjudicating part of the games of arasan against other opponents).

Uri
The adjudication is not wrong. I have played several thousand games and compared results to the output of Crafty's search. The rules are simple:

(1) if one side has no legal moves and is in check, he is checkmated and loses (loss).

(2) if one side has no legal moves and is not in check, he is stalemated (draw).

(3) if the position is repeated for the third time with the same side on move, it is a draw by repetition.

(4) if the position satisfies the 50-move rule, it is a draw.

(5) if the position has no pawns and only one minor piece per side, it is a draw, and yes I know there is an exception. I've never seen it happen in a real game and find it reasonable to say draw.

(6) if one side resigns, it is a loss.

I played several thousand test games between Crafty and the 5 opponents, I then grabbed the Result tag, and the last value from Crafty's search, and correlated them by hand. Since crafty is always +=good for white, it was easy to verify that the side with the evaluation edge always got the edge except for a few draws, in cases where Crafty thought it was worse but the opponent repeated or allowed a 50-move draw, for example.

Once I get a complete set of data, I'll put the whole thing on my ftp box if you want to look at the results for yourself. No program can end the game in any way except by resigning, which is an instant loss. Otherwise the referee detects the mates and such itself.
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: New testing thread

Post by hgm »

Wow! Your referee program is nearly as smart as WinBoard! :lol:
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: New testing thread

Post by bob »

hgm wrote:Wow! Your referee program is nearly as smart as WinBoard! :lol:
It's smarter. It can run on a cluster without adding the overhead of remote board displays (xboard for unix). Where winboard can't even run on a cluster like this because there's no place to display a graphical board. :)

running 540 games at a time on the big cluster produces a _ton_ of network traffic to display 540 chess boards, animation on or not. This introduces a ton of random delays due to network congestion at fast time controls. That's why I wrote the thing in the first place.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: New testing thread

Post by bob »

OK, naturally all didn't go quite as expected. Here are the first four runs using the round-robin approach, which does stabilize the ratings somewhat better than previous tests:

Code: Select all

Thu Aug 7 01:29:23 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5    75   19   18   800   62%   -15   18%
   2 opponent-21.7           21   18   18   800   54%    -4   22%
   3 Fruit 2.1                7   18   18   800   51%    -1   16%
   4 Glaurung 1.1 SMP       -13   18   18   800   48%     3   16%
   5 Crafty-22.2            -31   18   18   800   45%     6   21%
   6 Arasan 10.0            -59   19   19   800   41%    12   10%
Thu Aug 7 02:22:36 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5    65   18   18   800   60%   -13   17%
   2 opponent-21.7           29   18   18   800   55%    -6   23%
   3 Fruit 2.1               15   18   18   800   52%    -3   19%
   4 Glaurung 1.1 SMP       -18   18   18   800   47%     4   17%
   5 Crafty-22.2            -34   18   18   800   44%     7   20%
   6 Arasan 10.0            -57   19   19   800   42%    11    9%
Thu Aug 7 03:12:33 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5    73   19   18   800   61%   -15   15%
   2 opponent-21.7           23   18   18   800   54%    -5   21%
   3 Fruit 2.1                8   18   18   800   51%    -2   18%
   4 Glaurung 1.1 SMP       -24   18   18   800   46%     5   15%
   5 Crafty-22.2            -38   18   18   800   44%     8   18%
   6 Arasan 10.0            -43   19   19   800   44%     9   10%
Thu Aug 7 04:03:47 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5    68   19   18   800   61%   -14   17%
   2 opponent-21.7           27   18   18   800   55%    -5   21%
   3 Fruit 2.1               26   18   18   800   54%    -5   19%
   4 Glaurung 1.1 SMP       -38   18   18   800   44%     8   18%
   5 Crafty-22.2            -41   18   18   800   43%     8   20%
   6 Arasan 10.0            -42   19   19   800   44%     8   11%
the part that failed was the attempt to gather the above data for just crafty vs the world, leaving out the world-vs-world round-robin data so that we could compare old and new approaches across 4 runs. I will re-run the test with the fix included. But I am saving the PGN from the last run each time and that lets me provide this output for the last run above, which is just crafty vs the 5 opponents, without the opponents playing each other...

Code: Select all

800 game(s) loaded, 0 game(s) with unknown result ignored.
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   124   44   42   160   71%   -42   15% 
   2 Fruit 2.1               55   41   40   160   63%   -42   16% 
   3 opponent-21.7           26   38   38   160   60%   -42   31% 
   4 Glaurung 1.1 SMP        15   40   40   160   58%   -42   18% 
   5 Crafty-22.2            -42   18   18   800   43%     8   20% 
   6 Arasan 10.0           -178   40   42   160   32%   -42   19% 
The first thing that obviously stands out is that the effective Elos are significantly different when only using C vs world games, rather than a full round-robin. However, I want to get all the data for 4 runs so am starting it again. These seem to take about an hour to run based on the timestamps given above, so I should have good data later this afternoon.


I received an email from a mathematician overnight that had some interesting insight into our discussions. Since they could be taken as a bit insulting to some posters in the discussion, I have invited him to come and participate directly. He has some insights into why "some" of the posters are a bit off-the-target with discussions about dependency and correlation, but I will save that for him to present if he takes me up on my invitation. The suggestions he proposes will certainly generate some discussion since they were not things that immediately jumped out to me as potential improvements...
Uri Blass
Posts: 10281
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: New testing thread

Post by Uri Blass »

bob wrote:
hgm wrote:Wow! Your referee program is nearly as smart as WinBoard! :lol:
It's smarter. It can run on a cluster without adding the overhead of remote board displays (xboard for unix). Where winboard can't even run on a cluster like this because there's no place to display a graphical board. :)

running 540 games at a time on the big cluster produces a _ton_ of network traffic to display 540 chess boards, animation on or not. This introduces a ton of random delays due to network congestion at fast time controls. That's why I wrote the thing in the first place.
I do not understand your confidence that you have no bugs in yout referee program.

Results clearly suggest that you have some bug and it will be possible to find out after we look at the pgn.

Uri
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: New testing thread

Post by bob »

Uri Blass wrote:
bob wrote:
hgm wrote:Wow! Your referee program is nearly as smart as WinBoard! :lol:
It's smarter. It can run on a cluster without adding the overhead of remote board displays (xboard for unix). Where winboard can't even run on a cluster like this because there's no place to display a graphical board. :)

running 540 games at a time on the big cluster produces a _ton_ of network traffic to display 540 chess boards, animation on or not. This introduces a ton of random delays due to network congestion at fast time controls. That's why I wrote the thing in the first place.
I do not understand your confidence that you have no bugs in yout referee program.

Results clearly suggest that you have some bug and it will be possible to find out after we look at the pgn.

Uri
Did you read what I wrote. I played a couple of thousand games between Crafty and each possible opponent. I then correlated the Result tag from the PGN to the last search value Crafty produced. If the result was 1-0 and the score was > +8, then that passed. If the result was 0-1 and the score was < -9 then that passed. If the result was 1/2-1/2 and the score satisfied -.01 <= v <= .01, then that passed. A few failed, but were validated by hand. for example, Crafty was playing white against Glaurung2, and had a score of -.43, and we got a 3-fold repetition draw result. Crafty agreed that it was a repetition event hough it was not expecting it. Apparently Glaurung2 thought it was worse and went for the repeat, crafty thought it was worse and thought the opponent would not repeat. Result was acceptable.

That is why I an convinced the referee is not producing bogus results. I am in the process of making sure there are no issues in non-crafty games right now, as I did not check that. But then I thought that crafty can handle most any kind of move input. Coordinate (e2e4) or SAN (e4, Nf3). I am not sure about the other programs and am going to make sure that there are no time forfeits or complaints about illegal moves. Flags falling are not all that uncommon in 1+1 games (Crafty hardly ever loses one, but some older programs that I am not using have done this quite frequently) and so long as the flag falls because the program is thinking, it is OK. If it falls because the program though the last move was illegal and ignored it, that needs attention.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: New testing thread

Post by bob »

Here's a sample PGN for one game:

Code: Select all

&#91;Event "Computer chess game"&#93;
&#91;Site "compute-1-3"&#93;
&#91;Date "Thu Aug  7 13&#58;22&#58;07 2008"&#93;
&#91;Round "1"&#93;
&#91;White "Glaurung 2-epsilon/5"&#93;
&#91;Black "Crafty-22.2"&#93;
&#91;Result "1/2-1/2"&#93;
&#91;TimeControl "1+1"&#93;
&#91;FEN "1rbq1rk1/1pp2pbp/p1np1np1/4p3/2PPP3/2N1BP2/PP1Q2PP/R1N1KB1R w KQ e6 fmvn 10; id "Silver Suite - King's Indian, Saemisch &#58; ECO E84";
"&#93;

1. d4d5 Nd4 2. f1d3 c5 3. e1g1 Bd7 4. c1e2 h5 5. a1c1 h4 6. e3g5
h3 7. g2g3 Qc7 8. c3d1 Rfc8 9. f1f2 Nh7 10. g5h6 Bxh6 11. d2h6
Qd8 12. f3f4 Nf6 13. d1e3 b5 14. f4e5 dxe5 15. e2c3 Ng4 16. e3g4
Bxg4 17. c1f1 Qf8 18. h6g5 Bd7 19. g5f6 bxc4 20. d3c4 Bb5 21.
c4b5 axb5 22. d5d6 Rb7 23. c3d5 Re8 24. f6h4 Re6 25. d5f6 Rxf6
26. f2f6 Qe8 27. h4h3 Qd7 28. h3h4 Qc6 29. h2h3 Rd7 30. a2a3
b4 31. f1f2 bxa3 32. b2a3 c4 33. f2b2 Nb3 34. h4g4 Kg7 35. b2f2
Nd2 36. g4g5 Nxe4 37. g5e5 Nxf6 38. f2f6 Qb6+ 39. f6f2 Kh7 40.
g1g2 Qb7+ 41. g2h2 c3 42. a3a4 Qb6 43. f2f4 Qb2+ 44. h2g1 Qc1+
45. g1g2 Qc2+ 46. f4f2 Qd3 47. e5e8 Qd5+ 48. g2h2 Qxd6 49. f2f7
Rxf7 50. e8f7 Kh6 51. f7e8 Qd2+ 52. h2g1 c2 53. e8h8 Kg5 54.
h8e5 Kh6 55. e5h8 Kg5 56. h8e5 Kh6 57. e5h8 1/2-1/2 &#123;game drawn
by 3-fold Repetition&#125;
The "Site" tag identifies the node. The generic host name is in the form compute-rack-node where there are 4 racks (1-4) and 32 nodes per rack. we do have a couple of oddball nodes that are named compute-0-0 and compute-0-1. same exact configurations, but in the central rack where the head and IBRIX stuff is. The Date tag should be self-explanatory. I do think I am going to run each move thru Crafty's SAN output routine (which is in the referee) to get rid of that ugly d4d5 sort of move stuff, which will make the PGN easier for me personally to read.