New testing thread

bob · Post by **bob** » Thu Aug 07, 2008 8:57 pm

Already found one bug which I will fix. Arasan 10 produces "o-o" for castling, which Glaurung2 does not appreciate. It is supposed to be O-O (letter oh capitalized) so Arasan 10 has a bug. But by the same token, Glaurung2 ought to accept o-o since it is in a wide range of PGN games scattered everywhere.

To solve this in Crafty, I accept o-o, (small oh's), O-O (capital Oh;s) and even 0-0 (zero-zero). this appears to be the only sane way to deal with castling and avoiding user complaints about being unable to read some PGN files. Of course, there is also the brain-dead e1g1, Ke1-g1 and similar approaches, where the author ought to be flogged, but Crafty will handle them all. I'm going to simply take a move from an opponent, parse it into crafty's internal format for the referee, then use crafty's OutputMove() function to convert it back to a _normal_ SAN move to get rid of this crap.

Didn't show up in any of my crafty vs theworld games because crafty can handle anything, and always sends SAN moves only to the opponents. Others seem to send most any sort of approximation to algebraic moves and let it go at that...

bob · Post by **bob** » Thu Aug 07, 2008 9:23 pm

Ok, fix is in and I have played a bunch of games between the two offending engines, arasan 10 and glaurung2. Now after a couple of hundred games. arasan has lost two on time, and Glaurung2 has lost none (glaurung2 was losing previously because it was rejecting o-o/o-o-o moves. I'm going to check to make sure that other cases won't cause damage either, so I have to let the entire test run then go thru and look for any game that ends with a flag falling to see if it was legitimate or not.

more later. this will change the arasan results that Uri had complained about, because it had found a unique way to win games by sending opponents invalid moves that were really OK technically.

At least technically enough that my referee was happy with them... PGN now also looks much nicer.

xsadar · Post by **xsadar** » Thu Aug 07, 2008 10:08 pm

bob wrote:Already found one bug which I will fix. Arasan 10 produces "o-o" for castling, which Glaurung2 does not appreciate. It is supposed to be O-O (letter oh capitalized) so Arasan 10 has a bug. But by the same token, Glaurung2 ought to accept o-o since it is in a wide range of PGN games scattered everywhere.

To solve this in Crafty, I accept o-o, (small oh's), O-O (capital Oh;s) and even 0-0 (zero-zero). this appears to be the only sane way to deal with castling and avoiding user complaints about being unable to read some PGN files. Of course, there is also the brain-dead e1g1, Ke1-g1 and similar approaches, where the author ought to be flogged, but Crafty will handle them all. I'm going to simply take a move from an opponent, parse it into crafty's internal format for the referee, then use crafty's OutputMove() function to convert it back to a _normal_ SAN move to get rid of this crap.

Didn't show up in any of my crafty vs theworld games because crafty can handle anything, and always sends SAN moves only to the opponents. Others seem to send most any sort of approximation to algebraic moves and let it go at that...

Why should those of us who output e1g1 for castling be flogged? From the xboard/winboard protocol description, that's the only thing that's guaranteed to work with version 1 of the protocol. Not only that, but protocol v2 doesn't even require all v2 interfaces to implement SAN (although if they don't they probably should). For that reason, my engine outputs e1g1. And since I see no reason why my xboard communication needs to look pretty, I doubt I'll ever have it request the SAN feature when coordinate notation works just fine.

Uri Blass · Post by **Uri Blass** » Thu Aug 07, 2008 10:20 pm

I see that you already found a bug.
Note that I did not claim that something is wrong with the games of Crafty but that the result of arasan seems too good.

so my theory that something is wrong
You need to check also if there is no problem with the results of fruit2.1
against arasan10 because I see big difference also with the result of fruit2.1

The difference is not enough to be sure that there is a bug but is enough to suspect that there is a bug.

Uri

bob · Post by **bob** » Thu Aug 07, 2008 10:42 pm

xsadar wrote:
bob wrote:Already found one bug which I will fix. Arasan 10 produces "o-o" for castling, which Glaurung2 does not appreciate. It is supposed to be O-O (letter oh capitalized) so Arasan 10 has a bug. But by the same token, Glaurung2 ought to accept o-o since it is in a wide range of PGN games scattered everywhere.

To solve this in Crafty, I accept o-o, (small oh's), O-O (capital Oh;s) and even 0-0 (zero-zero). this appears to be the only sane way to deal with castling and avoiding user complaints about being unable to read some PGN files. Of course, there is also the brain-dead e1g1, Ke1-g1 and similar approaches, where the author ought to be flogged, but Crafty will handle them all. I'm going to simply take a move from an opponent, parse it into crafty's internal format for the referee, then use crafty's OutputMove() function to convert it back to a _normal_ SAN move to get rid of this crap.

Didn't show up in any of my crafty vs theworld games because crafty can handle anything, and always sends SAN moves only to the opponents. Others seem to send most any sort of approximation to algebraic moves and let it go at that...
Why should those of us who output e1g1 for castling be flogged? From the xboard/winboard protocol description, that's the only thing that's guaranteed to work with version 1 of the protocol. Not only that, but protocol v2 doesn't even require all v2 interfaces to implement SAN (although if they don't they probably should). For that reason, my engine outputs e1g1. And since I see no reason why my xboard communication needs to look pretty, I doubt I'll ever have it request the SAN feature when coordinate notation works just fine.

The reason is that O-O has been _the_ standard form for indicating castling for so long it is not funny. More importantly, it is _the_ form defined for inclusion in a PGN game (SAN is precisely defined, o-o and Kg1 are _not_ included).

Why is outputting O-O or O-O-O such a challenge? My first program used that syntax in 1968 in the ancient days of English Descriptive notation, and I've never had a problem doing it since. But the PGN standard is as good a reason as any to at least use proper notation in algebraic notation, even if you eschew SAN output.

bob · Post by **bob** » Thu Aug 07, 2008 10:47 pm

Uri Blass wrote:I see that you already found a bug.

Yes, although not a bug in the referee which is what I had tested thoroughly.

Note that I did not claim that something is wrong with the games of Crafty but that the result of arasan seems too good.

so my theory that something is wrong
You need to check also if there is no problem with the results of fruit2.1
against arasan10 because I see big difference also with the result of fruit2.1

I have a test running now, but the results look reasonable. If there was a problem there, it was also probably the o-o issue, even though I have not looked at fruit source to verify it.

The difference is not enough to be sure that there is a bug but is enough to suspect that there is a bug.

Uri

Here is the first run, first set of data is everyone vs everyone, second set is just crafty vs everyone:

Code: Select all

Thu Aug 7 15:29:01 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   116   19   19   800   68%   -23   17%
   2 Fruit 2.1               63   19   18   800   60%   -13   20%
   3 opponent-21.7           27   18   18   800   54%    -5   22%
   4 Glaurung 1.1 SMP         6   18   19   800   50%    -1   17%
   5 Crafty-22.2            -19   18   18   800   47%     4   22%
   6 Arasan 10.0           -194   20   21   800   21%    39   15%
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5    90   42   41   160   64%   -19   14%
   2 Fruit 2.1               71   40   39   160   63%   -19   26%
   3 opponent-21.7           17   37   37   160   55%   -19   37%
   4 Glaurung 1.1 SMP        13   41   40   160   54%   -19   14%
   5 Crafty-22.2            -19   18   18   800   47%     4   22%
   6 Arasan 10.0           -173   41   43   160   30%   -19   18%

bob · Post by **bob** » Thu Aug 07, 2008 11:01 pm

Since I am not sure whether or not the person that contacted me via email will join in here, I thought I would provide excerpts that we could perhaps use for a sane discourse on the issues. excerpt 1:

===================================================
The central point of miscommunication seems to have been confusion between the everyday meaning of dependent (causally connected) and the
mathematical meaning of dependent (correlated). I am astonished that self-styled mathematical experts at talkchess.com who were
criticizing you didn't make this distinction. The differnce in the two meanings is stark if one considers two engines playing each other
twice from a given position with fixed node counts, because the results of the two playouts will surely be the same. Neither playout
affects the other causally, so they are not dependent at all in the everyday sense, but the winner is always the same, which is to say
the outputs are perfectly correlated, and therefore as mathematically dependent as it gets.
====================================================

This is a topic I had already mentioned. In a perfect world, if A plays B, and A is better than B, then A will always win. And there will be perfect correlation between games since any game can be used to predict the outcome of any other, even there is none of the "causality dependcy" present since the games do not effect each other in any fashion.

Here's the next segment which again is a rehash of what has been explained previously:

=====================================================
Let's consider a series of hypothetical trial runs. I assume that you are as capable as anyone in the industry of preventing any causal
dependence between various games in the trials, so causal dependence will not factor in my calculations at all. I believe you when you
say that you have solved that problem.

Trial A: Crafty plays forty positions against each of five opponents with colors each way for a total of 400 games. The engines are each
limited to a node count of 10,000,000. Crafty wins 198 games.

Trial B: Same as Trial A, except the node count limit is changed to 10,010,000. Crafty wins 190 games.

Now we compare these two results to see if anything extraordinary has happened. In 400 games, the standard deviation is 10, and the
difference in results was only 8, so we are well within expected bounds. There's nothing to get excited about, and we move on to the
next experiment.

Trial C: Same as Trial A, except that each position-opponent-color combination is played out 64 times. Yes, this is a silly experiment,
because we know that repeated playouts with a fixed node count give identical results, but bear with me. Crafty wins (as expected)
exactly 12672 games.

Trial D: Same as Trial B, except that each position-opponent-color combination is played out 64 times. Crafty wins 12160, as we knew it
would.

Now we compare the latter two trials. In 25,600 games the standard deviation is 80, and our difference in result was 512, so we are more
than six sigmas out. Holy cow! Run out and buy lottery tickets!

In this deterministic case it is easy to see what happened. The prefect correlation of the sixty-four repeats of each combination meant
that we were gaining no new information by expanding the trial. The calculation of standard deviation, however, assumes no correlation
whatsoever, i.e. perfect mathematical independence. Since the statistical assumption was not met, the statistical result is absurd.
=====================================================

But at this point, we return to "stampee foot, impossible, stampee foot, test is flawed, stampee foot, etc..."

Now before I go farther, I will stop here and see if anyone wants to contest, add to, contradict, etc the above.

So just perhaps, this explains the so-called "six-sigma" event of my first post. And it explains why so many runs have been producing odd results. And it does once again explain exactly how there is correlation, simply because of the opponents and positions and semi-deterministic behavior of programs... Of course, it also suggests quite a bit more than that, since so many are doing this same exact test...

His next suggestion to help is one that will take me a bit of time to think about, as it is _completely_ counter-intuitive to me on first analysis. I'll save that until after the discussion on the above.

your turn...

bob · Post by **bob** » Thu Aug 07, 2008 11:06 pm

Let me add, this was the reason I originally started posting this data here. Some seem to think it is just a way to start an argument. But it was intended to try to get others to think critically and figure out exactly what might be causing results that looked extremely unusual to me.

Perhaps my original impression was wrong and this is going to turn out to be completely normal, perhaps not. But at least _some_ seem interested in discussing and analyzing what is going to to understand it even if it is not broken. Others just want to post one-liners and argue...

Maybe this will lead to something eventually, that will help everyone, even if everyone is not willing to participate in a helpful way...

Tony · Post by **Tony** » Thu Aug 07, 2008 11:11 pm

This seems nonsense.

The whole point was to NOT repeat the same games. 100 games have a certain uncertainty, repeating exactly the same 100 games, a 100 times, does not decrease the uncertainty while playing 10000 different games should.

After playing the same 100 games over and over you can not claim to have played 10,000 games, you still only played 100 games.

Or maybe I misunderstood.

Tony

bob · Post by **bob** » Thu Aug 07, 2008 11:29 pm

this data gets more interesting every time I run it. Here are two runs now, first elo data is overall, second is crafty vs everyone only, third repeats number 1, fourth repeats number 2. look at the crafty vs everyone data, final standings, for some interesting comparisons...

Code: Select all

Thu Aug 7 15:29:01 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   116   19   19   800   68%   -23   17%
   2 Fruit 2.1               63   19   18   800   60%   -13   20%
   3 opponent-21.7           27   18   18   800   54%    -5   22%
   4 Glaurung 1.1 SMP         6   18   19   800   50%    -1   17%
   5 Crafty-22.2            -19   18   18   800   47%     4   22%
   6 Arasan 10.0           -194   20   21   800   21%    39   15%
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5    90   42   41   160   64%   -19   14%
   2 Fruit 2.1               71   40   39   160   63%   -19   26%
   3 opponent-21.7           17   37   37   160   55%   -19   37%
   4 Glaurung 1.1 SMP        13   41   40   160   54%   -19   14%
   5 Crafty-22.2            -19   18   18   800   47%     4   22%
   6 Arasan 10.0           -173   41   43   160   30%   -19   18%
Thu Aug 7 16:25:56 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   123   19   19   800   69%   -25   19%
   2 Fruit 2.1               61   19   18   800   59%   -12   21%
   3 opponent-21.7           27   18   18   800   54%    -5   22%
   4 Glaurung 1.1 SMP        25   19   19   800   53%    -5   18%
   5 Crafty-22.2            -31   18   18   800   45%     6   22%
   6 Arasan 10.0           -205   21   21   800   20%    41   16%
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   104   42   40   160   68%   -30   18%
   2 Glaurung 1.1 SMP        46   41   40   160   60%   -30   16%
   3 opponent-21.7           42   38   37   160   61%   -30   36%
   4 Fruit 2.1               41   39   39   160   60%   -30   24%
   5 Crafty-22.2            -30   18   18   800   45%     6   22%
   6 Arasan 10.0           -202   41   44   160   28%   -30   18%

New testing thread

ugh ugh ugh

Re: ugh ugh ugh

Re: ugh ugh ugh

Re: New testing thread

Re: ugh ugh ugh

Re: New testing thread

Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: New testing thread