New testing thread

bob · Post by **bob** » Thu Aug 07, 2008 11:34 pm

Tony wrote:This seems nonsense.

The whole point was to NOT repeat the same games. 100 games have a certain uncertainty, repeating exactly the same 100 games, a 100 times, does not decrease the uncertainty while playing 10000 different games should.

After playing the same 100 games over and over you can not claim to have played 10,000 games, you still only played 100 games.

Or maybe I misunderstood.

Tony

It was a first cut. Normally, using the same time per move, you'd expect the games to repeat. Even if they are slightly different in terms of the moves played, you'd think they would have similar (correlated) results. that's his point. He discusses the time-based limit next but I wanted to wait on that. Because this clearly shows that there is correlation in the games, which we knew there was. It is just not as pronounced in the time-based games. But it is there, which makes the statistical test suspect, since the SD is based on no correlation at all, when there is obviously some present...

I'll go on to the second part of his email when I get home tonight.

Zach Wegner · Post by **Zach Wegner** » Thu Aug 07, 2008 11:49 pm

This all makes sense, but it just points to the testing setup being flawed. Testing by time in only 40 starting positions practically guarantees non-random results, if not plenty of identical games.

I have to reiterate my proposed testing method of fixed number of nodes, but with the GUI/referee using a random value per game. This avoids any time based fluctuations--though Bob avoids most of this effect--but more importantly avoids any statistical patterns in the results beyond pure I think if the random value is equally distributed between x and 2x it should give a pretty fair indication of strength. I also think that using just 40 hand-picked positions gives too dependent results. This point is a bit harder to get around, as it's not clear exactly what you are trying to measure. Do you use opening books? I think if I wanted to make a super-solid testing scheme, I'd randomly select ~1000 positions from grandmaster games after a certain number of moves, of course excluding duplicates. Not quite perfect, but IMO better than every other scheme out there.

bob · Post by **bob** » Thu Aug 07, 2008 11:57 pm

Zach Wegner wrote:This all makes sense, but it just points to the testing setup being flawed. Testing by time in only 40 starting positions practically guarantees non-random results, if not plenty of identical games.

I have to reiterate my proposed testing method of fixed number of nodes, but with the GUI/referee using a random value per game. This avoids any time based fluctuations--though Bob avoids most of this effect--but more importantly avoids any statistical patterns in the results beyond pure I think if the random value is equally distributed between x and 2x it should give a pretty fair indication of strength. I also think that using just 40 hand-picked positions gives too dependent results. This point is a bit harder to get around, as it's not clear exactly what you are trying to measure. Do you use opening books? I think if I wanted to make a super-solid testing scheme, I'd randomly select ~1000 positions from grandmaster games after a certain number of moves, of course excluding duplicates. Not quite perfect, but IMO better than every other scheme out there.

All you have to convince me of is that the two following tests are significantly different:

(1) using time to limit the search, which generates a tremendous variance in the results as I have seen and posted here.

(2) using random numbers of nodes to limit the search, which generates a tremendous variance in the results as I have also seen and posted here.

But I don't see how they are different. I randomly choose a set number of nodes, or use a set amount of time which randomly sets a number of nodes. I don't see how one is better than the other other than if you re-run with the same number of nodes you will get the same exact set of game results. But if you modify one thing in the program, that will change the shape of the tree and also change the results in a similar way.

So how does one favor one of those over the other???

the basic issue is that using time uses a random number of nodes which is introducing significant variability. So how does stipulating a random number of nodes for a limit get away from that??

Zach Wegner · Post by **Zach Wegner** » Fri Aug 08, 2008 12:12 am

bob wrote:All you have to convince me of is that the two following tests are significantly different:

(1) using time to limit the search, which generates a tremendous variance in the results as I have seen and posted here.

(2) using random numbers of nodes to limit the search, which generates a tremendous variance in the results as I have also seen and posted here.

But I don't see how they are different. I randomly choose a set number of nodes, or use a set amount of time which randomly sets a number of nodes. I don't see how one is better than the other other than if you re-run with the same number of nodes you will get the same exact set of game results. But if you modify one thing in the program, that will change the shape of the tree and also change the results in a similar way.

So how does one favor one of those over the other???

the basic issue is that using time uses a random number of nodes which is introducing significant variability. So how does stipulating a random number of nodes for a limit get away from that??

It does get away from that. Using time produces an _unpredictable_ number of nodes, but not a _random_ number. The distribution of number of nodes against time is very skewed, and only a very small number of samples are taken against this distribution. So the results may look random, but they are heavily dependent, as your mathematician friend says. The difference is that using a random value, you can draw statistical conclusions from the results, such as Elo, error margins, etc.

Another advantage that I can think of now, is that using a deterministic PRNG, you can _exactly_ reproduce an entire match.

xsadar · Post by **xsadar** » Fri Aug 08, 2008 12:38 am

bob wrote:
xsadar wrote:
bob wrote:Already found one bug which I will fix. Arasan 10 produces "o-o" for castling, which Glaurung2 does not appreciate. It is supposed to be O-O (letter oh capitalized) so Arasan 10 has a bug. But by the same token, Glaurung2 ought to accept o-o since it is in a wide range of PGN games scattered everywhere.

To solve this in Crafty, I accept o-o, (small oh's), O-O (capital Oh;s) and even 0-0 (zero-zero). this appears to be the only sane way to deal with castling and avoiding user complaints about being unable to read some PGN files. Of course, there is also the brain-dead e1g1, Ke1-g1 and similar approaches, where the author ought to be flogged, but Crafty will handle them all. I'm going to simply take a move from an opponent, parse it into crafty's internal format for the referee, then use crafty's OutputMove() function to convert it back to a _normal_ SAN move to get rid of this crap.

Didn't show up in any of my crafty vs theworld games because crafty can handle anything, and always sends SAN moves only to the opponents. Others seem to send most any sort of approximation to algebraic moves and let it go at that...
Why should those of us who output e1g1 for castling be flogged? From the xboard/winboard protocol description, that's the only thing that's guaranteed to work with version 1 of the protocol. Not only that, but protocol v2 doesn't even require all v2 interfaces to implement SAN (although if they don't they probably should). For that reason, my engine outputs e1g1. And since I see no reason why my xboard communication needs to look pretty, I doubt I'll ever have it request the SAN feature when coordinate notation works just fine.
The reason is that O-O has been _the_ standard form for indicating castling for so long it is not funny. More importantly, it is _the_ form defined for inclusion in a PGN game (SAN is precisely defined, o-o and Kg1 are _not_ included).

Why is outputting O-O or O-O-O such a challenge? My first program used that syntax in 1968 in the ancient days of English Descriptive notation, and I've never had a problem doing it since. But the PGN standard is as good a reason as any to at least use proper notation in algebraic notation, even if you eschew SAN output.

Aren't we talking about xboard output here?

For user output, I'll follow common convention: SAN is very standard. O-O and O-O-O are the standard for castling.

For PGN output, I'll follow the PGN specification: SAN using O-O and O-O-O is the only valid form.

For xboard output, I'll follow the xboard protocol: coordinate notation using e1g1, e1c1, e8g8, and e8c8 are required to work on all interfaces. SAN, O-O and O-O-O are not.
It's not that outputting O-O is difficult. It's that for xboard, I see more reason not to do it than reason to do it. If there's one method that always works, it makes more sense to always use it, rather than asking the interface if it supports another method so I can sometimes use one method and sometimes use another method.

krazyken · Post by **krazyken** » Fri Aug 08, 2008 2:16 am

That is a much clearer way of stating what I was trying to get at in counting duplicate games.

bob · Post by **bob** » Fri Aug 08, 2008 2:16 am

xsadar wrote:
bob wrote:
xsadar wrote:
bob wrote:Already found one bug which I will fix. Arasan 10 produces "o-o" for castling, which Glaurung2 does not appreciate. It is supposed to be O-O (letter oh capitalized) so Arasan 10 has a bug. But by the same token, Glaurung2 ought to accept o-o since it is in a wide range of PGN games scattered everywhere.

To solve this in Crafty, I accept o-o, (small oh's), O-O (capital Oh;s) and even 0-0 (zero-zero). this appears to be the only sane way to deal with castling and avoiding user complaints about being unable to read some PGN files. Of course, there is also the brain-dead e1g1, Ke1-g1 and similar approaches, where the author ought to be flogged, but Crafty will handle them all. I'm going to simply take a move from an opponent, parse it into crafty's internal format for the referee, then use crafty's OutputMove() function to convert it back to a _normal_ SAN move to get rid of this crap.

Didn't show up in any of my crafty vs theworld games because crafty can handle anything, and always sends SAN moves only to the opponents. Others seem to send most any sort of approximation to algebraic moves and let it go at that...
Why should those of us who output e1g1 for castling be flogged? From the xboard/winboard protocol description, that's the only thing that's guaranteed to work with version 1 of the protocol. Not only that, but protocol v2 doesn't even require all v2 interfaces to implement SAN (although if they don't they probably should). For that reason, my engine outputs e1g1. And since I see no reason why my xboard communication needs to look pretty, I doubt I'll ever have it request the SAN feature when coordinate notation works just fine.
The reason is that O-O has been _the_ standard form for indicating castling for so long it is not funny. More importantly, it is _the_ form defined for inclusion in a PGN game (SAN is precisely defined, o-o and Kg1 are _not_ included).

Why is outputting O-O or O-O-O such a challenge? My first program used that syntax in 1968 in the ancient days of English Descriptive notation, and I've never had a problem doing it since. But the PGN standard is as good a reason as any to at least use proper notation in algebraic notation, even if you eschew SAN output.
Aren't we talking about xboard output here?

For user output, I'll follow common convention: SAN is very standard. O-O and O-O-O are the standard for castling.

For PGN output, I'll follow the PGN specification: SAN using O-O and O-O-O is the only valid form.

For xboard output, I'll follow the xboard protocol: coordinate notation using e1g1, e1c1, e8g8, and e8c8 are required to work on all interfaces. SAN, O-O and O-O-O are not.
It's not that outputting O-O is difficult. It's that for xboard, I see more reason not to do it than reason to do it. If there's one method that always works, it makes more sense to always use it, rather than asking the interface if it supports another method so I can sometimes use one method and sometimes use another method.

So far as I know, and I can only state back to 1995, coordinate notation with respect to xboard has not insisted on Kg1 to indicate O-O. Crafty played its first games on ICC in December of that year and worked flawlessly with xboard, using the normal O-O and O-O-O. And O-O will _always_ work with xboard or winboard. That's all I have ever used and nobody has ever told me it was failing... So I am not quite sure where you are coming from with that.

bob · Post by **bob** » Fri Aug 08, 2008 2:23 am

Zach Wegner wrote:
bob wrote:All you have to convince me of is that the two following tests are significantly different:

(1) using time to limit the search, which generates a tremendous variance in the results as I have seen and posted here.

(2) using random numbers of nodes to limit the search, which generates a tremendous variance in the results as I have also seen and posted here.

But I don't see how they are different. I randomly choose a set number of nodes, or use a set amount of time which randomly sets a number of nodes. I don't see how one is better than the other other than if you re-run with the same number of nodes you will get the same exact set of game results. But if you modify one thing in the program, that will change the shape of the tree and also change the results in a similar way.

So how does one favor one of those over the other???

the basic issue is that using time uses a random number of nodes which is introducing significant variability. So how does stipulating a random number of nodes for a limit get away from that??
It does get away from that. Using time produces an _unpredictable_ number of nodes, but not a _random_ number. The distribution of number of nodes against time is very skewed, and only a very small number of samples are taken against this distribution. So the results may look random, but they are heavily dependent, as your mathematician friend says. The difference is that using a random value, you can draw statistical conclusions from the results, such as Elo, error margins, etc.

Another advantage that I can think of now, is that using a deterministic PRNG, you can _exactly_ reproduce an entire match.

The only difference I can see between the two approaches is the distribution of node counts. If you use 3M +/- 500K, you would probably get a uniform distribution evenly scattered over the interval unless you modified the PRNG to produce a normal rather than uniform distribution (easy enough to do to be sure). If we set the time so that the average search covers about 3M nodes, and assuming some sort of upper/lower bound of say 500K nodes, we get a normal distribution centered on 3M. Now are you going to tell me that the uniform distribution somehow is better? Or that it somehow more accurately simulates the real world?

So, again, I don't see any possible advantage other than repeatability, which is completely worthless in this context since we already know how to get perfect repeatability, but it gives too many duplicate games to provide useful information...

If you run the same test twice, and random numbers to produce a uniform distribution from 2.5M to 3.5M, that would seem to be _more_ random than the current normal distribution centered on 3M. Why would we want to go _more_ random when we know the results change with each different node count.

bob · Post by **bob** » Fri Aug 08, 2008 2:30 am

Here's the result of 4 distinct runs:

Code: Select all

Thu Aug 7 15&#58;29&#58;01 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   116   19   19   800   68%   -23   17%
   2 Fruit 2.1               63   19   18   800   60%   -13   20%
   3 opponent-21.7           27   18   18   800   54%    -5   22%
   4 Glaurung 1.1 SMP         6   18   19   800   50%    -1   17%
   5 Crafty-22.2            -19   18   18   800   47%     4   22%
   6 Arasan 10.0           -194   20   21   800   21%    39   15%
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5    90   42   41   160   64%   -19   14%
   2 Fruit 2.1               71   40   39   160   63%   -19   26%
   3 opponent-21.7           17   37   37   160   55%   -19   37%
   4 Glaurung 1.1 SMP        13   41   40   160   54%   -19   14%
   5 Crafty-22.2            -19   18   18   800   47%     4   22%
   6 Arasan 10.0           -173   41   43   160   30%   -19   18%
Thu Aug 7 16&#58;25&#58;56 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   123   19   19   800   69%   -25   19%
   2 Fruit 2.1               61   19   18   800   59%   -12   21%
   3 opponent-21.7           27   18   18   800   54%    -5   22%
   4 Glaurung 1.1 SMP        25   19   19   800   53%    -5   18%
   5 Crafty-22.2            -31   18   18   800   45%     6   22%
   6 Arasan 10.0           -205   21   21   800   20%    41   16%
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   104   42   40   160   68%   -30   18%
   2 Glaurung 1.1 SMP        46   41   40   160   60%   -30   16%
   3 opponent-21.7           42   38   37   160   61%   -30   36%
   4 Fruit 2.1               41   39   39   160   60%   -30   24%
   5 Crafty-22.2            -30   18   18   800   45%     6   22%
   6 Arasan 10.0           -202   41   44   160   28%   -30   18%
Thu Aug 7 17&#58;20&#58;58 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   135   20   19   800   70%   -27   17%
   2 Fruit 2.1               61   19   18   800   59%   -12   20%
   3 opponent-21.7           24   18   18   800   54%    -5   22%
   4 Glaurung 1.1 SMP        21   18   18   800   53%    -4   18%
   5 Crafty-22.2            -35   18   18   800   44%     7   23%
   6 Arasan 10.0           -206   21   21   800   20%    41   15%
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   147   44   42   160   74%   -34   18%
   2 Glaurung 1.1 SMP        43   40   39   160   61%   -34   23%
   3 Fruit 2.1               35   40   39   160   60%   -34   23%
   4 opponent-21.7           20   38   37   160   58%   -34   33%
   5 Crafty-22.2            -34   18   18   800   44%     7   23%
   6 Arasan 10.0           -211   42   44   160   27%   -34   18%
Thu Aug 7 18&#58;18&#58;54 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   131   19   19   800   70%   -26   18%
   2 Fruit 2.1               67   19   19   800   60%   -13   19%
   3 opponent-21.7           16   18   18   800   52%    -3   21%
   4 Glaurung 1.1 SMP         6   19   19   800   51%    -1   15%
   5 Crafty-22.2            -25   18   18   800   46%     5   20%
   6 Arasan 10.0           -196   20   21   800   22%    39   14%
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   121   43   41   160   69%   -25   17%
   2 Glaurung 1.1 SMP        41   41   40   160   59%   -25   16%
   3 opponent-21.7           30   38   38   160   58%   -25   31%
   4 Fruit 2.1               19   39   39   160   56%   -25   23%
   5 Crafty-22.2            -25   18   18   800   46%     5   20%
   6 Arasan 10.0           -187   42   44   160   29%   -25   14%

the data with round-robin matches is generally ordered the same, which is good and unlike the Crafty vs world games. But the Elo is still bouncing around enough that it would be very difficult to make a modest change and then successfully measure the change. So while there are possible improvements in deciding which of the programs I am using is the best, the ability to measure the difference in two crafty versions seems harder. I am going to make a run with the check extension set to zero to see how that goes, another 4 runs and I will post the results along with these again...

TonyJH · Post by **TonyJH** » Fri Aug 08, 2008 2:48 am

bob wrote:
xsadar wrote:
bob wrote:
xsadar wrote:
bob wrote:Already found one bug which I will fix. Arasan 10 produces "o-o" for castling, which Glaurung2 does not appreciate. It is supposed to be O-O (letter oh capitalized) so Arasan 10 has a bug. But by the same token, Glaurung2 ought to accept o-o since it is in a wide range of PGN games scattered everywhere.

To solve this in Crafty, I accept o-o, (small oh's), O-O (capital Oh;s) and even 0-0 (zero-zero). this appears to be the only sane way to deal with castling and avoiding user complaints about being unable to read some PGN files. Of course, there is also the brain-dead e1g1, Ke1-g1 and similar approaches, where the author ought to be flogged, but Crafty will handle them all. I'm going to simply take a move from an opponent, parse it into crafty's internal format for the referee, then use crafty's OutputMove() function to convert it back to a _normal_ SAN move to get rid of this crap.

Didn't show up in any of my crafty vs theworld games because crafty can handle anything, and always sends SAN moves only to the opponents. Others seem to send most any sort of approximation to algebraic moves and let it go at that...
Why should those of us who output e1g1 for castling be flogged? From the xboard/winboard protocol description, that's the only thing that's guaranteed to work with version 1 of the protocol. Not only that, but protocol v2 doesn't even require all v2 interfaces to implement SAN (although if they don't they probably should). For that reason, my engine outputs e1g1. And since I see no reason why my xboard communication needs to look pretty, I doubt I'll ever have it request the SAN feature when coordinate notation works just fine.
The reason is that O-O has been _the_ standard form for indicating castling for so long it is not funny. More importantly, it is _the_ form defined for inclusion in a PGN game (SAN is precisely defined, o-o and Kg1 are _not_ included).

Why is outputting O-O or O-O-O such a challenge? My first program used that syntax in 1968 in the ancient days of English Descriptive notation, and I've never had a problem doing it since. But the PGN standard is as good a reason as any to at least use proper notation in algebraic notation, even if you eschew SAN output.
Aren't we talking about xboard output here?

For user output, I'll follow common convention: SAN is very standard. O-O and O-O-O are the standard for castling.

For PGN output, I'll follow the PGN specification: SAN using O-O and O-O-O is the only valid form.

For xboard output, I'll follow the xboard protocol: coordinate notation using e1g1, e1c1, e8g8, and e8c8 are required to work on all interfaces. SAN, O-O and O-O-O are not.
It's not that outputting O-O is difficult. It's that for xboard, I see more reason not to do it than reason to do it. If there's one method that always works, it makes more sense to always use it, rather than asking the interface if it supports another method so I can sometimes use one method and sometimes use another method.
So far as I know, and I can only state back to 1995, coordinate notation with respect to xboard has not insisted on Kg1 to indicate O-O. Crafty played its first games on ICC in December of that year and worked flawlessly with xboard, using the normal O-O and O-O-O. And O-O will _always_ work with xboard or winboard. That's all I have ever used and nobody has ever told me it was failing... So I am not quite sure where you are coming from with that.

For engines that use coordinate notation (feature san=0), "e1g1" is what the engine should send to WinBoard/XBoard, and is also what WinBoard/XBoard will send to the engine. For SAN (feature san=1) engines, "O-O" is correct. I wouldn't be surprised if WinBoard tolerates either text from either type of engine, though.

New testing thread

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: ugh ugh ugh

Re: Correlated data discussion

Re: ugh ugh ugh

Re: Correlated data discussion

4 sets of data

Re: ugh ugh ugh