New testing thread

hgm · Post by **hgm** » Fri Aug 08, 2008 9:15 am

TonyJH wrote:For engines that use coordinate notation (feature san=0), "e1g1" is what the engine should send to WinBoard/XBoard, and is also what WinBoard/XBoard will send to the engine. For SAN (feature san=1) engines, "O-O" is correct. I wouldn't be surprised if WinBoard tolerates either text from either type of engine, though.

The WinBoard parser is extremely forgiving, and even understands things like o-o, oo, 00 (in stead of O-O).

But, like the protocol document specifies, there is no reason at al why other GUIs would tolerate such breaches of protocol, so you can ony say thet your engine complies with WB protocol if you send e1g1.

hgm · Post by **hgm** » Fri Aug 08, 2008 9:21 am

bob wrote:the data with round-robin matches is generally ordered the same, which is good and unlike the Crafty vs world games. But the Elo is still bouncing around enough that it would be very difficult to make a modest change and then successfully measure the change.

Yes, of course. BayesElo gives an error margin of 18 Elo. You know what '18' means, not? If the error margin is 18, you cannot use it to reliably measure differences of less than 18...

So while there are possible improvements in deciding which of the programs I am using is the best, the ability to measure the difference in two crafty versions seems harder. I am going to make a run with the check extension set to zero to see how that goes, another 4 runs and I will post the results along with these again...

Tord Romstad · Post by **Tord Romstad** » Fri Aug 08, 2008 9:57 am

bob wrote:Already found one bug which I will fix. Arasan 10 produces "o-o" for castling, which Glaurung2 does not appreciate. It is supposed to be O-O (letter oh capitalized) so Arasan 10 has a bug. But by the same token, Glaurung2 ought to accept o-o since it is in a wide range of PGN games scattered everywhere.

Glaurung does not yet support the XBoard protocol (the next version probably will), and only accepts castling moves the way they are specified by the UCI protocol ("e1g1" and "e1c1" in normal chess mode, "e1h1" and "e1a1" in Chess960 mode). If "o-o" and "o-o-o" don't work, it has nothing to do with Glaurung: It's because PolyGlot doesn't understand "o-o" and "o-o-o", and therefore cannot transmit the move to the engine.

By the way, I strongly recommend upgrading from Glaurung 2-ε/5 to Glaurung 2.1. As the name indicates, 2-ε/5 was just a beta version. The current version is less buggy, and much stronger.

Tord

Uri Blass · Post by **Uri Blass** » Fri Aug 08, 2008 10:15 am

Tord Romstad wrote:
bob wrote:Already found one bug which I will fix. Arasan 10 produces "o-o" for castling, which Glaurung2 does not appreciate. It is supposed to be O-O (letter oh capitalized) so Arasan 10 has a bug. But by the same token, Glaurung2 ought to accept o-o since it is in a wide range of PGN games scattered everywhere.
Glaurung does not yet support the XBoard protocol (the next version probably will), and only accepts castling moves the way they are specified by the UCI protocol ("e1g1" and "e1c1" in normal chess mode, "e1h1" and "e1a1" in Chess960 mode). If "o-o" and "o-o-o" don't work, it has nothing to do with Glaurung: It's because PolyGlot doesn't understand "o-o" and "o-o-o", and therefore cannot transmit the move to the engine.

By the way, I strongly recommend upgrading from Glaurung 2-ε/5 to Glaurung 2.1. As the name indicates, 2-ε/5 was just a beta version. The current version is less buggy, and much stronger.

Tord

Bob does not use latest version of other programs
He could use some free toga instead of fruit2.1 and he could use some better arasan than arasan10.

Uri

Uri Blass · Post by **Uri Blass** » Fri Aug 08, 2008 10:52 am

hgm wrote: Note that statistically speaking, such games are not dependent or correlated at all. A sampling that returns always exactly the same value obeys all the laws for independent sampling, with respect to the standard deviation and the central limit theorem. Having engines that play very reproducible, and only can play 2 or 3 different games from a position that they have to play a hundred times, still would produce perfectly independent sampling, with a statistical error < 0.5/sqrt(N), provided the choice of which of the few possible games they played was not dependent on the choice they made in the previous game.

I think that it is dependent on how you define correlation
You and I considered X and Y to be not correlated if and only if
cov(x,y)=0

It seems that the mathematician that hyatt talked with him
considered x and y to be not correlated only if the correlation coefficient is 0 and in the case of x=y=constant the correlation coefficient is undefined.

Uri

Uri Blass · Post by **Uri Blass** » Fri Aug 08, 2008 11:38 am

I think that I was confused by wikipedia.

Note that my first response after reading that some mathematician claimed that I and H.G.Muller are wrong was to look at wikipedia because
I thought maybe I am wrong but after thinking about it again I think that wikipedia is wrong

http://en.wikipedia.org/wiki/Correlation

Here are corrections that you need to make

wikipedia:
The correlation is defined only if both of the standard deviations are finite
and both of them are nonzero

correction:
The correlation coefficient is defined only if both of the standard deviations are finite and both of them are nonzero

mistake 2:
wikipedia:
If the variables are independent then the correlation is 0.

Correction:
If the variables are independent then the covariance is 0 or in other words
if x,y are independent then E(XY)=E(X)E(Y)

Uri

Zach Wegner · Post by **Zach Wegner** » Fri Aug 08, 2008 8:22 pm

Tord Romstad wrote:Glaurung does not yet support the XBoard protocol (the next version probably will)...

Awesome!

You might like to know that work is being done to back-port the Winboard_X/F enhancements back to xboard for us UNIX-based folk. There's also some interesting discussions about extending the protocol over at the Winboard Forum.

bob · Post by **bob** » Fri Aug 08, 2008 8:27 pm

Tord Romstad wrote:
bob wrote:Already found one bug which I will fix. Arasan 10 produces "o-o" for castling, which Glaurung2 does not appreciate. It is supposed to be O-O (letter oh capitalized) so Arasan 10 has a bug. But by the same token, Glaurung2 ought to accept o-o since it is in a wide range of PGN games scattered everywhere.
Glaurung does not yet support the XBoard protocol (the next version probably will), and only accepts castling moves the way they are specified by the UCI protocol ("e1g1" and "e1c1" in normal chess mode, "e1h1" and "e1a1" in Chess960 mode). If "o-o" and "o-o-o" don't work, it has nothing to do with Glaurung: It's because PolyGlot doesn't understand "o-o" and "o-o-o", and therefore cannot transmit the move to the engine.

By the way, I strongly recommend upgrading from Glaurung 2-ε/5 to Glaurung 2.1. As the name indicates, 2-ε/5 was just a beta version. The current version is less buggy, and much stronger.

Tord

I will do that once I get the kinks worked out on the Elo testing. I don't particularly care whether I use the strongest or not, just that all the opponents are perfectly consistent for the various test runs so the results are comparable.

bob · Post by **bob** » Fri Aug 08, 2008 8:29 pm

Uri Blass wrote:
Tord Romstad wrote:
bob wrote:Already found one bug which I will fix. Arasan 10 produces "o-o" for castling, which Glaurung2 does not appreciate. It is supposed to be O-O (letter oh capitalized) so Arasan 10 has a bug. But by the same token, Glaurung2 ought to accept o-o since it is in a wide range of PGN games scattered everywhere.
Glaurung does not yet support the XBoard protocol (the next version probably will), and only accepts castling moves the way they are specified by the UCI protocol ("e1g1" and "e1c1" in normal chess mode, "e1h1" and "e1a1" in Chess960 mode). If "o-o" and "o-o-o" don't work, it has nothing to do with Glaurung: It's because PolyGlot doesn't understand "o-o" and "o-o-o", and therefore cannot transmit the move to the engine.

By the way, I strongly recommend upgrading from Glaurung 2-ε/5 to Glaurung 2.1. As the name indicates, 2-ε/5 was just a beta version. The current version is less buggy, and much stronger.

Tord
Bob does not use latest version of other programs
He could use some free toga instead of fruit2.1 and he could use some better arasan than arasan10.

Uri

At one point, what I am using _was_ the latest versions. But I want to keep the opponents identical for a series of tests otherwise the tests don't help me. I'm not trying to find out which program is better, I am trying to measure small changes in mine...

bob · Post by **bob** » Fri Aug 08, 2008 8:35 pm

hgm wrote:
bob wrote:Since I am not sure whether or not the person that contacted me via email will join in here, I thought I would provide excerpts that we could perhaps use for a sane discourse on the issues. excerpt 1:

===================================================
The central point of miscommunication seems to have been confusion between the everyday meaning of dependent (causally connected) and the
mathematical meaning of dependent (correlated). I am astonished that self-styled mathematical experts at talkchess.com who were
criticizing you didn't make this distinction. The differnce in the two meanings is stark if one considers two engines playing each other
twice from a given position with fixed node counts, because the results of the two playouts will surely be the same. Neither playout
affects the other causally, so they are not dependent at all in the everyday sense, but the winner is always the same, which is to say
the outputs are perfectly correlated, and therefore as mathematically dependent as it gets.
====================================================
Note that statistically speaking, such games are not dependent or correlated at all. A sampling that returns always exactly the same value obeys all the laws for independent sampling, with respect to the standard deviation and the central limit theorem. Having engines that play very reproducible, and only can play 2 or 3 different games from a position that they have to play a hundred times, still would produce perfectly independent sampling, with a statistical error < 0.5/sqrt(N), provided the choice of which of the few possible games they played was not dependent on the choice they made in the previous game.
This is a topic I had already mentioned. In a perfect world, if A plays B, and A is better than B, then A will always win. And there will be perfect correlation between games since any game can be used to predict the outcome of any other, even there is none of the "causality dependcy" present since the games do not effect each other in any fashion.

Here's the next segment which again is a rehash of what has been explained previously:

=====================================================
Let's consider a series of hypothetical trial runs. I assume that you are as capable as anyone in the industry of preventing any causal
dependence between various games in the trials, so causal dependence will not factor in my calculations at all. I believe you when you
say that you have solved that problem.

Trial A: Crafty plays forty positions against each of five opponents with colors each way for a total of 400 games. The engines are each
limited to a node count of 10,000,000. Crafty wins 198 games.

Trial B: Same as Trial A, except the node count limit is changed to 10,010,000. Crafty wins 190 games.

Now we compare these two results to see if anything extraordinary has happened. In 400 games, the standard deviation is 10, and the
difference in results was only 8, so we are well within expected bounds. There's nothing to get excited about, and we move on to the
next experiment.

Trial C: Same as Trial A, except that each position-opponent-color combination is played out 64 times. Yes, this is a silly experiment,
because we know that repeated playouts with a fixed node count give identical results, but bear with me. Crafty wins (as expected)
exactly 12672 games.

Trial D: Same as Trial B, except that each position-opponent-color combination is played out 64 times. Crafty wins 12160, as we knew it
would.

Now we compare the latter two trials. In 25,600 games the standard deviation is 80, and our difference in result was 512, so we are more
than six sigmas out. Holy cow! Run out and buy lottery tickets!

In this deterministic case it is easy to see what happened. The prefect correlation of the sixty-four repeats of each combination meant
that we were gaining no new information by expanding the trial. The calculation of standard deviation, however, assumes no correlation
whatsoever, i.e. perfect mathematical independence. Since the statistical assumption was not met, the statistical result is absurd.
=====================================================
Seems to me the point made above is not relevant. Playing at 10,000,000 nodes is not the same experiment as playing at 10,010,000 nodes.

However, it _does_ represent what happens when you use a fixed time, since the nodes vary because of time jitter. That was his point.

Each match is a sample of a different process, each process having its own standard deviation. Which is actually zero for totally deterministic node-count-based TC games. So the results actually ly an infinite number of SDs apart, making it 100% certain that Crafty has a better performance against these opponents, with these starting positions, at the slightly different node count. Great! But of course meaningless, as the sample of opponents and positions was too small for this result to have any correlation with the performance on all positions against all possible opponents.

But none of it has any bearing on the reported 6-sigma deviation between the 2x25,000 games: there the conditions were supposed to be identical.

All conditions except for time measurements which are _never_ identical.

But at this point, we return to "stampee foot, impossible, stampee foot, test is flawed, stampee foot, etc..."

Now before I go farther, I will stop here and see if anyone wants to contest, add to, contradict, etc the above.

So just perhaps, this explains the so-called "six-sigma" event of my first post. And it explains why so many runs have been producing odd results. And it does once again explain exactly how there is correlation, simply because of the opponents and positions and semi-deterministic behavior of programs... Of course, it also suggests quite a bit more than that, since so many are doing this same exact test...
Bullshit. That two different experiments give a different answer can never be used to explain why repeating the same experiment gives a different answer.

So let me get this right. In your statistical definition, if I run three tests with three different node counts, the test is of no use. But if I run three tests with different time measurements, which also leads to different node counts, that is useful. And the two tests have nothing in common whatsoever??? The number of different node counts is not that great (I could run a hundred 3 second searches to see how many different node counts I get if you want. And then factor that in to N moves in the game... So his suggested experiment is just a small subset of what might happen. but it _is_ a subset.

His next suggestion to help is one that will take me a bit of time to think about, as it is _completely_ counter-intuitive to me on first analysis. I'll save that until after the discussion on the above.

your turn...

New testing thread

Re: ugh ugh ugh

Re: 4 sets of data

Re: ugh ugh ugh

Re: ugh ugh ugh

Re: Correlated data discussion

Re: Correlated data discussion

Re: ugh ugh ugh

Re: ugh ugh ugh

Re: ugh ugh ugh

Re: Correlated data discussion