OK Karl,
here are the results of the correlation experiment you proposed...
Spike v Twisted Logic
Two runs, 201 games/run
201 random openings, repeated in the 2nd run
Spike white in odd games, black in even
Run #1 - 1,000,000 positions/move
Run #2 - 1,010,000 positions/move
Raw stats...
Run #1
Twisted Logic 103.0 (51.2%) - Spike 98.0 (48.8%)
1-0 34.8%
1/2-1/2 33.8%
0-1 31.3%
Run #2
Twisted Logic 101.0 (50.2%) - Spike 100.0 (49.8%)
1-0 37.3%
1/2-1/2 28.9%
0-1 33.8%
Of the 201 games, 111 changed result, 90 remained the same.
There was only one game duplicated (a short 15 move draw by repetition) across the runs.
Here are the result pairs (201 rows) for your analysis...
0-1,1-0
1/2-1/2,0-1
1/2-1/2,1-0
1/2-1/2,1-0
1/2-1/2,1/2-1/2
0-1,1/2-1/2
1-0,1/2-1/2
1/2-1/2,1/2-1/2
0-1,0-1
1/2-1/2,1/2-1/2
0-1,0-1
0-1,0-1
1-0,1-0
0-1,0-1
1/2-1/2,1-0
1-0,1/2-1/2
1-0,0-1
1/2-1/2,0-1
1-0,1-0
1-0,1-0
0-1,0-1
1/2-1/2,0-1
1/2-1/2,1-0
1-0,1-0
0-1,0-1
1-0,1-0
1-0,0-1
0-1,1-0
0-1,1/2-1/2
1/2-1/2,1/2-1/2
0-1,1-0
0-1,0-1
1/2-1/2,1/2-1/2
1-0,1/2-1/2
1-0,1-0
0-1,1/2-1/2
1-0,0-1
1-0,1-0
1-0,1/2-1/2
0-1,1-0
0-1,0-1
0-1,1/2-1/2
1/2-1/2,1-0
1/2-1/2,1/2-1/2
0-1,1-0
1-0,1-0
1/2-1/2,1-0
0-1,0-1
0-1,0-1
1/2-1/2,0-1
1-0,1-0
0-1,0-1
1-0,1-0
1-0,1-0
0-1,0-1
1-0,1-0
1/2-1/2,1-0
0-1,1/2-1/2
1/2-1/2,1-0
0-1,1-0
1-0,1/2-1/2
1/2-1/2,1/2-1/2
0-1,1-0
0-1,1/2-1/2
1-0,1-0
1-0,1-0
1/2-1/2,0-1
1-0,1-0
1-0,0-1
1/2-1/2,0-1
1-0,0-1
1/2-1/2,1/2-1/2
1-0,0-1
0-1,0-1
1/2-1/2,1/2-1/2
1-0,1/2-1/2
1-0,1-0
0-1,0-1
1/2-1/2,0-1
1/2-1/2,1/2-1/2
1-0,0-1
0-1,0-1
1/2-1/2,0-1
1/2-1/2,1-0
1/2-1/2,1-0
1-0,1-0
0-1,1/2-1/2
0-1,1-0
1/2-1/2,1/2-1/2
1-0,0-1
1-0,1-0
1/2-1/2,1/2-1/2
0-1,1/2-1/2
1-0,1/2-1/2
1/2-1/2,1/2-1/2
1-0,1-0
0-1,1/2-1/2
1/2-1/2,1/2-1/2
0-1,0-1
1/2-1/2,0-1
0-1,0-1
1-0,0-1
1-0,1-0
0-1,1/2-1/2
1/2-1/2,1-0
1/2-1/2,1-0
1-0,0-1
1/2-1/2,1/2-1/2
0-1,1-0
0-1,1-0
0-1,1/2-1/2
1/2-1/2,1-0
0-1,0-1
1-0,1-0
0-1,0-1
0-1,0-1
1/2-1/2,1-0
1-0,1-0
1/2-1/2,0-1
1-0,0-1
0-1,0-1
1-0,1-0
1/2-1/2,0-1
0-1,0-1
1-0,0-1
0-1,1/2-1/2
1/2-1/2,1/2-1/2
1-0,1/2-1/2
1-0,1/2-1/2
1-0,1-0
1/2-1/2,0-1
0-1,1/2-1/2
0-1,0-1
1-0,1/2-1/2
0-1,1-0
0-1,0-1
0-1,1/2-1/2
1-0,1-0
0-1,0-1
1-0,1/2-1/2
1-0,0-1
1/2-1/2,1/2-1/2
1/2-1/2,1/2-1/2
1/2-1/2,0-1
1/2-1/2,1/2-1/2
0-1,1/2-1/2
0-1,1-0
1/2-1/2,1/2-1/2
1-0,0-1
1-0,1-0
1-0,1-0
0-1,0-1
0-1,1/2-1/2
0-1,0-1
1/2-1/2,0-1
1/2-1/2,1-0
1-0,0-1
1-0,1-0
0-1,1-0
1-0,0-1
1-0,1-0
1/2-1/2,0-1
1-0,1-0
0-1,0-1
1/2-1/2,1-0
1-0,1/2-1/2
1-0,1/2-1/2
1/2-1/2,1-0
1-0,1/2-1/2
1/2-1/2,1/2-1/2
1-0,1-0
0-1,1-0
1/2-1/2,1/2-1/2
1/2-1/2,0-1
1/2-1/2,1/2-1/2
1/2-1/2,0-1
1-0,1-0
1/2-1/2,1-0
0-1,1/2-1/2
1-0,0-1
1-0,1-0
0-1,1/2-1/2
1/2-1/2,0-1
1/2-1/2,1/2-1/2
1/2-1/2,1-0
1-0,1-0
0-1,1-0
1-0,1-0
1/2-1/2,1-0
1-0,1/2-1/2
0-1,0-1
1/2-1/2,1-0
1/2-1/2,0-1
1/2-1/2,0-1
1/2-1/2,0-1
1-0,1/2-1/2
1-0,1-0
1/2-1/2,0-1
1/2-1/2,1-0
1-0,1-0
0-1,1-0
Look forward to hearing your conclusions.
Correlation Experiment Results
Moderators: hgm, Rebel, chrisw
-
- Posts: 10460
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: Correlation Experiment Results
The results seem to be correlated in the meaning that the position is clearly important for the result and I think that it is not surprising.
If I know that the first result is a loss for white I can expect
25/63 in the second game(less than 40%)
If I know that the first result is a draw I can expect
34/68 in the second game(50%)
If I know that the first result is a win for white I can expect
45/70 in the second game(more than 60%)
I do not know if this type of correlation is enough for Crafty to beat fruit in the second game if it wins the first game but I am sure that Crafty is expected to score better against fruit if you play the same position twice and choose only positions that Crafty won.
Uri
If I know that the first result is a loss for white I can expect
25/63 in the second game(less than 40%)
If I know that the first result is a draw I can expect
34/68 in the second game(50%)
If I know that the first result is a win for white I can expect
45/70 in the second game(more than 60%)
I do not know if this type of correlation is enough for Crafty to beat fruit in the second game if it wins the first game but I am sure that Crafty is expected to score better against fruit if you play the same position twice and choose only positions that Crafty won.
Uri
Re: Correlation Experiment Results
Martin,
Thanks a bunch for running this test. As Uri points out, this test might be on the borderline of enough correlation to overcome Crafty's 60-40 disadvantage to Fruit. I'm leaning toward still taking the bet, though. Let's take a closer look at how much correlation there is.
I was actually wanting to measure correlation between Spike's scores in the first and second runs, as opposed to white's score in the first and second runs, but since Spike switched colors every game, it is simple enough to convert the data. Intuitively, there should be even more white/white correlation than Spike/Spike correlation, because the tendency of random positions to favor white is added on top of the other sources of correlation.
First, the hand-waving look at correlation. Repeating Uri's totals for Spike/Spike correlation (rather than white/white):
If Spike wins the first, it scores 38/64 = 59% on the second
If Spike draws the first, it scores 38/68 = 56% on the second
If Spike loses the first, it scores 24/69 = 35% on the second
That's quite a swing for engines that are supposed to be equal on results that are supposed to be independent. But that way of looking it privileges the first run over the second; correlation is supposed to be a symmetrical concept. Taking order out of it, we see that Spike has 135 wins, 126 draws, and 141 losses. We can compare the unordered pairs that actually occurred to what we would expect from sprinkling the results randomly.
Not only did all the duplicated results occur more often than you would expect, but the category that is furthest from expected is one result being a win while the other was a loss. A win for each engine should have been the most common result, and instead is the least common result by a wide margin. This is the most intuitively telling result so far. The fact that changing from a win to a loss is harder than from a win to a draw or a draw to a loss shows that divergent playouts do not equal divergent results.
Just in case eyeballing these results doesn't convince everyone of the existence of correlation, let's look at the F-test for significance. With 199 degrees of freedom, and alpha = 0.01, the critcal F-value is 6.8. Our F-value is a bit over 12, so there is well less than 1% chance this correlation is accidental.
Having (I hope) settled that correlation exists, the question is how much correlation there is, and how much it hurts us. The sample variance of Spike's scores is 0.172. The sample covariance between the runs is 0.041. We are trying to measure the expected score of Spike versus Twisted Logic on a random balanced position with random color assignment. When we design the experiment, we choose the number of positions, and the number of replays of each position. We can calculate in advance one standard deviation in our test results from the true value, as presented in the table below:
So for this particular test, with the observed amount of covariance, we can cut our measurement error in half in either of two ways. We can either use four times as many positions, or we can play out each position 64 times. To squeeze the most significant information out of a fixed amount of CPU time, we are best off not replaying positions at all.
Now come the caveats. Bob may have had less correlation than this or more, depending on the size of his clock jitter and other factors. He should measure his own correlation directly, rather than drawing conclusions from your experiments. Also the source of the correlation has not yet been established. Is it because you chose only unbalanced positions? Is there another source of correlation we haven't thought of yet? Also we chose a fixed amount of "jitter". Will it be, as we expect, that increasing the node change between runs decreases the correlation? What if we do the same experiment simply with time controls and not using node counts?
Thanks again, Martin, for running this test. I think this moves me from having not been proven wrong a bit closer to having been proven right.
Thanks a bunch for running this test. As Uri points out, this test might be on the borderline of enough correlation to overcome Crafty's 60-40 disadvantage to Fruit. I'm leaning toward still taking the bet, though. Let's take a closer look at how much correlation there is.
I was actually wanting to measure correlation between Spike's scores in the first and second runs, as opposed to white's score in the first and second runs, but since Spike switched colors every game, it is simple enough to convert the data. Intuitively, there should be even more white/white correlation than Spike/Spike correlation, because the tendency of random positions to favor white is added on top of the other sources of correlation.
First, the hand-waving look at correlation. Repeating Uri's totals for Spike/Spike correlation (rather than white/white):
If Spike wins the first, it scores 38/64 = 59% on the second
If Spike draws the first, it scores 38/68 = 56% on the second
If Spike loses the first, it scores 24/69 = 35% on the second
That's quite a swing for engines that are supposed to be equal on results that are supposed to be independent. But that way of looking it privileges the first run over the second; correlation is supposed to be a symmetrical concept. Taking order out of it, we see that Spike has 135 wins, 126 draws, and 141 losses. We can compare the unordered pairs that actually occurred to what we would expect from sprinkling the results randomly.
Code: Select all
Unordered Expected
Pairs Random Actual
--------- -------- ------
two wins 22.56 30
two draws 19.64 24
two losses 24.61 36
win & draw 42.42 42
loss & draw 44.30 36
win & loss 47.47 18
Just in case eyeballing these results doesn't convince everyone of the existence of correlation, let's look at the F-test for significance. With 199 degrees of freedom, and alpha = 0.01, the critcal F-value is 6.8. Our F-value is a bit over 12, so there is well less than 1% chance this correlation is accidental.
Having (I hope) settled that correlation exists, the question is how much correlation there is, and how much it hurts us. The sample variance of Spike's scores is 0.172. The sample covariance between the runs is 0.041. We are trying to measure the expected score of Spike versus Twisted Logic on a random balanced position with random color assignment. When we design the experiment, we choose the number of positions, and the number of replays of each position. We can calculate in advance one standard deviation in our test results from the true value, as presented in the table below:
Code: Select all
Positions\Plays 1 2 4 16 64
--------------- ------ ------ ------ ------ ------
1 41.47% 32.64% 27.16% 22.18% 20.75%
4 20.74% 16.32% 13.58% 11.09% 10.38%
16 10.37% 8.16% 6.79% 5.55% 5.19%
64 5.18% 4.08% 3.39% 2.77% 2.59%
256 2.59% 2.04% 1.70% 1.39% 1.30%
1024 1.30% 1.02% 0.85% 0.69% 0.65%
4096 0.65% 0.51% 0.42% 0.35% 0.32%
16384 0.32% 0.25% 0.21% 0.17% 0.16%
65536 0.16% 0.13% 0.11% 0.09% 0.08%
Now come the caveats. Bob may have had less correlation than this or more, depending on the size of his clock jitter and other factors. He should measure his own correlation directly, rather than drawing conclusions from your experiments. Also the source of the correlation has not yet been established. Is it because you chose only unbalanced positions? Is there another source of correlation we haven't thought of yet? Also we chose a fixed amount of "jitter". Will it be, as we expect, that increasing the node change between runs decreases the correlation? What if we do the same experiment simply with time controls and not using node counts?
Thanks again, Martin, for running this test. I think this moves me from having not been proven wrong a bit closer to having been proven right.