Which version is better A or B?
Score of A vs Telepath6.030: 364 - 144 - 262
A scores (364+262/2)/770 = 495.0/770 = 64.286%
~= +104 Elo
Margins are +/- 21 Elo
Score of B vs Telepath6.030: 271 - 133 - 195
B scores (271+195/2)/599 = 368.5/599 = 61.519%
~= +84 Elo
Margins are +/- 25 Elo
vote in the poll and post your reasons.
Math Test 4 All
Moderators: hgm, Rebel, chrisw
-
- Posts: 2055
- Joined: Mon Mar 13, 2006 2:31 am
- Location: North Carolina, USA
-
- Posts: 6073
- Joined: Sat Apr 01, 2006 9:34 pm
- Location: Scotland
Re: Math Test 4 All
84 + 25 = 109 which is within the margin of error or.....CRoberson wrote:Which version is better A or B?
Score of A vs Telepath6.030: 364 - 144 - 262
A scores (364+262/2)/770 = 495.0/770 = 64.286%
~= +104 Elo
Margins are +/- 21 Elo
Score of B vs Telepath6.030: 271 - 133 - 195
B scores (271+195/2)/599 = 368.5/599 = 61.519%
~= +84 Elo
Margins are +/- 25 Elo
vote in the poll and post your reasons.
104 - 21 = 83 which is also within the margin of error.
To reduce the margin of error and get a definitive result I believe you should play more games to have a greater sample and thereby achieve more precise results.
Regards
Chris
-
- Posts: 3232
- Joined: Mon May 31, 2010 1:29 pm
- Full name: lucasart
Re: Math Test 4 All
Forget about these elo margins, they are not useable for deciding the test as they are bilateral! You need an unilateral test. What you want to know is P(X>0.5) aso known as LOS. In this case the LOS simply cannot be calculated because your test is flawed by early stopping! This is the number one mistake in testing!CRoberson wrote:Which version is better A or B?
Score of A vs Telepath6.030: 364 - 144 - 262
A scores (364+262/2)/770 = 495.0/770 = 64.286%
~= +104 Elo
Margins are +/- 21 Elo
Score of B vs Telepath6.030: 271 - 133 - 195
B scores (271+195/2)/599 = 368.5/599 = 61.519%
~= +84 Elo
Margins are +/- 25 Elo
vote in the poll and post your reasons.
But assuming you had decided to play 770 games and 599 games initially and didn't stop the experience (which I really doubt), then you could compute the LOS, by a trivial calculation (empirical mean and stdev, use gaussian distribution function to compute the asymptotic LOS). With these numbers, I get a LOS of 100.0%, meaning that the test is conclusive.
The only way early stopping is possible is when you are using a stochastic stopping rule that is mathematically sound, like the the sequential wald test for instance. But that information is not available, so I can only assume that you didn't. An early stopped result cannot give any conclusion without the knowledge of the rule that was used to stop the experiment!
PS: Do not use a poll for that: vox populi is the worst possible choice. Most people are absolutely cluless, and some (even worse) think they know but they don't. Better to understand what you are doing.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
-
- Posts: 6401
- Joined: Thu Mar 09, 2006 8:30 pm
- Location: Chicago, Illinois, USA
Re: Math Test 4 All
The answer depends on the level of confidence. I put those results in a empty pgn and run ordo withCRoberson wrote:Which version is better A or B?
Score of A vs Telepath6.030: 364 - 144 - 262
A scores (364+262/2)/770 = 495.0/770 = 64.286%
~= +104 Elo
Margins are +/- 21 Elo
Score of B vs Telepath6.030: 271 - 133 - 195
B scores (271+195/2)/599 = 368.5/599 = 61.519%
~= +84 Elo
Margins are +/- 25 Elo
vote in the poll and post your reasons.
ordo -p all.pgn -s10000 -eerr.txt -a0 -AA -F74.2
10,000 simulations, errors in err.txt, ranking center to 0 in engine A, with a level of confidence of 74.2 (I forced it). So, since A is forced to 0, all the errors are referred to the difference with A.
Code: Select all
# ENGINE : RATING ERROR POINTS PLAYED (%)
1 A : 0.0 ---- 495.0 770 64.3%
2 B : -20.8 20.8 368.5 599 61.5%
3 T : -103.0 13.7 505.5 1369 36.9%
Miguel
-
- Posts: 6401
- Joined: Thu Mar 09, 2006 8:30 pm
- Location: Chicago, Illinois, USA
Re: Math Test 4 All
There is no problem if you stop the test for reasons that are unrelated to the results. For instance, a blackout, or you needed to use the computer for something else etc.lucasart wrote:Forget about these elo margins, they are not useable for deciding the test as they are bilateral! You need an unilateral test. What you want to know is P(X>0.5) aso known as LOS. In this case the LOS simply cannot be calculated because your test is flawed by early stopping! This is the number one mistake in testing!CRoberson wrote:Which version is better A or B?
Score of A vs Telepath6.030: 364 - 144 - 262
A scores (364+262/2)/770 = 495.0/770 = 64.286%
~= +104 Elo
Margins are +/- 21 Elo
Score of B vs Telepath6.030: 271 - 133 - 195
B scores (271+195/2)/599 = 368.5/599 = 61.519%
~= +84 Elo
Margins are +/- 25 Elo
vote in the poll and post your reasons.
But assuming you had decided to play 770 games and 599 games initially and didn't stop the experience (which I really doubt), then you could compute the LOS, by a trivial calculation (empirical mean and stdev, use gaussian distribution function to compute the asymptotic LOS). With these numbers, I get a LOS of 100.0%, meaning that the test is conclusive.
The only way early stopping is possible is when you are using a stochastic stopping rule that is mathematically sound, like the the sequential wald test for instance. But that information is not available, so I can only assume that you didn't. An early stopped result cannot give any conclusion without the knowledge of the rule that was used to stop the experiment!
PS: Do not use a poll for that: vox populi is the worst possible choice. Most people are absolutely cluless, and some (even worse) think they know but they don't. Better to understand what you are doing.
Miguel
-
- Posts: 3232
- Joined: Mon May 31, 2010 1:29 pm
- Full name: lucasart
Re: Math Test 4 All
In theory yes, it wouldn't introduce a biais. But it would still increase the variance of the estimator (in ways that cannot be quantified as it would depend on the law of the independant stopping process). So all of the calculations above are not correct in this case.michiguel wrote: There is no problem if you stop the test for reasons that are unrelated to the results. For instance, a blackout, or you needed to use the computer for something else etc.
In practice, you always look before you stop the experiment, and the very fact of looking means you are making a biaised decision. If however you switch the screen off, and stop by pressing Crtl+C, then switch on, then that's ok
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
-
- Posts: 2055
- Joined: Mon Mar 13, 2006 2:31 am
- Location: North Carolina, USA
Re: Math Test 4 All
Nah, I didn't look. The program running the matches crashed in both cases. I had set both to run 2400 games.lucasart wrote:In theory yes, it wouldn't introduce a biais. But it would still increase the variance of the estimator (in ways that cannot be quantified as it would depend on the law of the independant stopping process). So all of the calculations above are not correct in this case.michiguel wrote: There is no problem if you stop the test for reasons that are unrelated to the results. For instance, a blackout, or you needed to use the computer for something else etc.
In practice, you always look before you stop the experiment, and the very fact of looking means you are making a biaised decision. If however you switch the screen off, and stop by pressing Crtl+C, then switch on, then that's ok
-
- Posts: 1968
- Joined: Wed Jul 13, 2011 9:04 pm
- Location: Madrid, Spain.
Re: Math test for all.
Hello Charles:
For A:
------------------------
For B:
(Likelihood of superiority) = LOS ~ 100% for both A and B. If you want to say that one is slightly better than the other, then:
It will be good if you manage to increase the number of games of each match and specially if A plays the same amount of games than B.
A seems no worse than B with my patzer eyes. Maybe a self test can confirm it (IIRC self tests are usually good for compare two versions but tend to exaggerate the true improvement; for this reason, a bunch of different engines is better). Anyway, both A and B seem far better than Telepath6.030.
Miguel says that A is better than B with 74.2%. All I can say is the following:
Both Miguel and myself agree in the fact that A could be better than B but slightly disagree on confidence levels. We surely use different models and the only thing I can say is that you should use my results with a lot of care. I will not vote in the poll because I am not so sure. I hope no typos in this post.
Regards from Spain.
Ajedrecista.
I agree with Lucas regarding LOS. Here are the results of my own programme LOS_and_Elo_uncertainties_calculator.CRoberson wrote:Which version is better A or B?
Score of A vs Telepath6.030: 364 - 144 - 262
A scores (364+262/2)/770 = 495.0/770 = 64.286%
~= +104 Elo
Margins are +/- 21 Elo
Score of B vs Telepath6.030: 271 - 133 - 195
B scores (271+195/2)/599 = 368.5/599 = 61.519%
~= +84 Elo
Margins are +/- 25 Elo
vote in the poll and post your reasons.
For A:
Code: Select all
LOS_and_Elo_uncertainties_calculator, ® 2012-2013.
----------------------------------------------------------------
Calculation of Elo uncertainties in a match between two engines:
----------------------------------------------------------------
(The input and output data is referred to the first engine).
Please write down non-negative integers.
Maximum number of games supported: 2147483647.
Write down the number of wins (up to 1825361100):
364
Write down the number of loses (up to 1825361100):
144
Write down the number of draws (up to 2147483139):
262
Write down the confidence level (in percentage) between 65% and 99.9% (it will be rounded up to 0.01%):
95
Write down the clock rate of the CPU (in GHz), only for timing the elapsed time of the calculations:
3
---------------------------------------
Elo interval for 95.00 % confidence:
Elo rating difference: 102.11 Elo
Lower rating difference: 82.09 Elo
Upper rating difference: 122.81 Elo
Lower bound uncertainty: -20.02 Elo
Upper bound uncertainty: 20.70 Elo
Average error: +/- 20.36 Elo
K = (average error)*[sqrt(n)] = 564.95
Elo interval: ] 82.09, 122.81[
---------------------------------------
Number of games of the match: 770
Score: 64.29 %
Elo rating difference: 102.11 Elo
Draw ratio: 34.03 %
************************************************************************
Sample standard deviation: 1.3709 % of the points of the match.
1.9600 sample standard deviations: 2.6869 % of the points of the match.
(Corresponding to 95.00 % confidence).
************************************************************************
Error bars were calculated with two-sided tests; values are rounded up to 0.01 Elo, or 0.01 in the case of K.
-------------------------------------------------------------------
Calculation of likelihood of superiority (LOS) in a one-sided test:
-------------------------------------------------------------------
LOS (taking into account draws) is always calculated, if possible.
LOS (not taking into account draws) is only calculated if wins + loses < 16001.
LOS (average value) is calculated only when LOS (not taking into account draws) is calculated.
______________________________________________
LOS: 100.00 % (taking into account draws).
LOS: 100.00 % (not taking into account draws).
LOS: 100.00 % (average value).
______________________________________________
These values of LOS are rounded up to 0.01%
End of the calculations. Approximated elapsed time: 75 ms.
Thanks for using LOS_and_Elo_uncertainties_calculator. Press Enter to exit.
For B:
Code: Select all
LOS_and_Elo_uncertainties_calculator, ® 2012-2013.
----------------------------------------------------------------
Calculation of Elo uncertainties in a match between two engines:
----------------------------------------------------------------
(The input and output data is referred to the first engine).
Please write down non-negative integers.
Maximum number of games supported: 2147483647.
Write down the number of wins (up to 1825361100):
271
Write down the number of loses (up to 1825361100):
133
Write down the number of draws (up to 2147483243):
195
Write down the confidence level (in percentage) between 65% and 99.9% (it will be rounded up to 0.01%):
95
Write down the clock rate of the CPU (in GHz), only for timing the elapsed time of the calculations:
3
---------------------------------------
Elo interval for 95.00 % confidence:
Elo rating difference: 81.51 Elo
Lower rating difference: 58.64 Elo
Upper rating difference: 105.09 Elo
Lower bound uncertainty: -22.86 Elo
Upper bound uncertainty: 23.58 Elo
Average error: +/- 23.22 Elo
K = (average error)*[sqrt(n)] = 568.33
Elo interval: ] 58.64, 105.09[
---------------------------------------
Number of games of the match: 599
Score: 61.52 %
Elo rating difference: 81.51 Elo
Draw ratio: 32.55 %
************************************************************************
Sample standard deviation: 1.6118 % of the points of the match.
1.9600 sample standard deviations: 3.1590 % of the points of the match.
(Corresponding to 95.00 % confidence).
************************************************************************
Error bars were calculated with two-sided tests; values are rounded up to 0.01 Elo, or 0.01 in the case of K.
-------------------------------------------------------------------
Calculation of likelihood of superiority (LOS) in a one-sided test:
-------------------------------------------------------------------
LOS (taking into account draws) is always calculated, if possible.
LOS (not taking into account draws) is only calculated if wins + loses < 16001.
LOS (average value) is calculated only when LOS (not taking into account draws) is calculated.
______________________________________________
LOS: 100.00 % (taking into account draws).
LOS: 100.00 % (not taking into account draws).
LOS: 100.00 % (average value).
______________________________________________
These values of LOS are rounded up to 0.01%
End of the calculations. Approximated elapsed time: 79 ms.
Thanks for using LOS_and_Elo_uncertainties_calculator. Press Enter to exit.
Code: Select all
s: sample standard deviation.
k_i = (score_i - 0.5)/s
k_A ~ (64.29 - 50)/1.3709 ~ 10.42
k_B ~ (61.52 - 50)/1.6118 ~ 7.15
k_A > k_B; A is better than B.
A seems no worse than B with my patzer eyes. Maybe a self test can confirm it (IIRC self tests are usually good for compare two versions but tend to exaggerate the true improvement; for this reason, a bunch of different engines is better). Anyway, both A and B seem far better than Telepath6.030.
Miguel says that A is better than B with 74.2%. All I can say is the following:
Code: Select all
A - B ~ 102.11 - 81.51 ± sqrt[(20.36)² + (23.22)²] ~ 20.6 ± 30.88 Elo (with 95% confidence).
After some trials, if I use 80.93% confidence:
A - B ~ 102.11 - 81.51 ± sqrt[(13.58)² + (15.49)²] ~ 20.6 ± 20.6 Elo.
Regards from Spain.
Ajedrecista.
-
- Posts: 2055
- Joined: Mon Mar 13, 2006 2:31 am
- Location: North Carolina, USA
Re: Math Test 4 All
So far, interesting answers. For clarification: I am not asking if A or B is better than Telepath6.030. I am only asking which of A or B is better than the other. The data given is sufficient to show that both A and B are better than Telepath 6.030.
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Math Test 4 All
If early stoppage was not an issue (as Lucas pointed out) then you are inside 2SD, probably 1.2SD or so of confidence A being stronger than B just by using Jesus formula. If you want more confidence, say 2SD (95%), then play more games and use well defined stoppage rules.CRoberson wrote:So far, interesting answers. For clarification: I am not asking if A or B is better than Telepath6.030. I am only asking which of A or B is better than the other. The data given is sufficient to show that both A and B are better than Telepath 6.030.