Math Test 4 All

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

Which Version is better?

Poll ended at Tue Apr 09, 2013 1:05 am

A is better
3
17%
B is better
1
6%
Can't tell - they may be the same
14
78%
 
Total votes: 18

CRoberson
Posts: 2055
Joined: Mon Mar 13, 2006 2:31 am
Location: North Carolina, USA

Math Test 4 All

Post by CRoberson »

Which version is better A or B?

Score of A vs Telepath6.030: 364 - 144 - 262
A scores (364+262/2)/770 = 495.0/770 = 64.286%
~= +104 Elo
Margins are +/- 21 Elo

Score of B vs Telepath6.030: 271 - 133 - 195
B scores (271+195/2)/599 = 368.5/599 = 61.519%
~= +84 Elo
Margins are +/- 25 Elo

vote in the poll and post your reasons.
Christopher Conkie
Posts: 6073
Joined: Sat Apr 01, 2006 9:34 pm
Location: Scotland

Re: Math Test 4 All

Post by Christopher Conkie »

CRoberson wrote:Which version is better A or B?

Score of A vs Telepath6.030: 364 - 144 - 262
A scores (364+262/2)/770 = 495.0/770 = 64.286%
~= +104 Elo
Margins are +/- 21 Elo

Score of B vs Telepath6.030: 271 - 133 - 195
B scores (271+195/2)/599 = 368.5/599 = 61.519%
~= +84 Elo
Margins are +/- 25 Elo

vote in the poll and post your reasons.
84 + 25 = 109 which is within the margin of error or.....
104 - 21 = 83 which is also within the margin of error.

To reduce the margin of error and get a definitive result I believe you should play more games to have a greater sample and thereby achieve more precise results.

Regards

Chris
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: Math Test 4 All

Post by lucasart »

CRoberson wrote:Which version is better A or B?

Score of A vs Telepath6.030: 364 - 144 - 262
A scores (364+262/2)/770 = 495.0/770 = 64.286%
~= +104 Elo
Margins are +/- 21 Elo

Score of B vs Telepath6.030: 271 - 133 - 195
B scores (271+195/2)/599 = 368.5/599 = 61.519%
~= +84 Elo
Margins are +/- 25 Elo

vote in the poll and post your reasons.
Forget about these elo margins, they are not useable for deciding the test as they are bilateral! You need an unilateral test. What you want to know is P(X>0.5) aso known as LOS. In this case the LOS simply cannot be calculated because your test is flawed by early stopping! This is the number one mistake in testing!

But assuming you had decided to play 770 games and 599 games initially and didn't stop the experience (which I really doubt), then you could compute the LOS, by a trivial calculation (empirical mean and stdev, use gaussian distribution function to compute the asymptotic LOS). With these numbers, I get a LOS of 100.0%, meaning that the test is conclusive.

The only way early stopping is possible is when you are using a stochastic stopping rule that is mathematically sound, like the the sequential wald test for instance. But that information is not available, so I can only assume that you didn't. An early stopped result cannot give any conclusion without the knowledge of the rule that was used to stop the experiment!

PS: Do not use a poll for that: vox populi is the worst possible choice. Most people are absolutely cluless, and some (even worse) think they know but they don't. Better to understand what you are doing.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Math Test 4 All

Post by michiguel »

CRoberson wrote:Which version is better A or B?

Score of A vs Telepath6.030: 364 - 144 - 262
A scores (364+262/2)/770 = 495.0/770 = 64.286%
~= +104 Elo
Margins are +/- 21 Elo

Score of B vs Telepath6.030: 271 - 133 - 195
B scores (271+195/2)/599 = 368.5/599 = 61.519%
~= +84 Elo
Margins are +/- 25 Elo

vote in the poll and post your reasons.
The answer depends on the level of confidence. I put those results in a empty pgn and run ordo with

ordo -p all.pgn -s10000 -eerr.txt -a0 -AA -F74.2

10,000 simulations, errors in err.txt, ranking center to 0 in engine A, with a level of confidence of 74.2 (I forced it). So, since A is forced to 0, all the errors are referred to the difference with A.

Code: Select all

   # ENGINE    : RATING  ERROR   POINTS  PLAYED    (%)
   1 A         :    0.0   ----    495.0     770   64.3%
   2 B         :  -20.8   20.8    368.5     599   61.5%
   3 T         : -103.0   13.7    505.5    1369   36.9%
Then, A is stronger than B with a level of confidence of 74.2%. If you use 95%, then the error is bigger.

Miguel
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Math Test 4 All

Post by michiguel »

lucasart wrote:
CRoberson wrote:Which version is better A or B?

Score of A vs Telepath6.030: 364 - 144 - 262
A scores (364+262/2)/770 = 495.0/770 = 64.286%
~= +104 Elo
Margins are +/- 21 Elo

Score of B vs Telepath6.030: 271 - 133 - 195
B scores (271+195/2)/599 = 368.5/599 = 61.519%
~= +84 Elo
Margins are +/- 25 Elo

vote in the poll and post your reasons.
Forget about these elo margins, they are not useable for deciding the test as they are bilateral! You need an unilateral test. What you want to know is P(X>0.5) aso known as LOS. In this case the LOS simply cannot be calculated because your test is flawed by early stopping! This is the number one mistake in testing!

But assuming you had decided to play 770 games and 599 games initially and didn't stop the experience (which I really doubt), then you could compute the LOS, by a trivial calculation (empirical mean and stdev, use gaussian distribution function to compute the asymptotic LOS). With these numbers, I get a LOS of 100.0%, meaning that the test is conclusive.

The only way early stopping is possible is when you are using a stochastic stopping rule that is mathematically sound, like the the sequential wald test for instance. But that information is not available, so I can only assume that you didn't. An early stopped result cannot give any conclusion without the knowledge of the rule that was used to stop the experiment!

PS: Do not use a poll for that: vox populi is the worst possible choice. Most people are absolutely cluless, and some (even worse) think they know but they don't. Better to understand what you are doing.
There is no problem if you stop the test for reasons that are unrelated to the results. For instance, a blackout, or you needed to use the computer for something else etc.

Miguel
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: Math Test 4 All

Post by lucasart »

michiguel wrote: There is no problem if you stop the test for reasons that are unrelated to the results. For instance, a blackout, or you needed to use the computer for something else etc.
In theory yes, it wouldn't introduce a biais. But it would still increase the variance of the estimator (in ways that cannot be quantified as it would depend on the law of the independant stopping process). So all of the calculations above are not correct in this case.

In practice, you always look before you stop the experiment, and the very fact of looking means you are making a biaised decision. If however you switch the screen off, and stop by pressing Crtl+C, then switch on, then that's ok :lol:
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
CRoberson
Posts: 2055
Joined: Mon Mar 13, 2006 2:31 am
Location: North Carolina, USA

Re: Math Test 4 All

Post by CRoberson »

lucasart wrote:
michiguel wrote: There is no problem if you stop the test for reasons that are unrelated to the results. For instance, a blackout, or you needed to use the computer for something else etc.
In theory yes, it wouldn't introduce a biais. But it would still increase the variance of the estimator (in ways that cannot be quantified as it would depend on the law of the independant stopping process). So all of the calculations above are not correct in this case.

In practice, you always look before you stop the experiment, and the very fact of looking means you are making a biaised decision. If however you switch the screen off, and stop by pressing Crtl+C, then switch on, then that's ok :lol:
Nah, I didn't look. The program running the matches crashed in both cases. I had set both to run 2400 games.
User avatar
Ajedrecista
Posts: 1968
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: Math test for all.

Post by Ajedrecista »

Hello Charles:
CRoberson wrote:Which version is better A or B?

Score of A vs Telepath6.030: 364 - 144 - 262
A scores (364+262/2)/770 = 495.0/770 = 64.286%
~= +104 Elo
Margins are +/- 21 Elo

Score of B vs Telepath6.030: 271 - 133 - 195
B scores (271+195/2)/599 = 368.5/599 = 61.519%
~= +84 Elo
Margins are +/- 25 Elo

vote in the poll and post your reasons.
I agree with Lucas regarding LOS. Here are the results of my own programme LOS_and_Elo_uncertainties_calculator.

For A:

Code: Select all

LOS_and_Elo_uncertainties_calculator, ® 2012-2013.

----------------------------------------------------------------
Calculation of Elo uncertainties in a match between two engines:
----------------------------------------------------------------

(The input and output data is referred to the first engine).

Please write down non-negative integers.

Maximum number of games supported: 2147483647.

Write down the number of wins (up to 1825361100):

364

Write down the number of loses (up to 1825361100):

144

Write down the number of draws (up to 2147483139):

262

 Write down the confidence level (in percentage) between 65% and 99.9% (it will be rounded up to 0.01%):

95

Write down the clock rate of the CPU (in GHz), only for timing the elapsed time of the calculations:

3

---------------------------------------
Elo interval for 95.00 % confidence:

Elo rating difference:    102.11 Elo

Lower rating difference:   82.09 Elo
Upper rating difference:  122.81 Elo

Lower bound uncertainty:  -20.02 Elo
Upper bound uncertainty:   20.70 Elo
Average error:        +/-  20.36 Elo

K = (average error)*[sqrt(n)] =  564.95

Elo interval: ]  82.09,  122.81[
---------------------------------------

Number of games of the match:       770
Score: 64.29 %
Elo rating difference:  102.11 Elo
Draw ratio: 34.03 %

************************************************************************
        Sample standard deviation:  1.3709 % of the points of the match.
1.9600 sample standard deviations:  2.6869 % of the points of the match.

                 (Corresponding to 95.00 % confidence).
************************************************************************

 Error bars were calculated with two-sided tests; values are rounded up to 0.01 Elo, or 0.01 in the case of K.

-------------------------------------------------------------------
Calculation of likelihood of superiority (LOS) in a one-sided test:
-------------------------------------------------------------------

LOS (taking into account draws) is always calculated, if possible.

LOS &#40;not taking into account draws&#41; is only calculated if wins + loses < 16001.

LOS &#40;average value&#41; is calculated only when LOS &#40;not taking into account draws&#41; is calculated.
______________________________________________

LOS&#58; 100.00 % &#40;taking into account draws&#41;.
LOS&#58; 100.00 % &#40;not taking into account draws&#41;.
LOS&#58; 100.00 % &#40;average value&#41;.
______________________________________________

These values of LOS are rounded up to 0.01%

End of the calculations. Approximated elapsed time&#58;   75 ms.

Thanks for using LOS_and_Elo_uncertainties_calculator. Press Enter to exit.
------------------------

For B:

Code: Select all

LOS_and_Elo_uncertainties_calculator, ® 2012-2013.

----------------------------------------------------------------
Calculation of Elo uncertainties in a match between two engines&#58;
----------------------------------------------------------------

&#40;The input and output data is referred to the first engine&#41;.

Please write down non-negative integers.

Maximum number of games supported&#58; 2147483647.

Write down the number of wins &#40;up to 1825361100&#41;&#58;

271

Write down the number of loses &#40;up to 1825361100&#41;&#58;

133

Write down the number of draws &#40;up to 2147483243&#41;&#58;

195

 Write down the confidence level &#40;in percentage&#41; between 65% and 99.9% &#40;it will be rounded up to 0.01%)&#58;

95

Write down the clock rate of the CPU &#40;in GHz&#41;, only for timing the elapsed time of the calculations&#58;

3

---------------------------------------
Elo interval for 95.00 % confidence&#58;

Elo rating difference&#58;     81.51 Elo

Lower rating difference&#58;   58.64 Elo
Upper rating difference&#58;  105.09 Elo

Lower bound uncertainty&#58;  -22.86 Elo
Upper bound uncertainty&#58;   23.58 Elo
Average error&#58;        +/-  23.22 Elo

K = &#40;average error&#41;*&#91;sqrt&#40;n&#41;&#93; =  568.33

Elo interval&#58; &#93;  58.64,  105.09&#91;
---------------------------------------

Number of games of the match&#58;       599
Score&#58; 61.52 %
Elo rating difference&#58;   81.51 Elo
Draw ratio&#58; 32.55 %

************************************************************************
        Sample standard deviation&#58;  1.6118 % of the points of the match.
1.9600 sample standard deviations&#58;  3.1590 % of the points of the match.

                 &#40;Corresponding to 95.00 % confidence&#41;.
************************************************************************

 Error bars were calculated with two-sided tests; values are rounded up to 0.01 Elo, or 0.01 in the case of K.

-------------------------------------------------------------------
Calculation of likelihood of superiority &#40;LOS&#41; in a one-sided test&#58;
-------------------------------------------------------------------

LOS &#40;taking into account draws&#41; is always calculated, if possible.

LOS &#40;not taking into account draws&#41; is only calculated if wins + loses < 16001.

LOS &#40;average value&#41; is calculated only when LOS &#40;not taking into account draws&#41; is calculated.
______________________________________________

LOS&#58; 100.00 % &#40;taking into account draws&#41;.
LOS&#58; 100.00 % &#40;not taking into account draws&#41;.
LOS&#58; 100.00 % &#40;average value&#41;.
______________________________________________

These values of LOS are rounded up to 0.01%

End of the calculations. Approximated elapsed time&#58;   79 ms.

Thanks for using LOS_and_Elo_uncertainties_calculator. Press Enter to exit.
(Likelihood of superiority) = LOS ~ 100% for both A and B. If you want to say that one is slightly better than the other, then:

Code: Select all

s&#58; sample standard deviation.
k_i = &#40;score_i - 0.5&#41;/s

k_A ~ &#40;64.29 - 50&#41;/1.3709 ~ 10.42
k_B ~ &#40;61.52 - 50&#41;/1.6118 ~  7.15

k_A > k_B; A is better than B.
It will be good if you manage to increase the number of games of each match and specially if A plays the same amount of games than B.

A seems no worse than B with my patzer eyes. Maybe a self test can confirm it (IIRC self tests are usually good for compare two versions but tend to exaggerate the true improvement; for this reason, a bunch of different engines is better). Anyway, both A and B seem far better than Telepath6.030.

Miguel says that A is better than B with 74.2%. All I can say is the following:

Code: Select all

A - B ~ 102.11 - 81.51 ± sqrt&#91;&#40;20.36&#41;² + &#40;23.22&#41;²&#93; ~ 20.6 ± 30.88 Elo &#40;with 95% confidence&#41;.

After some trials, if I use 80.93% confidence&#58;

A - B ~ 102.11 - 81.51 ± sqrt&#91;&#40;13.58&#41;² + &#40;15.49&#41;²&#93; ~ 20.6 ± 20.6 Elo.
Both Miguel and myself agree in the fact that A could be better than B but slightly disagree on confidence levels. We surely use different models and the only thing I can say is that you should use my results with a lot of care. I will not vote in the poll because I am not so sure. I hope no typos in this post.

Regards from Spain.

Ajedrecista.
CRoberson
Posts: 2055
Joined: Mon Mar 13, 2006 2:31 am
Location: North Carolina, USA

Re: Math Test 4 All

Post by CRoberson »

So far, interesting answers. For clarification: I am not asking if A or B is better than Telepath6.030. I am only asking which of A or B is better than the other. The data given is sufficient to show that both A and B are better than Telepath 6.030.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Math Test 4 All

Post by Laskos »

CRoberson wrote:So far, interesting answers. For clarification: I am not asking if A or B is better than Telepath6.030. I am only asking which of A or B is better than the other. The data given is sufficient to show that both A and B are better than Telepath 6.030.
If early stoppage was not an issue (as Lucas pointed out) then you are inside 2SD, probably 1.2SD or so of confidence A being stronger than B just by using Jesus formula. If you want more confidence, say 2SD (95%), then play more games and use well defined stoppage rules.