Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match results

ivoryknight · Post by **ivoryknight** » Fri Mar 16, 2012 3:50 pm

http://atomicc-testing.blogspot.com/201 ... c-x64.html

Houdini · Post by **Houdini** » Fri Mar 16, 2012 4:15 pm

ivoryknight wrote:http://atomicc-testing.blogspot.com/201 ... c-x64.html

Hello Brent,

I read in your blog post: "I am really quite shocked by this result."

If you're "really quite shocked", you're probably very much unaware of the variability and error margins in chess engine testing.
Do you have any idea about the 95% confidence interval on a 100-game match?

Robert

ATOMICC · Post by **ATOMICC** » Fri Mar 16, 2012 4:23 pm

Houdini wrote:
ivoryknight wrote:http://atomicc-testing.blogspot.com/201 ... c-x64.html
Hello Brent,

I read in your blog post: "I am really quite shocked by this result."

If you're "really quite shocked", you're probably very much unaware of the variability and error margins in chess engine testing.
Do you have any idea about the 95% confidence interval on a 100-game match?

Robert

Yes, Robert, I am aware of those things. Thank you for responding, by the way. I just thought the match would be closer, given the previous results of my tests. Please do not take my post in any bad way; I did not mean it to look poorly on your engine, which I really like. I knew Houdini would do well; it always does. I just expected Critter to perform more strongly. I do not mean any offense to Critter, either. Just a bit surprised by the result.

Houdini · Post by **Houdini** » Fri Mar 16, 2012 5:08 pm

Hello Brent,

You're missing my point, my reaction was not emotional ("bad way", "offended", "look poorly on") but purely statistical.

If you see a 60-40 result in a 100 game match, you should understand that the most likely true result lies approximately between 67-33 and 53-47.

From Houdini and Critter's more or less established ratings the expected outcome is about 55-45.
Why are you surprised or "quite shocked" to find a 60-40 that lies well within the 95% confidence interval of the expected outcome?

Robert

Adam Hair · Post by **Adam Hair** » Sat Mar 17, 2012 2:02 pm

Houdini wrote:Hello Brent,

You're missing my point, my reaction was not emotional ("bad way", "offended", "look poorly on") but purely statistical.

If you see a 60-40 result in a 100 game match, you should understand that the most likely true result lies approximately between 67-33 and 53-47.

From Houdini and Critter's more or less established ratings the expected outcome is about 55-45.
Why are you surprised or "quite shocked" to find a 60-40 that lies well within the 95% confidence interval of the expected outcome?

Robert

Perhaps because of Ingo's results. Since Brent's results for Komodo vs Critter and Komodo vs Houdini resembled Ingo's results, it is not unreasonable to expect a result similar to Ingo's for Critter vs Houdini.

So, using the results from Ingo's match between Critter 1.4a and Houdini 2.0c, we could compute a prediction interval. From Ingo's data, 52.67% +/- 7.04% or (45.63, 59.71) is a 95% prediction interval for the score of Brent's match. In fact, one might predict that there was approximately 98% chance that the score for Houdini would be less than 60% (based on Ingo's results).

One thing to note. Yes, it appears that the rating difference between Critter and Houdini is ~40 Elo. From that, we could expect the score in a match between the two engines to be approximately 55% to 45%. But, we also know that there is a large amount of variance from the expected score for a match that is due to the observation that some engines play better against some opponents than against other opponents, regardless of Elo rating. In my own, non-expert opinion, it is better to try to base an expectation of a single match result from a previous result such as Ingo's rather than from the Elo rating. So, I believe that Brent had some basis to be surprised by the result.

However, I am not that surprised. I have seen individual match results, using the same positions and the same engines each time on the same computer, that have shown more variance than I would have expected. I can not expect that Brent's results and Ingo's results will always closely match each other.

Adam

Addendum:

I try not to nit-pick, but I have to say one more thing (it is the reason why I responded in the first place).

This statement,
"If you see a 60-40 result in a 100 game match, you should understand that the most likely true result lies approximately between 67-33 and 53-47."
is mostly correct (we expect that 95% of the time that the true score would fall within a confidence interval computed in this manner).

This statement,
"From Houdini and Critter's more or less established ratings the expected outcome is about 55-45.
Why are you surprised or "quite shocked" to find a 60-40 that lies well within the 95% confidence interval of the expected outcome?"
is nonsense, in the sense that 60-40 is in the interval due to the fact it is being used to compute the interval. The expected result, 55-45 as determined by Elo difference, lies within the confidence interval computed from Brent's results.

I knew what you meant, but others might get confused.

Houdini · Post by **Houdini** » Sat Mar 17, 2012 3:27 pm

Adam Hair wrote:So, using the results from Ingo's match between Critter 1.4a and Houdini 2.0c, we could compute a prediction interval. From Ingo's data, 52.67% +/- 7.04% or (45.63, 59.71) is a 95% prediction interval for the score of Brent's match. In fact, one might predict that there was approximately 98% chance that the score for Houdini would be less than 60% (based on Ingo's results).

Adam, that's incorrect. The confidence interval (+- 7%) is only valid with respect to the expected "true" value, not with respect to another random sample.

If you want to use Ingo's random sample to predict Brent's random sample, you have to multiply the uncertainty margin by 1.4: the confidence interval becomes 53% +- 10%. In other words, from Ingo's result you can deduce that another similar match will produce with 95% certainty a result between 43% and 63%. Brent's result lies comfortably within these boundaries.

A more straightforward way of looking at the results is as follows.
Ingo obtains 53 %, Brent obtains 60 %. Both results are perfectly compatible with the 55% +- 7% result predicted for a 100-game match and about 40 Elo established rating difference.

Robert

marijan · Post by **marijan** » Sat Mar 17, 2012 4:37 pm

Hi, Robert! I see you are using term " variability and error margins in chess engine testing" Question for You: If I make test with reversible starting positions ( and I always do that ); are those " variability and error margins in chess engine testing" the same if I dont use reversible starting positions?

Regards!

Adam Hair · Post by **Adam Hair** » Sat Mar 17, 2012 5:25 pm

Houdini wrote:
Adam Hair wrote:So, using the results from Ingo's match between Critter 1.4a and Houdini 2.0c, we could compute a prediction interval. From Ingo's data, 52.67% +/- 7.04% or (45.63, 59.71) is a 95% prediction interval for the score of Brent's match. In fact, one might predict that there was approximately 98% chance that the score for Houdini would be less than 60% (based on Ingo's results).
Adam, that's incorrect. The confidence interval (+- 7%) is only valid with respect to the expected "true" value, not with respect to another random sample.

If you want to use Ingo's random sample to predict Brent's random sample, you have to multiply the uncertainty margin by 1.4: the confidence interval becomes 53% +- 10%. In other words, from Ingo's result you can deduce that another similar match will produce with 95% certainty a result between 43% and 63%. Brent's result lies comfortably within these boundaries.

A more straightforward way of looking at the results is as follows.
Ingo obtains 53 %, Brent obtains 60 %. Both results are perfectly compatible with the 55% +- 7% result predicted for a 100-game match and about 40 Elo established rating difference.

Robert

I am using the results of one random sample to predict (although after the fact) the results of another random sample. The computation of a prediction interval makes use of the sample mean, sample variance, and Student's T distribution, but not Sqrt(2).

If I were to compare Ingo's result to Brent's result, then Sqrt(2) would come into play.

As you noted, the two results are not at odds with each other. However, we probably would not have predicted Brent's results from looking solely at Ingo's results, which by the way are the most similar to Brent's conditions.

Adam

Houdini · Post by **Houdini** » Sat Mar 17, 2012 8:32 pm

Adam Hair wrote:I am using the results of one random sample to predict (although after the fact) the results of another random sample. The computation of a prediction interval makes use of the sample mean, sample variance, and Student's T distribution, but not Sqrt(2).

My number is an approximation but should be reasonably accurate. Can you provide a better value?

Adam Hair wrote:However, we probably would not have predicted Brent's results from looking solely at Ingo's results, which by the way are the most similar to Brent's conditions.

The whole point is that I am not in the least surprised by finding a 53% and a 60% outcome in similar 100-game matches.
There is no statistical surprise nor any practical surprise - I see this kind of results every day, don't you?

Robert

Adam Hair · Post by **Adam Hair** » Sat Mar 17, 2012 10:43 pm

Houdini wrote:
Adam Hair wrote:I am using the results of one random sample to predict (although after the fact) the results of another random sample. The computation of a prediction interval makes use of the sample mean, sample variance, and Student's T distribution, but not Sqrt(2).
My number is an approximation but should be reasonably accurate. Can you provide a better value?

My actual source is a statistical textbook, but the Wikipedia entry for "prediction interval" also has the formula for computing the endpoints of the interval. For computing the variance, I used the approximation Error = 100%* Sqrt((score*(1-score)-0.25*draw%)/games).

If we were comparing the two samples in terms as point estimates for the true score, then I understand how Sqrt(2) comes into play.

Houdini wrote:
Adam Hair wrote:However, we probably would not have predicted Brent's results from looking solely at Ingo's results, which by the way are the most similar to Brent's conditions.
The whole point is that I am not in the least surprised by finding a 53% and a 60% outcome in similar 100-game matches.
There is no statistical surprise nor any practical surprise - I see this kind of results every day, don't you?

Robert

I agree with you, especially since no consideration of any measurement error has been made. My only point is that, basing a prediction on Ingo's results, a 60% score for Houdini vs Critter in Brent's test may be unexpected. Of course, when taking everything into account, the surprise evaporates.

Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match results

Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match results

Re: Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match resu

Re: Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match resu

Re: Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match resu

Re: Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match resu

Re: Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match resu

Re: Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match resu

Re: Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match resu

Re: Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match resu

Re: Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match resu