Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match results

Discussion of computer chess matches and engine tournaments.

Moderators: hgm, Rebel, chrisw

User avatar
ivoryknight
Posts: 117
Joined: Fri Mar 25, 2011 10:40 pm
Location: USA

Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match results

Post by ivoryknight »

User avatar
Houdini
Posts: 1471
Joined: Tue Mar 16, 2010 12:00 am

Re: Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match resu

Post by Houdini »

Hello Brent,

I read in your blog post: "I am really quite shocked by this result."

If you're "really quite shocked", you're probably very much unaware of the variability and error margins in chess engine testing.
Do you have any idea about the 95% confidence interval on a 100-game match?

Robert
User avatar
ATOMICC
Posts: 150
Joined: Sat Mar 10, 2012 11:50 pm
Location: USA

Re: Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match resu

Post by ATOMICC »

Houdini wrote:
Hello Brent,

I read in your blog post: "I am really quite shocked by this result."

If you're "really quite shocked", you're probably very much unaware of the variability and error margins in chess engine testing.
Do you have any idea about the 95% confidence interval on a 100-game match?

Robert
Yes, Robert, I am aware of those things. Thank you for responding, by the way. I just thought the match would be closer, given the previous results of my tests. Please do not take my post in any bad way; I did not mean it to look poorly on your engine, which I really like. I knew Houdini would do well; it always does. I just expected Critter to perform more strongly. I do not mean any offense to Critter, either. Just a bit surprised by the result.
Happy chessing!
User avatar
Houdini
Posts: 1471
Joined: Tue Mar 16, 2010 12:00 am

Re: Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match resu

Post by Houdini »

Hello Brent,

You're missing my point, my reaction was not emotional ("bad way", "offended", "look poorly on") but purely statistical.

If you see a 60-40 result in a 100 game match, you should understand that the most likely true result lies approximately between 67-33 and 53-47.

From Houdini and Critter's more or less established ratings the expected outcome is about 55-45.
Why are you surprised or "quite shocked" to find a 60-40 that lies well within the 95% confidence interval of the expected outcome?

Robert
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match resu

Post by Adam Hair »

Houdini wrote:Hello Brent,

You're missing my point, my reaction was not emotional ("bad way", "offended", "look poorly on") but purely statistical.

If you see a 60-40 result in a 100 game match, you should understand that the most likely true result lies approximately between 67-33 and 53-47.

From Houdini and Critter's more or less established ratings the expected outcome is about 55-45.
Why are you surprised or "quite shocked" to find a 60-40 that lies well within the 95% confidence interval of the expected outcome?

Robert
Perhaps because of Ingo's results. Since Brent's results for Komodo vs Critter and Komodo vs Houdini resembled Ingo's results, it is not unreasonable to expect a result similar to Ingo's for Critter vs Houdini.

So, using the results from Ingo's match between Critter 1.4a and Houdini 2.0c, we could compute a prediction interval. From Ingo's data, 52.67% +/- 7.04% or (45.63, 59.71) is a 95% prediction interval for the score of Brent's match. In fact, one might predict that there was approximately 98% chance that the score for Houdini would be less than 60% (based on Ingo's results).

One thing to note. Yes, it appears that the rating difference between Critter and Houdini is ~40 Elo. From that, we could expect the score in a match between the two engines to be approximately 55% to 45%. But, we also know that there is a large amount of variance from the expected score for a match that is due to the observation that some engines play better against some opponents than against other opponents, regardless of Elo rating. In my own, non-expert opinion, it is better to try to base an expectation of a single match result from a previous result such as Ingo's rather than from the Elo rating. So, I believe that Brent had some basis to be surprised by the result.

However, I am not that surprised. I have seen individual match results, using the same positions and the same engines each time on the same computer, that have shown more variance than I would have expected. I can not expect that Brent's results and Ingo's results will always closely match each other.

Adam

Addendum:

I try not to nit-pick, but I have to say one more thing (it is the reason why I responded in the first place).

This statement,
"If you see a 60-40 result in a 100 game match, you should understand that the most likely true result lies approximately between 67-33 and 53-47."
is mostly correct (we expect that 95% of the time that the true score would fall within a confidence interval computed in this manner).

This statement,
"From Houdini and Critter's more or less established ratings the expected outcome is about 55-45.
Why are you surprised or "quite shocked" to find a 60-40 that lies well within the 95% confidence interval of the expected outcome?"

is nonsense, in the sense that 60-40 is in the interval due to the fact it is being used to compute the interval. The expected result, 55-45 as determined by Elo difference, lies within the confidence interval computed from Brent's results.

I knew what you meant, but others might get confused.
User avatar
Houdini
Posts: 1471
Joined: Tue Mar 16, 2010 12:00 am

Re: Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match resu

Post by Houdini »

Adam Hair wrote:So, using the results from Ingo's match between Critter 1.4a and Houdini 2.0c, we could compute a prediction interval. From Ingo's data, 52.67% +/- 7.04% or (45.63, 59.71) is a 95% prediction interval for the score of Brent's match. In fact, one might predict that there was approximately 98% chance that the score for Houdini would be less than 60% (based on Ingo's results).
Adam, that's incorrect. The confidence interval (+- 7%) is only valid with respect to the expected "true" value, not with respect to another random sample.

If you want to use Ingo's random sample to predict Brent's random sample, you have to multiply the uncertainty margin by 1.4: the confidence interval becomes 53% +- 10%. In other words, from Ingo's result you can deduce that another similar match will produce with 95% certainty a result between 43% and 63%. Brent's result lies comfortably within these boundaries.

A more straightforward way of looking at the results is as follows.
Ingo obtains 53 %, Brent obtains 60 %. Both results are perfectly compatible with the 55% +- 7% result predicted for a 100-game match and about 40 Elo established rating difference.

Robert
marijan
Posts: 56
Joined: Mon Jan 16, 2012 1:16 am

Re: Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match resu

Post by marijan »

Hi, Robert! I see you are using term " variability and error margins in chess engine testing" Question for You: If I make test with reversible starting positions ( and I always do that ); are those " variability and error margins in chess engine testing" the same if I dont use reversible starting positions?


Regards!
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match resu

Post by Adam Hair »

Houdini wrote:
Adam Hair wrote:So, using the results from Ingo's match between Critter 1.4a and Houdini 2.0c, we could compute a prediction interval. From Ingo's data, 52.67% +/- 7.04% or (45.63, 59.71) is a 95% prediction interval for the score of Brent's match. In fact, one might predict that there was approximately 98% chance that the score for Houdini would be less than 60% (based on Ingo's results).
Adam, that's incorrect. The confidence interval (+- 7%) is only valid with respect to the expected "true" value, not with respect to another random sample.

If you want to use Ingo's random sample to predict Brent's random sample, you have to multiply the uncertainty margin by 1.4: the confidence interval becomes 53% +- 10%. In other words, from Ingo's result you can deduce that another similar match will produce with 95% certainty a result between 43% and 63%. Brent's result lies comfortably within these boundaries.

A more straightforward way of looking at the results is as follows.
Ingo obtains 53 %, Brent obtains 60 %. Both results are perfectly compatible with the 55% +- 7% result predicted for a 100-game match and about 40 Elo established rating difference.

Robert
I am using the results of one random sample to predict (although after the fact) the results of another random sample. The computation of a prediction interval makes use of the sample mean, sample variance, and Student's T distribution, but not Sqrt(2).

If I were to compare Ingo's result to Brent's result, then Sqrt(2) would come into play.

As you noted, the two results are not at odds with each other. However, we probably would not have predicted Brent's results from looking solely at Ingo's results, which by the way are the most similar to Brent's conditions.

Adam
User avatar
Houdini
Posts: 1471
Joined: Tue Mar 16, 2010 12:00 am

Re: Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match resu

Post by Houdini »

Adam Hair wrote:I am using the results of one random sample to predict (although after the fact) the results of another random sample. The computation of a prediction interval makes use of the sample mean, sample variance, and Student's T distribution, but not Sqrt(2).
My number is an approximation but should be reasonably accurate. Can you provide a better value?
Adam Hair wrote:However, we probably would not have predicted Brent's results from looking solely at Ingo's results, which by the way are the most similar to Brent's conditions.
The whole point is that I am not in the least surprised by finding a 53% and a 60% outcome in similar 100-game matches.
There is no statistical surprise nor any practical surprise - I see this kind of results every day, don't you?

Robert
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match resu

Post by Adam Hair »

Houdini wrote:
Adam Hair wrote:I am using the results of one random sample to predict (although after the fact) the results of another random sample. The computation of a prediction interval makes use of the sample mean, sample variance, and Student's T distribution, but not Sqrt(2).
My number is an approximation but should be reasonably accurate. Can you provide a better value?
My actual source is a statistical textbook, but the Wikipedia entry for "prediction interval" also has the formula for computing the endpoints of the interval. For computing the variance, I used the approximation Error = 100%* Sqrt((score*(1-score)-0.25*draw%)/games).

If we were comparing the two samples in terms as point estimates for the true score, then I understand how Sqrt(2) comes into play.
Houdini wrote:
Adam Hair wrote:However, we probably would not have predicted Brent's results from looking solely at Ingo's results, which by the way are the most similar to Brent's conditions.
The whole point is that I am not in the least surprised by finding a 53% and a 60% outcome in similar 100-game matches.
There is no statistical surprise nor any practical surprise - I see this kind of results every day, don't you?

Robert
I agree with you, especially since no consideration of any measurement error has been made. My only point is that, basing a prediction on Ingo's results, a 60% score for Houdini vs Critter in Brent's test may be unexpected. Of course, when taking everything into account, the surprise evaporates.