Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match results
Moderators: hgm, Rebel, chrisw
-
- Posts: 117
- Joined: Fri Mar 25, 2011 10:40 pm
- Location: USA
-
- Posts: 1471
- Joined: Tue Mar 16, 2010 12:00 am
Re: Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match resu
Hello Brent,ivoryknight wrote:http://atomicc-testing.blogspot.com/201 ... c-x64.html
I read in your blog post: "I am really quite shocked by this result."
If you're "really quite shocked", you're probably very much unaware of the variability and error margins in chess engine testing.
Do you have any idea about the 95% confidence interval on a 100-game match?
Robert
-
- Posts: 150
- Joined: Sat Mar 10, 2012 11:50 pm
- Location: USA
Re: Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match resu
Yes, Robert, I am aware of those things. Thank you for responding, by the way. I just thought the match would be closer, given the previous results of my tests. Please do not take my post in any bad way; I did not mean it to look poorly on your engine, which I really like. I knew Houdini would do well; it always does. I just expected Critter to perform more strongly. I do not mean any offense to Critter, either. Just a bit surprised by the result.Houdini wrote:Hello Brent,ivoryknight wrote:http://atomicc-testing.blogspot.com/201 ... c-x64.html
I read in your blog post: "I am really quite shocked by this result."
If you're "really quite shocked", you're probably very much unaware of the variability and error margins in chess engine testing.
Do you have any idea about the 95% confidence interval on a 100-game match?
Robert
Happy chessing!
-
- Posts: 1471
- Joined: Tue Mar 16, 2010 12:00 am
Re: Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match resu
Hello Brent,
You're missing my point, my reaction was not emotional ("bad way", "offended", "look poorly on") but purely statistical.
If you see a 60-40 result in a 100 game match, you should understand that the most likely true result lies approximately between 67-33 and 53-47.
From Houdini and Critter's more or less established ratings the expected outcome is about 55-45.
Why are you surprised or "quite shocked" to find a 60-40 that lies well within the 95% confidence interval of the expected outcome?
Robert
You're missing my point, my reaction was not emotional ("bad way", "offended", "look poorly on") but purely statistical.
If you see a 60-40 result in a 100 game match, you should understand that the most likely true result lies approximately between 67-33 and 53-47.
From Houdini and Critter's more or less established ratings the expected outcome is about 55-45.
Why are you surprised or "quite shocked" to find a 60-40 that lies well within the 95% confidence interval of the expected outcome?
Robert
-
- Posts: 3226
- Joined: Wed May 06, 2009 10:31 pm
- Location: Fuquay-Varina, North Carolina
Re: Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match resu
Perhaps because of Ingo's results. Since Brent's results for Komodo vs Critter and Komodo vs Houdini resembled Ingo's results, it is not unreasonable to expect a result similar to Ingo's for Critter vs Houdini.Houdini wrote:Hello Brent,
You're missing my point, my reaction was not emotional ("bad way", "offended", "look poorly on") but purely statistical.
If you see a 60-40 result in a 100 game match, you should understand that the most likely true result lies approximately between 67-33 and 53-47.
From Houdini and Critter's more or less established ratings the expected outcome is about 55-45.
Why are you surprised or "quite shocked" to find a 60-40 that lies well within the 95% confidence interval of the expected outcome?
Robert
So, using the results from Ingo's match between Critter 1.4a and Houdini 2.0c, we could compute a prediction interval. From Ingo's data, 52.67% +/- 7.04% or (45.63, 59.71) is a 95% prediction interval for the score of Brent's match. In fact, one might predict that there was approximately 98% chance that the score for Houdini would be less than 60% (based on Ingo's results).
One thing to note. Yes, it appears that the rating difference between Critter and Houdini is ~40 Elo. From that, we could expect the score in a match between the two engines to be approximately 55% to 45%. But, we also know that there is a large amount of variance from the expected score for a match that is due to the observation that some engines play better against some opponents than against other opponents, regardless of Elo rating. In my own, non-expert opinion, it is better to try to base an expectation of a single match result from a previous result such as Ingo's rather than from the Elo rating. So, I believe that Brent had some basis to be surprised by the result.
However, I am not that surprised. I have seen individual match results, using the same positions and the same engines each time on the same computer, that have shown more variance than I would have expected. I can not expect that Brent's results and Ingo's results will always closely match each other.
Adam
Addendum:
I try not to nit-pick, but I have to say one more thing (it is the reason why I responded in the first place).
This statement,
"If you see a 60-40 result in a 100 game match, you should understand that the most likely true result lies approximately between 67-33 and 53-47."
is mostly correct (we expect that 95% of the time that the true score would fall within a confidence interval computed in this manner).
This statement,
"From Houdini and Critter's more or less established ratings the expected outcome is about 55-45.
Why are you surprised or "quite shocked" to find a 60-40 that lies well within the 95% confidence interval of the expected outcome?"
is nonsense, in the sense that 60-40 is in the interval due to the fact it is being used to compute the interval. The expected result, 55-45 as determined by Elo difference, lies within the confidence interval computed from Brent's results.
I knew what you meant, but others might get confused.
-
- Posts: 1471
- Joined: Tue Mar 16, 2010 12:00 am
Re: Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match resu
Adam, that's incorrect. The confidence interval (+- 7%) is only valid with respect to the expected "true" value, not with respect to another random sample.Adam Hair wrote:So, using the results from Ingo's match between Critter 1.4a and Houdini 2.0c, we could compute a prediction interval. From Ingo's data, 52.67% +/- 7.04% or (45.63, 59.71) is a 95% prediction interval for the score of Brent's match. In fact, one might predict that there was approximately 98% chance that the score for Houdini would be less than 60% (based on Ingo's results).
If you want to use Ingo's random sample to predict Brent's random sample, you have to multiply the uncertainty margin by 1.4: the confidence interval becomes 53% +- 10%. In other words, from Ingo's result you can deduce that another similar match will produce with 95% certainty a result between 43% and 63%. Brent's result lies comfortably within these boundaries.
A more straightforward way of looking at the results is as follows.
Ingo obtains 53 %, Brent obtains 60 %. Both results are perfectly compatible with the 55% +- 7% result predicted for a 100-game match and about 40 Elo established rating difference.
Robert
-
- Posts: 56
- Joined: Mon Jan 16, 2012 1:16 am
Re: Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match resu
Hi, Robert! I see you are using term " variability and error margins in chess engine testing" Question for You: If I make test with reversible starting positions ( and I always do that ); are those " variability and error margins in chess engine testing" the same if I dont use reversible starting positions?
Regards!
Regards!
-
- Posts: 3226
- Joined: Wed May 06, 2009 10:31 pm
- Location: Fuquay-Varina, North Carolina
Re: Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match resu
I am using the results of one random sample to predict (although after the fact) the results of another random sample. The computation of a prediction interval makes use of the sample mean, sample variance, and Student's T distribution, but not Sqrt(2).Houdini wrote:Adam, that's incorrect. The confidence interval (+- 7%) is only valid with respect to the expected "true" value, not with respect to another random sample.Adam Hair wrote:So, using the results from Ingo's match between Critter 1.4a and Houdini 2.0c, we could compute a prediction interval. From Ingo's data, 52.67% +/- 7.04% or (45.63, 59.71) is a 95% prediction interval for the score of Brent's match. In fact, one might predict that there was approximately 98% chance that the score for Houdini would be less than 60% (based on Ingo's results).
If you want to use Ingo's random sample to predict Brent's random sample, you have to multiply the uncertainty margin by 1.4: the confidence interval becomes 53% +- 10%. In other words, from Ingo's result you can deduce that another similar match will produce with 95% certainty a result between 43% and 63%. Brent's result lies comfortably within these boundaries.
A more straightforward way of looking at the results is as follows.
Ingo obtains 53 %, Brent obtains 60 %. Both results are perfectly compatible with the 55% +- 7% result predicted for a 100-game match and about 40 Elo established rating difference.
Robert
If I were to compare Ingo's result to Brent's result, then Sqrt(2) would come into play.
As you noted, the two results are not at odds with each other. However, we probably would not have predicted Brent's results from looking solely at Ingo's results, which by the way are the most similar to Brent's conditions.
Adam
-
- Posts: 1471
- Joined: Tue Mar 16, 2010 12:00 am
Re: Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match resu
My number is an approximation but should be reasonably accurate. Can you provide a better value?Adam Hair wrote:I am using the results of one random sample to predict (although after the fact) the results of another random sample. The computation of a prediction interval makes use of the sample mean, sample variance, and Student's T distribution, but not Sqrt(2).
The whole point is that I am not in the least surprised by finding a 53% and a 60% outcome in similar 100-game matches.Adam Hair wrote:However, we probably would not have predicted Brent's results from looking solely at Ingo's results, which by the way are the most similar to Brent's conditions.
There is no statistical surprise nor any practical surprise - I see this kind of results every day, don't you?
Robert
-
- Posts: 3226
- Joined: Wed May 06, 2009 10:31 pm
- Location: Fuquay-Varina, North Carolina
Re: Critter 1.4a x64 SSE4 vs Houdini 2.0c Pro x64 match resu
My actual source is a statistical textbook, but the Wikipedia entry for "prediction interval" also has the formula for computing the endpoints of the interval. For computing the variance, I used the approximation Error = 100%* Sqrt((score*(1-score)-0.25*draw%)/games).Houdini wrote:My number is an approximation but should be reasonably accurate. Can you provide a better value?Adam Hair wrote:I am using the results of one random sample to predict (although after the fact) the results of another random sample. The computation of a prediction interval makes use of the sample mean, sample variance, and Student's T distribution, but not Sqrt(2).
If we were comparing the two samples in terms as point estimates for the true score, then I understand how Sqrt(2) comes into play.
I agree with you, especially since no consideration of any measurement error has been made. My only point is that, basing a prediction on Ingo's results, a 60% score for Houdini vs Critter in Brent's test may be unexpected. Of course, when taking everything into account, the surprise evaporates.Houdini wrote:The whole point is that I am not in the least surprised by finding a 53% and a 60% outcome in similar 100-game matches.Adam Hair wrote:However, we probably would not have predicted Brent's results from looking solely at Ingo's results, which by the way are the most similar to Brent's conditions.
There is no statistical surprise nor any practical surprise - I see this kind of results every day, don't you?
Robert