I think ELO error bar is a misleading metric in chess engine development testing.
When testing a patch we want to know if the patch works or not.
In the simplest (and most time efficent way) we play let's say 500 games and then look at the results. We see that engine A is in advantage over engine B of say 10 ELO points. This is below the error bar thresold, so we should discard the result or consider the result not trustuble. This happens because we try to answer the question:
"Is engine A stronger then engine B ? "
We are forced to answer "We don't know" due to high error bar so we trash away test result. In my opinion we are asking the wrong question. The correct question to ask is the following:
"Given that engine A it _seems_ stonger of 10 ELO points to engine B after 500 games, what is the probability that engine A is _actually_ stronger of B by _any_ ELO point ?"
I think only the latter question has a sense. Suppose the answer is that engine A is stronger of engine B by a 90% chance. It means that if we throw in 10 changes/patches each one tested only 500 games then at the end we have thrown in 9 patches that work and 1 that doesn't, but the net effect is well positive. And we have used only a fractional amount of testing resources of a full verified ELO testing.
Unfortunatly I don't know the statistical formula that bounds ELO difference, number of played games and likelihood of superiority. To be more precise I would like to ask the mathematical formulas for the two questions below:
1) What is probability that engine A is stronger of engine B given an ELO difference of V points calculated after N matches ?
2) If we apply N changes to an engine, and each change has a probability of P to be winning (i.e. to make the engine stronger), what is the probability that the result engine is stronger of the original ?
I would like the quantitative formulas if possible.
Thanks a lot
Marco
Likelihood of superiority
Moderator: Ras
-
- Posts: 2292
- Joined: Mon Sep 29, 2008 1:50 am
Re: Likelihood of superiority
BayesElo computes the LOS matrix. I use it all the time.
-
- Posts: 2684
- Joined: Sat Jun 14, 2008 9:17 pm
Re: Likelihood of superiority
Thanks, if possible I would like to know the formula so that I can use it myself, also because it seems that bayeselo wants a pgn file in input, but mostly I have Fritz GUI tournaments.Michel wrote:BayesElo computes the LOS matrix. I use it all the time.
-
- Posts: 10770
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: Likelihood of superiority
The probabilty is dependent on the change that you makemcostalba wrote:I think ELO error bar is a misleading metric in chess engine development testing.
When testing a patch we want to know if the patch works or not.
In the simplest (and most time efficent way) we play let's say 500 games and then look at the results. We see that engine A is in advantage over engine B of say 10 ELO points. This is below the error bar thresold, so we should discard the result or consider the result not trustuble. This happens because we try to answer the question:
"Is engine A stronger then engine B ? "
We are forced to answer "We don't know" due to high error bar so we trash away test result. In my opinion we are asking the wrong question. The correct question to ask is the following:
"Given that engine A it _seems_ stonger of 10 ELO points to engine B after 500 games, what is the probability that engine A is _actually_ stronger of B by _any_ ELO point ?"
I think only the latter question has a sense. Suppose the answer is that engine A is stronger of engine B by a 90% chance. It means that if we throw in 10 changes/patches each one tested only 500 games then at the end we have thrown in 9 patches that work and 1 that doesn't, but the net effect is well positive. And we have used only a fractional amount of testing resources of a full verified ELO testing.
Unfortunatly I don't know the statistical formula that bounds ELO difference, number of played games and likelihood of superiority. To be more precise I would like to ask the mathematical formulas for the two questions below:
1) What is probability that engine A is stronger of engine B given an ELO difference of V points calculated after N matches ?
2) If we apply N changes to an engine, and each change has a probability of P to be winning (i.e. to make the engine stronger), what is the probability that the result engine is stronger of the original ?
I would like the quantitative formulas if possible.
Thanks a lot
Marco
so you cannot give a general answer to it(different changes mean different apriory knowledge).
Here are 2 examples.
1)Suppose that you do some change in the order of moves and find based on test suite of 1000 positions that the program is 5% faster in getting depth 12.
I think that you can be almost sure that the change is an improvement even without games.
If after this knowledge
you make 500 games and find 5 elo improvement then you can decide to stop the test and accept that there is probably an improvement.
2)Suppose that you do some change and increase the value of pawn by 5%.
Suppose that you find after 500 games that the change made the program 5 elo better.
I think that in this case you can be less sure that the change made the program better.
Note that even in case 1 it is better if you play games(otherwise you can miss a bug that can happen in games and not in test suites) but you need less games if you see an improvement.
I think that even in case of evaluation change there are cases when you can strongly believe that the change help and if you find some mistakes that the engine does in games and you know that the change is not changing the value of some paramater but adding some endgame knowledge to prevent the mistakes(when the speed is not changed significantly and the evaluation is the same in most positions) you can strongly believe that the change is productive before the games.
Uri
-
- Posts: 2684
- Joined: Sat Jun 14, 2008 9:17 pm
Re: Likelihood of superiority
Uri, I understand your point, but I am _not_ interested in semantical discussions of why a change can be good or bad (discussions that anyhow have their reasons and are for sure worth to do).Uri Blass wrote: The probabilty is dependent on the change that you make
so you cannot give a general answer to it(different changes mean different apriory knowledge).
Let's try to explain in another way. When your favorite ELO calculator, let's call it bayes elo or Fritz GUI or pen and paper or anything else gives you the ELO difference of two engines out of a tournament this tool DOES NOT care less why the two engines are different and why one is stronger then the other.
It simply applies a mathematical formula to the tournment results given in the form of a triplet number of wins / nr. of draws /nr of losts
Stop nothing more. I would like to know the _corresponding_ statistical formula for the question I posted above, nothing more.
The problem of investigating why a tournament between two engines gives that result instead of another is an interesting discussion in itself, but is not the subject of this topic.
Mine is just a very technical question.
-
- Posts: 1922
- Joined: Thu Mar 09, 2006 12:51 am
- Location: Earth
Re: Likelihood of superiority
The real answer is: it's very complicated. Check out BayesElo, the CBradleyTerry.cpp file. Be prepared for some heavy math.mcostalba wrote:I think ELO error bar is a misleading metric in chess engine development testing.
When testing a patch we want to know if the patch works or not.
In the simplest (and most time efficent way) we play let's say 500 games and then look at the results. We see that engine A is in advantage over engine B of say 10 ELO points. This is below the error bar thresold, so we should discard the result or consider the result not trustuble. This happens because we try to answer the question:
"Is engine A stronger then engine B ? "
We are forced to answer "We don't know" due to high error bar so we trash away test result. In my opinion we are asking the wrong question. The correct question to ask is the following:
"Given that engine A it _seems_ stonger of 10 ELO points to engine B after 500 games, what is the probability that engine A is _actually_ stronger of B by _any_ ELO point ?"
I think only the latter question has a sense. Suppose the answer is that engine A is stronger of engine B by a 90% chance. It means that if we throw in 10 changes/patches each one tested only 500 games then at the end we have thrown in 9 patches that work and 1 that doesn't, but the net effect is well positive. And we have used only a fractional amount of testing resources of a full verified ELO testing.
Unfortunatly I don't know the statistical formula that bounds ELO difference, number of played games and likelihood of superiority. To be more precise I would like to ask the mathematical formulas for the two questions below:
1) What is probability that engine A is stronger of engine B given an ELO difference of V points calculated after N matches ?
2) If we apply N changes to an engine, and each change has a probability of P to be winning (i.e. to make the engine stronger), what is the probability that the result engine is stronger of the original ?
I would like the quantitative formulas if possible.
Thanks a lot
Marco
Of course, for a simple calculation, an engine's Elo is going to be a Gaussian distribution. The LOS of engine X over engine Y is going to be a double integral. For each point on engine Y's Gaussian (the probability of engine Y having Elo E), we multiply that probability by the integral of engine X's Elo from E to infinity. Luckily, the inner integral is already a well known function, erf(). So very simply, the LOS is the integral of Gaussian(Y)*ERF(X). Which is going to be pretty complicated in itself, so use a computer

Re: Likelihood of superiority
The CCRL ratings lists use BayesELO, and display the LOS. The far right hand column on the lists
-
- Posts: 10770
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: Likelihood of superiority
I understand but the why is clearly relevant to the probability question.mcostalba wrote:Uri, I understand your point, but I am _not_ interested in semantical discussions of why a change can be good or bad (discussions that anyhow have their reasons and are for sure worth to do).Uri Blass wrote: The probabilty is dependent on the change that you make
so you cannot give a general answer to it(different changes mean different apriory knowledge).
Let's try to explain in another way. When your favorite ELO calculator, let's call it bayes elo or Fritz GUI or pen and paper or anything else gives you the ELO difference of two engines out of a tournament this tool DOES NOT care less why the two engines are different and why one is stronger then the other.
It simply applies a mathematical formula to the tournment results given in the form of a triplet number of wins / nr. of draws /nr of losts
Stop nothing more. I would like to know the _corresponding_ statistical formula for the question I posted above, nothing more.
The problem of investigating why a tournament between two engines gives that result instead of another is an interesting discussion in itself, but is not the subject of this topic.
Mine is just a very technical question.
As an extreme example
If you found a better compiler that cause stockfish to become 5% faster in nodes per seconds without obvious change in the analysis then you are going to continue to believe that the new stockfish is better even if tests show after 500 games that it is 1 elo weaker.
You may still say after the games that you have more than 95% confidence that the new stockfish is stronger.
I am careful to say more than 95% confidence and not 100% because it is possible that the new compiler make it 5% faster with no change in analysis in most games when in minority of the games the new compiler cause it to blunder and these minority of games are enough to make the new stockfish slightly weaker.
My point is that it is logical to have a different opinion about probability for different changes even when you get the same result and in some cases it even make sense to accept some change even when you get slightly less than 50% in games.
Uri
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Likelihood of superiority
There is a problem with the question. If you play A against a set of opponents, and then A' against the same set, the LOS for A vs A' is not so clear since they never play each other. If they do play each other, the LOS between the two versions is (in my opinion) much less informative than how the two versions do against the same set of opponents. I'm not convinced there is any way to cheat the sample-size issue.mcostalba wrote:I think ELO error bar is a misleading metric in chess engine development testing.
When testing a patch we want to know if the patch works or not.
In the simplest (and most time efficent way) we play let's say 500 games and then look at the results. We see that engine A is in advantage over engine B of say 10 ELO points. This is below the error bar thresold, so we should discard the result or consider the result not trustuble. This happens because we try to answer the question:
"Is engine A stronger then engine B ? "
We are forced to answer "We don't know" due to high error bar so we trash away test result. In my opinion we are asking the wrong question. The correct question to ask is the following:
"Given that engine A it _seems_ stonger of 10 ELO points to engine B after 500 games, what is the probability that engine A is _actually_ stronger of B by _any_ ELO point ?"
I think only the latter question has a sense. Suppose the answer is that engine A is stronger of engine B by a 90% chance. It means that if we throw in 10 changes/patches each one tested only 500 games then at the end we have thrown in 9 patches that work and 1 that doesn't, but the net effect is well positive. And we have used only a fractional amount of testing resources of a full verified ELO testing.
Unfortunatly I don't know the statistical formula that bounds ELO difference, number of played games and likelihood of superiority. To be more precise I would like to ask the mathematical formulas for the two questions below:
1) What is probability that engine A is stronger of engine B given an ELO difference of V points calculated after N matches ?
2) If we apply N changes to an engine, and each change has a probability of P to be winning (i.e. to make the engine stronger), what is the probability that the result engine is stronger of the original ?
I would like the quantitative formulas if possible.
Thanks a lot
Marco
Other terms you use are ambiguous, making questions difficult to answer. What is a "match"? A set of games starting from a chosen set of positions, against an opponent (or group of opponents)? What is "matches"? Repeating the same set of positions/opponents again? Different positions?
The error bar in BayesElo is (I believe) based on a 95% confidence interval. Which means if the error bars overlap, your confidence is well below that level.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Likelihood of superiority
Classic mistake. Just change null-move R=5 and you will get that result (faster time to depth 12). Or change your reduction factor to 3 or 4 or 5. Ditto. Is the program actually stronger? almost certainly not. Is it faster to depth 12? significantly faster.Uri Blass wrote:The probabilty is dependent on the change that you makemcostalba wrote:I think ELO error bar is a misleading metric in chess engine development testing.
When testing a patch we want to know if the patch works or not.
In the simplest (and most time efficent way) we play let's say 500 games and then look at the results. We see that engine A is in advantage over engine B of say 10 ELO points. This is below the error bar thresold, so we should discard the result or consider the result not trustuble. This happens because we try to answer the question:
"Is engine A stronger then engine B ? "
We are forced to answer "We don't know" due to high error bar so we trash away test result. In my opinion we are asking the wrong question. The correct question to ask is the following:
"Given that engine A it _seems_ stonger of 10 ELO points to engine B after 500 games, what is the probability that engine A is _actually_ stronger of B by _any_ ELO point ?"
I think only the latter question has a sense. Suppose the answer is that engine A is stronger of engine B by a 90% chance. It means that if we throw in 10 changes/patches each one tested only 500 games then at the end we have thrown in 9 patches that work and 1 that doesn't, but the net effect is well positive. And we have used only a fractional amount of testing resources of a full verified ELO testing.
Unfortunatly I don't know the statistical formula that bounds ELO difference, number of played games and likelihood of superiority. To be more precise I would like to ask the mathematical formulas for the two questions below:
1) What is probability that engine A is stronger of engine B given an ELO difference of V points calculated after N matches ?
2) If we apply N changes to an engine, and each change has a probability of P to be winning (i.e. to make the engine stronger), what is the probability that the result engine is stronger of the original ?
I would like the quantitative formulas if possible.
Thanks a lot
Marco
so you cannot give a general answer to it(different changes mean different apriory knowledge).
Here are 2 examples.
1)Suppose that you do some change in the order of moves and find based on test suite of 1000 positions that the program is 5% faster in getting depth 12.
I think that you can be almost sure that the change is an improvement even without games.
Again, how? speed might make you better. Might make you worse. Fooling around with the search is a good way to dramatically change the speed to some fixed depth. And that speed change can easily be inversely proportional to playing skill change.[/quote]
If after this knowledge
you make 500 games and find 5 elo improvement then you can decide to stop the test and accept that there is probably an improvement.
I believe the two examples are _identical_.
2)Suppose that you do some change and increase the value of pawn by 5%.
Suppose that you find after 500 games that the change made the program 5 elo better.
I think that in this case you can be less sure that the change made the program better.
Note that even in case 1 it is better if you play games(otherwise you can miss a bug that can happen in games and not in test suites) but you need less games if you see an improvement.
I think that even in case of evaluation change there are cases when you can strongly believe that the change help and if you find some mistakes that the engine does in games and you know that the change is not changing the value of some paramater but adding some endgame knowledge to prevent the mistakes(when the speed is not changed significantly and the evaluation is the same in most positions) you can strongly believe that the change is productive before the games.
Uri