Likelihood of superiority

mcostalba · Post by **mcostalba** » Sun Nov 15, 2009 1:14 pm

I think ELO error bar is a misleading metric in chess engine development testing.

When testing a patch we want to know if the patch works or not.

In the simplest (and most time efficent way) we play let's say 500 games and then look at the results. We see that engine A is in advantage over engine B of say 10 ELO points. This is below the error bar thresold, so we should discard the result or consider the result not trustuble. This happens because we try to answer the question:

"Is engine A stronger then engine B ? "

We are forced to answer "We don't know" due to high error bar so we trash away test result. In my opinion we are asking the wrong question. The correct question to ask is the following:

"Given that engine A it _seems_ stonger of 10 ELO points to engine B after 500 games, what is the probability that engine A is _actually_ stronger of B by _any_ ELO point ?"

I think only the latter question has a sense. Suppose the answer is that engine A is stronger of engine B by a 90% chance. It means that if we throw in 10 changes/patches each one tested only 500 games then at the end we have thrown in 9 patches that work and 1 that doesn't, but the net effect is well positive. And we have used only a fractional amount of testing resources of a full verified ELO testing.

Unfortunatly I don't know the statistical formula that bounds ELO difference, number of played games and likelihood of superiority. To be more precise I would like to ask the mathematical formulas for the two questions below:

1) What is probability that engine A is stronger of engine B given an ELO difference of V points calculated after N matches ?

2) If we apply N changes to an engine, and each change has a probability of P to be winning (i.e. to make the engine stronger), what is the probability that the result engine is stronger of the original ?

I would like the quantitative formulas if possible.

Thanks a lot
Marco

Michel · Post by **Michel** » Sun Nov 15, 2009 2:40 pm

BayesElo computes the LOS matrix. I use it all the time.

mcostalba · Post by **mcostalba** » Sun Nov 15, 2009 4:35 pm

Michel wrote:BayesElo computes the LOS matrix. I use it all the time.

Thanks, if possible I would like to know the formula so that I can use it myself, also because it seems that bayeselo wants a pgn file in input, but mostly I have Fritz GUI tournaments.

Uri Blass · Post by **Uri Blass** » Sun Nov 15, 2009 4:54 pm

mcostalba wrote:I think ELO error bar is a misleading metric in chess engine development testing.

When testing a patch we want to know if the patch works or not.

In the simplest (and most time efficent way) we play let's say 500 games and then look at the results. We see that engine A is in advantage over engine B of say 10 ELO points. This is below the error bar thresold, so we should discard the result or consider the result not trustuble. This happens because we try to answer the question:

"Is engine A stronger then engine B ? "

We are forced to answer "We don't know" due to high error bar so we trash away test result. In my opinion we are asking the wrong question. The correct question to ask is the following:

"Given that engine A it _seems_ stonger of 10 ELO points to engine B after 500 games, what is the probability that engine A is _actually_ stronger of B by _any_ ELO point ?"

I think only the latter question has a sense. Suppose the answer is that engine A is stronger of engine B by a 90% chance. It means that if we throw in 10 changes/patches each one tested only 500 games then at the end we have thrown in 9 patches that work and 1 that doesn't, but the net effect is well positive. And we have used only a fractional amount of testing resources of a full verified ELO testing.

Unfortunatly I don't know the statistical formula that bounds ELO difference, number of played games and likelihood of superiority. To be more precise I would like to ask the mathematical formulas for the two questions below:

1) What is probability that engine A is stronger of engine B given an ELO difference of V points calculated after N matches ?

2) If we apply N changes to an engine, and each change has a probability of P to be winning (i.e. to make the engine stronger), what is the probability that the result engine is stronger of the original ?

I would like the quantitative formulas if possible.

Thanks a lot
Marco

The probabilty is dependent on the change that you make
so you cannot give a general answer to it(different changes mean different apriory knowledge).

Here are 2 examples.

1)Suppose that you do some change in the order of moves and find based on test suite of 1000 positions that the program is 5% faster in getting depth 12.

I think that you can be almost sure that the change is an improvement even without games.

If after this knowledge
you make 500 games and find 5 elo improvement then you can decide to stop the test and accept that there is probably an improvement.

2)Suppose that you do some change and increase the value of pawn by 5%.

Suppose that you find after 500 games that the change made the program 5 elo better.

I think that in this case you can be less sure that the change made the program better.

Note that even in case 1 it is better if you play games(otherwise you can miss a bug that can happen in games and not in test suites) but you need less games if you see an improvement.

I think that even in case of evaluation change there are cases when you can strongly believe that the change help and if you find some mistakes that the engine does in games and you know that the change is not changing the value of some paramater but adding some endgame knowledge to prevent the mistakes(when the speed is not changed significantly and the evaluation is the same in most positions) you can strongly believe that the change is productive before the games.

Uri

mcostalba · Post by **mcostalba** » Sun Nov 15, 2009 5:19 pm

Uri Blass wrote: The probabilty is dependent on the change that you make
so you cannot give a general answer to it(different changes mean different apriory knowledge).

Uri, I understand your point, but I am _not_ interested in semantical discussions of why a change can be good or bad (discussions that anyhow have their reasons and are for sure worth to do).

Let's try to explain in another way. When your favorite ELO calculator, let's call it bayes elo or Fritz GUI or pen and paper or anything else gives you the ELO difference of two engines out of a tournament this tool DOES NOT care less why the two engines are different and why one is stronger then the other.

It simply applies a mathematical formula to the tournment results given in the form of a triplet number of wins / nr. of draws /nr of losts

Stop nothing more. I would like to know the _corresponding_ statistical formula for the question I posted above, nothing more.

The problem of investigating why a tournament between two engines gives that result instead of another is an interesting discussion in itself, but is not the subject of this topic.

Mine is just a very technical question.

Zach Wegner · Post by **Zach Wegner** » Sun Nov 15, 2009 6:06 pm

mcostalba wrote:I think ELO error bar is a misleading metric in chess engine development testing.

When testing a patch we want to know if the patch works or not.

In the simplest (and most time efficent way) we play let's say 500 games and then look at the results. We see that engine A is in advantage over engine B of say 10 ELO points. This is below the error bar thresold, so we should discard the result or consider the result not trustuble. This happens because we try to answer the question:

"Is engine A stronger then engine B ? "

We are forced to answer "We don't know" due to high error bar so we trash away test result. In my opinion we are asking the wrong question. The correct question to ask is the following:

"Given that engine A it _seems_ stonger of 10 ELO points to engine B after 500 games, what is the probability that engine A is _actually_ stronger of B by _any_ ELO point ?"

I think only the latter question has a sense. Suppose the answer is that engine A is stronger of engine B by a 90% chance. It means that if we throw in 10 changes/patches each one tested only 500 games then at the end we have thrown in 9 patches that work and 1 that doesn't, but the net effect is well positive. And we have used only a fractional amount of testing resources of a full verified ELO testing.

Unfortunatly I don't know the statistical formula that bounds ELO difference, number of played games and likelihood of superiority. To be more precise I would like to ask the mathematical formulas for the two questions below:

1) What is probability that engine A is stronger of engine B given an ELO difference of V points calculated after N matches ?

2) If we apply N changes to an engine, and each change has a probability of P to be winning (i.e. to make the engine stronger), what is the probability that the result engine is stronger of the original ?

I would like the quantitative formulas if possible.

Thanks a lot
Marco

The real answer is: it's very complicated. Check out BayesElo, the CBradleyTerry.cpp file. Be prepared for some heavy math.

Of course, for a simple calculation, an engine's Elo is going to be a Gaussian distribution. The LOS of engine X over engine Y is going to be a double integral. For each point on engine Y's Gaussian (the probability of engine Y having Elo E), we multiply that probability by the integral of engine X's Elo from E to infinity. Luckily, the inner integral is already a well known function, erf(). So very simply, the LOS is the integral of Gaussian(Y)*ERF(X). Which is going to be pretty complicated in itself, so use a computer

Spock · Post by **Spock** » Sun Nov 15, 2009 6:17 pm

The CCRL ratings lists use BayesELO, and display the LOS. The far right hand column on the lists

Uri Blass · Post by **Uri Blass** » Sun Nov 15, 2009 6:26 pm

mcostalba wrote:
Uri Blass wrote: The probabilty is dependent on the change that you make
so you cannot give a general answer to it(different changes mean different apriory knowledge).
Uri, I understand your point, but I am _not_ interested in semantical discussions of why a change can be good or bad (discussions that anyhow have their reasons and are for sure worth to do).

Let's try to explain in another way. When your favorite ELO calculator, let's call it bayes elo or Fritz GUI or pen and paper or anything else gives you the ELO difference of two engines out of a tournament this tool DOES NOT care less why the two engines are different and why one is stronger then the other.

It simply applies a mathematical formula to the tournment results given in the form of a triplet number of wins / nr. of draws /nr of losts

Stop nothing more. I would like to know the _corresponding_ statistical formula for the question I posted above, nothing more.

The problem of investigating why a tournament between two engines gives that result instead of another is an interesting discussion in itself, but is not the subject of this topic.

Mine is just a very technical question.

I understand but the why is clearly relevant to the probability question.

As an extreme example
If you found a better compiler that cause stockfish to become 5% faster in nodes per seconds without obvious change in the analysis then you are going to continue to believe that the new stockfish is better even if tests show after 500 games that it is 1 elo weaker.

You may still say after the games that you have more than 95% confidence that the new stockfish is stronger.

I am careful to say more than 95% confidence and not 100% because it is possible that the new compiler make it 5% faster with no change in analysis in most games when in minority of the games the new compiler cause it to blunder and these minority of games are enough to make the new stockfish slightly weaker.

My point is that it is logical to have a different opinion about probability for different changes even when you get the same result and in some cases it even make sense to accept some change even when you get slightly less than 50% in games.

Uri

bob · Post by **bob** » Sun Nov 15, 2009 6:52 pm

mcostalba wrote:I think ELO error bar is a misleading metric in chess engine development testing.

When testing a patch we want to know if the patch works or not.

In the simplest (and most time efficent way) we play let's say 500 games and then look at the results. We see that engine A is in advantage over engine B of say 10 ELO points. This is below the error bar thresold, so we should discard the result or consider the result not trustuble. This happens because we try to answer the question:

"Is engine A stronger then engine B ? "

We are forced to answer "We don't know" due to high error bar so we trash away test result. In my opinion we are asking the wrong question. The correct question to ask is the following:

"Given that engine A it _seems_ stonger of 10 ELO points to engine B after 500 games, what is the probability that engine A is _actually_ stronger of B by _any_ ELO point ?"

I think only the latter question has a sense. Suppose the answer is that engine A is stronger of engine B by a 90% chance. It means that if we throw in 10 changes/patches each one tested only 500 games then at the end we have thrown in 9 patches that work and 1 that doesn't, but the net effect is well positive. And we have used only a fractional amount of testing resources of a full verified ELO testing.

Unfortunatly I don't know the statistical formula that bounds ELO difference, number of played games and likelihood of superiority. To be more precise I would like to ask the mathematical formulas for the two questions below:

1) What is probability that engine A is stronger of engine B given an ELO difference of V points calculated after N matches ?

2) If we apply N changes to an engine, and each change has a probability of P to be winning (i.e. to make the engine stronger), what is the probability that the result engine is stronger of the original ?

I would like the quantitative formulas if possible.

Thanks a lot
Marco

There is a problem with the question. If you play A against a set of opponents, and then A' against the same set, the LOS for A vs A' is not so clear since they never play each other. If they do play each other, the LOS between the two versions is (in my opinion) much less informative than how the two versions do against the same set of opponents. I'm not convinced there is any way to cheat the sample-size issue.

Other terms you use are ambiguous, making questions difficult to answer. What is a "match"? A set of games starting from a chosen set of positions, against an opponent (or group of opponents)? What is "matches"? Repeating the same set of positions/opponents again? Different positions?

The error bar in BayesElo is (I believe) based on a 95% confidence interval. Which means if the error bars overlap, your confidence is well below that level.

bob · Post by **bob** » Sun Nov 15, 2009 6:57 pm

Uri Blass wrote:
mcostalba wrote:I think ELO error bar is a misleading metric in chess engine development testing.

When testing a patch we want to know if the patch works or not.

In the simplest (and most time efficent way) we play let's say 500 games and then look at the results. We see that engine A is in advantage over engine B of say 10 ELO points. This is below the error bar thresold, so we should discard the result or consider the result not trustuble. This happens because we try to answer the question:

"Is engine A stronger then engine B ? "

We are forced to answer "We don't know" due to high error bar so we trash away test result. In my opinion we are asking the wrong question. The correct question to ask is the following:

"Given that engine A it _seems_ stonger of 10 ELO points to engine B after 500 games, what is the probability that engine A is _actually_ stronger of B by _any_ ELO point ?"

I think only the latter question has a sense. Suppose the answer is that engine A is stronger of engine B by a 90% chance. It means that if we throw in 10 changes/patches each one tested only 500 games then at the end we have thrown in 9 patches that work and 1 that doesn't, but the net effect is well positive. And we have used only a fractional amount of testing resources of a full verified ELO testing.

Unfortunatly I don't know the statistical formula that bounds ELO difference, number of played games and likelihood of superiority. To be more precise I would like to ask the mathematical formulas for the two questions below:

1) What is probability that engine A is stronger of engine B given an ELO difference of V points calculated after N matches ?

2) If we apply N changes to an engine, and each change has a probability of P to be winning (i.e. to make the engine stronger), what is the probability that the result engine is stronger of the original ?

I would like the quantitative formulas if possible.

Thanks a lot
Marco
The probabilty is dependent on the change that you make
so you cannot give a general answer to it(different changes mean different apriory knowledge).

Here are 2 examples.

1)Suppose that you do some change in the order of moves and find based on test suite of 1000 positions that the program is 5% faster in getting depth 12.

I think that you can be almost sure that the change is an improvement even without games.

Classic mistake. Just change null-move R=5 and you will get that result (faster time to depth 12). Or change your reduction factor to 3 or 4 or 5. Ditto. Is the program actually stronger? almost certainly not. Is it faster to depth 12? significantly faster.

If after this knowledge
you make 500 games and find 5 elo improvement then you can decide to stop the test and accept that there is probably an improvement.

Again, how? speed might make you better. Might make you worse. Fooling around with the search is a good way to dramatically change the speed to some fixed depth. And that speed change can easily be inversely proportional to playing skill change.[/quote]

2)Suppose that you do some change and increase the value of pawn by 5%.

Suppose that you find after 500 games that the change made the program 5 elo better.

I think that in this case you can be less sure that the change made the program better.

I believe the two examples are _identical_.

Note that even in case 1 it is better if you play games(otherwise you can miss a bug that can happen in games and not in test suites) but you need less games if you see an improvement.

I think that even in case of evaluation change there are cases when you can strongly believe that the change help and if you find some mistakes that the engine does in games and you know that the change is not changing the value of some paramater but adding some endgame knowledge to prevent the mistakes(when the speed is not changed significantly and the evaluation is the same in most positions) you can strongly believe that the change is productive before the games.

Uri

Likelihood of superiority

Likelihood of superiority

Re: Likelihood of superiority

Re: Likelihood of superiority

Re: Likelihood of superiority

Re: Likelihood of superiority

Re: Likelihood of superiority

Re: Likelihood of superiority

Re: Likelihood of superiority

Re: Likelihood of superiority

Re: Likelihood of superiority