Likelihood of superiority

Uri Blass · Post by **Uri Blass** » Sun Nov 15, 2009 7:35 pm

bob wrote:
Uri Blass wrote: The probabilty is dependent on the change that you make
so you cannot give a general answer to it(different changes mean different apriory knowledge).

Here are 2 examples.

1)Suppose that you do some change in the order of moves and find based on test suite of 1000 positions that the program is 5% faster in getting depth 12.

I think that you can be almost sure that the change is an improvement even without games.
Classic mistake. Just change null-move R=5 and you will get that result (faster time to depth 12). Or change your reduction factor to 3 or 4 or 5. Ditto. Is the program actually stronger? almost certainly not. Is it faster to depth 12? significantly faster.

I talked about change that is only in order of moves
R=5 in null move is a different type of change.

I did not claim that getting depth 12 faster is always better but if you get depth 12 faster only by changing the order of moves(no change in extensions or reductions) then you probably have an improvement.

Uri

mcostalba · Post by **mcostalba** » Sun Nov 15, 2009 9:15 pm

bob wrote: Other terms you use are ambiguous, making questions difficult to answer. What is a "match"? A set of games starting from a chosen set of positions, against an opponent (or group of opponents)? What is "matches"? Repeating the same set of positions/opponents again? Different positions?

In the simplest case is a match between A and A' only, a direct match on say 500 games where the two engines play against each other.

I would not consider others in this contest. I am interested to know with what probability engine A can be said to be stronger then A' _in that testing conditions_ (not that I didn't say "if A is stronger then A' "). I just need to know how to caulculale, starting from the 500 games result in terms of games wins, draws and losts, with what probability, say, A is stronger then A'.

This is a theoretical question, I am not arguing _if_ the one vs one matches are relaible or not or are effective or not in chess engine's testing.

Regarding testing condition I don't think that are important because, as I said, I want to know with what probability A is stronger then A' in THAT CONDITIONS, not in general terms, and not how well that conditions could be transalted to the general case.

So I have removed the variable of different opponents because I am not interested for this experiment and the variable on conditions because I accept that the diffference in strenght applies only under the same conditions in which the engines were tested.

How this LOS, calculated in this way, can be extrapolated for the general case it is a _complete_ other problem. I would call it 'problem of testing effectiveness', but is not discussed here.

mcostalba · Post by **mcostalba** » Sun Nov 15, 2009 9:23 pm

Zach Wegner wrote: Of course, for a simple calculation, an engine's Elo is going to be a Gaussian distribution. The LOS of engine X over engine Y is going to be a double integral. For each point on engine Y's Gaussian (the probability of engine Y having Elo E), we multiply that probability by the integral of engine X's Elo from E to infinity. Luckily, the inner integral is already a well known function, erf(). So very simply, the LOS is the integral of Gaussian(Y)*ERF(X). Which is going to be pretty complicated in itself, so use a computer

Thanks. I hoped for something easier actually

But it is ok. In this case I would argue I need the help of some tool. I would use Bayeselo but I have understood that in input this tool wants the whole pgn file. But I don't have always this info and actually I don't understand why it needs it. I think the tournment results in terms of wins/draws/loss should be enough to calculate the LOS or the ELO.

Spock · Post by **Spock** » Sun Nov 15, 2009 9:38 pm

mcostalba wrote: But it is ok. In this case I would argue I need the help of some tool. I would use Bayeselo but I have understood that in input this tool wants the whole pgn file. But I don't have always this info and actually I don't understand why it needs it. I think the tournment results in terms of wins/draws/loss should be enough to calculate the LOS or the ELO.

I think a pgn with just the header part with the game result would be OK for BayeseELO to work with, but again this would take some work to create - either manually, or for you to write a tool to for example take a tournament crosstab as an input file, and to generate a pgn file with just the headers as output. Must be an easier way

Rémi Coulom · Post by **Rémi Coulom** » Sun Nov 15, 2009 10:06 pm

Zach Wegner wrote:Of course, for a simple calculation, an engine's Elo is going to be a Gaussian distribution. The LOS of engine X over engine Y is going to be a double integral. For each point on engine Y's Gaussian (the probability of engine Y having Elo E), we multiply that probability by the integral of engine X's Elo from E to infinity. Luckily, the inner integral is already a well known function, erf(). So very simply, the LOS is the integral of Gaussian(Y)*ERF(X). Which is going to be pretty complicated in itself, so use a computer

It is even simpler if you know the Gaussian joint distribution of the two ratings: any linear combination of ratings is also Gaussian, so the difference of ratings is Gaussian, and its variance can be computed from the joint variance. So estimating the LOS only consists in ERF((Y-X)/sigma) or something like that. This is how it is computed in bayeselo.

When only two programs are playing against each other, it is possible to estimate LOS simply, without using any Elo model, thanks to the binomial distribution.

This short piece of code can compute the LOS between two programs, assuming uniform prior over winning rate:

Code: Select all

/////////////////////////////////////////////////////////////////////////////
//
// WhoIsBest.cpp
//
// Rémi Coulom
//
// November, 2002
//
/////////////////////////////////////////////////////////////////////////////
#include <stdio.h>

/////////////////////////////////////////////////////////////////////////////
// Calculate P&#40;A>B&#41;
/////////////////////////////////////////////////////////////////////////////
float ProbaBinom&#40;int n1, int n0&#41;
&#123;
 float *pi = new float&#91;n1 + 2&#93;;

 pi&#91;0&#93; = 0;
 for &#40;int i = n1 + 1; --i >= 0;)
  pi&#91;i + 1&#93; = 1;

 for &#40;int j = n0 + 1; --j >= 0;)
  for &#40;int i = 0; i <= n1; i++)
   pi&#91;i + 1&#93; = &#40;pi&#91;i + 1&#93; + pi&#91;i&#93;) * 0.5;

 float Result = pi&#91;n1 + 1&#93;;

 delete&#91;&#93; pi;

 return Result;
&#125;

/////////////////////////////////////////////////////////////////////////////
// main
/////////////////////////////////////////////////////////////////////////////
int main&#40;int argc, const char **argv&#41;
&#123;
 while&#40;1&#41;
 &#123;
  int n0 = -1;
  int n1 = -1;
  printf&#40;"number of wins   n1 = ");
  scanf&#40;"%d", &n1&#41;;
  printf&#40;"number of losses n0 = ");
  scanf&#40;"%d", &n0&#41;;
  if &#40;n0 < 0 || n1 < 0&#41;
   break;
  printf&#40;"P&#40;p1>p0&#41; = %f\n\n", ProbaBinom&#40;n1, n0&#41;);
 &#125;

 return 0;
&#125;

Note 1: draws do not count

Note 2: in practice, it may not be very reasonable to assume uniform prior of winning rate. It may be better to add a few virtual wins to every player, depending on how close in strength you believe the programs are.

Rémi

Uri Blass · Post by **Uri Blass** » Sun Nov 15, 2009 10:14 pm

mcostalba wrote:
bob wrote: Other terms you use are ambiguous, making questions difficult to answer. What is a "match"? A set of games starting from a chosen set of positions, against an opponent (or group of opponents)? What is "matches"? Repeating the same set of positions/opponents again? Different positions?
In the simplest case is a match between A and A' only, a direct match on say 500 games where the two engines play against each other.

I would not consider others in this contest. I am interested to know with what probability engine A can be said to be stronger then A' _in that testing conditions_ (not that I didn't say "if A is stronger then A' "). I just need to know how to caulculale, starting from the 500 games result in terms of games wins, draws and losts, with what probability, say, A is stronger then A'.

This is a theoretical question, I am not arguing _if_ the one vs one matches are relaible or not or are effective or not in chess engine's testing.

Regarding testing condition I don't think that are important because, as I said, I want to know with what probability A is stronger then A' in THAT CONDITIONS, not in general terms, and not how well that conditions could be transalted to the general case.

So I have removed the variable of different opponents because I am not interested for this experiment and the variable on conditions because I accept that the diffference in strenght applies only under the same conditions in which the engines were tested.

How this LOS, calculated in this way, can be extrapolated for the general case it is a _complete_ other problem. I would call it 'problem of testing effectiveness', but is not discussed here.

1)You ask the wrong question.
The probability that A is stronger than A' is always 1 or 0
and the result of the match does not change this probability.

I guess that you want to know what is the probability that the A' is better than A after you know the result of 500 games match between them(when A' is better than A mean that you can expect A' to beat A in a match).

Edit:The part that begins with "I guess" is also not a correct discription and you want to know what is the probability that you get the right decision if you have some rule of choice
to decide which program is better based on the result.

2)There is no easy answer for what you want to know
because the probability that you want to know is based on
a prior probability distribution that may be different in different cases.

An extreme example is when
you know in the beginning of the match that
A' and A is exactly the same program in this case the probability that the winner is better is 0 even after you know the result.

You usually have some knowledge and I think that it is a mistake to ignore the knowledge.

you usually know for example that a change that you make in stockfish is not going to make the program 200 elo better but it may make the program 200 elo weaker because of some bug so you have an apriori distribution about the rating difference between A and A'.

If you give some apriori distribution of the rating difference between A and A' then it is possible to calculate the probability that you get the right decision based on the result.

Without knowing a prior probability distribution it is impossible to calculate probability.

You can read about a prior probability distribution in

http://en.wikipedia.org/wiki/Prior_probability

Note that I did not ask Tord's opinion about it but I come from mathematicl background and I
expect every mathematician to agree with me so I expect Tord to agree with me.

Uri

Rémi Coulom · Post by **Rémi Coulom** » Sun Nov 15, 2009 10:25 pm

bob wrote:The error bar in BayesElo is (I believe) based on a 95% confidence interval. Which means if the error bars overlap, your confidence is well below that level.

No: you can have overlapping 95% confidence intervals, and a likelihood of superiority that is more than 95%.

The LOS matrix of bayeselo was explained in that post of the winboard forum:
http://www.open-aurec.com/wbforum/viewt ... t=60#p7614
In the example given in that post, the LOS of A over B is 99.6%, and their 95% confidence intervals overlap a lot.

In fact even in the case where there are only two players and no correlation between their ratings, the confidence can be more than 95% even if their 95% confidence intervals overlap.

In general, it is impossible to deduce LOS from confidence intervals.

Rémi

bob · Post by **bob** » Sun Nov 15, 2009 10:35 pm

Uri Blass wrote:
bob wrote:
Uri Blass wrote: The probabilty is dependent on the change that you make
so you cannot give a general answer to it(different changes mean different apriory knowledge).

Here are 2 examples.

1)Suppose that you do some change in the order of moves and find based on test suite of 1000 positions that the program is 5% faster in getting depth 12.

I think that you can be almost sure that the change is an improvement even without games.
Classic mistake. Just change null-move R=5 and you will get that result (faster time to depth 12). Or change your reduction factor to 3 or 4 or 5. Ditto. Is the program actually stronger? almost certainly not. Is it faster to depth 12? significantly faster.

I talked about change that is only in order of moves
R=5 in null move is a different type of change.

I did not claim that getting depth 12 faster is always better but if you get depth 12 faster only by changing the order of moves(no change in extensions or reductions) then you probably have an improvement.

Uri

Doesn't matter. Change ordering so that you search checks first. That will help most tactical positions go by faster. But the program will play worse overall. The rule is that changing ordering can change the time, but not the score, with normal alpha/beta. Of course with today's programs the scores can change a bit too. But faster needs to be faster in game positions. Using non-game test positions will most likely distort the result, and in the wrong direction.

bob · Post by **bob** » Sun Nov 15, 2009 10:38 pm

mcostalba wrote:
bob wrote: Other terms you use are ambiguous, making questions difficult to answer. What is a "match"? A set of games starting from a chosen set of positions, against an opponent (or group of opponents)? What is "matches"? Repeating the same set of positions/opponents again? Different positions?
In the simplest case is a match between A and A' only, a direct match on say 500 games where the two engines play against each other.

I would not consider others in this contest. I am interested to know with what probability engine A can be said to be stronger then A' _in that testing conditions_ (not that I didn't say "if A is stronger then A' "). I just need to know how to caulculale, starting from the 500 games result in terms of games wins, draws and losts, with what probability, say, A is stronger then A'.

This is a theoretical question, I am not arguing _if_ the one vs one matches are relaible or not or are effective or not in chess engine's testing.

Regarding testing condition I don't think that are important because, as I said, I want to know with what probability A is stronger then A' in THAT CONDITIONS, not in general terms, and not how well that conditions could be transalted to the general case.

So I have removed the variable of different opponents because I am not interested for this experiment and the variable on conditions because I accept that the diffference in strenght applies only under the same conditions in which the engines were tested.

How this LOS, calculated in this way, can be extrapolated for the general case it is a _complete_ other problem. I would call it 'problem of testing effectiveness', but is not discussed here.

The "LOS" is a very vague term if you are talking A vs A'. Suppose you add a new term that is very poorly tuned. It still may help against A, but can easily be worse against others. I found much better information by playing A vs a set of opponents, then repeating for A'. Incestuous testing has inherent dangers, since everything is the same except for one change.

Uri Blass · Post by **Uri Blass** » Sun Nov 15, 2009 10:40 pm

mcostalba wrote: I think only the latter question has a sense. Suppose the answer is that engine A is stronger of engine B by a 90% chance. It means that if we throw in 10 changes/patches each one tested only 500 games then at the end we have thrown in 9 patches that work and 1 that doesn't, but the net effect is well positive.

This is another mistake that you do.

it is possible to have 9 changes that add 1 elo and 1 change that reduce 10 elo and has a negative change of 1 elo(9*1-10).

The fact that 90% of your changes are productive does not mean that you get an improvement(even if practically you get an improvement).

I cannot help you to tell you what is the optimal way to test espacailly when apriory distribution that I mentioned in previous post is only a guess but
I can easily tell you that you make wrong assumptions.

You may practically get an improvement by testing but finding if it is better
to test changes for more games or for less games is a very hard question and the only way that I can suggest is simply to try and compare.

You can have a list of suggested changes and tester 1 who test in method A and tester 2 who test in method B and if tester 1 get better result with the same number of games after many games then you can guess that tester 1 is better.

Uri

Likelihood of superiority

Re: Likelihood of superiority

Re: Likelihood of superiority

Re: Likelihood of superiority

Re: Likelihood of superiority

Re: Likelihood of superiority

Re: Likelihood of superiority

Re: Likelihood of superiority

Re: Likelihood of superiority

Re: Likelihood of superiority

Re: Likelihood of superiority