Testing A against B by playing a pool of others

AndrewGrant · Post by **AndrewGrant** » Sat Jun 24, 2017 9:02 am

So I've gone through the trouble of writing a nice web based testing framework. The only part I am missing, or at least not sure of, is my process for terminating tests.

Originally, this was roughly my process:

I will have engines, one called test and one called base. They will both play a number of games against engines A, B, C, ....

I decide that I want 95% conf that there will not be a false positive / false negative. Based on this I get some bounds, [-X, X]. If the Z value I calculate falls outside those bounds, I will terminate the test. Otherwise I will play more games.

To calculate Z, I do the following

Code: Select all

testmean = 0; teststd= 0;
for matchup in test matchups&#58;
  w, d, l = matchup.results
  n = w + d + l
  s = w + d/2
  p = s / n
 
  diff = -400 * log10&#40;1/p - 1&#41;
  testmean += matchup.opponentsELO + diff

  std = sqrt&#40;p * &#40;1-p&#41; / n&#41;
  upperstd = -400 * log10&#40;1/&#40;p+std&#41; - 1&#41;
  lowerstd = -400 * log10&#40;1/&#40;p-std&#41; - 1&#41;
  teststd += &#40;upperstd + lowerstd&#41; / 2

Repeat again for basemean and basestd

Z = (&#40;testmean - basemean&#41; / numOpponents&#41; / sqrt&#40;&#40;testvar + basevar&#41; / numOpponents&#41;

I don't have anywhere near enough stats knowledge to say whether or not this is right. I question whether I should replace all the stds with variances. Should I have a divded by two on the (upperstd+lowerstd)/2?

Any help, or a pointer torwards some helpful reading materials would be appreciated greatly.

Thanks,
Andrew Grant

AndrewGrant · Post by **AndrewGrant** » Sat Jun 24, 2017 9:10 am

This was my other though, which gave me slighly better results, but still seems wrong, even for only 1 opponent.

Code: Select all

   
 for matchup in testmatchups&#58;
        wins = matchup.wins
        draws = matchup.draws
        losses = matchup.losses
        games = wins + draws + losses
        points = wins + draws / 2
        score = points / games
        
        diff = -400 * log10&#40;1/score - 1&#41;
        elo = matchup.opponent.elo + diff
        testmean += elo
        
        std = sqrt&#40;score * &#40;1 - score&#41; / games&#41;
        upperstd = (-400 * log10&#40;1/&#40;score+std&#41; - 1&#41;) - diff
        lowerstd = diff - (-400 * log10&#40;1/&#40;score-std&#41; - 1&#41;)
        elostd = upperstd + lowerstd
        testvar += elostd ** 2

Ajedrecista · Post by **Ajedrecista** » Sat Jun 24, 2017 7:52 pm

Hello Andrew:

I am not a expert in Statistics, but anyway I will give you my opinion.

I have a doubt: since you are using a trinomial model (wins, draws, loses), why do you use the binomial form of the standard deviation instead of the trinomial form?

http://centaur.reading.ac.uk/4549/1/200 ... icance.pdf

Please read the chapter 3.2 of that link. Following your notation:

Code: Select all

std = sqrt&#40; &#40;score * &#40;1 - score&#41; - 0.25 * d / n&#41; / n&#41;

Of course, score * (1 - score) - 0.25 * d / n < score * (1 - score), so abs(Z) will rise most of the time and number of games can be reduced.

This message do not answer your original question but I think you must be aware of it. The choice is yours.

Other questions might arise such as the stopping rule and its risk of biased results. I hope you will get more help from experts because the whole idea of comparing two versions with the aid of a pool of engines is really useful.

Regards from Spain.

Ajedrecista.

Testing A against B by playing a pool of others

Testing A against B by playing a pool of others

Re: Testing A against B by playing a pool of others

Re: Testing A against B by playing a pool of others.