Testing A against B by playing a pool of others

Discussion of chess software programming and technical issues.

Moderators: hgm, Harvey Williamson, bob

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Post Reply
AndrewGrant
Posts: 439
Joined: Tue Apr 19, 2016 4:08 am
Location: U.S.A
Full name: Andrew Grant
Contact:

Testing A against B by playing a pool of others

Post by AndrewGrant » Sat Jun 24, 2017 7:02 am

So I've gone through the trouble of writing a nice web based testing framework. The only part I am missing, or at least not sure of, is my process for terminating tests.

Originally, this was roughly my process:

I will have engines, one called test and one called base. They will both play a number of games against engines A, B, C, ....

I decide that I want 95% conf that there will not be a false positive / false negative. Based on this I get some bounds, [-X, X]. If the Z value I calculate falls outside those bounds, I will terminate the test. Otherwise I will play more games.

To calculate Z, I do the following

Code: Select all

testmean = 0; teststd= 0;
for matchup in test matchups:
  w, d, l = matchup.results
  n = w + d + l
  s = w + d/2
  p = s / n
 
  diff = -400 * log10(1/p - 1)
  testmean += matchup.opponentsELO + diff

  std = sqrt(p * (1-p) / n)
  upperstd = -400 * log10(1/(p+std) - 1)
  lowerstd = -400 * log10(1/(p-std) - 1)
  teststd += (upperstd + lowerstd) / 2

Repeat again for basemean and basestd

Z = ((testmean - basemean) / numOpponents) / sqrt((testvar + basevar) / numOpponents)
  
I don't have anywhere near enough stats knowledge to say whether or not this is right. I question whether I should replace all the stds with variances. Should I have a divded by two on the (upperstd+lowerstd)/2?

Any help, or a pointer torwards some helpful reading materials would be appreciated greatly.

Thanks,
Andrew Grant

AndrewGrant
Posts: 439
Joined: Tue Apr 19, 2016 4:08 am
Location: U.S.A
Full name: Andrew Grant
Contact:

Re: Testing A against B by playing a pool of others

Post by AndrewGrant » Sat Jun 24, 2017 7:10 am

This was my other though, which gave me slighly better results, but still seems wrong, even for only 1 opponent.

Code: Select all

   
 for matchup in testmatchups:
        wins = matchup.wins
        draws = matchup.draws
        losses = matchup.losses
        games = wins + draws + losses
        points = wins + draws / 2
        score = points / games
        
        diff = -400 * log10(1/score - 1)
        elo = matchup.opponent.elo + diff
        testmean += elo
        
        std = sqrt(score * (1 - score) / games)
        upperstd = (-400 * log10(1/(score+std) - 1)) - diff
        lowerstd = diff - (-400 * log10(1/(score-std) - 1))
        elostd = upperstd + lowerstd
        testvar += elostd ** 2

User avatar
Ajedrecista
Posts: 1376
Joined: Wed Jul 13, 2011 7:04 pm
Location: Madrid, Spain.
Contact:

Re: Testing A against B by playing a pool of others.

Post by Ajedrecista » Sat Jun 24, 2017 5:52 pm

Hello Andrew:

I am not a expert in Statistics, but anyway I will give you my opinion.

I have a doubt: since you are using a trinomial model (wins, draws, loses), why do you use the binomial form of the standard deviation instead of the trinomial form?

http://centaur.reading.ac.uk/4549/1/200 ... icance.pdf

Please read the chapter 3.2 of that link. Following your notation:

Code: Select all

std = sqrt( (score * (1 - score) - 0.25 * d / n) / n)
Of course, score * (1 - score) - 0.25 * d / n < score * (1 - score), so abs(Z) will rise most of the time and number of games can be reduced.

This message do not answer your original question but I think you must be aware of it. The choice is yours.

Other questions might arise such as the stopping rule and its risk of biased results. I hope you will get more help from experts because the whole idea of comparing two versions with the aid of a pool of engines is really useful.

Regards from Spain.

Ajedrecista.

Post Reply