Edsel Apostol wrote:It would be interesting to see this data being shown in a graph including the error bars. I'm interested to see how consistent (straight) the line of the elo and how the error bar diverge nearer the elo line as the number of games increases.
I'm actually working on a program that does this very thing. I have a small Windows app for my testing and am trying to incorporate regular calls to bayeselo for a graphical display. I will release it when ready
Edsel Apostol wrote:It would be interesting to see this data being shown in a graph including the error bars. I'm interested to see how consistent (straight) the line of the elo and how the error bar diverge nearer the elo line as the number of games increases.
I show such graphs in my tournament site. Example (Sjeng 12.13):
green line - rating
red lines - rating error interval
yellow line - average opponent
blue line - number of games (scale not shown)
And a smoother version where days without games are skipped:
Similar graphs are also featured in CCRL sites.
Of course, the number of games is much less than 40k. But it does give the idea of how much the ratings fluctuate with few games.
bob wrote:
This does show the danger of relying on small numbers of games to predict whether a change is good or bad...
I think that the conclusion is simply not to optimize engines for games but to optimize them for solving swami's test suite.
In this case it is easy to predict if a change is good or bad and hopefully most of the changes that are productive for swami's test suite are going to be also productive for games.
I can add that you cannot do it if you test changes in the time managament or changes of learning during the game but I hope that swami develop a good test suite and it may be possible to test it by comparing rating of programs based on CCRL or CEGT and rating of programs based on some formula based on the test results.
Uri Blass wrote:I can add that you cannot do it if you test changes in the time managament or changes of learning during the game but I hope that swami develop a good test suite and it may be possible to test it by comparing rating of programs based on CCRL or CEGT and rating of programs based on some formula based on the test results.
Uri
What I found missing in swami tests (and I have already said this some weeks ago) is that the final score uses equal weights for all the tests.
I am not sure that some strategic aspects have the same impact of others, I would think that to go in the direction of infering the ELO rating of the engine out of the Swami test, individual test scores should be weighted before to be summed toghter.
Correct weight to use can only be deduced out of various engine's result knowing their real ELO and so find a weight vector that best approximates engines ELO out of the swami test result.
Uri Blass wrote:I can add that you cannot do it if you test changes in the time managament or changes of learning during the game but I hope that swami develop a good test suite and it may be possible to test it by comparing rating of programs based on CCRL or CEGT and rating of programs based on some formula based on the test results.
Uri
What I found missing in swami tests (and I have already said this some weeks ago) is that the final score uses equal weights for all the tests.
I am not sure that some strategic aspects have the same impact of others, I would think that to go in the direction of infering the ELO rating of the engine out of the Swami test, individual test scores should be weighted before to be summed toghter.
Correct weight to use can only be deduced out of various engine's result knowing their real ELO and so find a weight vector that best approximates engines ELO out of the swami test result.
I guess that you are right that some features have more value than others.
We might be able to guess the proper weights by doing a linear least squares fit or some other simple math.
liuzy wrote:Bob, why don't you improve ipplite using your cluster?
Because we are busy improving Crafty, which is my own code. What would be the benefit of improving IP* if we one day discover it is derived from Rybka??? Also, what would be the motivation? I want to win with my code. Not much point in copying or using what someone else has done, IMHO. This is about enjoying a hobby, not copying what others have done. (Of course, not _everyone_ believes in that philosophy, but that's a separate issue.)
I agree with Bob in a big way and with all due respect, I think the question was a stupid one.
Can you guide a newbie on how to run multiple games on a cluster.
Are you using cutechess-cli for that purpose or do you use your own script ?
For other softwares that I use capapble of running on cluster , I just invoke
"mpirun -np 12 'which xxxx' " to run the command xxxx on 12 processors. I guess I need an MPI script to replace xxxxx , which assign the games to different IP addresses.
Right now when I do "mpirun -np 4 scorpio" , it starts 4 instances of scorpio on 4 nodes.
bob wrote:
This does show the danger of relying on small numbers of games to predict whether a change is good or bad...
I think that the conclusion is simply not to optimize engines for games but to optimize them for solving swami's test suite.
In this case it is easy to predict if a change is good or bad and hopefully most of the changes that are productive for swami's test suite are going to be also productive for games.