Dirt wrote:This still looks linear to me, just with a slightly steeper slope. Where does ph come from? Just incrementing it by one each ply would be simplest, but I don't think that could be right.
The point made by Greg is true, adjusting a linear regression between the data you've given results in a straight line, just more steep, which means either that Glaurung is already non-linear, or the new code is not really non-linear. Anyway, even so the increment in ELO can be real.
bob wrote:
Take any program. make a change to it. Play the old vs the new. If the change is good, the results will be far better than expected. If the change is bad, the results will be far worse than expected. Because since the only difference in the two programs is the change you made, it tends to influence games more than expected.
I've run millions of games testing A vs A' and the results are unreliable. Far better to run A and A' against a common set of opponents and see which turns out to be better.
The question that you are interested to know in order to decide if to accept a change is if A is better than A' and not how much better.
If testing A against A' increase the effect then it is a good test because
you need less games to know which version is stronger.
The only possible problem is if you get often cases when A' beat A and A is better than A' against other opponents but I see no data that suggests that this problem happen often.
bob wrote:
Take any program. make a change to it. Play the old vs the new. If the change is good, the results will be far better than expected. If the change is bad, the results will be far worse than expected. Because since the only difference in the two programs is the change you made, it tends to influence games more than expected.
I've run millions of games testing A vs A' and the results are unreliable. Far better to run A and A' against a common set of opponents and see which turns out to be better.
The question that you are interested to know in order to decide if to accept a change is if A is better than A' and not how much better.
If testing A against A' increase the effect then it is a good test because
you need less games to know which version is stronger.
The only possible problem is if you get often cases when A' beat A and A is better than A' against other opponents but I see no data that suggests that this problem happen often.
Uri
I have seen many cases where A' beats A, but when played against other programs, it does worse. I have had 3-4 of those this week in making the new changes to Crafty's eval.
Uri Blass wrote:If testing A against A' increase the effect then it is a good test because
you need less games to know which version is stronger.
The only possible problem is if you get often cases when A' beat A and A is better than A' against other opponents but I see no data that suggests that this problem happen often.
Are you sure? It seems entirely plausible to me that A' might be weaker in a way which A (being almost the same program) is not able to exploit, but which other programs could.
Testing A' against only A is a way of optimizing your engine to play well against itself, which sounds like the wrong local maxima to be optimizing for. Since what you actually want is for it to play well against a variety of other opponents, it will probably be more reliable to test and measure the changes against a variety of other opponents.
Even then, the many past threads about this topic suggest, that proving a change to be conclusively better via testing and measuring is kind of difficult anyway.
Uri Blass wrote:If testing A against A' increase the effect then it is a good test because
you need less games to know which version is stronger.
The only possible problem is if you get often cases when A' beat A and A is better than A' against other opponents but I see no data that suggests that this problem happen often.
Are you sure? It seems entirely plausible to me that A' might be weaker in a way which A (being almost the same program) is not able to exploit, but which other programs could.
Testing A' against only A is a way of optimizing your engine to play well against itself, which sounds like the wrong local maxima to be optimizing for. Since what you actually want is for it to play well against a variety of other opponents, it will probably be more reliable to test and measure the changes against a variety of other opponents.
Even then, the many past threads about this topic suggest, that proving a change to be conclusively better via testing and measuring is kind of difficult anyway.
The question is practical and not theorethical.
testing A' against A can be a practical way to get faster results for the question if A' is better than A.
The results may be in theory wrong but if it does not happen often then testing only A' against A may give bigger improvement than testing both against B and C and D because there is limited time to test changes.
bob wrote:
Take any program. make a change to it. Play the old vs the new. If the change is good, the results will be far better than expected. If the change is bad, the results will be far worse than expected. Because since the only difference in the two programs is the change you made, it tends to influence games more than expected.
I've run millions of games testing A vs A' and the results are unreliable. Far better to run A and A' against a common set of opponents and see which turns out to be better.
The question that you are interested to know in order to decide if to accept a change is if A is better than A' and not how much better.
If testing A against A' increase the effect then it is a good test because
you need less games to know which version is stronger.
The only possible problem is if you get often cases when A' beat A and A is better than A' against other opponents but I see no data that suggests that this problem happen often.
Uri
I have seen many cases where A' beats A, but when played against other programs, it does worse. I have had 3-4 of those this week in making the new changes to Crafty's eval.
Note that does worse is not enough and we need significant results
to be sure that it is not because of a statistical noise when the difference against other opponents is very small but for the same direction.
If you have specific data then it may be interesting to see it.