Superlinear interpolator: a nice novelity ?

BubbaTough · Post by **BubbaTough** » Thu Sep 25, 2008 6:05 pm

I can confirm that my results echo Bob's. I still sometimes try it though, and it sounds like Bob does to. So it sounds like we both consider A vs. A' matches useful, just not to be trusted on their own.

-Sam

mcostalba · Post by **mcostalba** » Thu Sep 25, 2008 6:07 pm

Uri Blass wrote:
bob wrote:
I have seen many cases where A' beats A, but when played against other programs, it does worse. I have had 3-4 of those this week in making the new changes to Crafty's eval.
Note that does worse is not enough and we need significant results
to be sure that it is not because of a statistical noise when the difference against other opponents is very small but for the same direction.

If you have specific data then it may be interesting to see it.

Uri

My naive 5 cents here are this:

We have two different situations:

- A' beats A, but when played against other programs, it does worse.

- A' beats A *by a confident margin* but when played against other programs, it does worse.

I would think, but I ask to expert for confirmation, that the two statements are completely different, at least as long as "confident margin" is something sound.

Now of course the problem is to define the "confident margin".

It can be a minimum ELO difference found *by experiment* that allows the second statement to be always false or can be a statistical ELO difference.

I explain myself better. When I run an engine match on chessbase GUI I see a nice elo range with a statistic confident, something as

TP=+20 ELO, 95% [-20, +63], 99.5% [-80, +160]

Now the normal way of testing is to do, say 1000 games, between the two engines and see the result.

I am wondering if it is sound instead a rule like this:

- Run the test until 1000 games are reached OR when we reach 95% [+10, +xxx]

What we are interested in, it is very important to clearly state, it is NOT the elo difference but if a new patch applied to your program makes the program better or worse.

If the second rule is sound (but I ask to the experts here) I think could be possible to greatly limit the testing for improvments.

As long as your patch seems to be better then *at least* 10 elo points with 95% of confidence then patch is good, otherwise go on testing...

Marco

bob · Post by **bob** » Thu Sep 25, 2008 10:05 pm

Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
Take any program. make a change to it. Play the old vs the new. If the change is good, the results will be far better than expected. If the change is bad, the results will be far worse than expected. Because since the only difference in the two programs is the change you made, it tends to influence games more than expected.

I've run millions of games testing A vs A' and the results are unreliable. Far better to run A and A' against a common set of opponents and see which turns out to be better.

The question that you are interested to know in order to decide if to accept a change is if A is better than A' and not how much better.

If testing A against A' increase the effect then it is a good test because
you need less games to know which version is stronger.

The only possible problem is if you get often cases when A' beat A and A is better than A' against other opponents but I see no data that suggests that this problem happen often.

Uri
I have seen many cases where A' beats A, but when played against other programs, it does worse. I have had 3-4 of those this week in making the new changes to Crafty's eval.
Note that does worse is not enough and we need significant results
to be sure that it is not because of a statistical noise when the difference against other opponents is very small but for the same direction.

What do you consider "significant"? I am using 40,000 games in a trial. A plays 40K games against a set of opponents, A' plays a set of games against the same opponents. A' turned out to be 12 elo weaker. Yet with 8,000 games between A and A', A' looked better. A' was actually significantly slower than A, but played better head-to-head. Against the normal pool of opponents it did worse.

I don't quote results for 20 game matches...

If you have specific data then it may be interesting to see it.

Uri

bob · Post by **bob** » Thu Sep 25, 2008 10:09 pm

BubbaTough wrote:I can confirm that my results echo Bob's. I still sometimes try it though, and it sounds like Bob does to. So it sounds like we both consider A vs. A' matches useful, just not to be trusted on their own.

-Sam

I don't do it very often, but was puzzling over a drop in Elo with what was apparently a pretty simple change that did not change evaluation numbers at all, except for the small endgame case I was working on. And it was playing about 12 elo worse than the previous version and I could not figure out why. On a whim, I played it against A as well and it won, and now I am really going "hmmm..." Turns out a non-related change broke lazy eval and slowed things down quite a bit in some positions. But that apparently didn't hurt in A vs A', but was murder in A' vs the world.

I played the games so that I could see where they "disagreed" to get some idea about what might be broken. I just happened to notice the NPS difference in certain positions before I figured out what I had done.

bob · Post by **bob** » Thu Sep 25, 2008 10:13 pm

mcostalba wrote:
Uri Blass wrote:
bob wrote:
I have seen many cases where A' beats A, but when played against other programs, it does worse. I have had 3-4 of those this week in making the new changes to Crafty's eval.
Note that does worse is not enough and we need significant results
to be sure that it is not because of a statistical noise when the difference against other opponents is very small but for the same direction.

If you have specific data then it may be interesting to see it.

Uri
My naive 5 cents here are this:

We have two different situations:

- A' beats A, but when played against other programs, it does worse.

- A' beats A *by a confident margin* but when played against other programs, it does worse.

I would think, but I ask to expert for confirmation, that the two statements are completely different, at least as long as "confident margin" is something sound.

I agree. My test was this: A was better than A' playing 40K matches against a set of common opponents. 40K for A, 40K for A', A' was 12 elo weaker, as computed by Bayeselo, with an Elo range (factoring in the error) of -15 to -8, with the actual Elo given as -12. I played A vs A' for 8,000 games, and A' was better by 19 elo, which with the error given would not let A and A' overlap at all.

All my results now are compared using BayesElo...

Now of course the problem is to define the "confident margin".

It can be a minimum ELO difference found *by experiment* that allows the second statement to be always false or can be a statistical ELO difference.

I explain myself better. When I run an engine match on chessbase GUI I see a nice elo range with a statistic confident, something as

TP=+20 ELO, 95% [-20, +63], 99.5% [-80, +160]

Now the normal way of testing is to do, say 1000 games, between the two engines and see the result.

I am wondering if it is sound instead a rule like this:

- Run the test until 1000 games are reached OR when we reach 95% [+10, +xxx]

What we are interested in, it is very important to clearly state, it is NOT the elo difference but if a new patch applied to your program makes the program better or worse.

If the second rule is sound (but I ask to the experts here) I think could be possible to greatly limit the testing for improvments.

As long as your patch seems to be better then *at least* 10 elo points with 95% of confidence then patch is good, otherwise go on testing...

Marco

BubbaTough · Post by **BubbaTough** » Thu Sep 25, 2008 10:48 pm

I know this is slightly off topic, but my recommendation to an engine writer at the early stages is not to worry about this stuff too much. There are so many things to work on in your first few years of engine development that provide huge ELO gains that worrying about detecting small improvements is just going to slow your progress in my opinion. After your engine stops improving at a fast rate you can start worrying about measuring the effect of minor changes.

-Sam

Uri Blass · Post by **Uri Blass** » Sat Sep 27, 2008 10:22 pm

bob wrote:
Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
Take any program. make a change to it. Play the old vs the new. If the change is good, the results will be far better than expected. If the change is bad, the results will be far worse than expected. Because since the only difference in the two programs is the change you made, it tends to influence games more than expected.

I've run millions of games testing A vs A' and the results are unreliable. Far better to run A and A' against a common set of opponents and see which turns out to be better.

The question that you are interested to know in order to decide if to accept a change is if A is better than A' and not how much better.

If testing A against A' increase the effect then it is a good test because
you need less games to know which version is stronger.

The only possible problem is if you get often cases when A' beat A and A is better than A' against other opponents but I see no data that suggests that this problem happen often.

Uri
I have seen many cases where A' beats A, but when played against other programs, it does worse. I have had 3-4 of those this week in making the new changes to Crafty's eval.
Note that does worse is not enough and we need significant results
to be sure that it is not because of a statistical noise when the difference against other opponents is very small but for the same direction.
What do you consider "significant"? I am using 40,000 games in a trial. A plays 40K games against a set of opponents, A' plays a set of games against the same opponents. A' turned out to be 12 elo weaker. Yet with 8,000 games between A and A', A' looked better. A' was actually significantly slower than A, but played better head-to-head. Against the normal pool of opponents it did worse.

I don't quote results for 20 game matches...

I think based on the second post that the results are significant and
it may be interesting to know what are the changes that cause this effect(A' better than A in head to head match but weaker than A against other opponents).

Uri

bob · Post by **bob** » Sun Sep 28, 2008 4:47 pm

Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
Take any program. make a change to it. Play the old vs the new. If the change is good, the results will be far better than expected. If the change is bad, the results will be far worse than expected. Because since the only difference in the two programs is the change you made, it tends to influence games more than expected.

I've run millions of games testing A vs A' and the results are unreliable. Far better to run A and A' against a common set of opponents and see which turns out to be better.

The question that you are interested to know in order to decide if to accept a change is if A is better than A' and not how much better.

If testing A against A' increase the effect then it is a good test because
you need less games to know which version is stronger.

The only possible problem is if you get often cases when A' beat A and A is better than A' against other opponents but I see no data that suggests that this problem happen often.

Uri
I have seen many cases where A' beats A, but when played against other programs, it does worse. I have had 3-4 of those this week in making the new changes to Crafty's eval.
Note that does worse is not enough and we need significant results
to be sure that it is not because of a statistical noise when the difference against other opponents is very small but for the same direction.
What do you consider "significant"? I am using 40,000 games in a trial. A plays 40K games against a set of opponents, A' plays a set of games against the same opponents. A' turned out to be 12 elo weaker. Yet with 8,000 games between A and A', A' looked better. A' was actually significantly slower than A, but played better head-to-head. Against the normal pool of opponents it did worse.

I don't quote results for 20 game matches...

I think based on the second post that the results are significant and
it may be interesting to know what are the changes that cause this effect(A' better than A in head to head match but weaker than A against other opponents).

Uri

The first one was a simple reduced scoring for passed pawns. All scores were reduced by 30% and when that looked "better" I ran a normal cluster test and found it was significantly worse overall.

Superlinear interpolator: a nice novelity ?

Re: Superlinear interpolator: a nice novelity ?

Re: Superlinear interpolator: a nice novelity ?

Re: Superlinear interpolator: a nice novelity ?

Re: Superlinear interpolator: a nice novelity ?

Re: Superlinear interpolator: a nice novelity ?

Re: Superlinear interpolator: a nice novelity ?

Re: Superlinear interpolator: a nice novelity ?

Re: Superlinear interpolator: a nice novelity ?