Stockfish - single evasion extensions

mcostalba · Post by **mcostalba** » Tue Jun 29, 2010 8:10 pm

Uri Blass wrote: I think that a possible problem may be that the playing strength is not transitive with small differences.

For sure is not cumulative !

If change A gives you 10 ELO and change B gives you 8 ELO then _amost for sure_ the sum of the changes A+B will not give you 18 ELO

Regarding the regression actually it is more like the case above then a real regression, namely we commit a series of patches that individually add something, but when we test the result the gain was about half of what we expected summing the gain of the single patches.

So there is no a version weaker then a previous one, simply the latest one is not so strong as we expected. This made us think of a regression somewhere, but now, after all this testing, we start to believe that perhaps what happened is the above: the total gain is not equal to the sum of single terms !

Uri Blass · Post by **Uri Blass** » Tue Jun 29, 2010 8:38 pm

thanks.

I hope that the new stockfish can top the rating list(at least at long time control that you could not test) but I am going to be satistied even if it is only above Rybka3 in most rating lists including 40/40 CCRL and 40/20 CEGT(today I believe that it is above Rybka3 only in the FRC 40/4 rating list).

Uri

Ralph Stoesser · Post by **Ralph Stoesser** » Tue Jun 29, 2010 9:01 pm

mcostalba wrote:
Uri Blass wrote: I think that a possible problem may be that the playing strength is not transitive with small differences.
For sure is not cumulative !

If change A gives you 10 ELO and change B gives you 8 ELO then _amost for sure_ the sum of the changes A+B will not give you 18 ELO

Regarding the regression actually it is more like the case above then a real regression, namely we commit a series of patches that individually add something, but when we test the result the gain was about half of what we expected summing the gain of the single patches.

So there is no a version weaker then a previous one, simply the latest one is not so strong as we expected. This made us think of a regression somewhere, but now, after all this testing, we start to believe that perhaps what happened is the above: the total gain is not equal to the sum of single terms !

What are your real numbers for patches A, B, C ... ?
And what is the sum for all patches so far?

bob · Post by **bob** » Tue Jun 29, 2010 11:22 pm

mcostalba wrote:
Uri Blass wrote: I think that a possible problem may be that the playing strength is not transitive with small differences.
For sure is not cumulative !

If change A gives you 10 ELO and change B gives you 8 ELO then _amost for sure_ the sum of the changes A+B will not give you 18 ELO

Regarding the regression actually it is more like the case above then a real regression, namely we commit a series of patches that individually add something, but when we test the result the gain was about half of what we expected summing the gain of the single patches.

So there is no a version weaker then a previous one, simply the latest one is not so strong as we expected. This made us think of a regression somewhere, but now, after all this testing, we start to believe that perhaps what happened is the above: the total gain is not equal to the sum of single terms !

We are not seeing this behaviour. I get +10 with a change, Tracy gets +10 with a different change, we generally get +20 when we combine them. If we are both changing related parts of the eval, they might interact, but I have been working on search things, Tracy on eval things, and the changes have added up as expected...

Uri Blass · Post by **Uri Blass** » Tue Jun 29, 2010 11:34 pm

bob wrote:
mcostalba wrote:
Uri Blass wrote: I think that a possible problem may be that the playing strength is not transitive with small differences.
For sure is not cumulative !

If change A gives you 10 ELO and change B gives you 8 ELO then _amost for sure_ the sum of the changes A+B will not give you 18 ELO

Regarding the regression actually it is more like the case above then a real regression, namely we commit a series of patches that individually add something, but when we test the result the gain was about half of what we expected summing the gain of the single patches.

So there is no a version weaker then a previous one, simply the latest one is not so strong as we expected. This made us think of a regression somewhere, but now, after all this testing, we start to believe that perhaps what happened is the above: the total gain is not equal to the sum of single terms !
We are not seeing this behaviour. I get +10 with a change, Tracy gets +10 with a different change, we generally get +20 when we combine them. If we are both changing related parts of the eval, they might interact, but I have been working on search things, Tracy on eval things, and the changes have added up as expected...

I think that the reason is that you test in a different way.

You test Crafty against many different opponents that are not Crafty and I suspect that the stockfish team only test with games between different versions of stockfish(they may correct me if I am wrong)

Uri

bob · Post by **bob** » Wed Jun 30, 2010 3:42 am

Uri Blass wrote:
bob wrote:
mcostalba wrote:
Uri Blass wrote: I think that a possible problem may be that the playing strength is not transitive with small differences.
For sure is not cumulative !

If change A gives you 10 ELO and change B gives you 8 ELO then _amost for sure_ the sum of the changes A+B will not give you 18 ELO

Regarding the regression actually it is more like the case above then a real regression, namely we commit a series of patches that individually add something, but when we test the result the gain was about half of what we expected summing the gain of the single patches.

So there is no a version weaker then a previous one, simply the latest one is not so strong as we expected. This made us think of a regression somewhere, but now, after all this testing, we start to believe that perhaps what happened is the above: the total gain is not equal to the sum of single terms !
We are not seeing this behaviour. I get +10 with a change, Tracy gets +10 with a different change, we generally get +20 when we combine them. If we are both changing related parts of the eval, they might interact, but I have been working on search things, Tracy on eval things, and the changes have added up as expected...
I think that the reason is that you test in a different way.

You test Crafty against many different opponents that are not Crafty and I suspect that the stockfish team only test with games between different versions of stockfish(they may correct me if I am wrong)

Uri

OK, I certainly don't like that kind of testing. I've tried it in the past and found lots of unexpected "issues".

mcostalba · Post by **mcostalba** » Wed Jun 30, 2010 8:02 am

bob wrote: OK, I certainly don't like that kind of testing. I've tried it in the past and found lots of unexpected "issues".

At the end testing is a compromise between speed and accuracy and it is the available hardware that sets the bar.

If resources are very limited you end up even testing from a set of tactic positions (as you stated in another post you were doing with Cray blitz)

Regarding our development methodology we cannot afford to spend more then 2 days to validate a patch except for very important and difficult ones like the tweaks to LMR or futility pruning parameters that we take extra care with and normally we change only once per release.

And testing against an engines pool doubles the error at given number of games. For instance suppose you play a gauntlet of 1000 games with SF_A against 4 opponents, 250 games each opponent. At the end you have a score of say 51% +-2

Now you want to test if SF_B is better then SF_A, so you repeat the test with SF_B and you get 52%+-2.

What can you understund from the result ? You have to consider _both_ the error ranges in the first _plus_ in the second match.

Instead in self play, after 1000 games you get that, for instance, SF_B scores at 51%+-2 against SF_A

In the second case the probability that SF_B is better then SF_A is bigger then in the first case and you have played half the games !

Sven · Post by **Sven** » Wed Jun 30, 2010 9:59 am

mcostalba wrote:
bob wrote: OK, I certainly don't like that kind of testing. I've tried it in the past and found lots of unexpected "issues".
At the end testing is a compromise between speed and accuracy and it is the available hardware that sets the bar.

If resources are very limited you end up even testing from a set of tactic positions (as you stated in another post you were doing with Cray blitz)

Regarding our development methodology we cannot afford to spend more then 2 days to validate a patch except for very important and difficult ones like the tweaks to LMR or futility pruning parameters that we take extra care with and normally we change only once per release.

And testing against an engines pool doubles the error at given number of games. For instance suppose you play a gauntlet of 1000 games with SF_A against 4 opponents, 250 games each opponent. At the end you have a score of say 51% +-2

Now you want to test if SF_B is better then SF_A, so you repeat the test with SF_B and you get 52%+-2.

What can you understund from the result ? You have to consider _both_ the error ranges in the first _plus_ in the second match.

Instead in self play, after 1000 games you get that, for instance, SF_B scores at 51%+-2 against SF_A

In the second case the probability that SF_B is better then SF_A is bigger then in the first case and you have played half the games !

You have to combine all games of both gauntlets of SF_A and SF_B into one pool and then look at the overall rating results. This is recommended by Rémi Coulom for using BayesElo, and is also used by Bob AFAIK.

Sven

Michel · Post by **Michel** » Wed Jun 30, 2010 10:42 am

You have to combine all games of both gauntlets of SF_A and SF_B into one pool and then look at the overall rating results. This is recommended by Rémi Coulom for using BayesElo, and is also used by Bob AFAIK.

I think you should simply pool all games you have (including those with prior
versions of the engine).

mcostalba · Post by **mcostalba** » Wed Jun 30, 2010 1:28 pm

Sven Schüle wrote: You have to combine all games of both gauntlets of SF_A and SF_B into one pool and then look at the overall rating results. This is recommended by Rémi Coulom for using BayesElo, and is also used by Bob AFAIK.

Apart that is not fair to compare a testing session of 1000 games against another of 2000, does it change something in the overall results ?

I put it down more clear:

Hypotesis: Given two version of an engine, say SF_A and SF_B, and given a fixed number of games to perform (say 1000) what is the testing method that maximizes the reliability of the judgment of which of the two engines is stronger ?

1) Self play of 1000 games
2) Gauntlet play with other engines for a total numer of 1000 games

I am quite confident (but no prof because you need knowledge of statistic) that is (1).

Note that in (1) you could also add the probability (supposedly low) of SF_A to be stronger then SF_B given that SF_A is weaker against other engines then SF_B. But even with this correction I guess (1) is still the best way, i.e. the way that minimize the probability of taking a bad decision regarding which engine is stronger.

It would be interesting if Rémi could give some insight in this problem (I think is the only one among us that could give a contribuition apart from "guessing")

Stockfish - single evasion extensions

Re: Stockfish - single evasion extensions

Re: Stockfish - single evasion extensions

Re: Stockfish - single evasion extensions

Re: Stockfish - single evasion extensions

Re: Stockfish - single evasion extensions

Re: Stockfish - single evasion extensions

Re: Stockfish - single evasion extensions

Re: Stockfish - single evasion extensions

Re: Stockfish - single evasion extensions

Re: Stockfish - single evasion extensions