Reducing testing time by patch combining leverage

mcostalba · Post by **mcostalba** » Mon Oct 13, 2008 8:00 pm

It is known that there is a non-linear relation between the ELO difference of two engines and the number of games needed to prove that one engine is stronger then the other.

If engine A is 100 ELO points stronger then engine B then, let’s say 100 games are enough to spot that A is stronger then B. If engine A is 50 ELO points stronger then engine B then we could need 300 games to reliably choose the strongest between the two, if difference is 20 ELO we probably need thousands of games.

Patch combining is a proposed technique to reduce the number of games needed to accept or reject a patch.

It works as follows:

Given two engines, A with patch applied and B without patch, then remove from B a known good feature, i.e. a feature that we know increases the ELO by a known amount, let’s say 30 ELO point. The feature must be independent from the patch we are testing, as example we could simply add a delay in a critical path in engine B.

Now the ELO difference from A and B is no more X but is X + 30.

So if, let’s say X is 20 ELO we will prove that A is stronger then B after let’s say 300 games, instead of thousands.

It is important to understand that we are not interested in real ELO difference but only to know if patch is good.

I would like to ask the experts here if this method is sound or could lead to false positives.

Thanks
Marco

hgm · Post by **hgm** » Mon Oct 13, 2008 10:26 pm

This is pure nonsense.

The error margins of the rating determinations specify how large the probability is that you get a 'false positive'. These error margins depende on the number of game, and do not alter if you add 30 Elo. In fact they are likely to get bigger, as you can never be certain that the patch was worth exactly 30 Elo, as this was also determined only with a finite number of games. So in fact this trick will drive up the statistical error in the elo difference, and will therefore make it more difficult to get the chance component of your result below any genuine difference (by playing more games).

opraus · Post by **opraus** » Tue Oct 14, 2008 12:53 am

Hi Marco,

Here is an idea:

Since it takes too many games to measure small changes, why not increase the size of the changes?

Eg, if we modified ALL our test engines to use %300 passed pawn scores, would'nt the adjustments to individual parts produce 3X results? In effect, exaggerating the impact of a category of the evaluation, so otherwise small changes, become more distinguishable. [have greater impact on tournament result].

Thoughts? Nonsense? [I can take it

]

-David

bob · Post by **bob** » Tue Oct 14, 2008 1:58 am

opraus wrote:Hi Marco,

Here is an idea:

Since it takes too many games to measure small changes, why not increase the size of the changes?

Eg, if we modified ALL our test engines to use %300 passed pawn scores, would'nt the adjustments to individual parts produce 3X results? In effect, exaggerating the impact of a category of the evaluation, so otherwise small changes, become more distinguishable. [have greater impact on tournament result].

Thoughts? Nonsense? [I can take it ]

-David

All it will show you is the result of making the scores big. But you are trying to tune small changes (at least I am as I don't know how to make _big_ rating jumps in my program, A good week is 10-20 Elo improvement, which takes multiple versions and changes to pull off.

opraus · Post by **opraus** » Tue Oct 14, 2008 2:17 am

Hi Robert,

I am not sure my point was clear.

We would _NOT_ be comparing the normal [100%] passer-score to [300%], but rather using 300% throughout the tests, so as to exaggerate changes eg, to the _individual_ elements of the passer score [ie, blocked, protected, connected, etc].

In this way, mere 'tweaks' eg, to the passed_pawn[] array, would result in 3X their normal affect.

-David

bob · Post by **bob** » Tue Oct 14, 2008 6:13 am

opraus wrote:Hi Robert,

I am not sure my point was clear.

We would _NOT_ be comparing the normal [100%] passer-score to [300%], but rather using 300% throughout the tests, so as to exaggerate changes eg, to the _individual_ elements of the passer score [ie, blocked, protected, connected, etc].

In this way, mere 'tweaks' eg, to the passed_pawn[] array, would result in 3X their normal affect.

-David

I understood what you said, but I do not see how it addresses any of the statistical questions. standard deviation goes down as the number of games goes up. A small number of games provides no real useful information unless the two programs are grossly different in rating. And trying to cripple a feature of one, to measure the improvement of another is just adding even more noise...

Dirt · Post by **Dirt** » Tue Oct 14, 2008 7:32 am

opraus wrote:Hi Robert,

I am not sure my point was clear.

We would _NOT_ be comparing the normal [100%] passer-score to [300%], but rather using 300% throughout the tests, so as to exaggerate changes eg, to the _individual_ elements of the passer score [ie, blocked, protected, connected, etc].

In this way, mere 'tweaks' eg, to the passed_pawn[] array, would result in 3X their normal affect.

-David

After you triple the bonus for a passed pawn, I see no reason to believe the optimal additional score for a protected passer would stay the same, or would even change proportionately.

opraus · Post by **opraus** » Tue Oct 14, 2008 11:30 am

Dirt wrote: After you triple the bonus for a passed pawn, I see no reason to believe the optimal additional score for a protected passer would stay the same, or would even change proportionately.

Again, not sure it is clear. [and maybe to me]

Bad pseudo-code example:

Code: Select all

passer_score = passed_pawn&#91;rank&#93;;
passer_score += PASSER_IS_CONNECTED&#40;sq&#41; * passer_connected&#91;rank&#93;;
passer_score -= PASSER_IS_BLOCKED&#40;sq&#41; * passer_blocked&#91;rank&#93;;
....

passer_score *= 3;

Now, the over-all passer_score is certainly 'off' being 3X it's optimal value, but this means tweaks will/might have 3X the affect also. So that changes to the individual metrics which constitute 'passer-score' can be more easily measured [ie, requiring fewer games] because they have a larger affect on the score. Effectively amplifying what might ordinarily be a mere 3 elo change, so that it is 9 elo, and therefore distinguishable in fewer games.

@Robert,

Statistical rules stay the same. This is just an attempt to make the 'tweaks' which otherwise require 10000 games to see, only require 3000 games to see. BTW, all the engines could be 'hobbled' with total-passer-score *= 3.

Harald Johnsen · Post by **Harald Johnsen** » Tue Oct 14, 2008 12:28 pm

You won't win any game with your new eval coefficient.
You can verify that by using a value of 3*325 for a knight (or 0.3*325).

HJ.

opraus · Post by **opraus** » Tue Oct 14, 2008 1:03 pm

Harald Johnsen wrote:You won't win any game with your new eval coefficient.
You can verify that by using a value of 3*325 for a knight (or 0.3*325).

HJ.

Hi Harold,

Unless maybe you set material = 300% in _all_ engines.

Trying this with Positional scores might be better, too.

-David

Reducing testing time by patch combining leverage

Reducing testing time by patch combining leverage

Re: Reducing testing time by patch combining leverage

Re: Reducing testing time by patch combining leverage

Re: Reducing testing time by patch combining leverage

Re: Reducing testing time by patch combining leverage

Re: Reducing testing time by patch combining leverage

Re: Reducing testing time by patch combining leverage

Re: Reducing testing time by patch combining leverage

Re: Reducing testing time by patch combining leverage

Re: Reducing testing time by patch combining leverage