Reducing testing time by patch combining leverage

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Reducing testing time by patch combining leverage

Post by mcostalba »

It is known that there is a non-linear relation between the ELO difference of two engines and the number of games needed to prove that one engine is stronger then the other.

If engine A is 100 ELO points stronger then engine B then, let’s say 100 games are enough to spot that A is stronger then B. If engine A is 50 ELO points stronger then engine B then we could need 300 games to reliably choose the strongest between the two, if difference is 20 ELO we probably need thousands of games.

Patch combining is a proposed technique to reduce the number of games needed to accept or reject a patch.

It works as follows:

Given two engines, A with patch applied and B without patch, then remove from B a known good feature, i.e. a feature that we know increases the ELO by a known amount, let’s say 30 ELO point. The feature must be independent from the patch we are testing, as example we could simply add a delay in a critical path in engine B.

Now the ELO difference from A and B is no more X but is X + 30.

So if, let’s say X is 20 ELO we will prove that A is stronger then B after let’s say 300 games, instead of thousands.

It is important to understand that we are not interested in real ELO difference but only to know if patch is good.

I would like to ask the experts here if this method is sound or could lead to false positives.

Thanks
Marco
User avatar
hgm
Posts: 27796
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Reducing testing time by patch combining leverage

Post by hgm »

This is pure nonsense.

The error margins of the rating determinations specify how large the probability is that you get a 'false positive'. These error margins depende on the number of game, and do not alter if you add 30 Elo. In fact they are likely to get bigger, as you can never be certain that the patch was worth exactly 30 Elo, as this was also determined only with a finite number of games. So in fact this trick will drive up the statistical error in the elo difference, and will therefore make it more difficult to get the chance component of your result below any genuine difference (by playing more games).
User avatar
opraus
Posts: 166
Joined: Wed Mar 08, 2006 9:49 pm
Location: S. New Jersey, USA

Re: Reducing testing time by patch combining leverage

Post by opraus »

Hi Marco,

Here is an idea:

Since it takes too many games to measure small changes, why not increase the size of the changes?

Eg, if we modified ALL our test engines to use %300 passed pawn scores, would'nt the adjustments to individual parts produce 3X results? In effect, exaggerating the impact of a category of the evaluation, so otherwise small changes, become more distinguishable. [have greater impact on tournament result].

Thoughts? Nonsense? [I can take it ;) ]

-David
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Reducing testing time by patch combining leverage

Post by bob »

opraus wrote:Hi Marco,

Here is an idea:

Since it takes too many games to measure small changes, why not increase the size of the changes?

Eg, if we modified ALL our test engines to use %300 passed pawn scores, would'nt the adjustments to individual parts produce 3X results? In effect, exaggerating the impact of a category of the evaluation, so otherwise small changes, become more distinguishable. [have greater impact on tournament result].

Thoughts? Nonsense? [I can take it ;) ]

-David
All it will show you is the result of making the scores big. But you are trying to tune small changes (at least I am as I don't know how to make _big_ rating jumps in my program, A good week is 10-20 Elo improvement, which takes multiple versions and changes to pull off.
User avatar
opraus
Posts: 166
Joined: Wed Mar 08, 2006 9:49 pm
Location: S. New Jersey, USA

Re: Reducing testing time by patch combining leverage

Post by opraus »

Hi Robert,

I am not sure my point was clear.

We would _NOT_ be comparing the normal [100%] passer-score to [300%], but rather using 300% throughout the tests, so as to exaggerate changes eg, to the _individual_ elements of the passer score [ie, blocked, protected, connected, etc].

In this way, mere 'tweaks' eg, to the passed_pawn[] array, would result in 3X their normal affect.

-David
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Reducing testing time by patch combining leverage

Post by bob »

opraus wrote:Hi Robert,

I am not sure my point was clear.

We would _NOT_ be comparing the normal [100%] passer-score to [300%], but rather using 300% throughout the tests, so as to exaggerate changes eg, to the _individual_ elements of the passer score [ie, blocked, protected, connected, etc].

In this way, mere 'tweaks' eg, to the passed_pawn[] array, would result in 3X their normal affect.

-David
I understood what you said, but I do not see how it addresses any of the statistical questions. standard deviation goes down as the number of games goes up. A small number of games provides no real useful information unless the two programs are grossly different in rating. And trying to cripple a feature of one, to measure the improvement of another is just adding even more noise...
Dirt
Posts: 2851
Joined: Wed Mar 08, 2006 10:01 pm
Location: Irvine, CA, USA

Re: Reducing testing time by patch combining leverage

Post by Dirt »

opraus wrote:Hi Robert,

I am not sure my point was clear.

We would _NOT_ be comparing the normal [100%] passer-score to [300%], but rather using 300% throughout the tests, so as to exaggerate changes eg, to the _individual_ elements of the passer score [ie, blocked, protected, connected, etc].

In this way, mere 'tweaks' eg, to the passed_pawn[] array, would result in 3X their normal affect.

-David
After you triple the bonus for a passed pawn, I see no reason to believe the optimal additional score for a protected passer would stay the same, or would even change proportionately.
User avatar
opraus
Posts: 166
Joined: Wed Mar 08, 2006 9:49 pm
Location: S. New Jersey, USA

Re: Reducing testing time by patch combining leverage

Post by opraus »

Dirt wrote: After you triple the bonus for a passed pawn, I see no reason to believe the optimal additional score for a protected passer would stay the same, or would even change proportionately.
Again, not sure it is clear. [and maybe to me]

Bad pseudo-code example:

Code: Select all

passer_score = passed_pawn[rank];
passer_score += PASSER_IS_CONNECTED(sq) * passer_connected[rank];
passer_score -= PASSER_IS_BLOCKED(sq) * passer_blocked[rank];
....

passer_score *= 3;
Now, the over-all passer_score is certainly 'off' being 3X it's optimal value, but this means tweaks will/might have 3X the affect also. So that changes to the individual metrics which constitute 'passer-score' can be more easily measured [ie, requiring fewer games] because they have a larger affect on the score. Effectively amplifying what might ordinarily be a mere 3 elo change, so that it is 9 elo, and therefore distinguishable in fewer games.

@Robert,

Statistical rules stay the same. This is just an attempt to make the 'tweaks' which otherwise require 10000 games to see, only require 3000 games to see. BTW, all the engines could be 'hobbled' with total-passer-score *= 3.
Harald Johnsen

Re: Reducing testing time by patch combining leverage

Post by Harald Johnsen »

You won't win any game with your new eval coefficient.
You can verify that by using a value of 3*325 for a knight (or 0.3*325).


HJ.
User avatar
opraus
Posts: 166
Joined: Wed Mar 08, 2006 9:49 pm
Location: S. New Jersey, USA

Re: Reducing testing time by patch combining leverage

Post by opraus »

Harald Johnsen wrote:You won't win any game with your new eval coefficient.
You can verify that by using a value of 3*325 for a knight (or 0.3*325).


HJ.
Hi Harold,

Unless maybe you set material = 300% in _all_ engines.

Trying this with Positional scores might be better, too.

-David