It is known that there is a non-linear relation between the ELO difference of two engines and the number of games needed to prove that one engine is stronger then the other.
If engine A is 100 ELO points stronger then engine B then, let’s say 100 games are enough to spot that A is stronger then B. If engine A is 50 ELO points stronger then engine B then we could need 300 games to reliably choose the strongest between the two, if difference is 20 ELO we probably need thousands of games.
Patch combining is a proposed technique to reduce the number of games needed to accept or reject a patch.
It works as follows:
Given two engines, A with patch applied and B without patch, then remove from B a known good feature, i.e. a feature that we know increases the ELO by a known amount, let’s say 30 ELO point. The feature must be independent from the patch we are testing, as example we could simply add a delay in a critical path in engine B.
Now the ELO difference from A and B is no more X but is X + 30.
So if, let’s say X is 20 ELO we will prove that A is stronger then B after let’s say 300 games, instead of thousands.
It is important to understand that we are not interested in real ELO difference but only to know if patch is good.
I would like to ask the experts here if this method is sound or could lead to false positives.
Thanks
Marco
Reducing testing time by patch combining leverage
Moderators: hgm, Rebel, chrisw
-
- Posts: 2684
- Joined: Sat Jun 14, 2008 9:17 pm
-
- Posts: 27796
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Reducing testing time by patch combining leverage
This is pure nonsense.
The error margins of the rating determinations specify how large the probability is that you get a 'false positive'. These error margins depende on the number of game, and do not alter if you add 30 Elo. In fact they are likely to get bigger, as you can never be certain that the patch was worth exactly 30 Elo, as this was also determined only with a finite number of games. So in fact this trick will drive up the statistical error in the elo difference, and will therefore make it more difficult to get the chance component of your result below any genuine difference (by playing more games).
The error margins of the rating determinations specify how large the probability is that you get a 'false positive'. These error margins depende on the number of game, and do not alter if you add 30 Elo. In fact they are likely to get bigger, as you can never be certain that the patch was worth exactly 30 Elo, as this was also determined only with a finite number of games. So in fact this trick will drive up the statistical error in the elo difference, and will therefore make it more difficult to get the chance component of your result below any genuine difference (by playing more games).
-
- Posts: 166
- Joined: Wed Mar 08, 2006 9:49 pm
- Location: S. New Jersey, USA
Re: Reducing testing time by patch combining leverage
Hi Marco,
Here is an idea:
Since it takes too many games to measure small changes, why not increase the size of the changes?
Eg, if we modified ALL our test engines to use %300 passed pawn scores, would'nt the adjustments to individual parts produce 3X results? In effect, exaggerating the impact of a category of the evaluation, so otherwise small changes, become more distinguishable. [have greater impact on tournament result].
Thoughts? Nonsense? [I can take it ]
-David
Here is an idea:
Since it takes too many games to measure small changes, why not increase the size of the changes?
Eg, if we modified ALL our test engines to use %300 passed pawn scores, would'nt the adjustments to individual parts produce 3X results? In effect, exaggerating the impact of a category of the evaluation, so otherwise small changes, become more distinguishable. [have greater impact on tournament result].
Thoughts? Nonsense? [I can take it ]
-David
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Reducing testing time by patch combining leverage
All it will show you is the result of making the scores big. But you are trying to tune small changes (at least I am as I don't know how to make _big_ rating jumps in my program, A good week is 10-20 Elo improvement, which takes multiple versions and changes to pull off.opraus wrote:Hi Marco,
Here is an idea:
Since it takes too many games to measure small changes, why not increase the size of the changes?
Eg, if we modified ALL our test engines to use %300 passed pawn scores, would'nt the adjustments to individual parts produce 3X results? In effect, exaggerating the impact of a category of the evaluation, so otherwise small changes, become more distinguishable. [have greater impact on tournament result].
Thoughts? Nonsense? [I can take it ]
-David
-
- Posts: 166
- Joined: Wed Mar 08, 2006 9:49 pm
- Location: S. New Jersey, USA
Re: Reducing testing time by patch combining leverage
Hi Robert,
I am not sure my point was clear.
We would _NOT_ be comparing the normal [100%] passer-score to [300%], but rather using 300% throughout the tests, so as to exaggerate changes eg, to the _individual_ elements of the passer score [ie, blocked, protected, connected, etc].
In this way, mere 'tweaks' eg, to the passed_pawn[] array, would result in 3X their normal affect.
-David
I am not sure my point was clear.
We would _NOT_ be comparing the normal [100%] passer-score to [300%], but rather using 300% throughout the tests, so as to exaggerate changes eg, to the _individual_ elements of the passer score [ie, blocked, protected, connected, etc].
In this way, mere 'tweaks' eg, to the passed_pawn[] array, would result in 3X their normal affect.
-David
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Reducing testing time by patch combining leverage
I understood what you said, but I do not see how it addresses any of the statistical questions. standard deviation goes down as the number of games goes up. A small number of games provides no real useful information unless the two programs are grossly different in rating. And trying to cripple a feature of one, to measure the improvement of another is just adding even more noise...opraus wrote:Hi Robert,
I am not sure my point was clear.
We would _NOT_ be comparing the normal [100%] passer-score to [300%], but rather using 300% throughout the tests, so as to exaggerate changes eg, to the _individual_ elements of the passer score [ie, blocked, protected, connected, etc].
In this way, mere 'tweaks' eg, to the passed_pawn[] array, would result in 3X their normal affect.
-David
-
- Posts: 2851
- Joined: Wed Mar 08, 2006 10:01 pm
- Location: Irvine, CA, USA
Re: Reducing testing time by patch combining leverage
After you triple the bonus for a passed pawn, I see no reason to believe the optimal additional score for a protected passer would stay the same, or would even change proportionately.opraus wrote:Hi Robert,
I am not sure my point was clear.
We would _NOT_ be comparing the normal [100%] passer-score to [300%], but rather using 300% throughout the tests, so as to exaggerate changes eg, to the _individual_ elements of the passer score [ie, blocked, protected, connected, etc].
In this way, mere 'tweaks' eg, to the passed_pawn[] array, would result in 3X their normal affect.
-David
-
- Posts: 166
- Joined: Wed Mar 08, 2006 9:49 pm
- Location: S. New Jersey, USA
Re: Reducing testing time by patch combining leverage
Again, not sure it is clear. [and maybe to me]Dirt wrote: After you triple the bonus for a passed pawn, I see no reason to believe the optimal additional score for a protected passer would stay the same, or would even change proportionately.
Bad pseudo-code example:
Code: Select all
passer_score = passed_pawn[rank];
passer_score += PASSER_IS_CONNECTED(sq) * passer_connected[rank];
passer_score -= PASSER_IS_BLOCKED(sq) * passer_blocked[rank];
....
passer_score *= 3;
@Robert,
Statistical rules stay the same. This is just an attempt to make the 'tweaks' which otherwise require 10000 games to see, only require 3000 games to see. BTW, all the engines could be 'hobbled' with total-passer-score *= 3.
Re: Reducing testing time by patch combining leverage
You won't win any game with your new eval coefficient.
You can verify that by using a value of 3*325 for a knight (or 0.3*325).
HJ.
You can verify that by using a value of 3*325 for a knight (or 0.3*325).
HJ.
-
- Posts: 166
- Joined: Wed Mar 08, 2006 9:49 pm
- Location: S. New Jersey, USA
Re: Reducing testing time by patch combining leverage
Hi Harold,Harald Johnsen wrote:You won't win any game with your new eval coefficient.
You can verify that by using a value of 3*325 for a knight (or 0.3*325).
HJ.
Unless maybe you set material = 300% in _all_ engines.
Trying this with Positional scores might be better, too.
-David