Reducing testing time by patch combining leverage

hgm · Post by **hgm** » Tue Oct 14, 2008 1:44 pm

opraus wrote:Hi Marco,

Here is an idea:

Since it takes too many games to measure small changes, why not increase the size of the changes?

Eg, if we modified ALL our test engines to use %300 passed pawn scores, would'nt the adjustments to individual parts produce 3X results? In effect, exaggerating the impact of a category of the evaluation, so otherwise small changes, become more distinguishable. [have greater impact on tournament result].

Thoughts? Nonsense? [I can take it ]

-David

I think the basic idea is sound: when looking for an optimum in noisy data, it is often much more accurate to determine points where you are the same amount below the optimum, on either side, and take the center of those, as to experiment close to the optimum. This because most optima are parbolic, and thus steepen their dependence on the parameter the further you get from the maximum. And a steeper slope means a larger uncertainty in the vertical (result) scale transates to a smaller error in the parameter value.

Perhaps 300% is overdoing it, but making the weighting factor 10% of where you think the optimum might ly, and then increasing it in the range of 200% until you get an equally sub-optimal result (say 210%), and just taking the average as estimate for the best value (110% in this case) would need several orders of magnitude fewer games than figuring out which of the values 100%, 110% and 120% is best by testing these values themselves.

opraus · Post by **opraus** » Tue Oct 14, 2008 4:11 pm

hgm wrote: I think the basic idea is sound: when looking for an optimum in noisy data, it is often much more accurate to determine points where you are the same amount below the optimum, on either side, and take the center of those, as to experiment close to the optimum. This because most optima are parbolic, and thus steepen their dependence on the parameter the further you get from the maximum. And a steeper slope means a larger uncertainty in the vertical (result) scale transates to a smaller error in the parameter value.

Perhaps 300% is overdoing it, but making the weighting factor 10% of where you think the optimum might ly, and then increasing it in the range of 200% until you get an equally sub-optimal result (say 210%), and just taking the average as estimate for the best value (110% in this case) would need several orders of magnitude fewer games than figuring out which of the values 100%, 110% and 120% is best by testing these values themselves.

Hi H.G.,

I still remember that bit of bit twiddler code you helped me with. thanks again [though you may not remember it]

I understand your point . It is also interesting.

Especially liked the part about my idea being sound.

But not sure if it is clear. [And, I am beginning to doubt my powers of communication].

I simplify for my own benefit.

Lets take pawn_structure as an example.

Suppose we wanted to know the best weight for pawn_structure:doubled.

Trying eg, 8, 10, 12, 14, etc would be too hard. It is likely that 2 cp changes would hardly result in _ANY_ elo strength at all. But lets imagine it might be +- 3 elo.

3 elo takes too many games to verify.

But if, at the end of eval(), we magnified the pawn structure category, by say, 3X then it might be possible to distinguish our tweaks to doubled because they would seem to have +-9 elo affect

My assumption/proposal, is:

That, tweaks within an AMPLIFIED category of the eval [category = KS, Pawn structure, Activity, eg], might have amplified affects on the results,[ enabling us to better distinguish otherwise small changes with fewer games, because THEY are bigger, and require less resolution].

-David

mathmoi · Post by **mathmoi** » Tue Oct 14, 2008 5:32 pm

opraus wrote:But if, at the end of eval(), we magnified the pawn structure category, by say, 3X then it might be possible to distinguish our tweaks to doubled because they would seem to have +-9 elo affect

Why do you think that magnifying a category of you evaluation will make the rating difference bigger?

hgm · Post by **hgm** » Tue Oct 14, 2008 5:34 pm

What is missing in your example is that you are looking for an optimum, and that usualy the strength is a smooth function of the parameters. So if the optimal value of the doubled-pawn penalty would be 16 cP, it will not be that setting it to 14 cP or 18 cP will give you -3 Elo, 12 or 20 will -6 Elo, and 10 or 22 will give you -9 elo. In stead you will typically see -1 Elo for 14 and 18 cP, -4 Elo for 12 and 20 cP, -9 Elo for 10 or 22 cP. Etc., upto -49 for 2 or 30 cP. There will be a quadratic dependence near the optimum.

So if you would directly test 14, 16 and 18 cP, you would have to resolve a difference of 1 Elo. OTOH, if you would try 2 cP, knowing it is too low, and test it against 26, 30 and 34 cP to determine which experience Elo drops of -25, -49 and -81, you would have to resolve differences of 24 or 32 Elo in order to figure out which high setting is equally strong as 2 cP. And to resolve 32 Eo requires 32*32 = 1024 fewer games than to resolve 1 Elo.

So it is comparatively easy to conclude that 34 cP is weaker than 2 cP, but 30 cP is about equally strong. So you would be pretty sure that the best value is about (2 + 30)/2 = 16 cP, and not (2 + 34)/2 = 18 cP. While concluding directly that 18 cP is worse than 16 cp would be a hopeless task.

opraus · Post by **opraus** » Tue Oct 14, 2008 6:02 pm

mathmoi wrote:
opraus wrote:But if, at the end of eval(), we magnified the pawn structure category, by say, 3X then it might be possible to distinguish our tweaks to doubled because they would seem to have +-9 elo affect
Why do you think that magnifying a category of you evaluation will make the rating difference bigger?

Hi Mathieu,

Good question! I dont know. Will it? Is the very question being asked.

I 'think' it might, because it shifts the 'significance' of that portion of the scoring toward those metrics. The games will 'tend' to be 'more about' the pawn structure, and the tweaks will be more significant.

I very well could be out in left field on this one ....

-David

opraus · Post by **opraus** » Tue Oct 14, 2008 6:17 pm

hgm wrote:What is missing in your example is that you are looking for an optimum, .

It is not missing. We would try various values 'looking for the optimum' with this 'Magnifying Glass' mechanism in place.

You would still need to run trials on a range of values, of course.

-David

bob · Post by **bob** » Tue Oct 14, 2008 7:34 pm

opraus wrote:
Dirt wrote: After you triple the bonus for a passed pawn, I see no reason to believe the optimal additional score for a protected passer would stay the same, or would even change proportionately.
Again, not sure it is clear. [and maybe to me]

Bad pseudo-code example:
Code: Select all
passer_score = passed_pawn[rank];
passer_score += PASSER_IS_CONNECTED(sq) * passer_connected[rank];
passer_score -= PASSER_IS_BLOCKED(sq) * passer_blocked[rank];
....

passer_score *= 3;
Now, the over-all passer_score is certainly 'off' being 3X it's optimal value, but this means tweaks will/might have 3X the affect also. So that changes to the individual metrics which constitute 'passer-score' can be more easily measured [ie, requiring fewer games] because they have a larger affect on the score. Effectively amplifying what might ordinarily be a mere 3 elo change, so that it is 9 elo, and therefore distinguishable in fewer games.

@Robert,

Statistical rules stay the same. This is just an attempt to make the 'tweaks' which otherwise require 10000 games to see, only require 3000 games to see. BTW, all the engines could be 'hobbled' with total-passer-score *= 3.

The problem is that if you multiply one (close to correct) score by 3, then any tuning to other parameters is going to be biased, and they will probably score better as _they_ are increased as well. And when you restore the original parameter, now the newly adjusted ones are wrong...

opraus · Post by **opraus** » Tue Oct 14, 2008 8:25 pm

bob wrote: The problem is that if you multiply one (close to correct) score by 3, then any tuning to other parameters is going to be biased, and they will probably score better as _they_ are increased as well. And when you restore the original parameter, now the newly adjusted ones are wrong...

1. The X3 would be applied to a whole category of metrics all under the same head: eg, King Safety [with all your individual metrics eg, shield, tropism, open-lines]

2. No tuning is attempted on the other categories, only those falling under KS in this scenario.

The question is: will tweaks have an amplified effect when applied to a 'magnified' category of metrics, such that, the tweaks might be more easily distinguished.

If King-Safety [the whole lot of metrics combined] is *=3 will tweaks to eg pawn-storm, or Q-tropism, be more prominent, and therefore easier to measure?

-David

Reducing testing time by patch combining leverage

Re: Reducing testing time by patch combining leverage

Re: Reducing testing time by patch combining leverage

Re: Reducing testing time by patch combining leverage

Re: Reducing testing time by patch combining leverage

Re: Reducing testing time by patch combining leverage

Re: Reducing testing time by patch combining leverage

Re: Reducing testing time by patch combining leverage

Re: Reducing testing time by patch combining leverage