CLOP slides

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: CLOP slides

Post by Laskos »

Laskos wrote:
Rémi Coulom wrote:
Laskos wrote:To exemplify quickly, we have a theoretical curve to play with for finding a global maximum:
Yes, CLOP would probably miss the global optimum with such a function.

But I don't worry about it, because I don't expect such function shapes will occur when tuning parameters of a game-playing program.

An algorithm such as UCT would find the global optimum. But such global optimization methods don't scale well to more than a few dimensions.

Rémi
In several dimensions, are you doing the multi-dimensional quadratic regression? I don't read the CLOP documentation, but if it's doing this, the result will almost certainly settle on the broadest local extremum, not on the global one. In many dimensions there will almost always be a hell lot of local extrema, even if each 1-dimensional function is very smooth. UCT is not exactly what I meant, it scales badly with the number of dimensions, rather on something like VEGAS by Lepage, adaptive recursive method where the data points are concentrated in the regions close to exrtrema. It scales extremely well as the number of dimensions go, linearly or a power close to 1, if I remember well.

Kai
To show quickly how ubiquitous the appearance of several extrema is in more dimensions, I am using a simple example in 2 dimensions. The theoretical function is separable C(X,Y)=A(X)*B(Y), with each A and B having only one extremum, and being pretty smooth. This, I think, is a realistic interplay for chess parameter global tuning. The resulting C(X,Y) is here

Image

It has one sharp global peak and a broad local peak. Adding white Gaussian noise, this gives

Image

and performing two dimensional quadratic regression on this noisy data, the fitted function is

Image

The peak reproduced is the broad local one, not the global one. This is just one example when from an apparently innocuous separable function into some smooth functions, one gets this situation. In many dimensions these problems are almost unavoidable.

I don't know if that is what CLOP is doing, but anyway, a brief example of what could happen.

Kai
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: CLOP slides

Post by Michel »

To Kai:

Note that CLOP does quadratic _logistic_ regression. Not quadratic regression. So the peaks can be much sharper than in your plots.
I don't know if that is what CLOP is doing...
The simple answer to that is to read the paper....

EDIT: Also: CLOP uses maximum likelihood to compute the coefficients of the
regression. Not least squares.
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: CLOP slides

Post by Michel »

Of course if your point is that CLOP might settle for a non optimal local maximum then that is correct. Remy Coulom has already pointed out that CLOP is simply not designed for finding global optima.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: CLOP slides

Post by Laskos »

Michel wrote:To Kai:

Note that CLOP does quadratic _logistic_ regression. Not quadratic regression. So the peaks can be much sharper than in your plots.
I am skeptical that the peaks would be more detectable. Isn't it something like 1/(1+a*exp(....)), taking care more of the tails than of the peaks (as the number of data points goes)? I can certainly do that, but I don't expect much change.

I don't know if that is what CLOP is doing...
The simple answer to that is to read the paper....

EDIT: Also: CLOP uses maximum likelihood to compute the coefficients of the
regression. Not least squares.
Yes, I am doing least squares, but the method for solving minimazitaion used here was gradient descent. I used several others (Gauss-Newton, Levenberg-Marquard), but they are giving generally less confident parameters on this noisy data. I calculated the parameter confidence intervals, and can plot confidence plots for this stuff, the most tight (the smallest parameter variation) was the gradient method. And _all_ of them missed the sharp peak. I could possibly do it maximum likelihood, but it would take some time, I just wanted to exemplify quickly.

If CLOP wasn't designed to find a global optimum, then... sorry, if one has a feeling how even a relatively smooth function looks like in 6-12 dimensions, then doing a quadratic regression on it (and even not on it, but on noisy data), be it logistic and max. likelihood, would give you probably something very unrelated to the global optimum. If it gives the correct result, then the total function is probably close to a linear combination of individual functions (weakly correlated), and in that case there is no big problem tuning it 1-dimensionally. Still, it's probably useful for weakly correlated parameters.

Kai
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: CLOP slides

Post by Michel »

The problem with quadratic regression is that it cannot adapt to a sharp peak without giving enormous errors in the rest of the domain. So it will ignore the sharp peak.

However a function of the form 1/(1+e^{q(x)}) with q(x) a quadratic function of x can adapt to a sharp peak.

I think in your example (a sharp peak and a relatively flat much lower hill) CLOP will probably find the global maximum (but if it doesn't it is not a problem with CLOP).
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: CLOP slides

Post by Laskos »

Michel wrote:The problem with quadratic regression is that it cannot adapt to a sharp peak without giving enormous errors in the rest of the domain. So it will ignore the sharp peak.

However a function of the form 1/(1+e^{q(x)}) with q(x) a quadratic function of x can adapt to a sharp peak.

I think in your example (a sharp peak and a relatively flat much lower hill) CLOP will probably find the global maximum (but if it doesn't it is not a problem with CLOP).
I found today some time to do the fit, we both were only partially right. For less than 50,000 data points, the result is similar to the quadratic regression, the same flat, wrong peak, but different tails. For >300,000 data points, the fit seems to stabilize to the following:

Image

Even here the global optimum is ambiguous, but at least the sharp peak is not overlooked. One has to be aware that this is a 2-dimensional example, in 6-12 dimensions the number of necessary data points explodes (even in 2 dimensions it is too large). If I understood something of CLOP, I still maintain that it is useful only for weakly correlated parameters.

Kai
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: CLOP slides

Post by Michel »

If I understood something of CLOP, I still maintain that it is useful only for weakly correlated parameters.
I don't really see what not finding a global maximum has to do with the correlatedness of the parameters...

Note also that your statement cannot be correct for purely formal reasons. The behaviour of CLOP is invariant under linear transformations of the parameter space. Your statement is not (uncorrelated parameters usually become correlated after linear transformation, and the opposite may also happen).
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: CLOP slides

Post by Laskos »

Michel wrote:
If I understood something of CLOP, I still maintain that it is useful only for weakly correlated parameters.
I don't really see what not finding a global maximum has to do with the correlatedness of the parameters...

Note also that your statement cannot be correct for purely formal reasons. The behaviour of CLOP is invariant under linear transformations of the parameter space. Your statement is not (uncorrelated parameters usually become correlated after linear transformation, and the opposite may also happen).
I mean, if you vary one parameter by a little, the other parameters don't jump wildly settling at the optimum constrained by the first parameter. It also means the the fit can be done perturbatively, by 1 dimensional sequenced fitting, one by one ordered parameter, finding an approximate optimum and iterating in a convergent series. For this to happen, one has to have a broad global optimum with irrelevant local peaks.

In this sense CLOP seems to me (if I understood something) to not be radically different from tuning iteratively (by hand or not) each parameter one after another and repeating. Even with CLOP one has to repeat the regression on a decreasing range for variables, as the regression would give large errors for the global peak with the initial full range of variables. Still, probably CLOP is much faster in doing this, limited task.

Kai
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: CLOP slides

Post by Michel »

I seem to be observing a possible weak point of CLOP.

I'll explain it in 1 dimension in a very much simplified example. I haven't simulated this example so I don't know if my reasoning is correct.

Assume the function we want to optimize is

f(x)=1/(1+e^(-x^2))

So it has a minimum instead of a maximum.

Since it is still quadratic logistic CLOP's regression will quickly find it and then generate samples in the area where f(x) is close to 1.

The estimate for the location of the "maximum" of f(x) returned by CLOP is the average of the generated samples.

Hence it will be zero: the local minimum of f(x)!

Of course CLOP is not designed to work under these conditions. So it is not a flaw. Still, converging to the worst possible parameter value seems to be somewhat disconcerting....
Rémi Coulom
Posts: 438
Joined: Mon Apr 24, 2006 8:06 pm

Re: CLOP slides

Post by Rémi Coulom »

Michel wrote:I seem to be observing a possible weak point of CLOP.

I'll explain it in 1 dimension in a very much simplified example. I haven't simulated this example so I don't know if my reasoning is correct.

Assume the function we want to optimize is

f(x)=1/(1+e^(-x^2))

So it has a minimum instead of a maximum.

Since it is still quadratic logistic CLOP's regression will quickly find it and then generate samples in the area where f(x) is close to 1.

The estimate for the location of the "maximum" of f(x) returned by CLOP is the average of the generated samples.

Hence it will be zero: the local minimum of f(x)!

Of course CLOP is not designed to work under these conditions. So it is not a flaw. Still, converging to the worst possible parameter value seems to be somewhat disconcerting....
I am not worried by this case. I never saw it happen in practice, and, if it does happen, it is not very difficult to figure it out by looking at the data.

In the next version, I will also print the weight of the estimated maximum. If it is < 1, it indicates such a problem with the maximum estimation.

In practice, the biggest problem with estimating the maximum as the weighted average of samples is when the parameter range is not wide enough to have bad values on both sides of the maximum. For instance, if the optimal values of a parameter is very close to the lower bound of its range, then CLOP will have a bias towards values that are too big. So, it is very important to use CLOP with wide ranges, and when it is impossible to have a wider range, it is good to be careful that the estimation of the maximum may have such a bias.

Rémi