Search tuned depending on TC

xr_a_y · Post by **xr_a_y** » Sun May 16, 2021 7:29 pm

When I switch to NNUE in Minic, I followed Dietrich idea of using a "scaling factor" for the NNUE eval.
The idea behind this is to try to automatically scale the NNUE eval so that it "fits" the usual HCE evaluation in terms of search/eval dependency.
In search, there are indeed a lots of thresholds that depends on evaluation, and this tuned search parameters would be far from optimized if NNUE eval returns something that is, in mean and distribution, far from HCE eval. Scaling NNUE evaluation tries to help with this.
In Minic I was using, since the first NNUE tries in august 2020, a random pool of 50000 positions and evaluate them with both HCE and NNUE (with the current loaded net) during initilisation phase of the engine. I sum the absolute values of HCE evals and NNUE evals (if they have same sign) and deduce a "factor" to scale NNUE eval so that the ratio between theses two sums is one. It worked quite well, at least i've always checked that Minic using a net and a scaling factor computed this way is always better than Minic using net without scaling NNUE evals. So far so good.

But for some reason last week I started to shut down this automatic scaling process and tries some other values manually. I discovered that this automatic scaling is not at the optimum point, at least for my new net "naive nostalgia" that will be released soon with Minic3.07.

So I use a python script to explore the possible scaling and deduce the best one. For this I used very small TC (3s+0.03) and the found optimal seems quite different from what I found "manually" with some 10s+0.1 testing. So I started to wonder, how this optimal scaling of NNUE eval (but in fact the question also holds for HCE) is affected by TC.

The point is, scaling score to be smaller or bigger is of course affecting how prunings are triggered, and clearly the optimum for different TC is not the same.

So, how about prunings thresholds depending on available time ?
I don't have the hardware power to answer of course ! but fishtest probably has !

Has it been tried already ?

Ferdy · Post by **Ferdy** » Sun May 16, 2021 8:03 pm

Good point even HCE is affected by time. Your idea of applying time in pruning/reduction is interesting. With more time one may scale down the eval i.e relying more on the search.

connor_mcmonigle · Post by **connor_mcmonigle** » Sun May 16, 2021 10:05 pm

If we expect the network's evaluation to be more accurate, in general, relative to your old evaluation function, then wouldn't it be natural to expect that it would be possible to get away with more aggressive static null move pruning constants for example? Or, equivalently, one would expect to be able to get away with scaling the network more aggressively than the scaling that would minimize the MSE between the network's evaluations and your old evaluation function as the network's evaluation is going to be a better predictor of the true value of nearly any given position.

To simplify analysis, let's consider a game with only two possible results (win, loss). Let the set of all possible game states be G and consider a function on G: v(S)->(win | loss) taking any S in G to the game theoretic value of S. Let f_0(S) -> R be the old evaluation function, mapping any state S of G to some real number and let the evaluation function be f_1(S) -> R. My hunch is that the optimal scaling can be obtained by first finding the coefficients of two logistic models, c_0 and c_1 respectively with f_0 and f_1 as predictors, minimizing the cross entropy between the predicted game theoretic value and the true game theoretic value provided by v on G. My guess is that (c_1/c_0) gives the optimal scaling for f_1 for a search function tuned for f_0.

Perhaps as a substitute to v (as chess isn't solved yet), the results of high quality self play games could be used (v') and some subset of the positions from those self play games could serve as a reasonable approximation to G (G'). Therefore, rather than scaling f_1 by m so as to minimize (m*f_1(S) - f_0(S))^2 for all S in G', you would scale f_0 and f_1 by c_0 and c_1 respectively to minimize both -v'(S) * logsigmoid(c_0 * f_0) + -(1-v'(S)) * logsigmoid(-c_0 * f_0) and -v'(S) * logsigmoid(c_1 * f_1) + -(1-v'(S)) * logsigmoid(-c_1 * f_1). Then m, the scaling coefficient is just c_1/c_0.

xr_a_y · Post by **xr_a_y** » Sun May 16, 2021 10:33 pm

Very interesting point.

One empirical thing I saw today is that optimal scaling factor at 10s+0.1 is bigger than for 3s+0.03.

So probably more aggressive pruning at longer TC.
I ll try 1min+1s tomorrow to verify if this tendency holds.

Search tuned depending on TC

Search tuned depending on TC

Re: Search tuned depending on TC

Re: Search tuned depending on TC

Re: Search tuned depending on TC