On LMR, and statically predicting moves

matthewlai · Post by **matthewlai** » Tue Aug 25, 2015 8:31 pm

In my last thread about probability-based search (http://www.talkchess.com/forum/viewtopic.php?t=57092), there was some discussion on LMR. I am starting a new thread since it's fundamentally a separate issue.

How does LMR work? Fundamentally, the idea is that when we do move ordering, we sort moves that are more likely to be the best towards the front, and we want to spend proportionally more time on moves that are more likely the best than moves that are less likely to be the best, given the parent position.

In traditional chess engines, the probability distribution is not explicit. When the move ordering returns "A, B, C, D, E", there's no way to tell whether it's "50%, 49%, 0.5%, 0.3%, 0.2%" or "22%, 21%, 20%, 19%, 18%". Obviously the correct reduction schedule is very different for these 2 cases.

Most programs use a fixed reduction schedule (except for Stockfish, which has a few tables and adjustments). That works on average because we know that the probability distribution is monotonically decreasing, so we are right more often than we are wrong, when we just apply a simple constant reduction schedule.

But can we do better?

When we do the move sorting, we already had more information. We knew which ones came from hash, good capture, killer, or history (and what the history is). Why throw away all that information and reduce everything to a simple ranking, before deciding on the reductions?

Why not estimate a probability distribution instead, and use that for both sorting and reducing?

How to estimate the probability distribution from all those information? I have no idea. Fortunately for me, I don't have to worry about that too much, because Giraffe is learning it by herself.

I haven't quite figured out how to incorporate path-dependent information, yet, but here's some results with no path-dependent info (model is a neural network of course) -

Code: Select all

0&#58; 15.67% &#40;15.67&#41;
1&#58; 10.64% &#40;26.31&#41;
2&#58; 7.28% &#40;33.59&#41;
3&#58; 4.79% &#40;38.38&#41;
4&#58; 3.72% &#40;42.1&#41;
5&#58; 2.98% &#40;45.08&#41;
6&#58; 2.18% &#40;47.26&#41;
7&#58; 2.07% &#40;49.33&#41;
8&#58; 1.41% &#40;50.74&#41;
9&#58; 1.57% &#40;52.31&#41;
10&#58; 1.32% &#40;53.63&#41;
11&#58; 1.17% &#40;54.8&#41;
12&#58; 0.98% &#40;55.78&#41;
13&#58; 0.85% &#40;56.63&#41;
14&#58; 0.77% &#40;57.4&#41;
15&#58; 0.62% &#40;58.02&#41;
16&#58; 0.46% &#40;58.48&#41;
17&#58; 0.4% &#40;58.88&#41;
18&#58; 0.32% &#40;59.2&#41;
19&#58; 0.26% &#40;59.46&#41;
Average Confidence&#58; 0.152621

From the predicted rankings, the first move happens to be the actual best move 15.67% of the time. Second move 10.64%, etc.

Each ranking also comes with a probability distribution. The actual best move, on average, has a predicted probability of 15.26%.

This only includes positions where the best move is not a SEE-winning capture (most of them are regular quiet moves, and a few are sacrifices).

Some programs do things like sorting moves toward the centre first, or by eval change, etc. It should be possible to get better ordering by spending more time on statistical things like how in the opening, pawn moves are more likely to be best than king moves, etc. It may actually be beneficial to spend more time on statically evaluating moves in addition to positions.

mvk · Post by **mvk** » Tue Aug 25, 2015 9:05 pm

I just have, since Sunday, a prototype of my new tree shaping method working. Recall that it calculates for every node in the known move graph the odds that each move is, eventually, the best in its own node.

Some interesting effects are clear, and make a lot of sense:
1. When a node has two good (uncorrelated, which I always assume) moves, the score for that node should be a bit higher than just the max of the means. The analogy is that the winner's time of a race between two equal runners is better than the individual expectations.
2. At the same time, the variation in that case is smaller than the individual variations. That makes sense, because if one move fails eventually, there is a backup before the problem propagates towards the root.
3. When a node has one good move, the mean and sigma of that node remain effectively unchanged, except for the sign of the mean of course.

LMR-like effects are dramatic once the sigmas get smaller.

The implementation difficulty was in getting the PDF back-propagation stable by deciding where to cut cycles.

I have a C implementation for the pdfMax integrals. In the end, I decided to go back to the numerical approach. It is here on github

Now that I start to look at the graphs from some test positions, it occurs to me that the method is more suitable for move selection than for book making. For example, it gives 95% for 1.e4, 5% on d4 and 0% on all others. While for book making you also want to spend a significant amount on other moves. I'm still not sure why there should be difference. Maybe for bookmaking you have to use bigger sigmas, because you try not only to predict the best move for yourself, but also the most likely moves your opponents will play. That is different.

The next version of my program will be the first to rely heavily on floating point operations it seems.

mvk · Post by **mvk** » Wed Aug 26, 2015 1:09 am

I'm trying to understand your table. A question that would help me understand your result is: why doesn't it add up to 100% (but 59.46%)?

Another question: Is the NN running in every node, or do you use the NN to generate the parameters of some state machine or bayes chain or something like that, which in turn generates the reduction for you?

bob · Post by **bob** » Wed Aug 26, 2015 2:52 am

matthewlai wrote:In my last thread about probability-based search (http://www.talkchess.com/forum/viewtopic.php?t=57092), there was some discussion on LMR. I am starting a new thread since it's fundamentally a separate issue.

How does LMR work? Fundamentally, the idea is that when we do move ordering, we sort moves that are more likely to be the best towards the front, and we want to spend proportionally more time on moves that are more likely the best than moves that are less likely to be the best, given the parent position.

In traditional chess engines, the probability distribution is not explicit. When the move ordering returns "A, B, C, D, E", there's no way to tell whether it's "50%, 49%, 0.5%, 0.3%, 0.2%" or "22%, 21%, 20%, 19%, 18%". Obviously the correct reduction schedule is very different for these 2 cases.

Most programs use a fixed reduction schedule (except for Stockfish, which has a few tables and adjustments). That works on average because we know that the probability distribution is monotonically decreasing, so we are right more often than we are wrong, when we just apply a simple constant reduction schedule.

But can we do better?

When we do the move sorting, we already had more information. We knew which ones came from hash, good capture, killer, or history (and what the history is). Why throw away all that information and reduce everything to a simple ranking, before deciding on the reductions?

Why not estimate a probability distribution instead, and use that for both sorting and reducing?

How to estimate the probability distribution from all those information? I have no idea. Fortunately for me, I don't have to worry about that too much, because Giraffe is learning it by herself.

I haven't quite figured out how to incorporate path-dependent information, yet, but here's some results with no path-dependent info (model is a neural network of course) -
Code: Select all
0&#58; 15.67% &#40;15.67&#41;
1&#58; 10.64% &#40;26.31&#41;
2&#58; 7.28% &#40;33.59&#41;
3&#58; 4.79% &#40;38.38&#41;
4&#58; 3.72% &#40;42.1&#41;
5&#58; 2.98% &#40;45.08&#41;
6&#58; 2.18% &#40;47.26&#41;
7&#58; 2.07% &#40;49.33&#41;
8&#58; 1.41% &#40;50.74&#41;
9&#58; 1.57% &#40;52.31&#41;
10&#58; 1.32% &#40;53.63&#41;
11&#58; 1.17% &#40;54.8&#41;
12&#58; 0.98% &#40;55.78&#41;
13&#58; 0.85% &#40;56.63&#41;
14&#58; 0.77% &#40;57.4&#41;
15&#58; 0.62% &#40;58.02&#41;
16&#58; 0.46% &#40;58.48&#41;
17&#58; 0.4% &#40;58.88&#41;
18&#58; 0.32% &#40;59.2&#41;
19&#58; 0.26% &#40;59.46&#41;
Average Confidence&#58; 0.152621
From the predicted rankings, the first move happens to be the actual best move 15.67% of the time. Second move 10.64%, etc.

Each ranking also comes with a probability distribution. The actual best move, on average, has a predicted probability of 15.26%.

This only includes positions where the best move is not a SEE-winning capture (most of them are regular quiet moves, and a few are sacrifices).

Some programs do things like sorting moves toward the centre first, or by eval change, etc. It should be possible to get better ordering by spending more time on statistical things like how in the opening, pawn moves are more likely to be best than king moves, etc. It may actually be beneficial to spend more time on statically evaluating moves in addition to positions.

Your numbers look a bit strange. Most of us keep a simple fail high counter for the first move, which means we count the number of fail highs that occur, and the number that fail high on the first move. This is typically > 90%, at least for my program. Not the 15% or so you mentioned. This with all the usual ordering such as hash, non-losing captures, killers, counter moves, then history heuristic and the rest of the moves...

Michel · Post by **Michel** » Wed Aug 26, 2015 8:27 am

Marcel wrote:
I just have, since Sunday, a prototype of my new tree shaping method working. Recall that it calculates for every node in the known move graph the odds that each move is, eventually, the best in its own node.

Some interesting effects are clear, and make a lot of sense:
1. When a node has two good (uncorrelated, which I always assume) moves, the score for that node should be a bit higher than just the max of the means. The analogy is that the winner's time of a race between two equal runners is better than the individual expectations.
2. At the same time, the variation in that case is smaller than the individual variations. That makes sense, because if one move fails eventually, there is a backup before the problem propagates towards the root.
3. When a node has one good move, the mean and sigma of that node remain effectively unchanged, except for the sign of the mean of course.

LMR-like effects are dramatic once the sigmas get smaller.

The implementation difficulty was in getting the PDF back-propagation stable by deciding where to cut cycles.

I have a C implementation for the pdfMax integrals. In the end, I decided to go back to the numerical approach. It is here on github

Now that I start to look at the graphs from some test positions, it occurs to me that the method is more suitable for move selection than for book making. For example, it gives 95% for 1.e4, 5% on d4 and 0% on all others. While for book making you also want to spend a significant amount on other moves. I'm still not sure why there should be difference. Maybe for bookmaking you have to use bigger sigmas, because you try not only to predict the best move for yourself, but also the most likely moves your opponents will play. That is different.

The next version of my program will be the first to rely heavily on floating point operations it seems.

This is an interesting project. I very much like probabilistic approaches to tree searching. After all the static evaluation function is ultimately some kind of probabilistic assessment of a position.

However I have the feeling that the assumption that sibling nodes are uncorrelated may be a fatal flaw. If the evaluation function assesses a feature of the position incorrectly (overestimating king safety for example, or underestimating a passed pawn) then it is likely to do so also in sibling nodes.

Unfortunately I do not know a good way to incorporate correlation. The first issue is that there will also be correlation between nodes that share the same grand parent and so on. One idea is to replace sibling correlation with child-parent correlation. But even then I do not know how to use this information.

I'd be glad to be proven wrong though.

mvk · Post by **mvk** » Wed Aug 26, 2015 10:42 am

Michel wrote:
Marcel wrote:I just have, since Sunday, a prototype of my new tree shaping method working. Recall that it calculates for every node in the known move graph the odds that each move is, eventually, the best in its own node.
This is an interesting project. I very much like probabilistic approaches to tree searching. After all the static evaluation function is ultimately some kind of probabilistic assessment of a position.

However I have the feeling that the assumption that sibling nodes are uncorrelated may be a fatal flaw. If the evaluation function assesses a feature of the position incorrectly (overestimating king safety for example, or underestimating a passed pawn) then it is likely to do so also in sibling nodes.

Unfortunately I do not know a good way to incorporate correlation. The first issue is that there will also be correlation between nodes that share the same grand parent and so on. One idea is to replace sibling correlation with child-parent correlation. But even then I do not know how to use this information.

I'd be glad to be proven wrong though.

I can see your point I think, but I don't exactly see how this problem doesn't also affect any other type of evaluation-based tree search. Isn't this issue the same with Shannon strategies as well? The only solution I can think of is to fix the evaluation. After all, evaluation provides the plan through the search space. If the map is wrong, no path finding algorithm will be any good.

There two other potential errors in the model as I can identify:

1. The premise that a longer search converges to a heuristic limit which represents the ultimate truth of the position. We know that in the ultimate case it converges to the game-theoretical value instead, because the search space is bounded.

2. The assumption that siblings are uncorrelated. We know this is not true due to transpositions. I ignore this unless I can think of a way to know the covariances as well. I have no idea how to do that in a practical way yet. One idea can be to estimate the covariances as a static average. But until I have more experience with the raw search, I don't want to add more heuristics.

Of these, only the second I can image to yield unwanted artefacts. But as with any heuristic approach, the proof is in the eating.

matthewlai · Post by **matthewlai** » Wed Aug 26, 2015 12:30 pm

mvk wrote:I'm trying to understand your table. A question that would help me understand your result is: why doesn't it add up to 100% (but 59.46%)?

Another question: Is the NN running in every node, or do you use the NN to generate the parameters of some state machine or bayes chain or something like that, which in turn generates the reduction for you?

Sorry I was sleepy when I made the post last night. Should have clarified more.

The NN is run for every node, to assign a probability distribution to all the moves.

The reason it doesn't add up to 100% is because I cut it off at move 20, that's all.

So the first line is the frequency the best move predicted by the NN happens to be the correct best move from search. Then the frequency the second best move predicted by the NN happens to be the correct best move, etc.

matthewlai · Post by **matthewlai** » Wed Aug 26, 2015 12:34 pm

bob wrote: Your numbers look a bit strange. Most of us keep a simple fail high counter for the first move, which means we count the number of fail highs that occur, and the number that fail high on the first move. This is typically > 90%, at least for my program. Not the 15% or so you mentioned. This with all the usual ordering such as hash, non-losing captures, killers, counter moves, then history heuristic and the rest of the moves...

It is entirely possible that path-dependent heuristics is the way to go. I haven't established that's not the case yet, and I am not really convinced that it's not the case.

However, this is with no path-dependent stuff at all. No hash move, no killer, no history, no counter moves, etc. And no winning captures.

The moves are predicted simply by looking at the position, with no other information.

Do you happen to have data for history heuristics? I would be very interested in that. Of the positions where either there's no hash move, killers, etc, or where those moves failed to produce a cutoff, how many cutoffs happen on the first move sorted by history? And second move, third move, etc?

Also, the original spirit of this post would still hold with history heuristics.

For example, if hash, killer, good captures, etc don't produce a cutoff, and the history counters are "100, 99, 98, 97", the reduction schedule should probably be different from if it was "100, 0, 0, 0".

Michel · Post by **Michel** » Wed Aug 26, 2015 7:43 pm

mvk wrote:
Michel wrote:
Marcel wrote:I just have, since Sunday, a prototype of my new tree shaping method working. Recall that it calculates for every node in the known move graph the odds that each move is, eventually, the best in its own node.
This is an interesting project. I very much like probabilistic approaches to tree searching. After all the static evaluation function is ultimately some kind of probabilistic assessment of a position.

However I have the feeling that the assumption that sibling nodes are uncorrelated may be a fatal flaw. If the evaluation function assesses a feature of the position incorrectly (overestimating king safety for example, or underestimating a passed pawn) then it is likely to do so also in sibling nodes.

Unfortunately I do not know a good way to incorporate correlation. The first issue is that there will also be correlation between nodes that share the same grand parent and so on. One idea is to replace sibling correlation with child-parent correlation. But even then I do not know how to use this information.

I'd be glad to be proven wrong though.
I can see your point I think, but I don't exactly see how this problem doesn't also affect any other type of evaluation-based tree search. Isn't this issue the same with Shannon strategies as well? The only solution I can think of is to fix the evaluation. After all, evaluation provides the plan through the search space. If the map is wrong, no path finding algorithm will be any good.

There two other potential errors in the model as I can identify:

1. The premise that a longer search converges to a heuristic limit which represents the ultimate truth of the position. We know that in the ultimate case it converges to the game-theoretical value instead, because the search space is bounded.

2. The assumption that siblings are uncorrelated. We know this is not true due to transpositions. I ignore this unless I can think of a way to know the covariances as well. I have no idea how to do that in a practical way yet. One idea can be to estimate the covariances as a static average. But until I have more experience with the raw search, I don't want to add more heuristics.

Of these, only the second I can image to yield unwanted artefacts. But as with any heuristic approach, the proof is in the eating.

Of course.

What I worry about is that ignoring correlation will make you over estimate max nodess and under estimate min nodes. It is easy to understand this in extreme cases (e.g. several perfectly correlated equal moves). But as you say there is no way to know if this effect is important without trying first.

mvk · Post by **mvk** » Thu Aug 27, 2015 7:13 pm

So the first line is the frequency the best move predicted by the NN happens to be the correct best move from search. Then the frequency the second best move predicted by the NN happens to be the correct best move, etc.

But then I find it strange that you only get to ~60% at move 20, with 0.3% there. Is the remaining 40% in the tail, or is it in other node types? If it is in the tail, your tail seems too long: 40/0.3 >> 100.

On LMR, and statically predicting moves

On LMR, and statically predicting moves

Re: On LMR, and statically predicting moves

Re: On LMR, and statically predicting moves

Re: On LMR, and statically predicting moves

Re: On LMR, and statically predicting moves

Re: On LMR, and statically predicting moves

Re: On LMR, and statically predicting moves

Re: On LMR, and statically predicting moves

Re: On LMR, and statically predicting moves

Re: On LMR, and statically predicting moves