Evaluation discontinuity

Kempelen · Post by **Kempelen** » Tue Jun 01, 2010 1:33 pm

Hello,

I really dont fully understand why evaluation discontinuity is a problem. Given two positions an engine will always chooses the best one, be endgame or middlegame (supposing both are correctly calculated). Can someone put me an example why two scores (endgame and middlegame) scaled are better than only one?

thanks.

zamar · Post by **zamar** » Tue Jun 01, 2010 3:19 pm

Kempelen wrote: I really dont fully understand why evaluation discontinuity is a problem.

For me it's quite intuitive that discontinuity of any kind usually results in larger search trees. Of course, this is impossible to prove.

Given two positions an engine will always chooses the best one, be endgame or middlegame (supposing both are correctly calculated). Can someone put me an example why two scores (endgame and middlegame) scaled are better than only one?

Well, I think there are many positions which can't be classified as pure endgame or midgame positions. Then one can classify that some position is 80% endgame position and 20% midgame position.

bob · Post by **bob** » Tue Jun 01, 2010 3:51 pm

Kempelen wrote:Hello,

I really dont fully understand why evaluation discontinuity is a problem. Given two positions an engine will always chooses the best one, be endgame or middlegame (supposing both are correctly calculated). Can someone put me an example why two scores (endgame and middlegame) scaled are better than only one?

thanks.

The search tree is huge. If your search goes right up to the discontinuity, it can then push just over if it wants to, or stay just on the other side if that is better. That big score jump or drop is a problem. Berliner wrote about this issue in a couple of papers he published years ago.

Tord Romstad · Post by **Tord Romstad** » Tue Jun 01, 2010 4:06 pm

Kempelen wrote:Hello,

I really dont fully understand why evaluation discontinuity is a problem.

Before we can do this, we need to define discontinuity. In mathematics, continuity is defined for mappings between topological spaces. A mapping f: X -> Y is continuous if for any open subset U of Y, the inverse image f^-1(U) is an open subset of X. The mathematical definition isn't very useful for evaluation functions, because it is difficult to define a useful topology on the set of all legal chess positions.

I therefore propose the following rather loose definition of continuity for an evaluation function:

Two quiescent chess positions P1 and P2 are considered to be "close to each other" if it is possible to get from one of the positions to the other by a short sequence of strong moves. An evaluation function has good continuity if positions which are close to each other usually have similar evaluations.

That continuity in this sense is a desirable feature should be evident: The static evaluation is supposed to be a rough estimate of the result of a deep search. If P1 and P2 are quiescent, it is possible to get from P1 to P2 by a short sequence of perfect moves, and the evaluations are very different, the evaluation of P1 must have been wrong.

Given two positions an engine will always chooses the best one, be endgame or middlegame (supposing both are correctly calculated).

True, if you have a perfect evaluation function, you don't have to worry about continuity, because a perfect evaluation function is automatically continuous. The point is that a perfect evaluation function is very hard to write, and that writing an evaluation function that isn't perfect, but at least reasonably continuous, is much easier.

Can someone put me an example why two scores (endgame and middlegame) scaled are better than only one?

Sure. Consider the position of the king. In the opening and early middle game, we want the king to be close to a corner, and protected by pawns. We want to keep it away from the centre of the board at all costs, because it will easily get mated there. In the endgame, we want to activate and centralize the king. As I am sure you agree, there is no exact moment in the game where keeping the king safe and sheltered ceases being important and activating the king becomes important. It happens gradually as pieces disappear from the board. Therefore, it makes sense to use a bonus for having the king close to the corner in the opening, a bonus for having the king close to the center in the endgame, and interpolate between the two scores when you are somewhere between the opening and the endgame.

Don · Post by **Don** » Tue Jun 01, 2010 5:07 pm

bob wrote:
Kempelen wrote:Hello,

I really dont fully understand why evaluation discontinuity is a problem. Given two positions an engine will always chooses the best one, be endgame or middlegame (supposing both are correctly calculated). Can someone put me an example why two scores (endgame and middlegame) scaled are better than only one?

thanks.
The search tree is huge. If your search goes right up to the discontinuity, it can then push just over if it wants to, or stay just on the other side if that is better. That big score jump or drop is a problem. Berliner wrote about this issue in a couple of papers he published years ago.

I agree completely in principle, but in practice I'm not so sure that what any of us are doing is correct. Especially with the opening/ending interpolation. Unfortunately I don't have a better suggestion.

We do it because it's easily to think about this way, but in many ways it's broken. The king tables are especially awkward, and I don't believe my own king table transitions well from opening to endgame.

In a lot of cases there is not much, if any, grey area and interpolation pretends it's all perfectly smooth when it isn't. A lot of times a square is good or it's horrible and this DOES change dramatically with a single move or exchange.

This may be the best we have so far, but I think it's probably a fad and will not be the way future programs are written. I'm talking about the interpolation of openings and endings the way most programs seem to be doing it.

In fact, I think most of the things we do are fads and superstitions. Of course there are good ideas we all use, but I'm pretty sure a lot of ideas are done simply because "everybody else does it that way."

Daniel Shawul · Post by **Daniel Shawul** » Tue Jun 01, 2010 5:15 pm

I therefore propose the following rather loose definition of continuity for an evaluation function:

Two quiescent chess positions P1 and P2 are considered to be "close to each other" if it is possible to get from one of the positions to the other by a short sequence of strong moves. An evaluation function has good continuity if positions which are close to each other usually have similar evaluations.

May be it is possible to improve upon this definition, because I think that even a single move could bring big changes in eval. IMO most of the eval terms we have now are rather discontinuous, except for a few typical cases like king-placements that we have been treating as dependent on phase (material). This is not necessarily a bad thing. In one move a bishop could be trapped on a7, a queen could attack target squares around king,a pawn could become passed, and many more on/off features (patterns).

bob · Post by **bob** » Tue Jun 01, 2010 5:25 pm

Don wrote:
bob wrote:
Kempelen wrote:Hello,

I really dont fully understand why evaluation discontinuity is a problem. Given two positions an engine will always chooses the best one, be endgame or middlegame (supposing both are correctly calculated). Can someone put me an example why two scores (endgame and middlegame) scaled are better than only one?

thanks.
The search tree is huge. If your search goes right up to the discontinuity, it can then push just over if it wants to, or stay just on the other side if that is better. That big score jump or drop is a problem. Berliner wrote about this issue in a couple of papers he published years ago.
I agree completely in principle, but in practice I'm not so sure that what any of us are doing is correct. Especially with the opening/ending interpolation. Unfortunately I don't have a better suggestion.

We do it because it's easily to think about this way, but in many ways it's broken. The king tables are especially awkward, and I don't believe my own king table transitions well from opening to endgame.

In a lot of cases there is not much, if any, grey area and interpolation pretends it's all perfectly smooth when it isn't. A lot of times a square is good or it's horrible and this DOES change dramatically with a single move or exchange.

This may be the best we have so far, but I think it's probably a fad and will not be the way future programs are written. I'm talking about the interpolation of openings and endings the way most programs seem to be doing it.

In fact, I think most of the things we do are fads and superstitions. Of course there are good ideas we all use, but I'm pretty sure a lot of ideas are done simply because "everybody else does it that way."

One can always make the case that since we are using integers, there is a discontinuity between any two possible scores.

However, I am thinking more along the lines of what in mathematics is called a "Unit step function". A function f(x) = y where y is zero until x reaches some value, then it becomes 1. Examples are suddenly shutting king safety off when material drops below x. Right around that point, bad things can happen. Say you are a pawn up in a won position, but your king position is weak according to your evaluation. You might give up that one pawn advantage to suddenly turn off king safety, and if your king safety scores are large, this might actually be seen as a plus. Or a more subtle form where you intentionally wreck your king safety to win a pawn, and trade pieces to disable king safety. Now you are a pawn up in a lost ending because you shattered your king-side voluntarily.

The interpolation approach does mitigate this a bit. But it is a coarse thing. But I would argue that coarse in terms of (say) 40 steps is much better than coarse where there is only one big step. Smaller steps would certainly be better.

Unfortunately for me, in Cray Blitz, we just shut king safety off at some point, or suddenly ramped up endgame scores at that same point. And it caused some ugly search results here and there. Scaling purely on material has got to be wrong, but far better than just turning something on or off at some fixed point. It is wrong because the total material (in terms of integer numbers) is very small overall, which makes this "coarseness" issue more apparent. What else should be used is unknown (to me) at present. But there must be something better. I know I don't evaluate like this as a human, for example.

Don · Post by **Don** » Tue Jun 01, 2010 5:59 pm

bob wrote:
Don wrote:
bob wrote:
Kempelen wrote:Hello,

I really dont fully understand why evaluation discontinuity is a problem. Given two positions an engine will always chooses the best one, be endgame or middlegame (supposing both are correctly calculated). Can someone put me an example why two scores (endgame and middlegame) scaled are better than only one?

thanks.
The search tree is huge. If your search goes right up to the discontinuity, it can then push just over if it wants to, or stay just on the other side if that is better. That big score jump or drop is a problem. Berliner wrote about this issue in a couple of papers he published years ago.
I agree completely in principle, but in practice I'm not so sure that what any of us are doing is correct. Especially with the opening/ending interpolation. Unfortunately I don't have a better suggestion.

We do it because it's easily to think about this way, but in many ways it's broken. The king tables are especially awkward, and I don't believe my own king table transitions well from opening to endgame.

In a lot of cases there is not much, if any, grey area and interpolation pretends it's all perfectly smooth when it isn't. A lot of times a square is good or it's horrible and this DOES change dramatically with a single move or exchange.

This may be the best we have so far, but I think it's probably a fad and will not be the way future programs are written. I'm talking about the interpolation of openings and endings the way most programs seem to be doing it.

In fact, I think most of the things we do are fads and superstitions. Of course there are good ideas we all use, but I'm pretty sure a lot of ideas are done simply because "everybody else does it that way."
One can always make the case that since we are using integers, there is a discontinuity between any two possible scores.

However, I am thinking more along the lines of what in mathematics is called a "Unit step function". A function f(x) = y where y is zero until x reaches some value, then it becomes 1. Examples are suddenly shutting king safety off when material drops below x. Right around that point, bad things can happen. Say you are a pawn up in a won position, but your king position is weak according to your evaluation. You might give up that one pawn advantage to suddenly turn off king safety, and if your king safety scores are large, this might actually be seen as a plus. Or a more subtle form where you intentionally wreck your king safety to win a pawn, and trade pieces to disable king safety. Now you are a pawn up in a lost ending because you shattered your king-side voluntarily.

The interpolation approach does mitigate this a bit. But it is a coarse thing. But I would argue that coarse in terms of (say) 40 steps is much better than coarse where there is only one big step. Smaller steps would certainly be better.

Unfortunately for me, in Cray Blitz, we just shut king safety off at some point, or suddenly ramped up endgame scores at that same point. And it caused some ugly search results here and there. Scaling purely on material has got to be wrong, but far better than just turning something on or off at some fixed point. It is wrong because the total material (in terms of integer numbers) is very small overall, which makes this "coarseness" issue more apparent. What else should be used is unknown (to me) at present. But there must be something better. I know I don't evaluate like this as a human, for example.

Yes, I agree. This is clearly better than what most of us did before, suddenly going from middle to ending on one move.

I think that having exactly 2 evaluation functions however is just a hack - a simple way that is easier for us to think about when developing an evaluation function.

bob · Post by **bob** » Tue Jun 01, 2010 11:07 pm

Don wrote:
bob wrote:
Don wrote:
bob wrote:
Kempelen wrote:Hello,

I really dont fully understand why evaluation discontinuity is a problem. Given two positions an engine will always chooses the best one, be endgame or middlegame (supposing both are correctly calculated). Can someone put me an example why two scores (endgame and middlegame) scaled are better than only one?

thanks.
The search tree is huge. If your search goes right up to the discontinuity, it can then push just over if it wants to, or stay just on the other side if that is better. That big score jump or drop is a problem. Berliner wrote about this issue in a couple of papers he published years ago.
I agree completely in principle, but in practice I'm not so sure that what any of us are doing is correct. Especially with the opening/ending interpolation. Unfortunately I don't have a better suggestion.

We do it because it's easily to think about this way, but in many ways it's broken. The king tables are especially awkward, and I don't believe my own king table transitions well from opening to endgame.

In a lot of cases there is not much, if any, grey area and interpolation pretends it's all perfectly smooth when it isn't. A lot of times a square is good or it's horrible and this DOES change dramatically with a single move or exchange.

This may be the best we have so far, but I think it's probably a fad and will not be the way future programs are written. I'm talking about the interpolation of openings and endings the way most programs seem to be doing it.

In fact, I think most of the things we do are fads and superstitions. Of course there are good ideas we all use, but I'm pretty sure a lot of ideas are done simply because "everybody else does it that way."
One can always make the case that since we are using integers, there is a discontinuity between any two possible scores.

However, I am thinking more along the lines of what in mathematics is called a "Unit step function". A function f(x) = y where y is zero until x reaches some value, then it becomes 1. Examples are suddenly shutting king safety off when material drops below x. Right around that point, bad things can happen. Say you are a pawn up in a won position, but your king position is weak according to your evaluation. You might give up that one pawn advantage to suddenly turn off king safety, and if your king safety scores are large, this might actually be seen as a plus. Or a more subtle form where you intentionally wreck your king safety to win a pawn, and trade pieces to disable king safety. Now you are a pawn up in a lost ending because you shattered your king-side voluntarily.

The interpolation approach does mitigate this a bit. But it is a coarse thing. But I would argue that coarse in terms of (say) 40 steps is much better than coarse where there is only one big step. Smaller steps would certainly be better.

Unfortunately for me, in Cray Blitz, we just shut king safety off at some point, or suddenly ramped up endgame scores at that same point. And it caused some ugly search results here and there. Scaling purely on material has got to be wrong, but far better than just turning something on or off at some fixed point. It is wrong because the total material (in terms of integer numbers) is very small overall, which makes this "coarseness" issue more apparent. What else should be used is unknown (to me) at present. But there must be something better. I know I don't evaluate like this as a human, for example.
Yes, I agree. This is clearly better than what most of us did before, suddenly going from middle to ending on one move.

I think that having exactly 2 evaluation functions however is just a hack - a simple way that is easier for us to think about when developing an evaluation function.

It is certainly simpler. Older versions of Crafty had one big eval, where scores were folded in based on material remaining, etc. Interpolating between two was a jump in simplicity, if not accuracy. With systems as big as these, simplicity does have its advantages..

Michael Sherwin · Post by **Michael Sherwin** » Wed Jun 02, 2010 12:28 am

Don wrote: The king tables are especially awkward

This is one of Romi's strong points. I think.

Code: Select all

    // King Tables

    // The kings are worth 2,000 each

    sum = (4200 - bMat) >> 6;
    wKingTbl[sq] = ((cenRow + cenCol) * sum);

    sum = (4200 - wMat) >> 6;
    bKingTbl[sq] = ((cenRow + cenCol) * sum);

Evaluation discontinuity

Evaluation discontinuity

Re: Evaluation discontinuity

Re: Evaluation discontinuity

Re: Evaluation discontinuity

Re: Evaluation discontinuity

Re: Evaluation discontinuity

Re: Evaluation discontinuity

Re: Evaluation discontinuity

Re: Evaluation discontinuity

Re: Evaluation discontinuity