Value of a Feature or Heuristic

jorose · Post by **jorose** » Sun Feb 15, 2015 6:14 pm

Hi, everyone!

I've been thinking about what feature I want to implement next in The Machine (my little digital woodpushing child) and it got me into thinking about the value of different features in search and evaluation. In doing so I realized there are some problems with trying to put a value on different features. Sometimes I see people write that something is "worth" 5 elo or something similar. While this may be true for their own engine I feel the value of a feature is tied extremely close to a specific engine. Depending on what other features are implemented, what time control is being used (from what I've seen search and speed improvements tend to fair better at lower time controls) and even how strong an engine is (an idea worth 5 elo for Stockfish will most likely be worth much more for an engine like Micromax). This may not seem like a problem at first as you could do your own tests to see how much a feature is worth for your own engine, but I feel it gets problematic as the impact of a feature changes over the life cycle of an engine.

In order to circumvent this problem for evaluation heuristics I decided to start keeping track of the maximum impact of a variable where I define the maximum impact as the difference between minimum and maximum effect of an evaluation term assuming it is "active". What I mean by this is that for a PST I take the difference between the minimal and maximal value (not the value of a piece being there and not being there but of a piece being on the best or worst square of the PST). In the future I also intend to pick a set of random positions and calculating the average impact on score in those positions (score difference using the heuristic or not). Like this I hope to be able to better evaluate the effect of a heuristic over the lifetime of my engine as I add more and more terms and retune my evaluation function again and again.

I was wondering if anyone else spent time thinking about this problem and if anyone knew of a good way to evaluate the impact of a search feature other than running a gauntlet with or without the feature (eg something that works efficiently without having Stockfish-like resources to back you up).

hgm · Post by **hgm** » Sun Feb 15, 2015 7:03 pm

I think it is not so much a matter of the engine getting stronger per se, but of non-orthogonal features. And of course as you develop the engine and add more and more features, it does get more likely that some of the features are doing partly the same, and will be able to fill in part of the gap created by switching one feature off.

E.g. centralization by PST in a sense is just poor-man's mobility. A truly good mobility term, which also scores X-ray moves that you could activate through discovering them, would almost completely make it redundant. But as long as you don;t have mobility at all, the PST will be crucial for developing the pieces. (Of course some aspects of the position can be implemented through PST, and not by mobility, such as a drive to advance Pawns.)

So a large part of making an engine stronger consists of replacing easy-to-implement very roughly averaging terms by more precise and accurate terms that on average do the same. This usually gains you much more than very precisely tuning the weight of the rough approximation. I always compare this to approximating a circular area with a square. No matter how precisely you determine the best size of the square to minimize the area of the parts not covering each other, the error will always remain large. Using an octagon instead of a square is immediately much better, even when the size is so off that it is entirely inside or outside the circle, touching it.

matthewlai · Post by **matthewlai** » Sun Feb 15, 2015 7:16 pm

jorose wrote:Hi, everyone!

I've been thinking about what feature I want to implement next in The Machine (my little digital woodpushing child) and it got me into thinking about the value of different features in search and evaluation. In doing so I realized there are some problems with trying to put a value on different features. Sometimes I see people write that something is "worth" 5 elo or something similar. While this may be true for their own engine I feel the value of a feature is tied extremely close to a specific engine. Depending on what other features are implemented, what time control is being used (from what I've seen search and speed improvements tend to fair better at lower time controls) and even how strong an engine is (an idea worth 5 elo for Stockfish will most likely be worth much more for an engine like Micromax). This may not seem like a problem at first as you could do your own tests to see how much a feature is worth for your own engine, but I feel it gets problematic as the impact of a feature changes over the life cycle of an engine.

In order to circumvent this problem for evaluation heuristics I decided to start keeping track of the maximum impact of a variable where I define the maximum impact as the difference between minimum and maximum effect of an evaluation term assuming it is "active". What I mean by this is that for a PST I take the difference between the minimal and maximal value (not the value of a piece being there and not being there but of a piece being on the best or worst square of the PST). In the future I also intend to pick a set of random positions and calculating the average impact on score in those positions (score difference using the heuristic or not). Like this I hope to be able to better evaluate the effect of a heuristic over the lifetime of my engine as I add more and more terms and retune my evaluation function again and again.

I was wondering if anyone else spent time thinking about this problem and if anyone knew of a good way to evaluate the impact of a search feature other than running a gauntlet with or without the feature (eg something that works efficiently without having Stockfish-like resources to back you up).

Unfortunately I don't think there is a known way to do that.

The problem is, like you mentioned, the effect of an evaluation term is dependent on the rest of the evaluation and even the search, and that's because features are not dependent.

In an extreme case, if you have 2 features doing the exact same thing, you can say either one is completely useless, given you have the other one.

In less extreme cases, the correlation is not 100%, but may still be strong.

For example, if you have a knight centrality feature and a knight mobility feature, while they don't correlate 100%, there is very significant correlation, and introducing either one will give you most of the ELO increase of introducing both (assuming perfectly tuned weights in both). If using only knight centrality gives you +80, using only knight mobility gives you +80, and using both gives you +100, what would your system assign as effects of those terms? Do you assume no other features exist (which wouldn't really make sense), or all other features exist (in which case you'll say both those features only give you +20)?

For eval-search interaction, zugzwang detection probably has higher value in engines that do null move prunning in end game, and king safety probably has higher value in engines that don't do check extension, for example.

Engine strength is a highly non-linear combination all all its features. It may be possible to model this using something like an artificial neural network, but you'll need way more than Stockfish-like resources to train that network, because of the roughness of the output space (meaning similar inputs can produce very different outputs).

Ferdy · Post by **Ferdy** » Mon Feb 16, 2015 10:23 am

jorose wrote:Hi, everyone!

I've been thinking about what feature I want to implement next in The Machine (my little digital woodpushing child) and it got me into thinking about the value of different features in search and evaluation. In doing so I realized there are some problems with trying to put a value on different features.

One way of solving this is by creating some anchor feature and value in the evaluation. Example from start position you play e2e4. What would be its value if 1 pawn is worth 100 points? Say 4. So from that anchor you will now have an idea of what is the worth of h3?, the worth of d3? when you design your pst. Next if you implement the advantage of having 2 bishops say 50 cp, then you have to choose for the penalty of doubled pawn because 2 bishops advantage is closely related to doubled pawn. So now you have the penalty of doubled pawn. Next what about the backward pawn? Since this is a pawn weakness you have to choose whether backward pawn is worst than doubled pawn. Generally the secret here is your weighing of value relative to other values in another feature. Most good chess players knew this. That is just the first try, then you always need the test games to establish the base line, then you may go for auto-tuning if you explore more. The major eval worth exploring and best way to waste time are mobility, king safety, king attack, passer, and pawn structure.

Sometimes I see people write that something is "worth" 5 elo or something similar. While this may be true for their own engine I feel the value of a feature is tied extremely close to a specific engine. Depending on what other features are implemented, what time control is being used (from what I've seen search and speed improvements tend to fair better at lower time controls) and even how strong an engine is (an idea worth 5 elo for Stockfish will most likely be worth much more for an engine like Micromax). This may not seem like a problem at first as you could do your own tests to see how much a feature is worth for your own engine, but I feel it gets problematic as the impact of a feature changes over the life cycle of an engine.

In order to circumvent this problem for evaluation heuristics I decided to start keeping track of the maximum impact of a variable where I define the maximum impact as the difference between minimum and maximum effect of an evaluation term assuming it is "active". What I mean by this is that for a PST I take the difference between the minimal and maximal value (not the value of a piece being there and not being there but of a piece being on the best or worst square of the PST). In the future I also intend to pick a set of random positions and calculating the average impact on score in those positions (score difference using the heuristic or not). Like this I hope to be able to better evaluate the effect of a heuristic over the lifetime of my engine as I add more and more terms and retune my evaluation function again and again.

I was wondering if anyone else spent time thinking about this problem and if anyone knew of a good way to evaluate the impact of a search feature other than running a gauntlet with or without the feature (eg something that works efficiently without having Stockfish-like resources to back you up).

AlvaroBegue · Post by **AlvaroBegue** » Mon Feb 16, 2015 5:39 pm

Ferdy wrote:
jorose wrote:Hi, everyone!

I've been thinking about what feature I want to implement next in The Machine (my little digital woodpushing child) and it got me into thinking about the value of different features in search and evaluation. In doing so I realized there are some problems with trying to put a value on different features.
One way of solving this is by creating some anchor feature and value in the evaluation. Example from start position you play e2e4. What would be its value if 1 pawn is worth 100 points? Say 4. So from that anchor you will now have an idea of what is the worth of h3?, the worth of d3? when you design your pst. Next if you implement the advantage of having 2 bishops say 50 cp, then you have to choose for the penalty of doubled pawn because 2 bishops advantage is closely related to doubled pawn. So now you have the penalty of doubled pawn. Next what about the backward pawn? Since this is a pawn weakness you have to choose whether backward pawn is worst than doubled pawn. Generally the secret here is your weighing of value relative to other values in another feature. Most good chess players knew this. That is just the first try, then you always need the test games to establish the base line, then you may go for auto-tuning if you explore more. The major eval worth exploring and best way to waste time are mobility, king safety, king attack, passer, and pawn structure.

I think you misread the question. This thread is about evaluating how much of an improvement in playing comes from adding a feature, which is a different question than what the weight of a feature should be in the evaluation function. The title is ambiguous, but if you read the entire first post, you'll see what I am saying.

The other question is more interesting to me, actually.

Ferdy · Post by **Ferdy** » Mon Feb 16, 2015 6:09 pm

AlvaroBegue wrote:
Ferdy wrote:
jorose wrote:Hi, everyone!

I've been thinking about what feature I want to implement next in The Machine (my little digital woodpushing child) and it got me into thinking about the value of different features in search and evaluation. In doing so I realized there are some problems with trying to put a value on different features.
One way of solving this is by creating some anchor feature and value in the evaluation. Example from start position you play e2e4. What would be its value if 1 pawn is worth 100 points? Say 4. So from that anchor you will now have an idea of what is the worth of h3?, the worth of d3? when you design your pst. Next if you implement the advantage of having 2 bishops say 50 cp, then you have to choose for the penalty of doubled pawn because 2 bishops advantage is closely related to doubled pawn. So now you have the penalty of doubled pawn. Next what about the backward pawn? Since this is a pawn weakness you have to choose whether backward pawn is worst than doubled pawn. Generally the secret here is your weighing of value relative to other values in another feature. Most good chess players knew this. That is just the first try, then you always need the test games to establish the base line, then you may go for auto-tuning if you explore more. The major eval worth exploring and best way to waste time are mobility, king safety, king attack, passer, and pawn structure.
I think you misread the question. This thread is about evaluating how much of an improvement in playing comes from adding a feature, which is a different question than what the weight of a feature should be in the evaluation function.

I don't think so, my reply is only for the one I quoted, am not even touching the search part

.

AlvaroBegue wrote:The title is ambiguous, but if you read the entire first post, you'll see what I am saying.

The other question is more interesting to me, actually.

jorose · Post by **jorose** » Mon Feb 16, 2015 6:20 pm

Actually Alvaro is right, but it seems I indeed worded my question fairly badly.

That being said, I'm grateful for the responses I got anyways, they were a good read for me =)

Ferdy · Post by **Ferdy** » Mon Feb 16, 2015 6:46 pm

jorose wrote:Actually Alvaro is right, but it seems I indeed worded my question fairly badly.

I understand your post, but choose to reply on that part, that is why I include a quote

.

jorose wrote: That being said, I'm grateful for the responses I got anyways, they were a good read for me =)

bob · Post by **bob** » Mon Feb 16, 2015 8:53 pm

hgm wrote:I think it is not so much a matter of the engine getting stronger per se, but of non-orthogonal features. And of course as you develop the engine and add more and more features, it does get more likely that some of the features are doing partly the same, and will be able to fill in part of the gap created by switching one feature off.

E.g. centralization by PST in a sense is just poor-man's mobility. A truly good mobility term, which also scores X-ray moves that you could activate through discovering them, would almost completely make it redundant. But as long as you don;t have mobility at all, the PST will be crucial for developing the pieces. (Of course some aspects of the position can be implemented through PST, and not by mobility, such as a drive to advance Pawns.)

So a large part of making an engine stronger consists of replacing easy-to-implement very roughly averaging terms by more precise and accurate terms that on average do the same. This usually gains you much more than very precisely tuning the weight of the rough approximation. I always compare this to approximating a circular area with a square. No matter how precisely you determine the best size of the square to minimize the area of the parts not covering each other, the error will always remain large. Using an octagon instead of a square is immediately much better, even when the size is so off that it is entirely inside or outside the circle, touching it.

I will go with HGM here. In fact, I have noticed that as an evaluation grows, it gets more "robust" in general. You can actually have a badly broken eval term that won't hurt a bit, because the other terms offset the problem. As the program grows, each part adds less and less to the overall picture because there are more and more new parts added, which causes the changes to represent just a tiny part of the total thing.

jorose · Post by **jorose** » Tue Feb 17, 2015 12:55 am

This reminds me of what I found while working with some supervised learning algorithms. I recall adding noise to a data-set of handwritten digits (mnist, for those familiar with the data-set) in essence mislabeling 5% of the training data. Interestingly enough most algorithms barely performed worse on the test data even compared to when they were fed the non faulty data.

Value of a Feature or Heuristic

Value of a Feature or Heuristic

Re: Value of a Feature or Heuristic

Re: Value of a Feature or Heuristic

Re: Value of a Feature or Heuristic

Re: Value of a Feature or Heuristic

Re: Value of a Feature or Heuristic

Re: Value of a Feature or Heuristic

Re: Value of a Feature or Heuristic

Re: Value of a Feature or Heuristic

Re: Value of a Feature or Heuristic