Revisiting GA's for tuning evaluation weights

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Re: Revisiting GA's for tuning evaluation weights

Post by Gian-Carlo Pascutto »

This really works only for the specific set of things where you can create the exact imbalance and understand reasonable "appearances" of it. I don't think it's useful for the general problem of tuning all parameters.
Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Re: Revisiting GA's for tuning evaluation weights

Post by Gian-Carlo Pascutto »

Don wrote: I did the simulations and there is no comparison to round robin. Out of an infinite population of random players, I can on average hand out much stronger players with much less effort doing K/O style tournaments than trying to measure with the horribly inefficient round robin. And for this problem that is all that matters. And there is no question that to have any chance at making any of these population based learning algorithms work in computer chess, you are going to have to get as efficient as possible because the "fitness" function is going to be incredibly expensive to compute. I think K/O is probably close to the best fitness function you can get. It's like the difference between bubble sort and quicksort for algorithmic efficiency.
Two problems:

a) White player advantage. I think you want to compensate for this, otherwise convergence is going to be hurting as you get close to the optimums. (Play 2 games with colors reversed?)

b) What do you do with draws? 1/3 of your matches will end with a draw, so you then have to pick randomly, or so?
User avatar
hgm
Posts: 27796
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Revisiting GA's for tuning evaluation weights

Post by hgm »

Gian-Carlo Pascutto wrote:This really works only for the specific set of things where you can create the exact imbalance and understand reasonable "appearances" of it. I don't think it's useful for the general problem of tuning all parameters.
I don't see the problem. You know what terms are in your eval, so for any se of positions you could determine where the term contributes, and where it doesn't. You can then divide the set of positions in two groups, depending if the incude the term you want to determine the empirical value of. From each group you then pick a set of representatives, where you take care that every other eval term is equally often present and absent in each set.

Of course you can combine this with orthogonal multi-testing, to use the same set of games you play from these starting positions to determine the value of multiple terms.
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Revisiting GA's for tuning evaluation weights

Post by Michael Sherwin »

ilari wrote:For a couple of years now I've been interested in tuning evaluation weights of a chess engine with a genetic algorithm. What has discouraged me are the disappointing results that others have gotten, and the huge amount of time it would take. But even though people smarter than me have tried it before and failed, there are a few reasons why I'm keen on discarding their results and doing it again:

- The number of games that it takes to determine superiority between two engines of similar strength was horribly underestimated in the past experiments. Usually the whole learning process consisted of only a few thousand games, and in some cases just a few hundred.
- The hardware they used wasn't very impressive by today's standards. Nowadays we have the tools to run a ton of games quickly, easily and concurrently.
- In some experiments they wrote their own simple and very slow chess engine to tune. In this Master's thesis the author had an engine that could search only 3 plies deep under his test conditions. The population size, the number of games and the number of generations were also ridiculously small.
- The number of evaluation parameters (weights) was usually small. Today's top programs have a large number of parameters, and because they depend on each other, it's tedious and difficult to hand-tune them. GA's on the other hand thrive when the number of parameters (genes) is large.


I've already written a GA library in C++, and I planned on plugging that to a somewhat modified version of cutechess-cli. I'd use the UCI protocol for the learning process, because UCI "spin" options could be easily used as learning parameters (genes) with a minimum and maximum value. The idea is that it would be possible to tune any UCI engine without modifying its source code, apart from converting eval weights to UCI options. So if the engine has options "PawnValue" and "KnightValue" the user could just tell cutechess-cli to use them as learning parameters.

I was thinking of the following initial test conditions:
- Population size of about 100 individuals
- For each generation, every individual plays one game against everyone else. So 4950 games per generation. With 99 games per individual the fitness function will of course be far from perfect, but then again, there's already some randomness in Selection.
- Time control of a few seconds per game. For example, a tc of 4 seconds would mean 5.5 hours per generation, or 1.4 hours on a quad-core.
- As many generations as it takes to reach a global maximum

If the initial test shows any promise, I would consider increasing the population count and number of games per generation.

So what do you people think? Would this be a colossal waste of time? Any advice, corrections, encouraging/discouraging anecdotes?
For GA to work well it needs to maintain a different set of parameters for every strategically differentiated pawn structure. If you do not do this then you will never overcome the noise caused by an improved parameter set for one pawn structure being worse for another. I would not start with the whole eval anyway. I would just start using GA for pawn play. Pawn play more than anything else is what leads to decisive play.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Revisiting GA's for tuning evaluation weights

Post by Don »

Gian-Carlo Pascutto wrote:
Don wrote: I did the simulations and there is no comparison to round robin. Out of an infinite population of random players, I can on average hand out much stronger players with much less effort doing K/O style tournaments than trying to measure with the horribly inefficient round robin. And for this problem that is all that matters. And there is no question that to have any chance at making any of these population based learning algorithms work in computer chess, you are going to have to get as efficient as possible because the "fitness" function is going to be incredibly expensive to compute. I think K/O is probably close to the best fitness function you can get. It's like the difference between bubble sort and quicksort for algorithmic efficiency.
Two problems:

a) White player advantage. I think you want to compensate for this, otherwise convergence is going to be hurting as you get close to the optimums. (Play 2 games with colors reversed?)

b) What do you do with draws? 1/3 of your matches will end with a draw, so you then have to pick randomly, or so?
When I did this I played games in pairs with colors reversed until one side had an advantage after the two games. In later rounds I extended the length of the matches and required at least N 2 games rounds.

There are variations of this. In early rounds it does not seem to pay to be too anal, so in some variations of this experiment I just flipped a coin after 2 games or I limited it to just 4 games and flipped a coin if there was still not a decision. The principle is that neither player distinguished himself so just move on and stop wasting time!

Since each round cuts the number of matches by half, it is very cheap to extend the length of the matches in later rounds and in simulations I did this payed off handsomely as measured by the amount of work to get a given quality of player.

When the players are very close together in strength, even a 2 round match is close to random. So build yourself a little simulator where you can define the range of ELO of random players. Assign them ELO ratings and when they play matches let them win in the same proportion that their ELO rating predicts. Then as output see how good the winners are. The goal isn't to pick the best player, it's to generate players way above the median with as little effort as possible. Then compare this to swiss or round robin and you will be astounded. In my simulator I assigned ELO ratings on a curve to be more realistic so that players would follow the "bell shaped" distribution curve. A simple way to do this is to assign 2 or 3 ELO ratings within some range randomly and average them. There you will have good variation but still have most of them cluster towards the center - like in real life.

I don't fully understand the variables to get the very best results but in my experiments I tended to get the best results when I didn't obsess much on early rounds but did on later rounds. One can think of the early round games as a way of quickly filtering out the obviously weak. If you lose in the first round you are not likely a contender. Of course that depends on how many rounds you play.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Revisiting GA's for tuning evaluation weights

Post by Don »

Don wrote:
Gian-Carlo Pascutto wrote:
Don wrote: I did the simulations and there is no comparison to round robin. Out of an infinite population of random players, I can on average hand out much stronger players with much less effort doing K/O style tournaments than trying to measure with the horribly inefficient round robin. And for this problem that is all that matters. And there is no question that to have any chance at making any of these population based learning algorithms work in computer chess, you are going to have to get as efficient as possible because the "fitness" function is going to be incredibly expensive to compute. I think K/O is probably close to the best fitness function you can get. It's like the difference between bubble sort and quicksort for algorithmic efficiency.
Two problems:

a) White player advantage. I think you want to compensate for this, otherwise convergence is going to be hurting as you get close to the optimums. (Play 2 games with colors reversed?)

b) What do you do with draws? 1/3 of your matches will end with a draw, so you then have to pick randomly, or so?
When I did this I played games in pairs with colors reversed until one side had an advantage after the two games. In later rounds I extended the length of the matches and required at least N 2 games rounds.

There are variations of this. In early rounds it does not seem to pay to be too anal, so in some variations of this experiment I just flipped a coin after 2 games or I limited it to just 4 games and flipped a coin if there was still not a decision. The principle is that neither player distinguished himself so just move on and stop wasting time!
Another way to look at this is that if 1 player does not distinguish himself in the first round and there are many rounds to go, either one is likely to be eliminated early anyway, so don't obsess.

It's not clear to me if you should spend a little more time on early rounds if you know your population is not very diverse (their strengths are very similar.) Because even though you don't want to spend an obsessive amount of time on them, you still want the best player to emerge with reasonably probability over 50%. Perhaps it's based on some kind of cost analysis, how much time should you continue to invest. I think the answer is dependent on the expected composition of the population you are sampling. If the ratings are really diverse I think it's a waste of time spending much time on this.

Since each round cuts the number of matches by half, it is very cheap to extend the length of the matches in later rounds and in simulations I did this payed off handsomely as measured by the amount of work to get a given quality of player.

When the players are very close together in strength, even a 2 round match is close to random. So build yourself a little simulator where you can define the range of ELO of random players. Assign them ELO ratings and when they play matches let them win in the same proportion that their ELO rating predicts. Then as output see how good the winners are. The goal isn't to pick the best player, it's to generate players way above the median with as little effort as possible. Then compare this to swiss or round robin and you will be astounded. In my simulator I assigned ELO ratings on a curve to be more realistic so that players would follow the "bell shaped" distribution curve. A simple way to do this is to assign 2 or 3 ELO ratings within some range randomly and average them. There you will have good variation but still have most of them cluster towards the center - like in real life.

I don't fully understand the variables to get the very best results but in my experiments I tended to get the best results when I didn't obsess much on early rounds but did on later rounds. One can think of the early round games as a way of quickly filtering out the obviously weak. If you lose in the first round you are not likely a contender. Of course that depends on how many rounds you play.
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Revisiting GA's for tuning evaluation weights

Post by michiguel »

Don wrote:
michiguel wrote:
CRoberson wrote:
michiguel wrote:
Don wrote:
BubbaTough wrote:I tried this in 1992 I think, and it didn't work that great for me, but as you said, times are different.

Instead of doing a large round robin as you suggest, or a knockout tournament as Don suggests, I think I played a limited round swiss, and I kept track of members of the population's rating. Rating was pervasive between generations, so a single unlucky tournament did not sink a population member with a great track record in previous generations.

It seems like a faster fitness function than your round robin (thereby allowing more generations) and perhaps more accurate than Don's...but perhaps not. Anyway, what do I know...it didn't work that well for me and I haven't tried it in the last 18 years :).

-Sam
It's not about accuracy.

The one thing you are missing here is that this process should give you the most "bang for the buck." You want to find a strong player (or two players) without doing much work.

KO (knockout) is not very good in the "accuracy" department, but it's great in the selection department. With K/O there is a very high probability that you have selected a very strong player out of a huge sample of players with a minimum of effort. I argue that that is much better (by far) than taking a tiny population and spending a lot of time trying to precisely measure which of those is best.

It's efficiency comes from the fact that it does not waste much time on the weak. In the early rounds of a knockout tournament the stronger player may not always win, but there is good change that the player that beat it is himself a pretty strong player. And even when that is not true the next round will punish him.

I did the simulations and there is no comparison to round robin. Out of an infinite population of random players, I can on average hand out much stronger players with much less effort doing K/O style tournaments than trying to measure with the horribly inefficient round robin. And for this problem that is all that matters. And there is no question that to have any chance at making any of these population based learning algorithms work in computer chess, you are going to have to get as efficient as possible because the "fitness" function is going to be incredibly expensive to compute. I think K/O is probably close to the best fitness function you can get. It's like the difference between bubble sort and quicksort for algorithmic efficiency.
Of course KO is better. That is how evolution works.

Miguel
I beg to differ. Seems to me that evolution is more like "king of the hill" instead of KO. "King of the hill" would suggest a simulated
annealer approach.

Code: Select all

    Generate  opponent #1
    call it the king
    while (not finished)  
    {
       Generate an opponent  #2
       Match them
       Decide winner and new king
    }
That is KO that later faces the Champion :-) What I mean is that evolutionary fate is decided by single important events, not a championship. Adding too much accuracy is an overkill IMO.

The key of evolutionary mechanisms is in the sexual reproduction, and the exchange of good and bad genes. Good genes could survive for a while in individuals that are not the best fit. Two “sub-optimal” individuals can reproduce and have better fit children if the combination of genes was the right one. If you do not have a strong component of "gene exchange" between individuals GA turns into a lottery. That is why keeping diversity is important.


Bacteria do not have sexual reproduction and they do not exchange genes (well.. not exactly but...). We do. How do bacteria evolve then? they have a new generation every 20-30 min and we do every 20-30 years. So, if in our calculations we do not take care of that, we may have to face many more generations to find good results.

I think that for eval terms, SA is more suitable because you have an idea of where to start. Steepest descent and their derivatives will get fast... into a local minimum, a trap. When you are close a solution, that is ok, otherwise is a disaster. GA explore a humongous space and do not fall into this problem. But the space is too big. SA explores a bigger space, but more reasonable. If SA works for protein dynamics, it should work for this.

Miguel
It doesn't matter whether you are talking about GA, PBIL or SA, the real bottleneck is going to be in the fitness function. You cannot just blow it off by saying too much accuracy is bad because that is never going to be the problem in this case. The problem is going to be how to resolve such a stupendously fuzzy evaluation function and it's going to take a lot of CPU power no matter how you slice it.
I did not say it is bad, I said it is overkill, which seems to agree to what you say in other posts.

Miguel
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Revisiting GA's for tuning evaluation weights

Post by michiguel »

Don wrote:
Don wrote:
Gian-Carlo Pascutto wrote:
Don wrote: I did the simulations and there is no comparison to round robin. Out of an infinite population of random players, I can on average hand out much stronger players with much less effort doing K/O style tournaments than trying to measure with the horribly inefficient round robin. And for this problem that is all that matters. And there is no question that to have any chance at making any of these population based learning algorithms work in computer chess, you are going to have to get as efficient as possible because the "fitness" function is going to be incredibly expensive to compute. I think K/O is probably close to the best fitness function you can get. It's like the difference between bubble sort and quicksort for algorithmic efficiency.
Two problems:

a) White player advantage. I think you want to compensate for this, otherwise convergence is going to be hurting as you get close to the optimums. (Play 2 games with colors reversed?)

b) What do you do with draws? 1/3 of your matches will end with a draw, so you then have to pick randomly, or so?
When I did this I played games in pairs with colors reversed until one side had an advantage after the two games. In later rounds I extended the length of the matches and required at least N 2 games rounds.

There are variations of this. In early rounds it does not seem to pay to be too anal, so in some variations of this experiment I just flipped a coin after 2 games or I limited it to just 4 games and flipped a coin if there was still not a decision. The principle is that neither player distinguished himself so just move on and stop wasting time!
Another way to look at this is that if 1 player does not distinguish himself in the first round and there are many rounds to go, either one is likely to be eliminated early anyway, so don't obsess.

It's not clear to me if you should spend a little more time on early rounds if you know your population is not very diverse (their strengths are very similar.) Because even though you don't want to spend an obsessive amount of time on them, you still want the best player to emerge with reasonably probability over 50%. Perhaps it's based on some kind of cost analysis, how much time should you continue to invest. I think the answer is dependent on the expected composition of the population you are sampling. If the ratings are really diverse I think it's a waste of time spending much time on this.
I agree with what you say, except that I do not understand why you want to end up with "one" best player. I should be better to end up with a set of good players, not one. Otherwise, all the evolutionary mechanisms are not working. It would be more like a breeding process. What do you do after you end up with the best? I may not be understanding your set up.

Miguel

Since each round cuts the number of matches by half, it is very cheap to extend the length of the matches in later rounds and in simulations I did this payed off handsomely as measured by the amount of work to get a given quality of player.

When the players are very close together in strength, even a 2 round match is close to random. So build yourself a little simulator where you can define the range of ELO of random players. Assign them ELO ratings and when they play matches let them win in the same proportion that their ELO rating predicts. Then as output see how good the winners are. The goal isn't to pick the best player, it's to generate players way above the median with as little effort as possible. Then compare this to swiss or round robin and you will be astounded. In my simulator I assigned ELO ratings on a curve to be more realistic so that players would follow the "bell shaped" distribution curve. A simple way to do this is to assign 2 or 3 ELO ratings within some range randomly and average them. There you will have good variation but still have most of them cluster towards the center - like in real life.

I don't fully understand the variables to get the very best results but in my experiments I tended to get the best results when I didn't obsess much on early rounds but did on later rounds. One can think of the early round games as a way of quickly filtering out the obviously weak. If you lose in the first round you are not likely a contender. Of course that depends on how many rounds you play.
Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Re: Revisiting GA's for tuning evaluation weights

Post by Gian-Carlo Pascutto »

michiguel wrote: I agree with what you say, except that I do not understand why you want to end up with "one" best player. I should be better to end up with a set of good players, not one. Otherwise, all the evolutionary mechanisms are not working. It would be more like a breeding process. What do you do after you end up with the best? I may not be understanding your set up.
Have it mate with everybody else.
Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Re: Revisiting GA's for tuning evaluation weights

Post by Gian-Carlo Pascutto »

Michael Sherwin wrote: For GA to work well it needs to maintain a different set of parameters for every strategically differentiated pawn structure. If you do not do this then you will never overcome the noise caused by an improved parameter set for one pawn structure being worse for another. I would not start with the whole eval anyway. I would just start using GA for pawn play. Pawn play more than anything else is what leads to decisive play.
This makes no sense to me. Chess programs without different sets of parameters for different pawn structures are able to play the game. There is no guarantee the weights in them are optimal (even if there is only 1 set for all pawnstructures). So a GA can of course work to improve them.

What you are claiming is that you cannot build better evaluation functions if you do not take different sets for pawn structures. That's just not true. Chess programs are full of heuristics which produce bad results in some situations. What matters is that overall the result is positive, and this is perfectly possible to achieve with one set of parameters.

In fact, doing what you describe would matters much worse due to the curse of dimensionality.