Revisiting GA's for tuning evaluation weights

Discussion of chess software programming and technical issues.

Moderator: Ras

Karlo Bala
Posts: 373
Joined: Wed Mar 22, 2006 10:17 am
Location: Novi Sad, Serbia
Full name: Karlo Balla

Re: Revisiting GA's for tuning evaluation weights

Post by Karlo Bala »

ilari wrote:For a couple of years now I've been interested in tuning evaluation weights of a chess engine with a genetic algorithm. What has discouraged me are the disappointing results that others have gotten, and the huge amount of time it would take. But even though people smarter than me have tried it before and failed, there are a few reasons why I'm keen on discarding their results and doing it again:

- The number of games that it takes to determine superiority between two engines of similar strength was horribly underestimated in the past experiments. Usually the whole learning process consisted of only a few thousand games, and in some cases just a few hundred.
- The hardware they used wasn't very impressive by today's standards. Nowadays we have the tools to run a ton of games quickly, easily and concurrently.
- In some experiments they wrote their own simple and very slow chess engine to tune. In this Master's thesis the author had an engine that could search only 3 plies deep under his test conditions. The population size, the number of games and the number of generations were also ridiculously small.
- The number of evaluation parameters (weights) was usually small. Today's top programs have a large number of parameters, and because they depend on each other, it's tedious and difficult to hand-tune them. GA's on the other hand thrive when the number of parameters (genes) is large.


I've already written a GA library in C++, and I planned on plugging that to a somewhat modified version of cutechess-cli. I'd use the UCI protocol for the learning process, because UCI "spin" options could be easily used as learning parameters (genes) with a minimum and maximum value. The idea is that it would be possible to tune any UCI engine without modifying its source code, apart from converting eval weights to UCI options. So if the engine has options "PawnValue" and "KnightValue" the user could just tell cutechess-cli to use them as learning parameters.

I was thinking of the following initial test conditions:
- Population size of about 100 individuals
- For each generation, every individual plays one game against everyone else. So 4950 games per generation. With 99 games per individual the fitness function will of course be far from perfect, but then again, there's already some randomness in Selection.
- Time control of a few seconds per game. For example, a tc of 4 seconds would mean 5.5 hours per generation, or 1.4 hours on a quad-core.
- As many generations as it takes to reach a global maximum

If the initial test shows any promise, I would consider increasing the population count and number of games per generation.

So what do you people think? Would this be a colossal waste of time? Any advice, corrections, encouraging/discouraging anecdotes?
Why not fit evaluation parameters into finite set of positions?
To get accurate evaluation for every position, you can play lot of games with different engines. Maybe you can use bases from CEGT, CCRL or from Chessbase like Kaufman. Or, perhaps you can use probabilities from opening books.
Best Regards,
Karlo Balla Jr.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Revisiting GA's for tuning evaluation weights

Post by Don »

BubbaTough wrote:I tried this in 1992 I think, and it didn't work that great for me, but as you said, times are different.

Instead of doing a large round robin as you suggest, or a knockout tournament as Don suggests, I think I played a limited round swiss, and I kept track of members of the population's rating. Rating was pervasive between generations, so a single unlucky tournament did not sink a population member with a great track record in previous generations.

It seems like a faster fitness function than your round robin (thereby allowing more generations) and perhaps more accurate than Don's...but perhaps not. Anyway, what do I know...it didn't work that well for me and I haven't tried it in the last 18 years :).

-Sam
It's not about accuracy.

The one thing you are missing here is that this process should give you the most "bang for the buck." You want to find a strong player (or two players) without doing much work.

KO (knockout) is not very good in the "accuracy" department, but it's great in the selection department. With K/O there is a very high probability that you have selected a very strong player out of a huge sample of players with a minimum of effort. I argue that that is much better (by far) than taking a tiny population and spending a lot of time trying to precisely measure which of those is best.

It's efficiency comes from the fact that it does not waste much time on the weak. In the early rounds of a knockout tournament the stronger player may not always win, but there is good change that the player that beat it is himself a pretty strong player. And even when that is not true the next round will punish him.

I did the simulations and there is no comparison to round robin. Out of an infinite population of random players, I can on average hand out much stronger players with much less effort doing K/O style tournaments than trying to measure with the horribly inefficient round robin. And for this problem that is all that matters. And there is no question that to have any chance at making any of these population based learning algorithms work in computer chess, you are going to have to get as efficient as possible because the "fitness" function is going to be incredibly expensive to compute. I think K/O is probably close to the best fitness function you can get. It's like the difference between bubble sort and quicksort for algorithmic efficiency.
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Revisiting GA's for tuning evaluation weights

Post by michiguel »

Don wrote:
BubbaTough wrote:I tried this in 1992 I think, and it didn't work that great for me, but as you said, times are different.

Instead of doing a large round robin as you suggest, or a knockout tournament as Don suggests, I think I played a limited round swiss, and I kept track of members of the population's rating. Rating was pervasive between generations, so a single unlucky tournament did not sink a population member with a great track record in previous generations.

It seems like a faster fitness function than your round robin (thereby allowing more generations) and perhaps more accurate than Don's...but perhaps not. Anyway, what do I know...it didn't work that well for me and I haven't tried it in the last 18 years :).

-Sam
It's not about accuracy.

The one thing you are missing here is that this process should give you the most "bang for the buck." You want to find a strong player (or two players) without doing much work.

KO (knockout) is not very good in the "accuracy" department, but it's great in the selection department. With K/O there is a very high probability that you have selected a very strong player out of a huge sample of players with a minimum of effort. I argue that that is much better (by far) than taking a tiny population and spending a lot of time trying to precisely measure which of those is best.

It's efficiency comes from the fact that it does not waste much time on the weak. In the early rounds of a knockout tournament the stronger player may not always win, but there is good change that the player that beat it is himself a pretty strong player. And even when that is not true the next round will punish him.

I did the simulations and there is no comparison to round robin. Out of an infinite population of random players, I can on average hand out much stronger players with much less effort doing K/O style tournaments than trying to measure with the horribly inefficient round robin. And for this problem that is all that matters. And there is no question that to have any chance at making any of these population based learning algorithms work in computer chess, you are going to have to get as efficient as possible because the "fitness" function is going to be incredibly expensive to compute. I think K/O is probably close to the best fitness function you can get. It's like the difference between bubble sort and quicksort for algorithmic efficiency.
Of course KO is better. That is how evolution works.

Miguel
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Revisiting GA's for tuning evaluation weights

Post by Don »

ilari wrote: I was thinking of the following initial test conditions:
- Population size of about 100 individuals
- For each generation, every individual plays one game against everyone else. So 4950 games per generation. With 99 games per individual the fitness function will of course be far from perfect, but then again, there's already some randomness in Selection.
I would suggest that instead you choose a power of 2 population size such as 128 instead of 100. Maybe even 256 is better. Don't waste time playing the same number of games for each player, instead play a big knockout tournament among all these players. In the early rounds play just a few games per match and in later rounds when 2 player meet play longer matches between them.

I guarantee that with the same number of games, you will on average have selected MUCH stronger players.

Here is some pseudo code for elegantly expressing a K/O tournament, which of course is recursive:

Code: Select all

player_t  ko(int rounds)
{
  player_t  a;
  player_t  b;

  if (rounds == 1) {
    a = getNextPlayer();
    b = getNextPlayer();
  } else {
    a = ko(rounds - 1);
    b = ko(rounds - 1);
  }

  (player_t) winner = playMatch(a, b);

  return winner;
}
BubbaTough
Posts: 1154
Joined: Fri Jun 23, 2006 5:18 am

Re: Revisiting GA's for tuning evaluation weights

Post by BubbaTough »

Don wrote:
BubbaTough wrote:I tried this in 1992 I think, and it didn't work that great for me, but as you said, times are different.

Instead of doing a large round robin as you suggest, or a knockout tournament as Don suggests, I think I played a limited round swiss, and I kept track of members of the population's rating. Rating was pervasive between generations, so a single unlucky tournament did not sink a population member with a great track record in previous generations.

It seems like a faster fitness function than your round robin (thereby allowing more generations) and perhaps more accurate than Don's...but perhaps not. Anyway, what do I know...it didn't work that well for me and I haven't tried it in the last 18 years :).

-Sam
It's not about accuracy.

The one thing you are missing here is that this process should give you the most "bang for the buck." You want to find a strong player (or two players) without doing much work.

KO (knockout) is not very good in the "accuracy" department, but it's great in the selection department. With K/O there is a very high probability that you have selected a very strong player out of a huge sample of players with a minimum of effort. I argue that that is much better (by far) than taking a tiny population and spending a lot of time trying to precisely measure which of those is best.

It's efficiency comes from the fact that it does not waste much time on the weak. In the early rounds of a knockout tournament the stronger player may not always win, but there is good change that the player that beat it is himself a pretty strong player. And even when that is not true the next round will punish him.

I did the simulations and there is no comparison to round robin. Out of an infinite population of random players, I can on average hand out much stronger players with much less effort doing K/O style tournaments than trying to measure with the horribly inefficient round robin. And for this problem that is all that matters. And there is no question that to have any chance at making any of these population based learning algorithms work in computer chess, you are going to have to get as efficient as possible because the "fitness" function is going to be incredibly expensive to compute. I think K/O is probably close to the best fitness function you can get. It's like the difference between bubble sort and quicksort for algorithmic efficiency.
We are in complete agreement that Round Robin has a bad bang for the buck. But saying you are better than that is not very close to saying you are "close to the best"! As you say, Round Robin is just awful, and not hard to beat.

But don't you think just keeping an ELO rating of your versions, that lasts between generations, is going to give you more "bang" than starting from scratch each generation without more work (therefore more bang for buck)? How you want to decide what games to run is still an interesting question. One way would be to do a knockout tournament, in which case I think you would have a better version of what you proposed. But there are many others (one could create matchups using some statistical analysis taking into account rating and rating +/- to determine the most "likely to be the best" engines and give them games for example).

-Sam
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Revisiting GA's for tuning evaluation weights

Post by michiguel »

Don wrote:
BubbaTough wrote:I tried this in 1992 I think, and it didn't work that great for me, but as you said, times are different.

Instead of doing a large round robin as you suggest, or a knockout tournament as Don suggests, I think I played a limited round swiss, and I kept track of members of the population's rating. Rating was pervasive between generations, so a single unlucky tournament did not sink a population member with a great track record in previous generations.

It seems like a faster fitness function than your round robin (thereby allowing more generations) and perhaps more accurate than Don's...but perhaps not. Anyway, what do I know...it didn't work that well for me and I haven't tried it in the last 18 years :).

-Sam
It's not about accuracy.

The one thing you are missing here is that this process should give you the most "bang for the buck." You want to find a strong player (or two players) without doing much work.
Now that I read this again, why do you want to end up with only one or two players? I think it is better to keep a diversity in the gene pool selecting the 1/4-1/8 of the best players, mate them, reproduce them, mutate them and start again.

Miguel

KO (knockout) is not very good in the "accuracy" department, but it's great in the selection department. With K/O there is a very high probability that you have selected a very strong player out of a huge sample of players with a minimum of effort. I argue that that is much better (by far) than taking a tiny population and spending a lot of time trying to precisely measure which of those is best.

It's efficiency comes from the fact that it does not waste much time on the weak. In the early rounds of a knockout tournament the stronger player may not always win, but there is good change that the player that beat it is himself a pretty strong player. And even when that is not true the next round will punish him.

I did the simulations and there is no comparison to round robin. Out of an infinite population of random players, I can on average hand out much stronger players with much less effort doing K/O style tournaments than trying to measure with the horribly inefficient round robin. And for this problem that is all that matters. And there is no question that to have any chance at making any of these population based learning algorithms work in computer chess, you are going to have to get as efficient as possible because the "fitness" function is going to be incredibly expensive to compute. I think K/O is probably close to the best fitness function you can get. It's like the difference between bubble sort and quicksort for algorithmic efficiency.
CRoberson
Posts: 2094
Joined: Mon Mar 13, 2006 2:31 am
Location: North Carolina, USA

Re: Revisiting GA's for tuning evaluation weights

Post by CRoberson »

michiguel wrote:
Don wrote:
BubbaTough wrote:I tried this in 1992 I think, and it didn't work that great for me, but as you said, times are different.

Instead of doing a large round robin as you suggest, or a knockout tournament as Don suggests, I think I played a limited round swiss, and I kept track of members of the population's rating. Rating was pervasive between generations, so a single unlucky tournament did not sink a population member with a great track record in previous generations.

It seems like a faster fitness function than your round robin (thereby allowing more generations) and perhaps more accurate than Don's...but perhaps not. Anyway, what do I know...it didn't work that well for me and I haven't tried it in the last 18 years :).

-Sam
It's not about accuracy.

The one thing you are missing here is that this process should give you the most "bang for the buck." You want to find a strong player (or two players) without doing much work.

KO (knockout) is not very good in the "accuracy" department, but it's great in the selection department. With K/O there is a very high probability that you have selected a very strong player out of a huge sample of players with a minimum of effort. I argue that that is much better (by far) than taking a tiny population and spending a lot of time trying to precisely measure which of those is best.

It's efficiency comes from the fact that it does not waste much time on the weak. In the early rounds of a knockout tournament the stronger player may not always win, but there is good change that the player that beat it is himself a pretty strong player. And even when that is not true the next round will punish him.

I did the simulations and there is no comparison to round robin. Out of an infinite population of random players, I can on average hand out much stronger players with much less effort doing K/O style tournaments than trying to measure with the horribly inefficient round robin. And for this problem that is all that matters. And there is no question that to have any chance at making any of these population based learning algorithms work in computer chess, you are going to have to get as efficient as possible because the "fitness" function is going to be incredibly expensive to compute. I think K/O is probably close to the best fitness function you can get. It's like the difference between bubble sort and quicksort for algorithmic efficiency.
Of course KO is better. That is how evolution works.

Miguel
I beg to differ. Seems to me that evolution is more like "king of the hill" instead of KO. "King of the hill" would suggest a simulated
annealer approach.

Code: Select all

    Generate  opponent #1
    call it the king
    while (not finished)  
    {
       Generate an opponent  #2
       Match them
       Decide winner and new king
    }
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Revisiting GA's for tuning evaluation weights

Post by michiguel »

CRoberson wrote:
michiguel wrote:
Don wrote:
BubbaTough wrote:I tried this in 1992 I think, and it didn't work that great for me, but as you said, times are different.

Instead of doing a large round robin as you suggest, or a knockout tournament as Don suggests, I think I played a limited round swiss, and I kept track of members of the population's rating. Rating was pervasive between generations, so a single unlucky tournament did not sink a population member with a great track record in previous generations.

It seems like a faster fitness function than your round robin (thereby allowing more generations) and perhaps more accurate than Don's...but perhaps not. Anyway, what do I know...it didn't work that well for me and I haven't tried it in the last 18 years :).

-Sam
It's not about accuracy.

The one thing you are missing here is that this process should give you the most "bang for the buck." You want to find a strong player (or two players) without doing much work.

KO (knockout) is not very good in the "accuracy" department, but it's great in the selection department. With K/O there is a very high probability that you have selected a very strong player out of a huge sample of players with a minimum of effort. I argue that that is much better (by far) than taking a tiny population and spending a lot of time trying to precisely measure which of those is best.

It's efficiency comes from the fact that it does not waste much time on the weak. In the early rounds of a knockout tournament the stronger player may not always win, but there is good change that the player that beat it is himself a pretty strong player. And even when that is not true the next round will punish him.

I did the simulations and there is no comparison to round robin. Out of an infinite population of random players, I can on average hand out much stronger players with much less effort doing K/O style tournaments than trying to measure with the horribly inefficient round robin. And for this problem that is all that matters. And there is no question that to have any chance at making any of these population based learning algorithms work in computer chess, you are going to have to get as efficient as possible because the "fitness" function is going to be incredibly expensive to compute. I think K/O is probably close to the best fitness function you can get. It's like the difference between bubble sort and quicksort for algorithmic efficiency.
Of course KO is better. That is how evolution works.

Miguel
I beg to differ. Seems to me that evolution is more like "king of the hill" instead of KO. "King of the hill" would suggest a simulated
annealer approach.

Code: Select all

    Generate  opponent #1
    call it the king
    while (not finished)  
    {
       Generate an opponent  #2
       Match them
       Decide winner and new king
    }
That is KO that later faces the Champion :-) What I mean is that evolutionary fate is decided by single important events, not a championship. Adding too much accuracy is an overkill IMO.

The key of evolutionary mechanisms is in the sexual reproduction, and the exchange of good and bad genes. Good genes could survive for a while in individuals that are not the best fit. Two “sub-optimal” individuals can reproduce and have better fit children if the combination of genes was the right one. If you do not have a strong component of "gene exchange" between individuals GA turns into a lottery. That is why keeping diversity is important.


Bacteria do not have sexual reproduction and they do not exchange genes (well.. not exactly but...). We do. How do bacteria evolve then? they have a new generation every 20-30 min and we do every 20-30 years. So, if in our calculations we do not take care of that, we may have to face many more generations to find good results.

I think that for eval terms, SA is more suitable because you have an idea of where to start. Steepest descent and their derivatives will get fast... into a local minimum, a trap. When you are close a solution, that is ok, otherwise is a disaster. GA explore a humongous space and do not fall into this problem. But the space is too big. SA explores a bigger space, but more reasonable. If SA works for protein dynamics, it should work for this.

Miguel
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Revisiting GA's for tuning evaluation weights

Post by Don »

michiguel wrote:
CRoberson wrote:
michiguel wrote:
Don wrote:
BubbaTough wrote:I tried this in 1992 I think, and it didn't work that great for me, but as you said, times are different.

Instead of doing a large round robin as you suggest, or a knockout tournament as Don suggests, I think I played a limited round swiss, and I kept track of members of the population's rating. Rating was pervasive between generations, so a single unlucky tournament did not sink a population member with a great track record in previous generations.

It seems like a faster fitness function than your round robin (thereby allowing more generations) and perhaps more accurate than Don's...but perhaps not. Anyway, what do I know...it didn't work that well for me and I haven't tried it in the last 18 years :).

-Sam
It's not about accuracy.

The one thing you are missing here is that this process should give you the most "bang for the buck." You want to find a strong player (or two players) without doing much work.

KO (knockout) is not very good in the "accuracy" department, but it's great in the selection department. With K/O there is a very high probability that you have selected a very strong player out of a huge sample of players with a minimum of effort. I argue that that is much better (by far) than taking a tiny population and spending a lot of time trying to precisely measure which of those is best.

It's efficiency comes from the fact that it does not waste much time on the weak. In the early rounds of a knockout tournament the stronger player may not always win, but there is good change that the player that beat it is himself a pretty strong player. And even when that is not true the next round will punish him.

I did the simulations and there is no comparison to round robin. Out of an infinite population of random players, I can on average hand out much stronger players with much less effort doing K/O style tournaments than trying to measure with the horribly inefficient round robin. And for this problem that is all that matters. And there is no question that to have any chance at making any of these population based learning algorithms work in computer chess, you are going to have to get as efficient as possible because the "fitness" function is going to be incredibly expensive to compute. I think K/O is probably close to the best fitness function you can get. It's like the difference between bubble sort and quicksort for algorithmic efficiency.
Of course KO is better. That is how evolution works.

Miguel
I beg to differ. Seems to me that evolution is more like "king of the hill" instead of KO. "King of the hill" would suggest a simulated
annealer approach.

Code: Select all

    Generate  opponent #1
    call it the king
    while (not finished)  
    {
       Generate an opponent  #2
       Match them
       Decide winner and new king
    }
That is KO that later faces the Champion :-) What I mean is that evolutionary fate is decided by single important events, not a championship. Adding too much accuracy is an overkill IMO.

The key of evolutionary mechanisms is in the sexual reproduction, and the exchange of good and bad genes. Good genes could survive for a while in individuals that are not the best fit. Two “sub-optimal” individuals can reproduce and have better fit children if the combination of genes was the right one. If you do not have a strong component of "gene exchange" between individuals GA turns into a lottery. That is why keeping diversity is important.


Bacteria do not have sexual reproduction and they do not exchange genes (well.. not exactly but...). We do. How do bacteria evolve then? they have a new generation every 20-30 min and we do every 20-30 years. So, if in our calculations we do not take care of that, we may have to face many more generations to find good results.

I think that for eval terms, SA is more suitable because you have an idea of where to start. Steepest descent and their derivatives will get fast... into a local minimum, a trap. When you are close a solution, that is ok, otherwise is a disaster. GA explore a humongous space and do not fall into this problem. But the space is too big. SA explores a bigger space, but more reasonable. If SA works for protein dynamics, it should work for this.

Miguel
It doesn't matter whether you are talking about GA, PBIL or SA, the real bottleneck is going to be in the fitness function. You cannot just blow it off by saying too much accuracy is bad because that is never going to be the problem in this case. The problem is going to be how to resolve such a stupendously fuzzy evaluation function and it's going to take a lot of CPU power no matter how you slice it.
User avatar
hgm
Posts: 28378
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Revisiting GA's for tuning evaluation weights

Post by hgm »

In my experience, testing the effect of theh tuning of evauation terms by playing different programversions from an unbalanced position is not a competitive way to determine the optiml value of these terms. It is relly orders of magnitude faster to play identical programs from an imbalanced position.

E.g. if I want to know a good piece value of compared to an earlier determined B=N=3, I play games from oprning-like positions that are symetric except that one side lacks Q or QP, and the other BBN or BNN.

If you would try to obtain the same data by playing from a balanced postion, you wold be dependent on the engines to first crete that material imbalnce, whih might happen in only 10% of the games. The other 90% of the games will not be affected by the programmed value difference as Q was traded far Q, which will be neutral trade to both programs no mtter what Q value they employ. But these 90% do contribute to the statistical noise. To get the 10% diluted effect above the noise level, you then need 100 times as many games, as the noise only decreases as the square root of the number of games.