Revisiting GA's for tuning evaluation weights

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
ilari
Posts: 750
Joined: Mon Mar 27, 2006 7:45 pm
Location: Finland

Revisiting GA's for tuning evaluation weights

Post by ilari »

For a couple of years now I've been interested in tuning evaluation weights of a chess engine with a genetic algorithm. What has discouraged me are the disappointing results that others have gotten, and the huge amount of time it would take. But even though people smarter than me have tried it before and failed, there are a few reasons why I'm keen on discarding their results and doing it again:

- The number of games that it takes to determine superiority between two engines of similar strength was horribly underestimated in the past experiments. Usually the whole learning process consisted of only a few thousand games, and in some cases just a few hundred.
- The hardware they used wasn't very impressive by today's standards. Nowadays we have the tools to run a ton of games quickly, easily and concurrently.
- In some experiments they wrote their own simple and very slow chess engine to tune. In this Master's thesis the author had an engine that could search only 3 plies deep under his test conditions. The population size, the number of games and the number of generations were also ridiculously small.
- The number of evaluation parameters (weights) was usually small. Today's top programs have a large number of parameters, and because they depend on each other, it's tedious and difficult to hand-tune them. GA's on the other hand thrive when the number of parameters (genes) is large.


I've already written a GA library in C++, and I planned on plugging that to a somewhat modified version of cutechess-cli. I'd use the UCI protocol for the learning process, because UCI "spin" options could be easily used as learning parameters (genes) with a minimum and maximum value. The idea is that it would be possible to tune any UCI engine without modifying its source code, apart from converting eval weights to UCI options. So if the engine has options "PawnValue" and "KnightValue" the user could just tell cutechess-cli to use them as learning parameters.

I was thinking of the following initial test conditions:
- Population size of about 100 individuals
- For each generation, every individual plays one game against everyone else. So 4950 games per generation. With 99 games per individual the fitness function will of course be far from perfect, but then again, there's already some randomness in Selection.
- Time control of a few seconds per game. For example, a tc of 4 seconds would mean 5.5 hours per generation, or 1.4 hours on a quad-core.
- As many generations as it takes to reach a global maximum

If the initial test shows any promise, I would consider increasing the population count and number of games per generation.

So what do you people think? Would this be a colossal waste of time? Any advice, corrections, encouraging/discouraging anecdotes?
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Revisiting GA's for tuning evaluation weights

Post by Don »

Have you looked at PBIL, population based incremental learning? It's in the same family as genetic algorithms. For many problems it's probably a smarter algorithm. It's based on the concept of not having a small population of players, but instead having an INFINITE population of players represented by a probability vector. Each floating point value in the "probability vector" corresponds to 1 bit in the "genetic material" of the population and the value between 0 and 1 represents which percentage of the population has a 1. For instance if bit 17 has a value of 0.75 then 3/4 of the "infinite" population has this bit set to 1 and 25% has it set to zero. So it's trivial to generate a random member of this infinite population.

It is very similar to GA in that you locate "fit" individuals and "breed" them, but in this case you simply modify the probability vector slightly in the direction of the fit individuals you find.

I have discovered what I think is a fairly clever way of finding a "fit" player out of an infinite population of players. I don't think there anything much more efficient than the classic knockout tournament for doing this. With PBIL you basically are supposed to generate N individuals, and pick the fittest of the N for the purpose of further modifying the vector. This is not possible with chess, but it's clear the problem really is how to "discover" a player in the infinite population who is likely to be superior to most of the other players. The winner of a knockout (elimination) tournament is rarely one of the weaker players in a population. And a knockout tournament is clearly a divide and conquer algorithm, very efficient at finding a "good" player. It's not necessary to find the "best" player because we are only concerned with finding the strongest player as quickly as possible.

There are probably statistical methods that might improve on the knockout tournemnet for getting the most "bang for the buck", or finding the best possible player in the least amount of effort. But probably nothing quite so simple and I'm sure this would approach the efficiency any statistical method.

In a study I did, it is an enhancement to play more games in the later rounds. For instance if you are playing 6 rounds with 64 random players from this infinite population, then playing 2 games instead of 1 game in the final round is adding only a tiny fraction of the work to the process and yet gives back better result. So this can be enhanced by using small matches in the early rounds and longer matches in the later rounds.

When I tried this I played 2 games per round in the early rounds and if the 2 game match was tied I chose the winner randomly. After the first 3 or 4 rounds I added a couple of games per round.

There is also CGA, compact genetic algorithm, very much related to this family of ideas and a little more like the classic genetic algorithm but still using a vector to represent an infinite population. I think the vector idea to represents an infinite population is brilliant.

With PBIL the algorithm is very straightforward and hardly any tuning is required. With classic GA there are zillions of parameters to tune to try to get this right.





ilari wrote:For a couple of years now I've been interested in tuning evaluation weights of a chess engine with a genetic algorithm. What has discouraged me are the disappointing results that others have gotten, and the huge amount of time it would take. But even though people smarter than me have tried it before and failed, there are a few reasons why I'm keen on discarding their results and doing it again:

- The number of games that it takes to determine superiority between two engines of similar strength was horribly underestimated in the past experiments. Usually the whole learning process consisted of only a few thousand games, and in some cases just a few hundred.
- The hardware they used wasn't very impressive by today's standards. Nowadays we have the tools to run a ton of games quickly, easily and concurrently.
- In some experiments they wrote their own simple and very slow chess engine to tune. In this Master's thesis the author had an engine that could search only 3 plies deep under his test conditions. The population size, the number of games and the number of generations were also ridiculously small.
- The number of evaluation parameters (weights) was usually small. Today's top programs have a large number of parameters, and because they depend on each other, it's tedious and difficult to hand-tune them. GA's on the other hand thrive when the number of parameters (genes) is large.


I've already written a GA library in C++, and I planned on plugging that to a somewhat modified version of cutechess-cli. I'd use the UCI protocol for the learning process, because UCI "spin" options could be easily used as learning parameters (genes) with a minimum and maximum value. The idea is that it would be possible to tune any UCI engine without modifying its source code, apart from converting eval weights to UCI options. So if the engine has options "PawnValue" and "KnightValue" the user could just tell cutechess-cli to use them as learning parameters.

I was thinking of the following initial test conditions:
- Population size of about 100 individuals
- For each generation, every individual plays one game against everyone else. So 4950 games per generation. With 99 games per individual the fitness function will of course be far from perfect, but then again, there's already some randomness in Selection.
- Time control of a few seconds per game. For example, a tc of 4 seconds would mean 5.5 hours per generation, or 1.4 hours on a quad-core.
- As many generations as it takes to reach a global maximum

If the initial test shows any promise, I would consider increasing the population count and number of games per generation.

So what do you people think? Would this be a colossal waste of time? Any advice, corrections, encouraging/discouraging anecdotes?
User avatar
hgm
Posts: 27789
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Revisiting GA's for tuning evaluation weights

Post by hgm »

Genetical Algorithms are particularly good at finding optimal sets of non-linearly interacting parameters. They are particularly poor in extracting signal from noisy data. So the main question is not what GA you should use, but if you will be able to see the effects of tuning your parameters above the noise. In this respect I am not very optimistic.

Takes a simple major evaluation term, like material. How many Elo difference would you expect between a program with a piece value for Q=9, and one with Q=7? How many games would you need to see such a difference? How about the Elo difference between a program with Q=9 and Q=8.5?
CRoberson
Posts: 2055
Joined: Mon Mar 13, 2006 2:31 am
Location: North Carolina, USA

Re: Revisiting GA's for tuning evaluation weights

Post by CRoberson »

What you suggest sounds fine. I would suggest that you cut down on the number of participants so as to increase the number of games per
participant. The main reason for using a large population is to keep the GA from converging too early. So, a population of 40 with 100
games per may produce better results.

It is common to use randomness to inject mutation into GA solutions. This may not be a big need (still needed) for this. The chess tournaments will have sufficient randomness with only +/- 60 Elo.
Last edited by CRoberson on Sun Jan 03, 2010 8:35 pm, edited 1 time in total.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Revisiting GA's for tuning evaluation weights

Post by Don »

hgm wrote:Genetical Algorithms are particularly good at finding optimal sets of non-linearly interacting parameters. They are particularly poor in extracting signal from noisy data. So the main question is not what GA you should use, but if you will be able to see the effects of tuning your parameters above the noise. In this respect I am not very optimistic.

Takes a simple major evaluation term, like material. How many Elo difference would you expect between a program with a piece value for Q=9, and one with Q=7? How many games would you need to see such a difference? How about the Elo difference between a program with Q=9 and Q=8.5?
I think you hit the nail on the head. This is about noise.

Still, I think it can be made to work if one is willing to be patient. The way I see it is that standard PBIL algorithm suggests getting the best individual out of a small sample, such as 20. In other words to make this work well you need to do whatever amount of work is required to find an individual who is SIGNIFICANTLY above the median level, such as at or above 95 percentile.

It's not so easy to do this with auto test games unless the players vary significantly in strength. It requires thousands of games per generation. But it's probably doable. So one suggestion is to do the math when trying to locate the strong individual who will contribute his genetic material to the population vector and make sure your confidence is very high that this player is exceptional.

I would expect that this could be made to work at fast levels if you are willing to wait several weeks or months to beat down the noise.
Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Re: Revisiting GA's for tuning evaluation weights

Post by Gian-Carlo Pascutto »

You should just go ahead and do it. I agree with you that nobody has really tried this and it's not obvious how well it is going to work if at all.

An interesting problem is where to put the threshold in the number of games to fit the fittest individual (or the fitness of each individual), compared to just letting it run faster (and more random) and having evolution filter out the weaklings. I'm wondering if there are any theoretical insights that can be made there.

I wouldn't expect any reasonable weights the first week, but hey, it's unsupervised, so you can just let it go and go and go on as many cores as you have...

So, I'd just go ahead and try, and report back what does and doesn't work for you.
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Revisiting GA's for tuning evaluation weights

Post by michiguel »

Gian-Carlo Pascutto wrote:You should just go ahead and do it. I agree with you that nobody has really tried this and it's not obvious how well it is going to work if at all.
Yes, I agree that he should go and try, but...

Evolutionary mechanisms are poor at finding optimums. In biology, they are good at finding good enough and robust solutions. I cannot see why they will be different in Comp. sci. Finding "good enough" is useful when you have very little clue of a good solution. For that reason, I do not think that evaluation terms per se are the best candidates to try. Good enough does not cut it. There are other simpler ways to find good enough solutions. Possibly, search parameters could be better suited for this. They are fewer (it will reduce computational time) and they interact in a weird way that is difficult to predict (good chances to beat human estimation). I think this is interesting.

Miguel
PS: Then again, I could be completely wrong :-)


An interesting problem is where to put the threshold in the number of games to fit the fittest individual (or the fitness of each individual), compared to just letting it run faster (and more random) and having evolution filter out the weaklings. I'm wondering if there are any theoretical insights that can be made there.

I wouldn't expect any reasonable weights the first week, but hey, it's unsupervised, so you can just let it go and go and go on as many cores as you have...

So, I'd just go ahead and try, and report back what does and doesn't work for you.
zamar
Posts: 613
Joined: Sun Jan 18, 2009 7:03 am

Re: Revisiting GA's for tuning evaluation weights

Post by zamar »

I've many times thought about using GAs for weights tuning, but there is one problem I haven't been able to overcome:

In all GAs you need at some point to sort population (to decide who shall live, who shall die, who shall produce offspring). Now because different weights cause elo changes like +-5 elo, the sorting phase is either extremely slow (I don't have a cluster!) or extremely inaccurate (population doesn't converge, bad mutations just spread everywhere).

I'd be very happy if someone here could propose solution to this problem :)
Joona Kiiski
CRoberson
Posts: 2055
Joined: Mon Mar 13, 2006 2:31 am
Location: North Carolina, USA

Re: Revisiting GA's for tuning evaluation weights

Post by CRoberson »

michiguel wrote:
Gian-Carlo Pascutto wrote:You should just go ahead and do it. I agree with you that nobody has really tried this and it's not obvious how well it is going to work if at all.
Yes, I agree that he should go and try, but...

Evolutionary mechanisms are poor at finding optimums. In biology, they are good at finding good enough and robust solutions. I cannot see why they will be different in Comp. sci. Finding "good enough" is useful when you have very little clue of a good solution. For that reason, I do not think that evaluation terms per se are the best candidates to try. Good enough does not cut it. There are other simpler ways to find good enough solutions. Possibly, search parameters could be better suited for this. They are fewer (it will reduce computational time) and they interact in a weird way that is difficult to predict (good chances to beat human estimation). I think this is interesting.

Miguel
PS: Then again, I could be completely wrong :-)


An interesting problem is where to put the threshold in the number of games to fit the fittest individual (or the fitness of each individual), compared to just letting it run faster (and more random) and having evolution filter out the weaklings. I'm wondering if there are any theoretical insights that can be made there.

I wouldn't expect any reasonable weights the first week, but hey, it's unsupervised, so you can just let it go and go and go on as many cores as you have...

So, I'd just go ahead and try, and report back what does and doesn't work for you.
You are right. My experience has been this: If the solution space is ripe with near optimal solutions, the GA will find one fast. Otherwise, it will take a very large number of generations to find it.

A close cousin to a GA is a Simulated Annealer. SAs require less coding.
BubbaTough
Posts: 1154
Joined: Fri Jun 23, 2006 5:18 am

Re: Revisiting GA's for tuning evaluation weights

Post by BubbaTough »

I tried this in 1992 I think, and it didn't work that great for me, but as you said, times are different.

Instead of doing a large round robin as you suggest, or a knockout tournament as Don suggests, I think I played a limited round swiss, and I kept track of members of the population's rating. Rating was pervasive between generations, so a single unlucky tournament did not sink a population member with a great track record in previous generations.

It seems like a faster fitness function than your round robin (thereby allowing more generations) and perhaps more accurate than Don's...but perhaps not. Anyway, what do I know...it didn't work that well for me and I haven't tried it in the last 18 years :).

-Sam