Revisiting GA's for tuning evaluation weights

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Re: Revisiting GA's for tuning evaluation weights

Post by Gian-Carlo Pascutto »

michiguel wrote: Who's everybody else in that case if he ends up with one player? That is why I asked.
The population minus the winner.

As for why ending up with one: I don't do threesomes.
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Revisiting GA's for tuning evaluation weights

Post by michiguel »

Gian-Carlo Pascutto wrote:
michiguel wrote: Who's everybody else in that case if he ends up with one player? That is why I asked.
The population minus the winner.

As for why ending up with one: I don't do threesomes.
That is evolution like Genghis Khan = become a King and impregnate everybody :-)
Nobody tried to end eliminate only half or 3/4 of the population, and the other half mate each other, mutate, and start over? That would be more similar to nature and biochemical procedures of directed evolution.

Miguel
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Revisiting GA's for tuning evaluation weights

Post by Don »

michiguel wrote:
Don wrote:
michiguel wrote:
I agree with what you say, except that I do not understand why you want to end up with "one" best player. I should be better to end up with a set of good players, not one. Otherwise, all the evolutionary mechanisms are not working. It would be more like a breeding process. What do you do after you end up with the best? I may not be understanding your set up.
Why is this so difficult to understand?
I am asking a question that you are not answering. I understand what you are saying. I am just trying to know what is your rationale for your decision of ending up with few players rather than more.
It's a matter of which learning algorithm you use. I prefer PBIL which is population based incremental learning as it's more modern that the old fashioned and less efficient genetic algorithm. The compact genetic algorithm is a modern improvement it you like GA which works on the same principles. The principle I'm talking about is that you simulate an infinite population (that is what the "compact" is all about) and thus get much more diversity without having to actually store an infinite population. You get the best of both worlds.

PBIL is sort of like GA with a hint of simulated annealing. In each generation you select a single representative from the population by fitness, and then you let his genetic makeup have a tiny influence on the entire population.

I don't recommend classic GA, there have been a number of enhancements over the past few years that will serve you better. Different problems respond to different algorithms of course but PBIL is one of the very best and has the virtue of being very simple to implement and requires very little tuning.


It's pretty obvious that you can end up with as many players as you want. If you want 2 player, don't play the final round match. If you want 4, don't play the last 2 rounds.
Of course, I know that you can end up with n players.

And you do not have to fear that this is "too precise" because all you are doing is plucking N players from a population (from an infinite population if you use the more modern and efficient CGA or PBIL.) These players will on average be very good players (which is what you want) but there is always a chance that some weak player will get "lucky" and win the tournament. You have perfect control over this by deciding how many rounds and how many games per match to play.
Again, I am curious to know why you decided to end up with few players, and I would like to know what is the next step you do. Do you mate them? with whom? do you mutate them and start again and you do not mate them?
Let's assume we have a 2 bit gene or chromosome or whatever you call it for simplicity. An infinite population is represented by 2 floating point values which correspond to each bit in the gene. If the floating point value is 0 it means the entire population is compose of a zero bit in that place. If the bit is 1 it means the entire population has bit set to 1. And anything in between corresponds to the ratio of ones to zeros in the infinite population. So there are two floating points numbers, one for each of the 2 bits in the chromosome. Of course for a practical application it will have many bits.

You evaluate the fitness of N individuals generated at random from the population. For each bit in the gene you generate a 1 bit in proportion to the "probably vector" of floating point values.

When this individual is decided, you use him to modify the probability vector which will affect all new players generated from now on. I can find the exact formula if you want, but it's basically controlled by the "learning rate" which you can set. If the "fit" individual has a 1 in some bit position, the vector value that represents that bit is slightly increased towards 1 and conversely if the value is zero. Then you start all over, but the probability vector now reflects some slight influence from the strong player of the previous generation.

If you run the process long enough, the probability vector values will approach either 1 or 0. But they will NEVER quite reach 1 or 0 so there is no need for a mutation operator. It will always be possible for a bit position to change, although the odds of that happening asymptotically approach zero.

One thing I like about this approach is that you can generate an "ideal" individual at any point in the process by just using 0 if the vector value that corresponds to a position is less than 0.50 and generate a 1 otherwise. The individual won't really be ideal but he will be the most common individual in the "infinite" population at the moment.

Miguel
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Revisiting GA's for tuning evaluation weights

Post by diep »

ilari wrote:For a couple of years now I've been interested in tuning evaluation weights of a chess engine with a genetic algorithm. What has discouraged me are the disappointing results that others have gotten, and the huge amount of time it would take. But even though people smarter than me have tried it before and failed, there are a few reasons why I'm keen on discarding their results and doing it again:
Keep trying i'd say.

The real big problem in computerchess is always that everyone keeps his mouth shut when it involves new phenomena's. As someone who chats with really a lot of programmers, even then it is hard to get useful information. Most have the habit to never say a word on anything.

So really 99.9% of all attempts with GA's in computerchess have simply not been published publicly, or not published at all.

Of course 99% of those attempts we are not interested at because of reasons you quote below.
- The number of games that it takes to determine superiority between two engines of similar strength was horribly underestimated in the past experiments. Usually the whole learning process consisted of only a few thousand games, and in some cases just a few hundred.
Getting system time is difficult, even today.
- The hardware they used wasn't very impressive by today's standards. Nowadays we have the tools to run a ton of games quickly, easily and concurrently.
I feel 1997-1999 world champs is an important milestone in computerchess.
Not so much for deep blue which played in 1997, as buying a chessplayer is just a matter of money, but what happened in the world champs.

1999 really was a milestone. In 1997 as a chessplayer the programs were so clumsy everywhere, that improving that was priority 1. I was getting 8 ply with diep @ 2 minutes a move? Well bad luck then if i played schach getting 9 ply, well 10 in fact and therefore winning.

What we saw in that world champs clearly is that 1999 definitely is the turning point there. An important turning point is openingsbook. In 1997 it was still very common to use pgn books. By 1999 everyone started to understand that this was total suicide back then. Some manual novelty always killed you then.

The first sign was Kure's book in 1997 there with nimzo.

It really kicked butt.

Secondly the search depths became more mature. In 1999 hardware really was better than the software and by a lot.

The 'top' played quad xeon 400-500Mhz there at 3 minutes a move.

That means 2Ghz * 3 minutes = 6Ghz minute a move.

If we compare with most games played now online or in tests at most, that usually is not much above this. Most 'ratinglists' you see published, i recently saw one in fact that's single core and 6 minutes a game + a few seconds increment, are factors worse hardware wise than what we had back then, thanks to the time control.

That benefits of course fast beancounters.

But it also means that it's tempting to focus upon that.
- In some experiments they wrote their own simple and very slow chess engine to tune. In this Master's thesis the author had an engine that could search only 3 plies deep under his test conditions. The population size, the number of games and the number of generations were also ridiculously small.
Well there is some truths in this. In some total beginners engine if you randomly change something and implement some random 'enhancement' somewhere, based some texts you wrote down when listening for 2 minutes to HGM, then odds are big already it improves such engine. Additionally to that those written down thoughts make more sense when the engine is weaker.

To improve todays top engines is not so simple without being some big expert. To be a big expert takes really long time. There is usually no 'course' that drills you in the latest techinques as none of the existing experts gets paid to write such a course.

CCC is a bad forum there also, though of course much better than just googling for published algorithms as the GCP types drop a line sometimes.
- The number of evaluation parameters (weights) was usually small. Today's top programs have a large number of parameters, and because they depend on each other, it's tedious and difficult to hand-tune them. GA's on the other hand thrive when the number of parameters (genes) is large.
so true.

i wonder however why you do next proposal then considering the above thought:
I've already written a GA library in C++, and I planned on plugging that to a somewhat modified version of cutechess-cli. I'd use the UCI protocol for the learning process, because UCI "spin" options could be easily used as learning parameters (genes) with a minimum and maximum value. The idea is that it would be possible to tune any UCI engine without modifying its source code, apart from converting eval weights to UCI options. So if the engine has options "PawnValue" and "KnightValue" the user could just tell cutechess-cli to use them as learning parameters.
Well basically all this is a very bad idea because it has a few assumption mistakes. Mistake 1 is to assume we just want to tune the material values. Mistake 2 is to assume the material values are just 6 values.

In case of diep already since summer 2000 or so the material is quite some parameter values. Like a 200+ or so, depending upon how you count.

In fact i intend to put one and another in a table to speedup access of it. That bigtime increases number of tunable parameters there.

Second problem is that maybe we want to tune more than just material parameters. It's more interesting if you design something that can also tune other parameters. Not hardcoded parameters such as: "MySuperStrongKnight", but simply a variable amount of parameters.

In short just invent some generic command to get parameters.

Then people like me very lazy can start your thing, give diep's 20k parameters to it, and see whether after a month or so a quadcore machine shows up with something that didn't deteriorate it too much.

I assume you already see one of the problems if you think about it. It's possible that the thing total mistunes parameters that have absolutely no influence at the testgames played.
I was thinking of the following initial test conditions:
- Population size of about 100 individuals
- For each generation, every individual plays one game against everyone else. So 4950 games per generation. With 99 games per individual the fitness function will of course be far from perfect, but then again, there's already some randomness in Selection.
- Time control of a few seconds per game. For example, a tc of 4 seconds would mean 5.5 hours per generation, or 1.4 hours on a quad-core.
- As many generations as it takes to reach a global maximum

If the initial test shows any promise, I would consider increasing the population count and number of games per generation.

So what do you people think? Would this be a colossal waste of time? Any advice, corrections, encouraging/discouraging anecdotes?
Make something generic that benefits us all and which we can use as push button technology to test GA's.

Vincent
Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Re: Revisiting GA's for tuning evaluation weights

Post by Gian-Carlo Pascutto »

michiguel wrote: That is evolution like Genghis Khan = become a King and impregnate everybody :-)
It leads to faster convergence at the cost of some diversity.

I remember reading an argument not so long ago that humans are more prone to take risk than could be explained by natural evolution, because the reward for being very successful is disproportionately big. Unfortunately I don't remember the source.

You can also do a smaller RR of a (randomly chosen) part of the population, and mate the top 2 finishers, then replace the oldest entity in the pool with the bred result. IIRC this is one of the best methods in classical GA.
User avatar
ilari
Posts: 750
Joined: Mon Mar 27, 2006 7:45 pm
Location: Finland

Re: Revisiting GA's for tuning evaluation weights

Post by ilari »

Wow, didn't expect to get this many insightful replies. I'm glad I asked for advice before plunging into the project. I'm now convinced that tournament selection is better than roulette wheel selection (round robin). And I'll look into PBIL, which was completely unknown to me before.

Well basically all this is a very bad idea because it has a few assumption mistakes. Mistake 1 is to assume we just want to tune the material values. Mistake 2 is to assume the material values are just 6 values.

In case of diep already since summer 2000 or so the material is quite some parameter values. Like a 200+ or so, depending upon how you count.
No worries, I was just using the two material weights as examples of how the tuning software would work. The goal is to tune all available evaluation weights of a top engine.

Make something generic that benefits us all and which we can use as push button technology to test GA's.
That's my ultimate goal. Any code that I write will be available at a Git repository near you.
Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Re: Revisiting GA's for tuning evaluation weights

Post by Gian-Carlo Pascutto »

diep wrote: Make something generic that benefits us all and which we can use as push button technology to test GA's.
Coding a GA optimization program is completely trivial. (Which is why there's so much academic material about it :)) There will probably be much more work involved in making it generic or adapting any program to it than to the actual algorithm in the first place.

What matters is the exact method used. How many games, which selection method to use, replacement strategy, crossover-mutation parameters, gene pool size, choice between GA/CGA/PBIL.

Knowing a working combination of the *latter* is what would be a breakthrough.
User avatar
Steve Maughan
Posts: 1221
Joined: Wed Mar 08, 2006 8:28 pm
Location: Florida, USA

SWARM Algorithm...

Post by Steve Maughan »

Don wrote:Have you looked at PBIL, population based incremental learning?...
All interesting stuff. You may also look at "Swam Learning". I'm not sure how applicable it is to chess parameter learning but I'll throw it into the mix in case someone hasn't heard about it. It's a nice cross between probabilistic and hill climb optimization. It's based on the natural model of a swarm of birds / ants / buffalo. Each member of the swarm has a set of genes representing the elements to be optimized. The genes are adjusted based on the host's fitness and the fitness of its healthiest neighbor. There are variations that also look at the single fitest citizen.

I haven't tested it yet but it may be an interesting algorithm for chess parameter optimization.

Cheers,

Steve
SuneF
Posts: 127
Joined: Thu Sep 17, 2009 11:19 am

Re: Revisiting GA's for tuning evaluation weights

Post by SuneF »

ilari wrote:For a couple of years now I've been interested in tuning evaluation weights of a chess engine with a genetic algorithm. What has discouraged me are the disappointing results that others have gotten, and the huge amount of time it would take. But even though people smarter than me have tried it before and failed, there are a few reasons why I'm keen on discarding their results and doing it again:

Any advice, corrections, encouraging/discouraging anecdotes?
I'm mostly familiar with SA but it's a very similar idea of random mutation.
I see a basic problem with these tuning methods though.

The methods require that you be able to decide if a mutation is an improvement or not.

Basicly you have two possible senarios. Either there is an improvement or there isn't.

First consider the probability of actually producing an improvement by random (per)mutation.
As the program gets stronger, making any improvement gets harder or less likely.
However we can handtune the program pretty good so we can assume that it will be very hard to find an improvement by a random mutation.
This is the interesting domain to consider I think.

If the program is already fairly well tuned we can pretend there is a 5% chance we have an improvement and thus a 95% chance we don't.

Case 1: We beat the odds and made an improvement (5%):

90% chance we detect this correctly as being an improvement (depends on number of games).
10% chance we detect this falsely as not being an improvement (depends on number of games).

Case 2: We did not manage to produce an improvement (95%):

90% chance we detect this correctly as not being an improvement (depends on number of games).
10% chance we detect this as being an improvement (depends on number of games).


Now in SA, if it is detected as an improvement the new state is accepted, otherwise it is only accepted by some Boltzmann criteria.
However you already see the problem: there is a higher chance of it going uphill than downhill (95% x 10% > 5% x 90%).
In other words, the method seems to reach it limit when the probabilities even out.
Some math have been left out, but the Boltzmann criteria is not going to save it.

So yes, I'm just throwing out numbers here and one could improve on this by detecting with better accuracy whether the change is good or bad, ie. playing more games.
But realisticly, in SA you might need thousands of iterations and probably an ensemble of a few dozen.
So millions of games would be needed for a serious test.
User avatar
hgm
Posts: 27799
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Revisiting GA's for tuning evaluation weights

Post by hgm »

In fact the randomness in the testing removes the need to take an explicit Boltzmann factor. You just accept the mutation when it tested better. Mutations that are worse will still have non-zero acceptance probablilty that way, but the probability gets smaller as they are worse (because that reduces their chances of beating the testing odds). The statistical noise (i.e. sample size) determines the lowest temperature you can acheive.

And you are right: to cool the sample enough that its distribution peaks ata better setting that what you can acheive by hand tuning will require millions of games.

Note that tree-matches would be a good way out of this, as (unlike normal gauntets) they will depend only on the cases where the mutated terms actually influenced the move choice.