Particle Swarm Optimization Code

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
emadsen
Posts: 434
Joined: Thu Apr 26, 2012 1:51 am
Location: Oak Park, IL, USA
Full name: Erik Madsen

Particle Swarm Optimization Code

Post by emadsen »

For anyone interested, I posted on my website the C# code I wrote that implements a multi-threaded particle swarm optimizer. See MadChess 3.0 Beta Build 75 (Eval Param Tuning). I find the idea rather straight-forward and easy to understand. Each iteration a particle's velocity is influenced by inertia and is a) drawn toward their individual best location and b) their swarm's best location and c) the global best location. The magnitude of each vector (a, b, and c) is randomized so the particles jitter in parameter-space.

Has anyone else used PSO to minimize evaluation error? If not, what algorithm(s) have you used? Did you write the code yourself or did you write an evaluation error cost function and plug it into a third-party optimization library? Just curious what others have done.
My C# chess engine: https://www.madchess.net
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Particle Swarm Optimization Code

Post by jdart »

For machine learning, the standard algorithm is Stochastic Gradient descent (SGD).

I use a so-called batch gradient descent algorithm called ADAM (https://arxiv.org/pdf/1412.6980.pdf).

These methods can tune parameters to find the minimum error over very large data sets of training examples. For this use case the function that you are minimizing is a sum of errors over the training set. So it is a sum of an error measure based on the evaluation function for chess, or the result of search, over possibly millions of training examples. For this you need a method that converges with relatively few evaluations of this overall error measure. (SGD "cheats" by evaluating only a sample of the training examples in each iteration).

The problem with global optimization methods including PSO is that they can be slow to converge, especially when there are large numbers of variables (a chess evaluation function can have hundreds). They also generally assume that it is virtually free to execute the function being optimized, so the "budget" of evaluations can be a very large number. But for the eval tuning use case this is not true.
User avatar
emadsen
Posts: 434
Joined: Thu Apr 26, 2012 1:51 am
Location: Oak Park, IL, USA
Full name: Erik Madsen

Re: Particle Swarm Optimization Code

Post by emadsen »

The function that you are minimizing is a sum of errors over the training set.
Yes, I know that- that's what my code does.

You lay out the challenges quite clearly. Many of these global optimization methods are slow to converge, and even slower because a chess engine's evaluation cost function is so time-consuming to compute.
My C# chess engine: https://www.madchess.net
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Particle Swarm Optimization Code

Post by jdart »

The other point is that generally evals are a linear function of eval weights and you can evaluate the gradient of the evaluation function easily. So with a linear function and the gradient there are gradient descent methods that converge quite rapidly.

If though you want a good global method that works with nonlinear functions and does not require gradients, there are a bunch of those. CMA-ES is one of the better ones. See also https://github.com/avaneev/biteopt, and https://ccse.lbl.gov/people/julianem/index.html. But I don't think the chess use case is a good fit for these. They might have some use optimizing non-eval parameters.

--Jon
User avatar
emadsen
Posts: 434
Joined: Thu Apr 26, 2012 1:51 am
Location: Oak Park, IL, USA
Full name: Erik Madsen

Re: Particle Swarm Optimization Code

Post by emadsen »

The other point is that generally evals are a linear function of eval weights and you can evaluate the gradient of the evaluation function easily.
Most everything in my evaluation function is linear. Though passed pawns, mobility, and king safety are not.

I wrote the PSO because it's so simple, and frankly, I was too impatient to get my head around the math of other techniques. Not that I can't do it- I majored in physics in college and did well in calculus, diffy Q, linear algebra, etc. I'm just out of practice when it comes to reading math in scientific papers, comprehending it, and reformulating the ideas in my own code. It's funny because between my sophomore and junior year in college, I stayed on campus during the summer to write a global optimizer for my physics professor. I remember poring over the Numerical Recipes in C book trying to decipher the Levenberg-Marquardt optimization algorithm. I managed to get it working by the end of the summer, optimizing 18 variables of his theoretical equation to the data he had collected in his bio-physics research. The program drew a graph on screen, updated after each iteration was calculated. So it was rewarding to see the theoretical graph lines approaching the empirical data closer and closer each iteration. That was 23 years ago.

But I digress... My intent is to run the PSO as I add features to my engine's evaluation function. Perhaps not after every feature, maybe after every two or three. If the PSO fails to find improvements, I'll consider implementing SGD or one of the gradient-free, non-linear global optimizers you mention. Thanks for your suggestions. Do you know if any of these algorithms are better suited for multi-threading than others?
My C# chess engine: https://www.madchess.net
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Particle Swarm Optimization Code

Post by jdart »

Batch gradient descent parallelizes nicely - each thread just computes a part of the training examples and the parameters are constant for each iteration so you can use an evaluation cache if you have one.

SGD is harder to parallelize because the parameters aren't constant but you can do "mini-batches" where they are. There is also a parallel SGD variant called HogWild (https://people.eecs.berkeley.edu/~brech ... wildTR.pdf) but it assumes sparsity in the parameters which is generally not true for chess.

CMA-ES can support parallelized evaluation I believe.

--Jon
tomitank
Posts: 276
Joined: Sat Mar 04, 2017 12:24 pm
Location: Hungary

Re: Particle Swarm Optimization Code

Post by tomitank »

jdart wrote: Sun Nov 25, 2018 2:51 am For machine learning, the standard algorithm is Stochastic Gradient descent (SGD).
I think the SGD is an wrong name.
Just simply GD is the right name. (Gradient Descent)

Different Types of GDs
- Full - Batch or (vanilla) or (basic): Looping over every training example
- Mini -  Batch : we process n training examples at once
- Stochastic  - GD: we use JUST USE ONE training example

I use Full- Batch GD.
I experiment with three different methods:
- ADAM
- Nesterov Momentum
- Different learning rate for all weights

Each gives similar results..

I'm trying to use the L2 penalty as well.

I feel I have not found yet the perfect hyper-parameters. I'm still experimenting.

But the results so far are already very good: ~ 100 elo better than hand-tuned version.

But there are weights that are few examples. I think they work better if they are not linear. (eg: king-safety, passed pawn..)
User avatar
xr_a_y
Posts: 1871
Joined: Sat Nov 25, 2017 2:28 pm
Location: France

Re: Particle Swarm Optimization Code

Post by xr_a_y »

just discovered this : https://esa.github.io/pagmo2/install.html

great stuff !
Karlo Bala
Posts: 373
Joined: Wed Mar 22, 2006 10:17 am
Location: Novi Sad, Serbia
Full name: Karlo Balla

Re: Particle Swarm Optimization Code

Post by Karlo Bala »

emadsen wrote: Sat Nov 24, 2018 11:38 pm For anyone interested, I posted on my website the C# code I wrote that implements a multi-threaded particle swarm optimizer. See MadChess 3.0 Beta Build 75 (Eval Param Tuning). I find the idea rather straight-forward and easy to understand. Each iteration a particle's velocity is influenced by inertia and is a) drawn toward their individual best location and b) their swarm's best location and c) the global best location. The magnitude of each vector (a, b, and c) is randomized so the particles jitter in parameter-space.

Has anyone else used PSO to minimize evaluation error? If not, what algorithm(s) have you used? Did you write the code yourself or did you write an evaluation error cost function and plug it into a third-party optimization library? Just curious what others have done.
A few years ago when CLOP was popular, I used PSO (also GA and PBIL) as an experiment to tune Toga parameters from scratch. After a few days on a very slow computer, tuned Toga was marginally better than original. PSO worked better than GA or PBIL.
It is not at all important for PSO to be multithreaded, because 99.99% of the time was spent on engine-engine matches.
Best Regards,
Karlo Balla Jr.
User avatar
xr_a_y
Posts: 1871
Joined: Sat Nov 25, 2017 2:28 pm
Location: France

Re: Particle Swarm Optimization Code

Post by xr_a_y »

I plan to "texel tune" with PSO, not play game.