A.k.a. how to tune high-quality PSTs from scratch (material values) in 20 seconds.
All previous versions of Leorik have just taken their PST values from MinimalChess. I was pretty happy with the quality but tuning these tables took a lot of time. Literally days of trial and error with hours of waiting in between.
For Leorik I wanted to revisit the tuning process to get shorter iteration times for my upcoming experiments with a better evaluation.
I'm not sure how familiar everybody here is with tuning? Maybe a short recap/introduction can't hurt. Everyone knows PSTs I guess. But where do the values come from? It turns out you can write a program to "extract" these values mathematically from a set of annotated positions. That's typically just a text file where each line is a FEN string of the position and a game result: Is it winning, losing or a draw? And you have a lot of such lines. Maybe a million.
It's easy to imagine that you can use such a data set to assess the predictive quality of your evaluation. You typically use a sigmoid function to map the unbounded evaluation score given in centipawns into the range of [-1..1] the spectrum between a win and a loss. The difference of of the stored outcome and your prediction is the error. Iterate over the complete set of positions and compute the mean squared error of your eval. The goal of tuning is now to find a configuration of PST values that minimizes this error function!
Maybe you've heard of Texel's tuning method or used it yourself. That's what I used for MinimalChess too because it's really... really... simple. You literally just add +1 to one of your PST values and compare the result of the error function before and after the change. Did it help? If not you try to decrease the value instead. You do that for all PST values and you keep doing it until neither increasing nor decreasing any of the PST values will minimize the error function further.
This is pretty slow because for each change of *one* value you'll have to run the error function again over all positions. In my case 725.000 positions. You can do that only a few times per second and you need a lot of +1 and -1 until you arrive at good PSTs if you start from just material values.
In the last year I remember a few posts where people mentioned they were using gradient descent for their tuning. I also remembered a good online course on machine learning that I did a few years ago (and subsequently forgot all details about) so I watched the relevant videos again. Andrew Ng is really a good teacher but the stuff is very math heavy and doesn't seem to fit at all to what I'm doing with my chess engine. Lot's of greek symbols and partial deriviatives. How does all that relate to my incremental evaluation function? With tapering based on gamephase? The vector of coefficients is easy: That's just all the PST values together. A vector with 756 elements. But what's my feature vector?
So I turned the question around: What feature vector would produce the same result when multiplied with a vector of coefficients as my engine's current evaluation? Turns out you can calculate such a vector for each FEN and at that point the tuning has little to do anymore with the details of your engine. Now it fits the information you find on wikipedia or elsewhere.
Code: Select all
public static float[] GetFeatures(BoardState pos, double phase)
{
float[] result = new float[N];
//phase is used to interpolate between endgame and midgame score but we want to incorporate it into the features vector
//score = midgameScore + phase * (endgameScore - midgameScore)
//score = midgameScore + phase * endgameScore - phase * midgameScore
//score = phase * endgameScore + (1 - phase) * midgameScore;
float phaseEg = (float)(phase);
float phaseMG = (float)(1 - phase);
ulong occupied = pos.Black | pos.White;
for (ulong bits = occupied; bits != 0; bits = Bitboard.ClearLSB(bits))
{
int square = Bitboard.LSB(bits);
Piece piece = pos.GetPiece(square);
int pieceOffset = ((int)piece >> 2) - 1;
int squareIndex = (piece & Piece.ColorMask) == Piece.White ? square ^ 56 : square;
int sign = (piece & Piece.ColorMask) == Piece.White ? 1 : -1;
int iMg = pieceOffset * 128 + 2 * squareIndex;
int iEg = iMg + 1;
result[iMg] += sign * phaseMG;
result[iEg] += sign * phaseEg;
}
return result;
}
Code: Select all
public static double MeanSquareError(List<Data2> data, float[] coefficients, double scalingCoefficient)
{
double squaredErrorSum = 0;
foreach (Data2 entry in data)
{
float eval = Evaluate(entry.Features, coefficients);
squaredErrorSum += SquareError(entry.Result, eval, scalingCoefficient);
}
double result = squaredErrorSum / data.Count;
return result;
}
public static float Evaluate(float[] features, float[] coefficients)
{
//dot product of features vector with coefficients vector
float result = 0;
for (int i = 0; i < N; i++)
result += features[i] * coefficients[i];
return result;
}
So, first of all gradient descent is already much faster than Texel's tuning method. Because with one pass over the training data set you accumulate information for each coefficient at the same time: You take note whether it was on average contributing too much or too little to the result of the evaluations. (That information is the gradient) And based on that information you can adjust all coefficients at once into the right direction! With one pass over the training set you can improve all 768 values of the PSTs at once. When you start with just material values (all Pawn tables include only values of 100 for example) about 2000 such iterations later you have already pretty decent quality PSTs. And one iteration would take less than a second to compute. So it takes maybe half an hour to compute a good set of PSTs from scratch.
But can we make it faster?
The first attempt I did was to use SIMD instructions.
Code: Select all
public static float EvaluateSIMD(float[] features, float[] coefficients)
{
//dot product of features vector with coefficients vector
float result = 0;
int slots = Vector<float>.Count;
for (int i = 0; i < N; i += slots)
{
Vector<float> vF = new Vector<float>(features, i);
Vector<float> vC = new Vector<float>(coefficients, i);
result += Vector.Dot(vF, vC);
}
return result;
}
If you look at the feature vector however you realize that only a few percent of elements are actually non-zero. The vast majority of the multiplications and additions don't affect the result at all. So I introduced an index buffer to each feature vector storing the indices that have non-zero values.
Code: Select all
private static float Evaluate(float[] features, short[] indices, float[] coefficients)
{
//dot product of a selection (indices) of elements from the features vector with coefficients vector
float result = 0;
foreach (short i in indices)
result += features[i] * coefficients[i];
return result;
}
But even though the index buffer helps to save pointless computations all these zeros still take a lot of space up in memory. After loading the 725.000 positions that each consist of 756 floats to encode the features the process uses 2500MB of memory. To store mostly zeroes!

The process memory shrunk to just 300 MB that way. And now a full iteration of gradient descent takes only 65ms! This is a great example of how important cache friendly programming is these days! And it was my first big WHOAAT? moment of the day.
This is already quite fast. Fast enough for all practical purposes probably but it's still just running on one thread. In theory this workload should be well suited for parallelization. And I was waiting for an excuse to try the new Task Parallel Library that was recently added to .Net
It's one of the examples where the documentation leaves you scratching your head for a while. Seriously... just look at the method signatures of the different overloads.
But in actual code it looks quite elegant.
Code: Select all
public static void Minimize(List<Data2> data, float[] coefficients, double scalingCoefficient, float alpha)
{
float[] accu = new float[N];
foreach (Data2 entry in data)
{
float eval = Evaluate(entry.Features, coefficients);
double sigmoid = 2 / (1 + Math.Exp(-(eval / scalingCoefficient))) - 1;
float error = (float)(sigmoid - entry.Result);
foreach (Feature f in entry.Features)
accu[f.Index] += error * f.Value;
}
for (int i = 0; i < N; i++)
coefficients[i] -= alpha * accu[i] / data.Count;
}
Code: Select all
public static void MinimizeParallel(List<Data2> data, float[] coefficients, double scalingCoefficient, float alpha)
{
//each thread maintains a local accu. After the loop is complete the accus are combined
Parallel.ForEach(data,
//initialize the local variable accu
() => new float[N],
//invoked by the loop on each iteration in parallel
(entry, loop, accu) =>
{
float eval = Evaluate(entry.Features, coefficients);
double sigmoid = 2 / (1 + Math.Exp(-(eval / scalingCoefficient))) - 1;
float error = (float)(sigmoid - entry.Result);
foreach (Feature f in entry.Features)
accu[f.Index] += error * f.Value;
return accu;
},
//executed when each partition has completed.
(accu) =>
{
lock(coefficients)
{
for (int i = 0; i < N; i++)
coefficients[i] -= alpha * accu[i] / data.Count;
}
}
);
}
And how fast is it? If you run the program after this change all 12 logical cores of my Ryzen 3600 are reported to be utilized equally with 75% load. And the net result is another 700% speedup! Now it's 9ms per iteration of gradient descent. That was the 2nd time my jaw dropped today.
That's 2000 epochs in less than 20 seconds. And that allows you to tune tapered PSTs from scratch just starting with material values in the time it takes to... well... I don't know much that could be accomplished in 20 seconds, actually.
