Hello,
I did some tests and experiments. You might want to jump to the results first and read about ressources afterwards.
Engine description
The strength is somewhere in the range of 2200-2300 elo. Currently no need to measure that because i am only interest in the relative strength between original and tuned versions. The reason why i metion this is, that the effect of elo progress is certainly different in a more advanced engine.
• pvs
• transposition table
• mvv-lva
• killer
• best move to front at the root
• staged move generation
• pvs
• qs (captures only)
• eval = material + pst
Data source
1. quiet-labeled.epd
The public known table is a good point to start and compare results. There have been successful tuning sessions with the data. A well known example is Rofchade. The data includes 750000 training positions.
2. self-play
The engine played 150K games on a level with 64k nodes per move. Then there were five positions randomly picked out of each game. Finaly the training includes 750K positions.
• position not in check
• abs(staticEval – qsEval) < 40
• fabs( simoid(statiEval) – result) <= 0.8) // fishy
Although these criteria influence the basic setup, i doubt that the specific choice has a high impact on the general process. More important than the selection criteria of the source is the source itself imo.
Tuning Setup
• Algorithm : SGD
• Learning Rate: const 0.1
• Epochs : 500
• PST : asymetric
• Parameters : 768
Experiment 1 – Test against Original PSTs
• opening : book UHO_XXL_100_129
• trained with: quiet-labeled.epd
• score wld : 315 - 97 - 56 [0.733] 468
• Elo :
+175.35 +/- 32.86
• Stopped : SPRT llr 2.97, lbound -2.94, ubound 2.94 - H1 was accepted
• opening : book UHO_XXL_100_129
• trained with: 64k_5.epd
• score wld : 389 - 168 - 76 [0.675] 633
• Elo :
+126.62 +/- 26.87
• Stopped : llr 2.95, lbound -2.94, ubound 2.94 - H1 was accepted
Experiment 2 – Test against Rofchade PSTs
Result vs Rofchade PSTs (match 1)
• opening : book UHO_XXL_100_129
• trained with : quiet-labeled.epd
• Score WLD : 7505 - 7311 - 4813 [0.505] 19629
• Elo :
+3.43 +/- 4.22
• Stopped : SPRT llr 2.96, lbound -2.94, ubound 2.94 - H1 was accepted
Result vs Rofchade PSTs (match 2)
• opening : 8moves_gm_lb.epd
• trained with : quiet-labeled.epd
• Score wld : 5674 - 5784 - 5257 [0.497] 16715
• Elo :
-2.29 +/- 4.35
• Stopped : by hand
Conclusion
The most important point is that the tuning process works in two aspects. First the optimizer is minimizing the error of the evaluation function. The second thing is, it improves the elo strength. The impressive elo improvements show, that different kind of data sources have a very different impact. Although the result of the well known quiet-labeled.epd is better than the self-played database, it is important to know, that based on the current engine level, you can produce new data with the potential to make further progress. That feature might be more important than the better result for a single tuning session.
The second experiment is very exciting too. Rofchade PSTs are based on the same datasource, the quiet-labeled.epd. At this point, the engine used for the test, was prepared to plugin these Rofchade values. As you can see in the match result it is a head-to-head match.
While the match statistics are close, the MSE scores for the quiet-labeled data set are:
K : 1.663756460
MSE Rofchade: 0.0615515096270968
MSE Engine : 0.0614646864265578
Maybe it is important to say, that the original values (before tuning) are not useless tables.
Here are the initial values (before tuning) for king and knight, that have been improved.
So, the reported elo gains are related to tables that already played reasonable chess.
Code: Select all
void Eval::setup_knights()
{
// A1 ... H8
const int mg[ID64] = {
-28, -22, -18, -12, -12, -18, -22, -28,
-10, -4, 0, 6, 6, 0, -4, -10,
-6, 0, 4, 10, 10, 4, 0, -6,
0, 6, 10, 16, 16, 10, 6, 0,
0, 6, 10, 16, 16, 10, 6, 0,
-6, 0, 4, 10, 10, 4, 0, -6,
-10, -4, 0, 6, 6, 0, -4, -10,
-16, -10, -6, 0, 0, -6, -10, -16
};
const int eg[ID64] = {
-24, -16, -10, -4, -4, -10, -16, -24,
-16, -8, -2, 4, 4, -2, -8, -16,
-10, -2, 4, 10, 10, 4, -2, -10,
-4, 4, 10, 16, 16, 10, 4, -4,
-4, 4, 10, 16, 16, 10, 4, -4,
-10, -2, 4, 10, 10, 4, -2, -10,
-16, -8, -2, 4, 4, -2, -8, -16,
-24, -16, -10, -4, -4, -10, -16, -24
};
for(int i = A1; i <= H8; i++) {
pst[KNIGHT][i] = Score(mg[i],eg[i]);
}
}
const int mg[ID64] = {
8, 16, 8, -4, 0, -4, 16, 8,
0, 0, -4, -4, -4, -4, 0, 0,
-8, -8, -8, -8, -8, -8, -8, -8,
-12, -12, -12, -12, -12, -12, -12, -12,
-16, -16, -16, -16, -16, -16, -16, -16,
-20, -20, -20, -20, -20, -20, -20, -20,
-24, -24, -24, -24, -24, -24, -24, -24,
-28, -28, -28, -28, -28, -28, -28, -28
};
const int eg[ID64] = {
-48, -20, -16, -8, -8, -16, -20, -48,
-20, 8, 12, 20, 20, 12, 8, -20,
-16, 12, 16, 24, 24, 16, 12, -16,
-4, 24, 28, 36, 36, 28, 24, -4,
0, 28, 32, 40, 40, 32, 28, 0,
-12, 16, 20, 28, 28, 20, 16, -12,
-16, 12, 16, 24, 24, 16, 12, -16,
-44, -16, -12, -4, -4, -12, -16, -44
};
Finally i have a look at the optimizing process.
Time duration
Most impressive for me is, that a tuning session was between
15 and 20 minutes for 500 epochs. So far i did not output the time duration. To be honest, i don‘t care (at the moment) if that can be done in five or even two minutes because the validation (test match) will take much longer anyway. It is not a overnight process or a matter of days, we are talking of minutes. This is basically idependant of the numbers of parameters like eight values, 800 or even 8000 values.
Hill climbing or oscillation
Another intersting point is, that i introduced two indicators in my output that showed me the ratio of successful epochs and the latest successfull update. When the tuning session starts it is like 1/1, 2/2, …, 35/35. When coming closer epoch 100 it beginns to look like 76/96 (current 100). That means that i had 75% improvements over the run and the latest update was in epoch 96.
That means while coming closer to the (an) optimum, it looks like it is ozillating, but it isn‘t.
It just climbs small hills because at some point the optimum is improved again. A good starting point to learn something about the learning rate and spend some time on it.
Epochs and parameters
Although the patterns and the produced numbers between 200 and 500 epochs look very similar (for the human eye), the test match afterwards sometimes reported negative elo differences between 10 and 50 elo. This might be dependent on how i implement the optimizer.
Sample size
Well, i have epd files that include more than 100 million positions. But it looks like, even with the generated 64k_5.epd, based on a low level engine, 750K positions can lead to massive improvements by a very basic selection process.
At the end the basics are working, so i will spent some time in advanced techniques of GD approches. Exploring the world of optimization by myself took a lot of time and of course i am very happy that things become successful now. Especially i would like to elaborate more on the choice of data.
On the fly, i did some different test too. One was to check symetric tables against asymetric tables. The result was very clear, so i did not metioned it above, but related to the original scores also an improvement. Another test was to tune the tables one after the other, like K,P,N,B,R,Q. That worked too, but was not as efficient like tuning all parameters in on run. And I did a lot of useless stuff that I won't talk about

.
If someone is interested in the resulting tables, just send a pm.
P.S.: I will think about White+Black tables. Because the asymetric tables are better between +40 and +60 elo than the symetric tables, that might give another boost. It won't take long to create such tables
Michael