Devlog of Leorik

algerbrex · Post by **algerbrex** » Sun Dec 04, 2022 6:58 pm

Glad to hear you've found something to keep you motivated, Thomas!

Like you, I haven't made much progress with Blunder. There were some interesting ideas I had with neural networks that I shared here, but nothing significant in terms of strength gain. To be honest, I haven't touched the code base for Blunder in a couple of months now, and I'm sort of where you are right now, trying to decide if I'm going to conclude my work on Blunder and move on to other projects. I'm just not that interested in chess engine programming, or chess for that matter, like I had been the past year and a half.

But I think you've had a very good idea. If there's anything in Blunder I'm confident I can still improve, it's the code originality, especially with tuning data.

Anyway, just a couple of random thoughts. Good luck!

emadsen · Post by **emadsen** » Sun Dec 04, 2022 9:15 pm

lithander wrote: ↑Sun Dec 04, 2022 1:08 am Recently I was trying to climb the Elo ladder, to do better in tournament matches. But my regrets in hindsight wouldn't be about reaching a certain Elo milestone. Instead the biggest flaw is a lack of "purity". Since I have written my first tuner for the PSQTs in MinimalChess I was using the same set of 725k annotated positions from Zurichess. And looking at the Readme.txt that comes with these positions the label of these positions was derived by playing the position to a conclusion with Stockfish.
I have always avoided looking at other engines sourcecode when implementing new ideas in Leorik (which I got from reading the forum or the wiki) but the tuner just transferred chess-knowledge from Zurichess and Stockfish and encoded it into the weights of Leoriks evaluation. Nothing unethical about that. But when I got interested in chess programming that was after hearing how Alpha-Zero learned chess from scratch by purely self-play.

Imagining myself looking back at Leorik as an abandoned project I would really regret if I hadn't made a serious attempt of doing something like that. All the weights and coefficients of the HCE are owed to the dataset my tuner is using. I need to create my own dataset! And I would have to start with a version of Leorik where all the borrowed knowledge is purged from the evaluation. Which means going back to material values!

Let me know if you are interested in any details. The post is already long enough but I would love to elaborate in the next one.

I'm glad to hear you're back at it, Thomas. Nothing wrong with taking a break. I've taken an entire year away from MadChess at least twice in the last ten years.

I'm interested to hear more about your progress as you re-build Leorik's evaluation from zero knowledge.

I agree with your sentiments about engine purity. So much so that I found it strange how much discussion there's been over the last two years on this forum about PESTO piece square tables and the Zurichess data set. Newer chess engine programmers are fascinated (obsessed?) with those topics. I don't get it. By expressing PST values via piece advancement, centrality, distance to nearest corner, etc and tuning against your own engine's games you'll arrive at strong PST values. PESTO is unremarkable in this regard. I even experimented with PESTO values to verify I wasn't missing something due to my obstinacy regarding engine purity. I wasn't.

I think you'll find the "pure" approach of tuning your engine's evaluation from positions found in its own games to be rewarding and intellectually satisfying. I've had success with this technique. I run gauntlet tournaments where I pit MadChess against ten opponents near in rating to it (+/- 100 Elo). Then I feed positions from those games to my Texel tuner to find improved evaluation parameter values via a particle swarm algorithm.

This makes sense to me. I measure MadChess' strength by how it performs against engines of similar strength. If I wish to increase MadChess' playing strength, I need to "teach" it to defeat the opponents it encounters in gauntlet tournaments. Texel tuning against MadChess gauntlet game results does exactly that. Using Stockfish game results or CCRL games seems irrelevant to me.

One important caveat is to tune against enough positions to avoid over-fitting evaluation parameters to too small of a data set. I somewhat arbitrarily decided 5 million positions was enough (about 40,000 games).

Good luck with your renewed efforts!

lithander · Post by **lithander** » Tue Dec 06, 2022 8:43 am

JoAnnP38 wrote: ↑Sun Dec 04, 2022 3:53 pm BTW, I was really inspired by your implementation of MinimalChess. I thought your code was so clean it was beautiful.

emadsen wrote: ↑Sun Dec 04, 2022 9:15 pm I'm glad to hear you're back at it, Thomas. Nothing wrong with taking a break.

algerbrex wrote: ↑Sun Dec 04, 2022 6:58 pm Glad to hear you've found something to keep you motivated, Thomas!

Thanks guys, for the friendly comments! Your messages of encouragement mean a lot to me. Without the members of this community taking interest in each other's work interesting engineering problems alone probably wouldn't be enough to keep me going! It's hard to care about something that nobody else notices.

emadsen wrote: ↑Sun Dec 04, 2022 9:15 pm I'm interested to hear more about your progress as you re-build Leorik's evaluation from zero knowledge.

Yesterday I made another small code change: Now the randomness can be configured separately between middle game and endgame with an interpolation in between. My thinking was that I want to try some really random looking moves in the beginning so my training data includes a wider variety of board states. But I don't want to mislabel the entire game due to blunders in the late game that were caused by randomness.

But mostly all I can do at the moment is wait for the games to trickle in! The tuning from zero-knowledge definitely got off the ground but the next big question is where it will hit the plateau. Nobody will care about the details of how I created a set of weights that underperform!

emadsen wrote: ↑Sun Dec 04, 2022 9:15 pm I found it strange how much discussion there's been over the last two years on this forum about PESTO piece square tables and the Zurichess data set. Newer chess engine programmers are fascinated (obsessed?) with those topics. I don't get it. By expressing PST values via piece advancement, centrality, distance to nearest corner, etc and tuning against your own engine's games you'll arrive at strong PST values. PESTO is unremarkable in this regard. I even experimented with PESTO values to verify I wasn't missing something due to my obstinacy regarding engine purity. I wasn't.

I think I can still speak for the new engine programmers.

After I had a working move generator some simple alpha-beta search is quickly implemented, too, but now you need an eval. Material-only eval doesn't get you very far. I learned quickly that even the most simple chess engines are using at least PSTs. And now - as a programmer - I'm stuck for the first time. Writing the code is easy enough but where do you get the values from? At that point you just want to verify your implementation and try values you know have worked well for many others: the one's from PeSTO. I think it's a matter of visibility - each time they get mentioned (like now) their status get's reinforced.

A bit later I wrote my own tuner and managed to create PSTs that looked way different but were even a bit stronger in my engine. So I agree: they are not that special, apart form being well known. But now instead of weights I needed training data for the tuner. I believe there was a link on the texel tuning wiki page to the Zurichess repo. And I heard the dataset it mentioned positively on the forums. Popularity often is a self-reinforcing.

JoAnnP38 wrote: ↑Sun Dec 04, 2022 3:53 pm Now, I am excited about the prospect of a self-teaching chess engine. I have already built a component based on a genetic algorithm that will let my evaluation evolve over time. Essentially it will be a collection of "features" (as many as can reasonably be implemented) and then I will encode the weights of these features into a "chromosome" so it can be scored and generate progeny with other promising chromosomes in my gene pool. My initial population of chromosomes can just be generated randomly, and I'll enjoy watching evolution in progress. I imagine that there will be a threshold that if a feature's weight is too low it will drop out of the evaluation altogether which could allow different genes to inspire different play. I'm quite excited about this part as I don't think I will need a large database of games to "tune" my engine but rather it will learn over time and each chromosomal generation will start to converge on better and better chromosomes through survival of the fittest. I like this method because if mutations are part of the process, there will be more of a chance that the engine's learning won't get trapped in a local maximum.

That sounds super interesting! There's something about genetic programming that makes it super appealing. When I mentioned to colleagues that I'm going to see if I can make my engine learn chess from scratch they excitedly asked whether I use "genetics" and when I said nah just linear regression, they were disappointed about using "just" statistics. The idea of chess engines playing for survival is lovely. I can't help but visualize that in biological terms. Make sure to keep us updated!

I've considered trying something like you describe but worried that it wouldn't converge fast enough. Survival of the fittest in that framework basically means the individual organisms are each a unique set of weights (the state of their chromosomes) and you'd play a lot of games between individuals but discard all the information other than who won or lost. And you use that history to influence who's getting to procreate, right?

emadsen · Post by **emadsen** » Wed Dec 07, 2022 7:34 pm

lithander wrote: ↑Tue Dec 06, 2022 8:43 am Yesterday I made another small code change: Now the randomness can be configured separately between middle game and endgame with an interpolation in between. My thinking was that I want to try some really random looking moves in the beginning so my training data includes a wider variety of board states. But I don't want to mislabel the entire game due to blunders in the late game that were caused by randomness.

I'm curious about the technique of adding a random value to the evaluation score combined with self-play. How reliable is the game result as part of the fitness function? Self-play makes sense to me if you're at the top of the ratings mountain (Stockfish or Komodo). Is it reliable for a weaker engine? Or is it better to play a gauntlet against other engines near in strength, with no randomness? I chose the latter. I'm interested to see your results if you chose the former.

lithander wrote: ↑Tue Dec 06, 2022 8:43 am I learned quickly that even the most simple chess engines are using at least PSTs. And now - as a programmer - I'm stuck for the first time. Writing the code is easy enough but where do you get the values from? At that point you just want to verify your implementation and try values you know have worked well for many others: the one's from PeSTO. I think it's a matter of visibility - each time they get mentioned (like now) their status get's reinforced... I believe there was a link on the texel tuning wiki page to the Zurichess repo. And I heard the dataset it mentioned positively on the forums. Popularity often is a self-reinforcing.

I think you're correct. They're mentioned often on this forum. If a new chess engine developer applies them to their untuned engine, they see an improvement. They think PESTO is amazing. Confirmation bias. They've heard PESTO is amazing and now they've seen it with their own eyes. But Texel tuning would have gotten them there anyhow, probably with more harmony with other eval params in their engine.

lithander · Post by **lithander** » Wed Dec 07, 2022 11:26 pm

emadsen wrote: ↑Wed Dec 07, 2022 7:34 pm I'm curious about the technique of adding a random value to the evaluation score combined with self-play. How reliable is the game result as part of the fitness function? Self-play makes sense to me if you're at the top of the ratings mountain (Stockfish or Komodo). Is it reliable for a weaker engine? Or is it better to play a gauntlet against other engines near in strength, with no randomness? I chose the latter. I'm interested to see your results if you chose the former.

I added randomness because otherwise a material-only evaluation considers all moves equal that don't lead to a winning capture. So without randomness there was no variety in the played games.

After playing 30k games with a randomized material-only engine I extracted 15 positions from each game and used the game's outcome as label. This gave me the first training data to tune weights for a new version. And this version was beating the material-only one with close to 100% winrate!

So of course I just continued on that path just to see where the limit of this super-simple approach would be. An important change I implemented midway was to make these 15 extracted positions quiet, using Leorik's quiesce-search. So now all the positions in the training set are bare of any winning captures - just like Leorik would encounter them in practice. Other than that I did no filtering whatsoever and the labels are still just the outcome of the game where I took them from.

But nonetheless I got pretty far with this. The most recent version approaches the performance of Leorik 2.2

Code: Select all

Score of Leorik-2.2.8zeta vs Leorik 2.2: 3090 - 3565 - 2793  [0.475] 9448
 Elo difference: -17.5 +/- 5.9, LOS: 0.0 %, DrawRatio: 29.6 %

To have a reference I also tuned a set of weights based on the Zurichess set.

Code: Select all

Score of Leorik-2.2.8zurichess vs Leorik 2.2: 2768 - 2528 - 3593  [0.513] 8889
Elo difference: 9.4 +/- 5.6, LOS: 100.0 %, DrawRatio: 40.4 %

Considering that the quality of the zurichess set is much higher 30 Elo is a surprisingly small gap. Of course it's a gap I would like to close as much as possible. But sadly at this point just playing more games with Zeta to train Eta does seem to help much anymore. I've hit the anticipated plateau.

I haven't spent much time on comparing my dataset with the one from Zurichess. But what is interesting to note is that tuning on my own set of labeled positions the evaluation only achieves a mean-squared-error of 0,347434 but if I tune on the Zurichess data the MSE goes down to 0,23451.

One obvious explanation is that some labels are probably just plain wrong. After all Zurichess involved Stockfish in labeling where I followed the naive logic of "this was a position in a match that white won so it must be winning for white" which ignores all the blunders my crippled engine makes, especially with randomization being part of the mix. But before I consider involving other engines like Stockfish or the gauntlet you suggested I will first focus on filtering. For example if a game took 200 moves for a side to finally win, labeling the positions right after the opening as winning for any side is probably not helpful. I have plenty ideas... but testing them will take a bit of time. There are no shortcuts, sadly: The MSE going down doesn't mean anything if the filtered set of positions isn't a good sampling anymore of everything Leorik will have to evaluate during a real game.

But I have just a 30 Elo gap to close... how hard can it be, right?

Mike Sherwin · Post by **Mike Sherwin** » Thu Dec 08, 2022 4:15 pm

Just a couple thoughts.
While getting training data do you search each root move with an open window (-INF, INF)?
Since the object of the game is to checkmate why not count checkmates for each root move and then randomize a bit?

lithander · Post by **lithander** » Fri Dec 09, 2022 3:36 pm

Mike Sherwin wrote: ↑Thu Dec 08, 2022 4:15 pm While getting training data do you search each root move with an open window (-INF, INF)?
Since the object of the game is to checkmate why not count checkmates for each root move and then randomize a bit?

My approach is based on the assumption that an optimal set of training positions would contain positions that are all equally likely to be encountered during search, whose evaluation is equally important for the outcome of the game and that are all correctly labeled. This would create the best possible training data. I hope this assumption is correct - if you disagree please let me know why!

For the above reasons I'm currently focusing on ideas that either help me a) align the sampling of positions in the training data more with what is going to be evaluated under real conditions and b) reduce the amount of of wrongly labeled positions.

I'm not sure how your suggestions would help with these goals? Shouldn't a proper implementation of alpha-beta search produce the same result whether you retain the narrow alpha-beta window compared to using an open window on each root move? The former is just searching less nodes to achieve the same result making it more efficient.
And why are moves that lead to a lot of checkmates within the search-horizon likely to be good moves? is there some theoretical reason for that? I mean top engine mostly play moves that lead to a drawn position so this seems counter-intuitive to me.

chrisw · Post by **chrisw** » Fri Dec 09, 2022 3:55 pm

lithander wrote: ↑Fri Dec 09, 2022 3:36 pm
Mike Sherwin wrote: ↑Thu Dec 08, 2022 4:15 pm While getting training data do you search each root move with an open window (-INF, INF)?
Since the object of the game is to checkmate why not count checkmates for each root move and then randomize a bit?
My approach is based on the assumption that an optimal set of training positions would contain positions that are all equally likely to be encountered during search, whose evaluation is equally important for the outcome of the game and that are all correctly labeled. This would create the best possible training data. I hope this assumption is correct - if you disagree please let me know why!

For the above reasons I'm currently focusing on ideas that either help me a) align the sampling of positions in the training data more with what is going to be evaluated under real conditions and b) reduce the amount of of wrongly labeled positions.

I'm not sure how your suggestions would help with these goals? Shouldn't a proper implementation of alpha-beta search produce the same result whether you retain the narrow alpha-beta window compared to using an open window on each root move? The former is just searching less nodes to achieve the same result making it more efficient.
And why are moves that lead to a lot of checkmates within the search-horizon likely to be good moves? is there some theoretical reason for that?

It’s pretty much the logic of MCTS - play towards the region with winning lines for you.

I mean top engine mostly play moves that lead to a drawn position so this seems counter-intuitive to me.

Mike Sherwin · Post by **Mike Sherwin** » Fri Dec 09, 2022 6:29 pm

lithander wrote: ↑Fri Dec 09, 2022 3:36 pm
Mike Sherwin wrote: ↑Thu Dec 08, 2022 4:15 pm While getting training data do you search each root move with an open window (-INF, INF)?
Since the object of the game is to checkmate why not count checkmates for each root move and then randomize a bit?
My approach is based on the assumption that an optimal set of training positions would contain positions that are all equally likely to be encountered during search, whose evaluation is equally important for the outcome of the game and that are all correctly labeled. This would create the best possible training data. I hope this assumption is correct - if you disagree please let me know why!

For the above reasons I'm currently focusing on ideas that either help me a) align the sampling of positions in the training data more with what is going to be evaluated under real conditions and b) reduce the amount of of wrongly labeled positions.

I'm not sure how your suggestions would help with these goals? Shouldn't a proper implementation of alpha-beta search produce the same result whether you retain the narrow alpha-beta window compared to using an open window on each root move? The former is just searching less nodes to achieve the same result making it more efficient.
And why are moves that lead to a lot of checkmates within the search-horizon likely to be good moves? is there some theoretical reason for that? I mean top engine mostly play moves that lead to a drawn position so this seems counter-intuitive to me.

Root moves with a narrowed window are not accurate. They are just simply not better and may be much worse but alpha-beta does not care which.

If using an open window for the root moves one can count things accurately relative to each other for each root move. If checkmates are counted and normalized to a range like 0 to 30 and then randomized a bit you still get a fair amount of randomization but more time will be spent where more time should be spent. And checkmates are not the only thing that can be counted. Null Move cut failures can be counted. That would give an indication of good squares where pieces make the most threats. Also more checkmates usually means more mobility. On the first move counting checkmates for a3 and e4 will clearly show e4 to be the better candidate. Therefore after being randomized e4 will be played more often than a3 in the training games.

When I experimented with this idea in Bricabrac I did not do it correct because I did not search each root move with an open window. Therefore the counts were not accurate. Now that I have finally realized my mistake I am going to do some more experimenting. Despite my error material only searches produced intelligent looking moves relative to pure material only searches. Here is a self play material only search (counting applied after the search) game where I did it wrong.
[pgn]1. e3 Nf6 2. Bc4 e6 3. Qf3 d6 4. g4 a5 5. Nh3 h5 6. g5 d5 7. Na3 dxc4 8. gxf6 gxf6 9. Nf4 h4 10. Nxc4 h3 11. a3 f5 12. Ne5 Qd6 13. d4 Bg7 14. Nfd3 Rh4 15. Nc4 Qa6 16. Rg1 Bf8 17. b3 Qa7 18. Nf4 b5 19. Ne5 a4 20. Rg8 Qa5 21. Bd2 Rxf4 22. exf4 Qa6 23. Qh5 Ke7 24. Qxf7 Kd6 25. Qxf8 Kd5 26. Qc5 Ke4 27. f3#[/pgn]

So the question is how do you want the training games to spend time. And in pure randomness probably 30k games is not enough. In intelligently guided (swayed) games 30k games would be more than enough. Also you could use Reinforcement Learning for those 30k games like in RomiChess. In a test of ten positions playing both sides of each position RomiChess played ten matches against Glaurung 2. RomiChess scored 5% in match 1 and 95% in match 10. That big a swing with only 180 games of learning.

lithander · Post by **lithander** » Sat Dec 10, 2022 2:35 pm

Mike Sherwin wrote: ↑Fri Dec 09, 2022 6:29 pm So the question is how do you want the training games to spend time. And in pure randomness probably 30k games is not enough. In intelligently guided (swayed) games 30k games would be more than enough.

Thanks for the detailed explanation. I understand what you mean, now. You were suggesting an improvement on the material-only + randomness evaluation so that the very first version of Leorik with untrained weights would already generate much better games.
It's something I would like to try but at this point in time it would mean I have to start from scratch an lose all the progress.

This is how my input to the trainer looks like:

Code: Select all

string[] PGN_FILES = {
    //"leorik2X3_selfplay_startpos_5s_200ms_50mb_12112020.pgn",
    //"leorik2X3_selfplay_startpos_5s_200ms_50mb_16112020.pgn",
    //"leorik228a_startpos_RND25_100Hash_5s_200ms_selfplay.pgn",
    //"leorik228a_startpos_RND25_100Hash_5s_200ms_selfplay_2.pgn",
    //"leorik228a_startpos_RND25_100Hash_5s_200ms_selfplay_3.pgn",
    //"leorik228alpha_selfplay_startpos_RND25_100Hash_5s_200ms.pgn",
    //"leorik228alpha_selfplay_startpos_RND25_100Hash_5s_200ms_2.pgn",
    //"leorik228beta_vs_leorik228alpha_varied_RND30_100Hash_5s_200ms.pgn",
    //"leorik228beta_selfplay_startpos_RND30_100Hash_5s_200ms.pgn",
    "leorik228gamma_vs_leorik228beta_startpos_RND30_100Hash_5s_200ms.pgn",
    "leorik228gamma_selfplay_startpos_RND30_100Hash_5s_200ms.pgn",
    "leorik228gamma_selfplay_varied_RND30_100Hash_5s_200ms.pgn",
    "leorik228delta_vs_leorik228gamma_startpos_RND30_100Hash_5s_200ms.pgn",
    "leorik228delta_selfplay_startpos_RND30_100Hash_5s_200ms.pgn",
    "leorik228delta_selfplay_varied_RND30_100Hash_5s_200ms.pgn",
    "leorik228epsilon_vs_leorik228delta_startpos_RND30_100Hash_5s_200ms.pgn",
    "leorik228epsilon_vs_leorik228delta_startpos_RND35_100Hash_5s_200ms.pgn",
    "leorik228epsilon_selfplay_startpos_RND50-10_100Hash_5s_200ms.pgn",
    "leorik228epsilon_selfplay_one_with_book_startpos_RND50-10_100Hash_5s_200ms.pgn",
    "leorik228epsilon_selfplay_startpos_RND40-0_100Hash_5s_200ms.pgn",
    "leorik228epsilon_selfplay_varied_RND40-0_100Hash_5s_200ms.pgn",
    "leorik228zeta_selfplay_startpos_RND50-0_100Hash_5s_200ms.pgn",
    "leorik228zeta_vs_leorik228epsilon2_startpos_RND40-0_100Hash_5s_200ms.pgn",
    "leorik228zeta_vs_leorik228epsilon2_varied_RND40-0_100Hash_5s_200ms.pgn"
};

My trainer will start with completely reset weights and tunes them from scratch from the games in the list above. But note how the first 9 PGN files are commented out. Games with the material-only evaluation where only used up until the Delta version. Epsilon didn't use them anymore. At the point where I am now even Alpha and Beta are not included anymore. So the quality of games my trainer is currently using is already pretty high.

The same quality could have been achieved with fewer numbers of games if I had started with the kind of guided material-only eval you suggest but at this point it would still mean have to start from scratch and lose all my progress.

Speaking of progress: I have experimented with filtering the positions I pick from the games. Surprisingly the straight-forward filtering ideas did not help. Like excluding games of 200 moves or longer - you would assume that even if one side wins after 200+ moves that labeling *all* the positions as winning would be misleading. But no... excluding these games was a regression. Or doing what Zurichess did: "From the set were removed all positions on which quiescence search found a wining capture." That was a huuuge regression.

Then I got inspired by what humans do. You know... how you people settle on an ideology and then tend to filter out information that doesn't confirm their prior conclusion so their worldview is remaining consistent. So what I did is that I trained a set of weights without any filtering. Then I trained it again but this time I filtered all positions where the evaluation (with the previous weights) was predicting a different result than the label. E.g. the label said winning for white but the position was -150cp? Well... let's just ignore it. This lowered the MSE a good deal (as expected) but also improved the strength of the weights against Leorik 2.2 (which I didn't dare to hope). What a strange discovery that this works!

Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik