Reinforcement learning project

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Pio
Posts: 334
Joined: Sat Feb 25, 2012 10:42 pm
Location: Stockholm

Re: Reinforcement learning project

Post by Pio »

hgm wrote: Fri Mar 19, 2021 9:31 am Indeed, this is what I see. The fully tuned engine is not just weaker than the untuned one: it is enormously weaker. Basically it loses each and every game against the untuned engine.

It is also obvious why this is: the PST get such high values that it very quickly sacrifices two or three Pawns to put its other pieces on the 'optimal' squares. Problem is that these squares are only optimal for attacking the Palace under conditions where sufficiently many Pawns participate in the attack: in Janggi you need those Pawns to compensate the fact the the opponent's Advisors defend the Palace, and your own Advisors cannot attack it. Without enough Pawns to trade away the Advisors, an attack has zero chance to succeed, no matter how well executed. So the PST bonuses you 'gained' at the expense of a few Pawns are in fact worthless. They are even worse than worthless, because with all hope for a victory gone, your only resort is to salvage the draw by defending your own Palace. And the locations good for attacking the enemy Palace in most cases are very bad for defending your own. So you get the worst of both worlds: you sacrificed Pawns to get your pieces in the worst possible locations positionally.

This is of course all caused by the fact that the evaluation function that was tuned was fundamentally flawed. It assumed the evaluation was purely additive, and did not account for the fact that you need completely different PST when ahead or equal in material (or at least in Pawns) then when behind.
If you just include games where your engine plays itself and minimise the absolute error instead of the squared error I believe that the tuned engine would outperform the untuned one. Also weighting the positions closer to the end more will probably help too.

It is always interesting to read your posts.

It would be very interesting to see what results you would get if you orthogonalised your features with for example PCA over your tuning set, then tuned the weights for the new orthogonal basis. The goodness of the position could then be defined as the difference between the Euclidean distances between white and black where all the negative terms for black would be transferred to positive values for white and the same for white before taking the square root.

/Pio
User avatar
hgm
Posts: 27791
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Reinforcement learning project

Post by hgm »

You want the absolute error because then the optimum is at the median rather than the average? So that if the data contains two clusters that should have been treated as different cases, but which the evaluation cannot distinguish, it would converge to someweher inside the larger cluster, rather than someweher in between the two?

I don't think that it would solve the particular problem I am seeing. Which is that the training set is dominated by Palace attacks where both players have several Pawns (the large cluster), while there are hardly any cases where the Pawns are distributed 5-0, 5-1 or 4-0, and occupying the squares favorable for attack by the pawn-poor player backfires (a small cluster with nearly opposit PST scoring). The median there would most certainly assign the undiluted attack bonus to the PST, without any reduction by pawnless case that needed opposit scoring. So the engine would thing it even more worthwhile to scrifice the Pawns to get pieces on those squares.

You will always get problems of this kind when the functional form of the evaluation makes it impossible to distinguish cases that need very different, very large scores. That is, it would be best to not give a high score at all, as the search will be drawn to nodes with extreme scores, which in the cases that needed the opposit score is immediately suicidal. If the tuning averages things that should not be averaged (but distinguised), the result becomes dependent on the composition of the training set. And you could tailor that to make the average close to zero. Then it would still be sizable in either of the cases. But at least it would not draw the search to it.
User avatar
hgm
Posts: 27791
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Reinforcement learning project

Post by hgm »

I think I conceived an evaluation that better suits the peculiarities of this game than just adding independent piece values and positional bonuses (i.e. PST). Instead I can do

Code: Select all

eval = (SUM_w PST[piece][sqr] - SUM_b PST[piece][sqr])
          + max(0, SUM_w APST[piece][sqr] - SUM_b DPST[piece][sqr])^2
          + max(0, SUM_b APST[piece][sqr] - SUM_w DPST[piece][sqr])^2
where SUM_w/SUM_b is the sum over all white/black pieces, and APST/BPST are piece-square-type tables with attack and defense 'points', respectively. The idea is that the normal PST draw the pieces towards the enemy Palace, as attacking in general is more profitable than defending. But only weakly so, as single pieces never are very dangerous. (One needs at least 2, and often 3 pieces to checkmate a bare King.)

The APST values increase towards the enemy Palace; the DPST increase similarly towards your own Palace. The sum doesn't have to be strictly conserved, though: there can be locations where a piece is not very useful for either attack or defense. It is also possible that certain piece types are intrinsically more useful for attack than defense, or the other way around.

But the idea is that even when a piece is equally useful for attack or defense in its optimum location (which no doubt will be a different location for attacking and defending), the desirability to add it to your attack or to your defense will be determined by which player already has the strongest attack. The non-linear behavior (square in this case) of the APST/DPST bonuses will see to that. E.g. if the net white attack has 5 attack points, and the net black attack 9 points, adding two more points to the white attack will up the white score from 25 to 49 (= 24), but adding those points to the white defense would up white's score by decreasing the net black attack from 81 to 49 (= 32); beefing up a counter attack that will nevertheless remain inferior to the opponent's attack on your Palace will still get you mated first, and just as quickly. So you'd better focus on lasting longer, by abandoning the counter attack, and defending your own Palace instead. For the side with the stronger net attack it pays more to strengthen that attack so he can decide the game before any counter attack can be successful.
smcracraft
Posts: 737
Joined: Wed Mar 08, 2006 8:08 pm
Location: Orange County California
Full name: Stuart Cracraft

Re: Reinforcement learning project

Post by smcracraft »

Some years ago I read about Andrew Tridgell's work KnightCap and was intrigued. Recall it was a learning program doing coefficient manipulation of the evaluation function in play against similarly-rated others, from tabla rasa, on the internet chess server. It learned quickly. So I studied Sutton's Temporal Difference learning. Rich was in my class at Stanford and based his interests on neural networks (too slow for the day due to slow hardware) but on Art Samuel's checkers-playing program. Rich co-wrote the standard text Reinforcement Learning, which I recommend. Anyway, I added TD Leaf which Tridge used to a version of my chess program and on VERY slow hardware it learned the canonical piece weights in about 2000-2500 games, self-play from tabla rasa. Of course now that would be in micro-seconds on the fastest possible hardware. I ran that by Don Beal at King's College in London and he verified I had replicated his work (rofl) so I felt happy as a clam. Nowadays, things are much more advanced. You might want to grab yourself a copy of Leela at lczero.org and dig in there instead. Wish that program had more of a killer instinct instead of just positional boa constrictor. They're working on it, I'm sure.

Cheerio,
--Stuart
User avatar
maksimKorzh
Posts: 771
Joined: Sat Sep 08, 2018 5:37 pm
Location: Ukraine
Full name: Maksim Korzh

Re: Reinforcement learning project

Post by maksimKorzh »

Mr. Muller

What kind of rule set for opposed kings do you use in your engine?
Btw is it microMax type or something different?
Could you give a logical clue (not obviously implementation ideas) on how to deal with opposed kings?
I'm confused by the fact that fairy stockfish gives different perft results of the same position if different janggi variants are used.
I used this position for tests:
9/4k4/9/9/9/9/9/9/4K4/9 w - - 0 1
User avatar
hgm
Posts: 27791
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Reinforcement learning project

Post by hgm »

My engine plays according to the rule that Kings can completely ignore each other. I think Fairy-Stockfish calls this variant janggicasual.

If the BikJang rule applies, you can check whether the Kings face each other after every move. And if it is the second half-move in a row where this is the case, that move draws. It is a bit similar to repetition detection. E.g. you pass to the child whether the condition occurred (like you would pass an e.p. square) and you didn't get it passed from the parent that the condition already occurred there (in which case it is an immediate draw, and there is no child to pass anything to).

Of course a draw can be a partial loss because of the counting rule, when that apply. And I think Stockfish doesn't distinguish partial losses from full losses. And I am not sure it distinguishes either of those from illegal moves. I am not sure whether it counts such losing moves in perft.

BTW, the engine is not micro-Max-like, but more HaQiKi-D-like: it maintains a full piece list, and it lists possible move steps per piece & square, rather than just per piece (to account for the board irregularity, and zonal confinement).
User avatar
maksimKorzh
Posts: 771
Joined: Sat Sep 08, 2018 5:37 pm
Location: Ukraine
Full name: Maksim Korzh

Re: Reinforcement learning project

Post by maksimKorzh »

[Moderation] I accidentally erased this message by hitting the Edit button where I intended to quote it. I am very sorry about that. From the quotes in the positng below part of the message can still be seen, but about half of it was destroyed.
H.G.
User avatar
hgm
Posts: 27791
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Reinforcement learning project

Post by hgm »

maksimKorzh wrote: Sun Mar 28, 2021 10:37 pmDid you do perft testing?
No
Did you count pass move (e.g. e1e1) as a node?
Currently the engine doesn't pass at all. I figured this was somethingthat would only affect the game in the late end-game (e.g. in KCP-K), and I currently adjudicate such games before they get in that stage.
Could you please provide some positions for perft test and the exact logic of counting nodes if possible?
No, because I cannot do perft.
And one more question:
What does it mean in particular: "no bigjang in modern and casual variants"?
I understand bikjang as "draw if kings face each other" but it's unclear whether this position should be treated as legal or illegal?
I mean let's say we have "no bikjang variant" - in this case is it legal to make a move leading opposed generals?
If it is legal then is it legal to pass a move while opposed generals?
Without BikJang the Kings can face each other as long as they want, just like Advisors can face each other. They are confined to their own Palace, and exert zero influence outside it.

With BikJang there still is an issue whether it counts as checking, and would fall under the ban on perpetual checking. E.g. in the KPP-K end-game, which in principle is a forcible made, the bare King can perpetually chase the other King with BikJang. In janggicasual this seems to be allowed. So KPP-K is only winnable there if you have at least one Pawn in the central 5 files.
These Korean rules are so tough and tricky...
How did you manage to bring them all together?
I'm debugging these two opposed kings for the whole day and feel totally lost and stuck...
Any ideas on how to "divide and conquer" these rules?
Well, I just stuck to the simplest set of rules (no BikJang, no partial victories by counting), and forbidden to bring about a repeat when not in check. The latter is not even correct, but I assumed that it would not affect the general middle-game strategy very much , as long as you somehow had a rule that outlawed perpetual checking. The true rules are rather complex, and I still don't fully understand them.
User avatar
maksimKorzh
Posts: 771
Joined: Sat Sep 08, 2018 5:37 pm
Location: Ukraine
Full name: Maksim Korzh

Re: Reinforcement learning project

Post by maksimKorzh »

hgm wrote: Mon Mar 29, 2021 9:20 am
maksimKorzh wrote: Sun Mar 28, 2021 10:37 pmDid you do perft testing?
No
Did you count pass move (e.g. e1e1) as a node?
Currently the engine doesn't pass at all. I figured this was somethingthat would only affect the game in the late end-game (e.g. in KCP-K), and I currently adjudicate such games before they get in that stage.
Could you please provide some positions for perft test and the exact logic of counting nodes if possible?
No, because I cannot do perft.
And one more question:
What does it mean in particular: "no bigjang in modern and casual variants"?
I understand bikjang as "draw if kings face each other" but it's unclear whether this position should be treated as legal or illegal?
I mean let's say we have "no bikjang variant" - in this case is it legal to make a move leading opposed generals?
If it is legal then is it legal to pass a move while opposed generals?
Without BikJang the Kings can face each other as long as they want, just like Advisors can face each other. They are confined to their own Palace, and exert zero influence outside it.

With BikJang there still is an issue whether it counts as checking, and would fall under the ban on perpetual checking. E.g. in the KPP-K end-game, which in principle is a forcible made, the bare King can perpetually chase the other King with BikJang. In janggicasual this seems to be allowed. So KPP-K is only winnable there if you have at least one Pawn in the central 5 files.
These Korean rules are so tough and tricky...
How did you manage to bring them all together?
I'm debugging these two opposed kings for the whole day and feel totally lost and stuck...
Any ideas on how to "divide and conquer" these rules?
Well, I just stuck to the simplest set of rules (no BikJang, no partial victories by counting), and forbidden to bring about a repeat when not in check. The latter is not even correct, but I assumed that it would not affect the general middle-game strategy very much , as long as you somehow had a rule that outlawed perpetual checking. The true rules are rather complex, and I still don't fully understand them.
Thank you so much!
Now it's clear regarding bikjang and passes)

re: do you have perft? - no
- that's so cool))) I always get excited by the fact you almost never do perft and still have playable engines)

Thank you. I mean really. This was really helpful.
Hope to see our engines playing each other in Winboard soon)

P.S. Baek Seung Heong (you must already know him) is very pedantic regarding the rules) And feels like a pain to implement at least a single set of rules properly) But it's ok)