Some more experiments with neural nets

jonkr · Post by **jonkr** » Tue Jun 15, 2021 6:38 am

I recently made an update to the neural network architecture in Slow Chess.
After a few days of testing the current dev version is scoring +52 elo to Slow Chess 2.5 and +15 elo to Stockfish 9. *

The main change to the net structure is the addition of a Pawn-King-Material network, that is stored in a hash table like typical pawn hashes. Despite writing something like this looked promising myself last year I only now got around to actually fully implementing it and cleaning up my implementation enough to fit in with my default training process. The idea is just to do what has been done for HCE in chess, but instead of specific outputs let the neural net output whatever it decides.

The PKM hash network size I'm currently using is 208 inputs -> 256 -> 32 outputs. For material 2-inputs for each non-pawn piece type, bishop inputs always same order by dark, light square. Also 4 castling flag inputs.

The 32-outputs from the hash are concatenated on the end of the first layer values of the general network that uses all pieces/pawns as inputs. I didn't try removing pawn inputs from the general network since I figure they are important in mobility/tactics.
My current general network is 705 inputs -> 352 + 32 from PKM hash -> 32 -> 16 -> 1.

One issue is with this structure is that the PKM-hash neural network probably won't capture the values it should during the training, a lot will end up in the bigger general network part. My solution to this was to use a dropout layer between the general inputs -> general 1st layer for the first part of training, so values get captured by the PKM-hash, then for the second part the drop-out layer is de-activated and and training resumes for more epochs. The first time I tried with a dropout layer I got a much lower error and 15-elo increase, which while not exact, implies that it was quite helpful. This is the first time in training neural nets for anything that I found a dropout layer to be clearly effective, so I'm pretty happy to see that. I think training the PKM first might also help get better values for imbalanced positions.

I was a bit disappointed in the speed up of the hashing being less than I expected. I think it is partially from having so much stuff in the hash beyond just pawn structure (eg. material meaning a capture would change the hash) and partially I didn't spend long enough to find the most optimized implementation.

I made some other smaller changes. I trained a Pawn-King endgame network, and started deleting some random parts of my regular eval (like pawn race code for example.) I'm worried about losing a some Elo just deleting stuff though so I've put it off, but I'm thinking I'll delete my entire hard-coded eval after I release the next version to just concentrate on neural nets for everything. It's a bit sad to delete the evaluation I worked so hard on, especially the more original terms, but it's pretty clear the computer is just better at it.

I did some more endgame training and was finally able to reach the goal of eeking out a win vs SF11 in an endgame book match of 16000 games.
I noticed some of the sure win endgame values in my nets are getting quite high, which leads to more simplifying to sure wins. This does feel more human-like to me, but now that so many engines are using neural nets I want to try out adjusting values / including mate distance in my target value to try to mix it up for longer when ahead, and maybe get some more aggressive finishes.

(*elo difference depends on openings, I've been testing mostly 2-move book and recently Herz openings. 8-move is a better variety but seems to increase random variation since only play a random sampling of the large book. Slow does a bit worse in 8-move than 2-move. Then FRC is way worse, I hope to train eventually for FRC.)

jonkr · Post by **jonkr** » Sat Jun 19, 2021 5:02 am

Further testing is holding stead at +60 elo to Slow Chess 2.5 (in blitz with any opening set I tried.) It's time to start putting together a 2.6 release so I can risk breaking everything again.

Another interesting thing about my training I forgot to mention is I'm not using nearly as many positions as NNUE nets. (Which makes sense to some extent given the more limited number of weights I'm training.) At this point I have 15 million positions being used to train my general net. However I also have about 15 million positions between the various endgame nets, so that's 30 million. Then I account for horizontal and vertical/stm symettry so in a sense that's 120 million, so perhaps not as small as it sounds

Training a new general net takes 10 minutes, training general+endgame takes 16 minutes. The general net I train for 26 epochs total (12 then 14.) I've tried more or less and if number epochs matters or not seems unclear, I assume too few would be bad based on higher error (as long as you have enough data not to horribly overfit.) There is a balance between net size and number of positions where overfitting isn't too big a concern.

I have some more ideas that seem interesting, so hopefully might have something to update this thread with every few weeks.

jonkr · Post by **jonkr** » Sat Jul 03, 2021 6:51 am

I have been running some more tests and re-training nets so time to give an update before I take a break.

I was curious about how bad 2.6 was at FRC compared to standard and how quickly it could get better, so I switched my openings to the FRC position book and have been running some training. The current net scored +80 elo FRC against 2.6 @ 4000 games, which is quite impressive. There was more variation between FRC nets, I suppose lack of FRC training leading to less consistency, so just choosing a net that performed well at FRC was part of the gain. 4000 standard games against 2.6 was +6 elo, so while the difference didn't extend to standard, it didn't hurt it either and was likely a small improvement.

Another experiment I ran was using King and Pawns endgame start positions and playing matches against Stockfish 13 to test/train that endgame net and in the latest match Slow scored +0.3 elo @ 4000 games against SF13. Which is the first time beating SF12+ in any test, and probably speaks more to lack importance improving performance that deep into the endgame, but was still a first milestone that maybe not all aspects of Stockfish are invincible. (Well mate-finding Slow has done better too in my tests, but given the selectivity of modern search it was still hit or miss in some positions taking noticeably longer, and I don't know of any large neutral test suites to test mate-finding.) Also I just played SlowDev @ 15s + .15s 2000 games against SF14 and was -335 elo, so Slow might even be just falling further behind.

Overall though, it looks likely that I'll be able to create a decent Slow 2.7 at some point when I return to working on chess even if I don't make any major changes.

Madeleine Birchfield · Sat Jul 03, 2021 2:20 pm

jonkr wrote: ↑Sat Jul 03, 2021 6:51 am Also I just played SlowDev @ 15s + .15s 2000 games against SF14 and was -335 elo, so Slow might even be just falling further behind.

It is difficult to compete against a dedicated group of programmers and network trainers that have huge amounts of hardware and time collectively.

jonkr · Post by **jonkr** » Sun Aug 22, 2021 8:46 am

I've still been doing some work on chess. Well more so my computer than me, but I have run more experiments. After the quick improvement with the King-Pawn-Material neural hash mentioned in the first post though nothing has seemed as interesting, so haven't felt compelled to update regularly.

Improvements

More training attempts with the KPM structure got a stronger net within a few days about +8 elo. Probably as I generated more data it became even stronger.
Fix for KPM hash scheme. I use a 32-bit perfect material hash. 16-bits white, 16-bits black. Then I used the 15 least significant bits to find the material hash entry. Ignoring white material entirely, oops! So the hashing speed-up is closer to what I hoped now, but still hashing KPM is a lot that can change.
I changed the net size of the first and second layer. It's now Inputs x 512 (480 + 32 KPM hash) x 16 x 16 x 1. This makes for slightly more weights, but also pushes more weights into the incremental first layer making it a bit faster. This was about +5 elo. I had avoided trying this out of stubbornness since this tweak felt like moving in a NNUE influenced direction. But eventually decided it was nice I could just change a constant, click generate net, and be testing something +5 elo 15 minutes later.
I added a Q vs. 1 or 2 lessor(R/B/N) endgame net. This was +1 or +2 elo, and +1 or +2 elo to the general net not needing to handle this endgame. I feel like the amount of nets is getting pretty messy and not optimally arranged, but it is something that my tuner / tester makes easy so hard not to do it.
Overall nps is > 10% faster in early game, and also somewhat faster late game.
I played with search constants mostly making slightly more selective now that eval is better. I can't accurately measure but I think was around +4 elo.
I generated a new opening book. The 2.6 book has a larger amount of lines that included anything playable 2700-3000 elo level. For the new book I tried playing stronger opponents putting it all in one pgn, and choosing positions in book based on winning percentage. (Minimum 48% for white. 36% for black.) For the specific conditions the book was generated in it was a huge plus to actually move choice towards lines that won (maybe +30 elo, I ran the book generation 5 nights I think.) Switching up conditions (opponent book, opponent program, time control) made the gain less clear and I decided I don't care about book generation enough to try to figure how to make a competitive one.

Unfinished Experiments :

I tried removing hard-coded eval entirely. This was very nice from a coder's perspective, made the code look much cleaner. Unfortunately there were small issues 1. It tested -5 elo compared to having HCE. 2. It negatively affected mate-finding and quickness in finishing games.
Unfortunately the net was poor at scoring lopsided positions, preferring ones that occurred more in data to ones that were more lopsided. So I ended up adding back HCE just for the general net. (Endgame HCE didn't test as useful at all, probably because I have endgame nets.)
To encourage ending games sooner, I tried to scoring wins not as 1, but as 0.95 to 1 depending on distance to finish. I can't say if this was successful. My data wasn't standard with most of it having early adjucation, so sorta a mess. I think the idea is sound but might not make a big difference. Testing showed -2 elo from this change, but I left it in case I ever want to try later training up from new data.
I tried running training games for 6 hours to train a mating net (including the distance to finish change), and it seemed as good my hard-coded mating eval, but then without TBs it couldn't mate with Bishop-Knight, so I left it to try again later.
Testing in general can get frustrating since it takes me so long to run games. I considered then decided not to get a computer for chess (less to make it faster than I might be able to start and ignore tests better if it didn't use my main computer.) There is definitely a difference in nets where some do better against some opponents or with certain openings, but also the variation in testing when I'm only running 4000 games at a time and re-testing if needed is large.

Overall the standard chess I would guess is about +35 elo. With all the changes listed in improvements, I was able to play against Stockfish10 on my old computer using Winboard and Slow 2.7 won the 2000 game match. (Including book, this is the only time I ever test with the slow book instead of both sides of starting positions.). This is significant to me because when I started getting into computer chess again I looked for current programs and found SF10, and when I tried it out it just felt like unbeatable alien technology compared to 2007 programs. So it's nice to finally get "revenge" for the years of trouncing older Slow versions. (I actually rarely played SF10 it was just too strong, I preferred and still do choosing a few programs slightly stronger than SlowDev, then improving to beat them.)

FRC improvement was way beyond my expectations. Apparently dealing with the severely lacking FRC strength was a matter of a better net structure and most of all actually doing some FRC training games. I never noticeable lost any standard Elo from adding FRC games so I probably should have started training on FRC sooner. Gain in FRC strength didn't obviously translate to gain in standard strength either though.
My latest test was +118 elo stronger than Slow 2.6 in FRC! Probably partially luck, but this significant improvement is the main reason I am planning on releasing a new version as soon as I can do a few more tests.

Sopel · Post by **Sopel** » Mon Aug 23, 2021 12:32 am

Could you describe the KPM feature set in more detail? What are the inputs exactly, what tuples?

Also, you state your KPM FT output is only 32-wide, have you tried wider ones, or ones that are summed with the usual features completely?

I tried removing hard-coded eval entirely. This was very nice from a coder's perspective, made the code look much cleaner. Unfortunately there were small issues 1. It tested -5 elo compared to having HCE. 2. It negatively affected mate-finding and quickness in finishing games.

Indeed, the NNUE eval seems to have very conservative evaluation in near mate (especially in the middle game) positions. This might be partially caused the the lack of such postions in the training? At least in our case such games are adjudicated quite early.

jonkr · Post by **jonkr** » Mon Aug 23, 2021 4:20 am

Sopel wrote: ↑Mon Aug 23, 2021 12:32 am Could you describe the KPM feature set in more detail? What are the inputs exactly, what tuples?

Also, you state your KPM FT output is only 32-wide, have you tried wider ones, or ones that are summed with the usual features completely?

For the KPM net I'm still using inputs of 0 or 1 only, not for any good reason other than I never restored the ability to use floating point inputs to my trainer. I have been considering trying some incremental updates on the KPM net see if it helps speed a bit, but haven't tried it.

The specific inputs are : 48 x 2 colors pawn square inputs, 64 + 32 king square inputs (my inputs always use horizontal symmetry by white king side, so only 32 inputs for white king), 2x2 castling flag inputs, 8x2 material inputs. The material inputs are 2 for knight, 2 rook, 2 queen, (by count of 1 or 2.) Also 2 for bishops, with first input for light square bishop, second for dark square bishop.
So the KPM net is 212 x 256 x 32. (Originally the inputs didn't include the 4 castle flag inputs, so the 208 from my first post should be 212, I forgot to update it.)

I tried 16 outputs for KPM net a couple times (and 496 general first layer values from general net), but they tested worse. I should try 64 outputs sometime see if that helps. I also didn't experiment much with the inputs, hopefully If I do I can find something slightly better.

I'm not sure what you mean by ones that are summed with the usual features completely? Like using the KPM output in the input layer instead of the first hidden layer? I figured that would be tougher to set up and not be slow.
Or trying out using King sq-Piece sq inputs? (I never tried the usual NNUE, just pure piece on sq inputs, with multiple nets with active net chosen based on material, then eventually the addition of the hashed KPM net part to the general net.)

Indeed, the NNUE eval seems to have very conservative evaluation in near mate (especially in the middle game) positions. This might be partially caused the the lack of such postions in the training? At least in our case such games are adjudicated quite early.

Recently I have been running games to completion (and I take values from anywhere up to and including the mate position,) but I still have more adjudicated games than ones to mate so not a good test.
I know my training set is too small to have decent coverage of all the various lopsided positions, so I was mainly hoping it might generalize a bit better than it did. I suppose a few elo worse, and converting to slower endgame wins more often isn't that bad just don't want to make a change that makes things worse.

Sopel · Post by **Sopel** » Mon Aug 23, 2021 11:41 am

jonkr wrote: ↑Mon Aug 23, 2021 4:20 am
Sopel wrote: ↑Mon Aug 23, 2021 12:32 am Could you describe the KPM feature set in more detail? What are the inputs exactly, what tuples?

Also, you state your KPM FT output is only 32-wide, have you tried wider ones, or ones that are summed with the usual features completely?
For the KPM net I'm still using inputs of 0 or 1 only, not for any good reason other than I never restored the ability to use floating point inputs to my trainer. I have been considering trying some incremental updates on the KPM net see if it helps speed a bit, but haven't tried it.

The specific inputs are : 48 x 2 colors pawn square inputs, 64 + 32 king square inputs (my inputs always use horizontal symmetry by white king side, so only 32 inputs for white king), 2x2 castling flag inputs, 8x2 material inputs. The material inputs are 2 for knight, 2 rook, 2 queen, (by count of 1 or 2.) Also 2 for bishops, with first input for light square bishop, second for dark square bishop.
So the KPM net is 212 x 256 x 32. (Originally the inputs didn't include the 4 castle flag inputs, so the 208 from my first post should be 212, I forgot to update it.)

Thanks, it's clear now to me. I also see why the concatenated bit is smaller. It's interesting to me that this provides positive elo, as the feature set is very simple. Is your base feature set HalfKP or just P or something else?

By "that are summed with the usual features completely" I meant an output that is summed with the normal FT output, instead of being concatenated with it - which would also mean they are both the same size.

I want to try something like this in the trainer and look for improvment in loss, now I know what to implement at least

jonkr · Post by **jonkr** » Mon Aug 23, 2021 11:41 pm

Sopel wrote: ↑Mon Aug 23, 2021 11:41 am Is your base feature set HalfKP or just P or something else?

My base feature set I suppose would be just P, since it's a 1 input for piece type on a square. Lack of "KP" is probably one of the big reasons that the King-Pawn-Material neural hash was so clearly effective for me.

When I started building my training/inference code I wanted to start with something simple, which I then tested out on various games. When I started trying more chess specific stuff, standard NNUE seemed already well explored, so experimenting with different ideas was more fun to me. (Also it takes me a while to generate positions, so I wanted to make sure I didn't need that many training positions.)

connor_mcmonigle · Post by **connor_mcmonigle** » Fri Aug 27, 2021 10:48 pm

jonkr wrote: ↑Sun Aug 22, 2021 8:46 am ...
[*] I changed the net size of the first and second layer. It's now Inputs x 512 (480 + 32 KPM hash) x 16 x 16 x 1. This makes for slightly more weights, but
...

I've recently been investigating using a pawn cache+pawn specific sub network in Seer. I plan to take a largely similar approach, but I don't see why it's necessary to concatenate the 32 KPM hash vector with the 480 base encoding vector. Wouldn't it be equivalent and faster to just store the 16D vector resulting from multiplying the KPM hash specific weights from what is now your primary network's L2 layer (16x32 weight matrix) with the vector you're currently treating as the 32 KPM hash vector? To use the new 16D KPM hash vector, you would just add it to the L2 output (before applying the activation function).

A x a + B x b = C x concatenate(a, b)
where A = C[:][:size(a)] and B = C[:][size(a):] (row major)

Hopefully my description is clear. This should be equivalent, enable for more KPM cache entries (improve the hit rate) and also be somewhat faster given a KPM cache hit as you only have to perform a 480x16 matrix vector product.

Some more experiments with neural nets

Some more experiments with neural nets

Re: Some more experiments with neural nets

Re: Some more experiments with neural nets

Re: Some more experiments with neural nets

Re: Some more experiments with neural nets

Re: Some more experiments with neural nets

Re: Some more experiments with neural nets

Re: Some more experiments with neural nets

Re: Some more experiments with neural nets

Re: Some more experiments with neural nets