Tactics in training data

niel5946 · Post by **niel5946** » Sat Jun 12, 2021 12:10 pm

Hi everyone.

I am currently trying to train a network for Loki that is actually able to play chess. One problem I have though, is that now that I use search results instead of static evaluations for the training points, the networks play completely idiotic moves (walking into mate, giving the queen away for free etc..). They are extremely bad compared to the ones trained on static evaluations, and I don't really know why.
Right now, I am trying to train a net with quiescence search scores, but this doesn't look too good either

One reason could be that the search results are heavily influenced by tactics which the nets aren't supposed to know anyway. Therefore, my first question is: Should I do anything to resolve the captures before getting a search score?

My second question concerns how to actually resolve the captures. Is that just done by using quiescence search to get a quiet position and then searching that to get a data point?

Thanks in advance

PS. My network's architecture is 768 neurons (12 pieces * 64 squares) input neurons, then three hidden layers with 256, 32 and 32 neurons in that order, and a single output neuron.

jonkr · Post by **jonkr** » Sun Jun 13, 2021 12:01 am

It was helpful for me to remove positions with immediate tactics, I did this by just running the Qsearch and saving the position at the end of the PV. (I think actually I'm using depth 1 search now which includes Qsearch.) Then this was the position that would be scored. I used a combination of game result and search score. Using quiet positions measured a clear gain over not doing it, but wasn't a huge difference.

Is your network giving away a queen even after searching? If so that sounds like there's a bug somewhere in your code. Either in the training process, the training data itself, or the network value calculation. (If it prefers a move leaving a queen hanging without searching I wouldn't worry, your network isn't big enough to detect tactics well.)

I started with a similar network and it can be pretty strong. Then added some more tweaks (splitting into multiple sub-nets, accounting for horizontal & vertical/stm symmetries, network value calculation optimizations, etc.) Also as I generated more training positions, I was able to find extra strength with increases in network size.

niel5946 · Post by **niel5946** » Sun Jun 13, 2021 7:01 am

jonkr wrote: ↑Sun Jun 13, 2021 12:01 am It was helpful for me to remove positions with immediate tactics, I did this by just running the Qsearch and saving the position at the end of the PV. (I think actually I'm using depth 1 search now which includes Qsearch.) Then this was the position that would be scored. I used a combination of game result and search score. Using quiet positions measured a clear gain over not doing it, but wasn't a huge difference.

I tried to do the same yesterday. I made the PV-stack extend all the way down into qsearch and then after searching, I would save the position's score, go down the PV and save the resulting position. I believe that is what you mean too? Or do you also search the quiet position at the end of qsearch again?

jonkr wrote: ↑Sun Jun 13, 2021 12:01 am Is your network giving away a queen even after searching? If so that sounds like there's a bug somewhere in your code. Either in the training process, the training data itself, or the network value calculation. (If it prefers a move leaving a queen hanging without searching I wouldn't worry, your network isn't big enough to detect tactics well.)

Yes, it usually just hangs a piece for one of the sides (sometimes it doesn't even see the recapture) in what seems like a rather random way. I don't think it is a problem with either 1) My network forward propagation or 2) My training process. The reasons are that 1) When I train a network with pure static evaluations, it does not hang pieces in the same manner (it still does, but that is because it doesn't search a deep as the HCE version), 2) If I use a really small dataset (10-100 positions), it quite easily overfits, so the training algorithm seems to do what it's meant to.
For the input to first layer, I use incremental updates in make/unmake move, which introduces another source of the bug. However, I think it's unlikely since I tested the incremental update for undo and do move while doing perft, and I didn't catch any bugs.
Therefore, I think the problem is in the training data itself... I am using ~72M positions from the lichess database that I search to depth 2 (very low depth, but it should give at least a descent strength compared to HCE), and as I said, I (now) use the position at the end of the PV.

BTW. The way I represent the board for the input is, as mentioned, 12 pieces * 64 squares. It goes: WP, WN, WB, WR, WQ, WK, BP, BN, BB, BR, BQ, BK, with a 1 for a piece present and 0 for no piece. I don't swap these depending on the side to move because I thought it would be good enough to get a white-relative score for each evaluation and then just inverting it in case it was black's move. Does this sound okay?

connor_mcmonigle · Post by **connor_mcmonigle** » Sun Jun 13, 2021 7:37 am

niel5946 wrote: ↑Sun Jun 13, 2021 7:01 am
jonkr wrote: ↑Sun Jun 13, 2021 12:01 am It was helpful for me to remove positions with immediate tactics, I did this by just running the Qsearch and saving the position at the end of the PV. (I think actually I'm using depth 1 search now which includes Qsearch.) Then this was the position that would be scored. I used a combination of game result and search score. Using quiet positions measured a clear gain over not doing it, but wasn't a huge difference.
I tried to do the same yesterday. I made the PV-stack extend all the way down into qsearch and then after searching, I would save the position's score, go down the PV and save the resulting position. I believe that is what you mean too? Or do you also search the quiet position at the end of qsearch again?
jonkr wrote: ↑Sun Jun 13, 2021 12:01 am Is your network giving away a queen even after searching? If so that sounds like there's a bug somewhere in your code. Either in the training process, the training data itself, or the network value calculation. (If it prefers a move leaving a queen hanging without searching I wouldn't worry, your network isn't big enough to detect tactics well.)
Yes, it usually just hangs a piece for one of the sides (sometimes it doesn't even see the recapture) in what seems like a rather random way. I don't think it is a problem with either 1) My network forward propagation or 2) My training process. The reasons are that 1) When I train a network with pure static evaluations, it does not hang pieces in the same manner (it still does, but that is because it doesn't search a deep as the HCE version), 2) If I use a really small dataset (10-100 positions), it quite easily overfits, so the training algorithm seems to do what it's meant to.
For the input to first layer, I use incremental updates in make/unmake move, which introduces another source of the bug. However, I think it's unlikely since I tested the incremental update for undo and do move while doing perft, and I didn't catch any bugs.
Therefore, I think the problem is in the training data itself... I am using ~72M positions from the lichess database that I search to depth 2 (very low depth, but it should give at least a descent strength compared to HCE), and as I said, I (now) use the position at the end of the PV.

BTW. The way I represent the board for the input is, as mentioned, 12 pieces * 64 squares. It goes: WP, WN, WB, WR, WQ, WK, BP, BN, BB, BR, BQ, BK, with a 1 for a piece present and 0 for no piece. I don't swap these depending on the side to move because I thought it would be good enough to get a white-relative score for each evaluation and then just inverting it in case it was black's move. Does this sound okay?

A simple solution to your original problem which works reasonably is to just to filter out positions in selfplay where a tactical move was played. This is far simpler than playing out qsearch PVs on self play positions and actually was superior in my limited testing.

For incremental undo to work, you must have quantized weights. Otherwise, due to floating point rounding, the values will diverge which could be the source of the issues you're experiencing. To account for this, in Seer, I make a copy of the incrementally updated encoding vector effectively on every make move (though, after pruning). It's not elegant, but it works reasonably.

I would suggest board mirroring such that the position the network "sees" is always white to move and training with relative scores as targets. This will enable the network to develop a more positionally dependent understanding of tempo. You might notice that this will amount to maintaining two copies of the incrementally updated encoding vectors which might seem wasteful. In fact, this wastefulness is in large part the motivation for HalfKP and other feature sets incorporating side (half) specific information such as king location. With such feature sets, it is beneficial to feed both encodings (mirrored and unmirrored) to the network.

jonkr · Post by **jonkr** » Sun Jun 13, 2021 8:51 pm

niel5946 wrote: ↑Sun Jun 13, 2021 7:01 am I tried to do the same yesterday. I made the PV-stack extend all the way down into qsearch and then after searching, I would save the position's score, go down the PV and save the resulting position. I believe that is what you mean too? Or do you also search the quiet position at the end of qsearch again?

I search the quiet position again. Then blend search score with the game result. (I started with just using game results but the blend definitely helped train faster.)

I wouldn't worry about minor details right now since it sounds like something is not working. There is a lot of stuff that can be messed up, here's a partial list of some of problems I had :

On one of my early tests I was accidentally only writing White piece inputs for training (I started generating positions from scratch so in the beginning error was almost 0 despite this, but as I built more and more positions and it was playing horribly and error increasing quickly it became obvious something was wrong.)
Once I wasn't writing proper target scores with similar play issues. (One of the issues I remember is value was flipped wrong since STM changed after PV playout.)
When I first changed to int16 values had some overflow issues that messed up play completely.
Had occasional incremental update issues after I added that.
Once when I was trying training from Zero it kinda broke my qsearch since it didn't know piece values and would keep reinforcing some bad values. But I'm assuming your starting scoring with some HCE.

Also your inputs sound fine fine to me.
For STM I have a single 1/0 input, and am double exporting positions with flipped board / colors/ targetValue, but don't think specifics of that matter much. In my data I'm only taking 5 positions from game for better variety, but that isn't big deal and best choice depends on many other factors.

You can probably start with fewer positions too while working out process, but does depend on your training data quality.
I only do Depth 6 search so I wouldn't worry about searching too deep. Depth 4 was about as good.

niel5946 · Post by **niel5946** » Wed Jun 16, 2021 11:01 pm

connor_mcmonigle wrote: ↑Sun Jun 13, 2021 7:37 am For incremental undo to work, you must have quantized weights. Otherwise, due to floating point rounding, the values will diverge which could be the source of the issues you're experiencing. To account for this, in Seer, I make a copy of the incrementally updated encoding vector effectively on every make move (though, after pruning). It's not elegant, but it works reasonably.

I have just tried looking through Seer's NNUE code, but I am afraid it is above my level of programming.. I don't really understand why I can't use incremental updates without quantization though. I have checked with the debugger while searching that the values look correct. By look I mean that none of the neurons seem to saturate at extremely high or extremely low values.
Additionally, if the incremental update were the cause of the error, this would also be present when playing with the HCE trained network.

connor_mcmonigle wrote: ↑Sun Jun 13, 2021 7:37 am I would suggest board mirroring such that the position the network "sees" is always white to move and training with relative scores as targets. This will enable the network to develop a more positionally dependent understanding of tempo. You might notice that this will amount to maintaining two copies of the incrementally updated encoding vectors which might seem wasteful. In fact, this wastefulness is in large part the motivation for HalfKP and other feature sets incorporating side (half) specific information such as king location. With such feature sets, it is beneficial to feed both encodings (mirrored and unmirrored) to the network.

That seems like an improvement to my current input board representation. Despite of this, I think I'll stick to the current architecture until I have trained a descent network. That should certainly be possible.

jonkr wrote: ↑Sun Jun 13, 2021 8:51 pm I search the quiet position again. Then blend search score with the game result. (I started with just using game results but the blend definitely helped train faster.)

I have also thought about using game results, but since I am only using EPD files for now it isn't really possible. When I get around to generating data with self-play, I will look into it though.

jonkr wrote: ↑Sun Jun 13, 2021 8:51 pm I wouldn't worry about minor details right now since it sounds like something is not working. There is a lot of stuff that can be messed up, here's a partial list of some of problems I had :

On one of my early tests I was accidentally only writing White piece inputs for training (I started generating positions from scratch so in the beginning error was almost 0 despite this, but as I built more and more positions and it was playing horribly and error increasing quickly it became obvious something was wrong.)

Once I wasn't writing proper target scores with similar play issues. (One of the issues I remember is value was flipped wrong since STM changed after PV playout.)

When I first changed to int16 values had some overflow issues that messed up play completely.

Had occasional incremental update issues after I added that.

Once when I was trying training from Zero it kinda broke my qsearch since it didn't know piece values and would keep reinforcing some bad values. But I'm assuming your starting scoring with some HCE.

I am pretty sure my input setup works, but I will write a test method for sanity's sake now
I just noticed that I had the same problem with the score's sign. Before, I flipped the score depending on the STM in root and not in the leaf. I tried to train the net again with this, but it still doesn't seem to work...
How did you change to int16? Did you just cast the floating point values to integers? I am asking because I have no clue about how quantization works yet. I have tried reading some articles about the subject, but it doesn't really make sense to me yet.
Yes, I use the classical evaluation for search scoring

jonkr wrote: ↑Sun Jun 13, 2021 8:51 pm Also your inputs sound fine fine to me.
For STM I have a single 1/0 input, and am double exporting positions with flipped board / colors/ targetValue, but don't think specifics of that matter much. In my data I'm only taking 5 positions from game for better variety, but that isn't big deal and best choice depends on many other factors.

You can probably start with fewer positions too while working out process, but does depend on your training data quality.
I only do Depth 6 search so I wouldn't worry about searching too deep. Depth 4 was about as good.

For the real dataset with 50M-70M positions, I am using lichess positions for exactly that; variety. I think non-master human play is good at early training phases since such games are usually a lot more unbalanced than engine-v-engine. As a side note I have actually thought about using positions from random self-play (like Minic I think?) since they will also be very unbalanced. My thought process is that if the net doesn't know the difference between clearly loosing positions and winning ones, balanced positions won't even matter.
Right now, I have switched back to using the 3.2M dataset that I used with the HCE one. This is in the hopes of removing training data quality as a source of error. Regarding depth, I am only using 2 plies ATM since the main goal is to just replicate the HCE's quality for now.

I am pretty confident that the problem lies in the training data or scoring of said data. Otherwise, the plain HCE score training would be equally bad.
One more problem: Hyperparameters. I am afraid that I'm also at a loss here

. Until now, I have used a batch size of 50k and learning rate of 0.001 and relatively few epochs (< 10). What values do you use? I have implemented learning rate decay, but I don't use it at the moment.

connor_mcmonigle · Post by **connor_mcmonigle** » Thu Jun 17, 2021 1:51 am

niel5946 wrote: ↑Wed Jun 16, 2021 11:01 pm
connor_mcmonigle wrote: ↑Sun Jun 13, 2021 7:37 am For incremental undo to work, you must have quantized weights. Otherwise, due to floating point rounding, the values will diverge which could be the source of the issues you're experiencing. To account for this, in Seer, I make a copy of the incrementally updated encoding vector effectively on every make move (though, after pruning). It's not elegant, but it works reasonably.
I have just tried looking through Seer's NNUE code, but I am afraid it is above my level of programming.. I don't really understand why I can't use incremental updates without quantization though. I have checked with the debugger while searching that the values look correct. By look I mean that none of the neurons seem to saturate at extremely high or extremely low values.
Additionally, if the incremental update were the cause of the error, this would also be present when playing with the HCE trained network.

...

I was referring to the errors associated with floating point arithmetic. In general, a + b - b = a does not hold for a floating point representation of real numbers. Therefore, error can grow with the millions of incremental updates performed in a search. Here's a quick demo:

Code: Select all

#include <iostream>
#include <iomanip>
#include <random>

constexpr size_t branching_factor = 2;

template<typename Distribution, typename Generator>
void demo(float& x, Distribution& dist, Generator& gen, size_t depth=25) {
  if (depth == 0) { return; }
  for (size_t i(0); i < branching_factor; ++i){
    const float val = dist(gen);
    x += val;
    demo(x, dist, gen, depth-1);
    x -= val;
  }
}

int main(){
  float x{2.0};
  auto gen = std::mt19937(std::random_device()());
  auto dist = std::normal_distribution<float>(0.0, 1.0);
  std::cout << std::setprecision(10) << x << std::endl;
  demo(x, dist, gen);
  std::cout << x << std::endl;
}

As the errors aren't correlated in this example, the error isn't too extreme, but still undesirable in the context of a chess engine. However, as your HCE training experiment performed reasonably, this is unlikely the source of your issues.

I'd recommend checking your labels to insure they're reasonable. Even if training on tactical positions, the evaluations should be mostly reasonable.

xr_a_y · Post by **xr_a_y** » Thu Jun 17, 2021 8:02 am

niel5946 wrote: ↑Wed Jun 16, 2021 11:01 pm
[...]

For the real dataset with 50M-70M positions, I am using lichess positions for exactly that; variety. I think non-master human play is good at early training phases since such games are usually a lot more unbalanced than engine-v-engine. As a side note I have actually thought about using positions from random self-play (like Minic I think?) since they will also be very unbalanced. My thought process is that if the net doesn't know the difference between clearly loosing positions and winning ones, balanced positions won't even matter.

[...]

For NNUE training data generation I tried many things ... :
- from pv (real game at fixed depth, or even just using short TC)
- from random position (self play random mover)
- from search tree (not taking all position of search tree of course, but sampling)

For each of these solution, one must filter out non quiet positions (in a sense to be defined ...) probably.

What works best currently for me is using fixed small depth self-play games with a random factor added to score for the first 10 moves, and for each positions reached go to the quiet leaf using a qsearch and then evalute this leaf position at a resonnable depth (currently for Minic 8 to 12 depending on game phase).

On 16 threads, I'm able to generate 3M such positions per hour. So 2 weeks for 1B positions ...

chrisw · Post by **chrisw** » Thu Jun 17, 2021 10:58 am

xr_a_y wrote: ↑Thu Jun 17, 2021 8:02 am
niel5946 wrote: ↑Wed Jun 16, 2021 11:01 pm
[...]

For the real dataset with 50M-70M positions, I am using lichess positions for exactly that; variety. I think non-master human play is good at early training phases since such games are usually a lot more unbalanced than engine-v-engine. As a side note I have actually thought about using positions from random self-play (like Minic I think?) since they will also be very unbalanced. My thought process is that if the net doesn't know the difference between clearly loosing positions and winning ones, balanced positions won't even matter.

[...]

For NNUE training data generation I tried many things ... :
- from pv (real game at fixed depth, or even just using short TC)
- from random position (self play random mover)
- from search tree (not taking all position of search tree of course, but sampling)

For each of these solution, one must filter out non quiet positions (in a sense to be defined ...) probably.

What works best currently for me is using fixed small depth self-play games with a random factor added to score for the first 10 moves, and for each positions reached go to the quiet leaf using a qsearch and then evalute this leaf position at a resonnable depth (currently for Minic 8 to 12 depending on game phase).

On 16 threads, I'm able to generate 3M such positions per hour. So 2 weeks for 1B positions ...

How much depth are you using for the game play?

chrisw · Post by **chrisw** » Thu Jun 17, 2021 11:05 am

connor_mcmonigle wrote: ↑Thu Jun 17, 2021 1:51 am
niel5946 wrote: ↑Wed Jun 16, 2021 11:01 pm
connor_mcmonigle wrote: ↑Sun Jun 13, 2021 7:37 am For incremental undo to work, you must have quantized weights. Otherwise, due to floating point rounding, the values will diverge which could be the source of the issues you're experiencing. To account for this, in Seer, I make a copy of the incrementally updated encoding vector effectively on every make move (though, after pruning). It's not elegant, but it works reasonably.
I have just tried looking through Seer's NNUE code, but I am afraid it is above my level of programming.. I don't really understand why I can't use incremental updates without quantization though. I have checked with the debugger while searching that the values look correct. By look I mean that none of the neurons seem to saturate at extremely high or extremely low values.
Additionally, if the incremental update were the cause of the error, this would also be present when playing with the HCE trained network.

...

I was referring to the errors associated with floating point arithmetic. In general, a + b - b = a does not hold for a floating point representation of real numbers. Therefore, error can grow with the millions of incremental updates performed in a search. Here's a quick demo:
Code: Select all
#include <iostream>
#include <iomanip>
#include <random>

constexpr size_t branching_factor = 2;

template<typename Distribution, typename Generator>
void demo(float& x, Distribution& dist, Generator& gen, size_t depth=25) {
  if (depth == 0) { return; }
  for (size_t i(0); i < branching_factor; ++i){
    const float val = dist(gen);
    x += val;
    demo(x, dist, gen, depth-1);
    x -= val;
  }
}

int main(){
  float x{2.0};
  auto gen = std::mt19937(std::random_device()());
  auto dist = std::normal_distribution<float>(0.0, 1.0);
  std::cout << std::setprecision(10) << x << std::endl;
  demo(x, dist, gen);
  std::cout << x << std::endl;
}
As the errors aren't correlated in this example, the error isn't too extreme, but still undesirable in the context of a chess engine. However, as your HCE training experiment performed reasonably, this is unlikely the source of your issues.

I'd recommend checking your labels to insure they're reasonable. Even if training on tactical positions, the evaluations should be mostly reasonable.

Mine incrementally updates when moving forward in the tree, but I pass the accumulator up from the prior ply each time (so no need to down date). My guess is that only updating in one direction, coupled with the “forced” accumulator rebuilds on complex moves, won’t generate much in the way of errors. But mine integerifies the base weights anyway.

Tactics in training data

Tactics in training data

Re: Tactics in training data

Re: Tactics in training data

Re: Tactics in training data

Re: Tactics in training data

Re: Tactics in training data

Re: Tactics in training data

Re: Tactics in training data

Re: Tactics in training data

Re: Tactics in training data