Booot progress

booot · Post by **booot** » Tue Jun 01, 2021 10:11 am

Last news and plans.

During checking and tuning Delphi's generator code i have generated 50M of fens with evaluations. Now everything is ready for the next big stage: "The Choice" : i have to try, check and choose the best from several feature layer's models. As i told before i will not try HalfKP ('everyone has' ) and will implement my owns. Honestly speaking i do not feel HalfKP is natural for chess. Of cause it works and works fine (as we see) - neural net does miracles, but i am sure it will do even more miracles if we help it a little

.
So, now i have 3 model candidates to test. I call them Minimal ( size: 12320 features ), Medium (24640 features) and Maximum (160160 features). The algorithm of new stage will be written on Python + Keras +Tensorflow and it is:

1. Make converter FEN->Feature layer array record ( for 3 models).
2. Split existing 50M trainset to batches (lets say 32k FENs each).
3. Convert each batch to input layer array.
4. Train full NN (with input layer) with every batch.
5. Make quantization of final network.
6. Measure results of quantized model in terms of accuracy NN.
7. Compare and choose 1 model from 3.

cdani · Post by **cdani** » Sat Jun 05, 2021 4:09 pm

I wish you the best of lucks with your nice work!
Will be very interesting the comparison between the models.

connor_mcmonigle · Post by **connor_mcmonigle** » Sat Jun 05, 2021 9:50 pm

booot wrote: ↑Tue Jun 01, 2021 10:11 am Last news and plans.

During checking and tuning Delphi's generator code i have generated 50M of fens with evaluations. Now everything is ready for the next big stage: "The Choice" : i have to try, check and choose the best from several feature layer's models. As i told before i will not try HalfKP ('everyone has' ) and will implement my owns. Honestly speaking i do not feel HalfKP is natural for chess. Of cause it works and works fine (as we see) - neural net does miracles, but i am sure it will do even more miracles if we help it a little .
So, now i have 3 model candidates to test. I call them Minimal ( size: 12320 features ), Medium (24640 features) and Maximum (160160 features). The algorithm of new stage will be written on Python + Keras +Tensorflow and it is:

1. Make converter FEN->Feature layer array record ( for 3 models).
2. Split existing 50M trainset to batches (lets say 32k FENs each).
3. Convert each batch to input layer array.
4. Train full NN (with input layer) with every batch.
5. Make quantization of final network.
6. Measure results of quantized model in terms of accuracy NN.
7. Compare and choose 1 model from 3.

Very interesting. I'd recommend using the standard 768 features as a first step as a sort of an ablation study to insure the extra features are actually having a positive impact. Additionally, a dataset of "just" 50 million positions may be inadequate for the number of features you've proposed. Dropout might be a partial solution. Training with dropout enabled seems to have helped a great deal when training on smaller datasets in my experience.

Could you elaborate on the input features you plan to try? It is important that, for large feature sets, the features for any given position are sparse and that the average number of updated features between adjacent positions is also low. For HalfKA/HalfKP the average number of updated features between adjacent positions is roughly ~6. Much higher than 6 and it is likely incremental updates will prove too slow.

booot · Post by **booot** » Sun Jun 06, 2021 1:18 pm

cdani wrote: ↑Sat Jun 05, 2021 4:09 pm I wish you the best of lucks with your nice work!
Will be very interesting the comparison between the models.

Thank you! Luck is only thing i need now

booot · Post by **booot** » Sun Jun 06, 2021 1:31 pm

connor_mcmonigle wrote: ↑Sat Jun 05, 2021 9:50 pm
Very interesting. I'd recommend using the standard 768 features as a first step as a sort of an ablation study to insure the extra features are actually having a positive impact. Additionally, a dataset of "just" 50 million positions may be inadequate for the number of features you've proposed. Dropout might be a partial solution. Training with dropout enabled seems to have helped a great deal when training on smaller datasets in my experience.

Could you elaborate on the input features you plan to try? It is important that, for large feature sets, the features for any given position are sparse and that the average number of updated features between adjacent positions is also low. For HalfKA/HalfKP the average number of updated features between adjacent positions is roughly ~6. Much higher than 6 and it is likely incremental updates will prove too slow.

Yes - you are right. I added "tiny" model (2*770 features - i use also castle rights features) to compare with.

Did you try to train NNUE with HCE labels? Could you provide me information about 'mae' metric (Mean average error) for those NN's? I try to train my model with my learner, i see - the process goes normal (loss-function value decreases, train and validation losses and 'mae' metrics are close for the first several epochs before overfitting on this small dataset ). But i do not know what 'mae' value is ok - i do not have anything to compare with. I received 'mae' about 32-35 centipawns in validation dataset. Is it much or small?

Pio · Post by **Pio** » Sun Jun 06, 2021 3:45 pm

booot wrote: ↑Sun Jun 06, 2021 1:31 pm
connor_mcmonigle wrote: ↑Sat Jun 05, 2021 9:50 pm
Very interesting. I'd recommend using the standard 768 features as a first step as a sort of an ablation study to insure the extra features are actually having a positive impact. Additionally, a dataset of "just" 50 million positions may be inadequate for the number of features you've proposed. Dropout might be a partial solution. Training with dropout enabled seems to have helped a great deal when training on smaller datasets in my experience.

Could you elaborate on the input features you plan to try? It is important that, for large feature sets, the features for any given position are sparse and that the average number of updated features between adjacent positions is also low. For HalfKA/HalfKP the average number of updated features between adjacent positions is roughly ~6. Much higher than 6 and it is likely incremental updates will prove too slow.
Yes - you are right. I added "tiny" model (2*770 features - i use also castle rights features) to compare with.

Did you try to train NNUE with HCE labels? Could you provide me information about 'mae' metric (Mean average error) for those NN's? I try to train my model with my learner, i see - the process goes normal (loss-function value decreases, train and validation losses and 'mae' metrics are close for the first several epochs before overfitting on this small dataset ). But i do not know what 'mae' value is ok - i do not have anything to compare with. I received 'mae' about 32-35 centipawns in validation dataset. Is it much or small?

You can increase the batch size. It usually helps overfitting. One idea I have had for a couple of years is to do a NN for the pawn hash table so you can reuse it 99% of the time. Then it does not matter how big you make the pawn NN since the computational effort won’t be so big. The only thing that matters is that the result should not take to much space. The result of the pawn hash could then be fed to the kings positions NN (that will not change so often) and the result could then create the input to the rest of the pieces.

Don’t measure the error in centipawns. Measure the error in probability difference since it does not matter so much if you lead by 10 or 12 pawns but it matters a lot if you lead with 2 pawns instead of 0 pawns.

I really like Delphi, so I don’t agree with the rest

Good luck!

booot · Post by **booot** » Sun Jun 06, 2021 4:35 pm

"Don’t measure the error in centipawns. Measure the error in probability difference since it does not matter so much if you lead by 10 or 12 pawns but it matters a lot if you lead with 2 pawns instead of 0 pawns."

You mean : NN still will give output in centipawns but i should change the metric from 'mae' to something like 'mae of sigmoid outputs'?

"I really like Delphi, so I don’t agree with the rest

"

I also like it

. The only problem is : you have to write everything by yourself in chess programming world

Pio · Post by **Pio** » Sun Jun 06, 2021 5:26 pm

booot wrote: ↑Sun Jun 06, 2021 4:35 pm "Don’t measure the error in centipawns. Measure the error in probability difference since it does not matter so much if you lead by 10 or 12 pawns but it matters a lot if you lead with 2 pawns instead of 0 pawns."

You mean : NN still will give output in centipawns but i should change the metric from 'mae' to something like 'mae of sigmoid outputs'?

"I really like Delphi, so I don’t agree with the rest "

I also like it . The only problem is : you have to write everything by yourself in chess programming world

Yes, exactly. It does not matter what your output is only that the error function should be based on the probability difference. I think most people use min squared error although it makes much more sense to use min absolute error exactly like you do. Of course it might also depend on the type of search you do. In an alpha-beta framework it might be more important do minimise the big errors, thus use min squared error, but when using a probabilistic search like MCTS the absolute error function might be better since it is much more stable by design. I don’t know. Best is to test. Why I also like to minimise the absolute error is that it makes labelling errors influence the end result a lot less. I guess that minimising the absolute error will give okay result even on training on non quiet (because the errors should take each other out a lot) positions while it should be disastrous if you minimised the squared error.

connor_mcmonigle · Post by **connor_mcmonigle** » Sun Jun 06, 2021 5:33 pm

booot wrote: ↑Sun Jun 06, 2021 1:31 pm
connor_mcmonigle wrote: ↑Sat Jun 05, 2021 9:50 pm
Very interesting. I'd recommend using the standard 768 features as a first step as a sort of an ablation study to insure the extra features are actually having a positive impact. Additionally, a dataset of "just" 50 million positions may be inadequate for the number of features you've proposed. Dropout might be a partial solution. Training with dropout enabled seems to have helped a great deal when training on smaller datasets in my experience.

Could you elaborate on the input features you plan to try? It is important that, for large feature sets, the features for any given position are sparse and that the average number of updated features between adjacent positions is also low. For HalfKA/HalfKP the average number of updated features between adjacent positions is roughly ~6. Much higher than 6 and it is likely incremental updates will prove too slow.
Yes - you are right. I added "tiny" model (2*770 features - i use also castle rights features) to compare with.

Did you try to train NNUE with HCE labels? Could you provide me information about 'mae' metric (Mean average error) for those NN's? I try to train my model with my learner, i see - the process goes normal (loss-function value decreases, train and validation losses and 'mae' metrics are close for the first several epochs before overfitting on this small dataset ). But i do not know what 'mae' value is ok - i do not have anything to compare with. I received 'mae' about 32-35 centipawns in validation dataset. Is it much or small?

Great. Starting with "tiny" seems a wise decision. My networks for Seer weren't trained on CP evals and instead predict WDL probabilities starting initially from EGTB. I used cross entropy loss (effectively, MLE).

The latest SF networks were trained with MSE loss, I believe. Cross entropy should also be worthwhile to experiment with imho. More information can be found in this document written by Sopel:
https://github.com/glinscott/nnue-pytor ... apply-them

booot · Post by **booot** » Sun Jun 06, 2021 5:43 pm

"My networks for Seer weren't trained on CP evals and instead predict WDL probabilities starting initially from EGTB. I used cross entropy loss (effectively, MLE)."

Then, you have to 'decode' back the probability from NNUE to CP eval every time you call NNUE_eval()?

Booot progress

Re: Booot progress

Re: Booot progress

Re: Booot progress

Re: Booot progress

Re: Booot progress

Re: Booot progress

Re: Booot progress

Re: Booot progress

Re: Booot progress

Re: Booot progress