Neural network training, positions shuffling between epochs

Fabio Gobbato · Post by **Fabio Gobbato** » Sun Aug 29, 2021 5:49 pm

I know it's a good practice to shuffle positions between epochs, but has someone measured the difference of the training with and without shuffling?
With a dataset of 1 billion positions I have created a good net without shuffling and the error get lower at every epoch.
I would like to know if it's a big advantage to shuffle positions or one can omit to support shuffling in the training process.

Pio · Post by **Pio** » Sun Aug 29, 2021 7:34 pm

Fabio Gobbato wrote: ↑Sun Aug 29, 2021 5:49 pm I know it's a good practice to shuffle positions between epochs, but has someone measured the difference of the training with and without shuffling?
With a dataset of 1 billion positions I have created a good net without shuffling and the error get lower at every epoch.
I would like to know if it's a big advantage to shuffle positions or one can omit to support shuffling in the training process.

I guess the shuffling makes sense depending on how fast the learning rate decreases. If you do not shuffle, the first positions in the dataset will contribute more to the learnt weights than those in the end. I really doubt that shuffling will have any measurable effect compared to random order on your huge dataset. On a small dataset it would have an effect.

I guess there might be a better method than shuffling however. I guess a better method would be to feed simpler positions closer to the end of the game in the beginning and more complicated positions closer to the start of the game in the end of the dataset.

brianr · Post by **brianr** » Sun Aug 29, 2021 10:07 pm

With Leela-type nets, there typically are no epochs.

With reinforcement learning (RL) like for the Lc0 project runs, there are sliding windows of self-play games as each net is created after a few thousand training steps. I think it is something like 32,000 new games with each net with a window of maybe 1 million games. Would have to ask on the Lc0 Discord to check.

With supervised learning (SL), except in cases with a very small number of games, there are no epochs either as there is nearly unlimited data. Each game is roughly 150 positions (the positions are the actual samples used for training).

I have trained Lc0 nets using groups of games in increasing strength tiers. It does not seem to make much difference either way, FWIW.

Nearly all of the NN ML papers I have seen do employ multiple epochs with relatively small training sample sets, which makes me question how applicable they are to Leela training. I know SF nets are typically trained on billions of positions, and perhaps epochs are used there.

Joost Buijs · Post by **Joost Buijs** » Tue Aug 31, 2021 10:24 am

Fabio Gobbato wrote: ↑Sun Aug 29, 2021 5:49 pm I know it's a good practice to shuffle positions between epochs, but has someone measured the difference of the training with and without shuffling?
With a dataset of 1 billion positions I have created a good net without shuffling and the error get lower at every epoch.
I would like to know if it's a big advantage to shuffle positions or one can omit to support shuffling in the training process.

Currently I use a data-set of 2 billion positions and shuffle the positions between epochs because intuition tells me this should be better. However it is very difficult to determine if it has a positive effect on the quality of training because there are several other parameters that have influence as well like batch-size, initial-learning rate and weight-decay to name a few, and they all seem to influence each other.

On my RTX-3090 an epoch with batches of 128K takes between 5 and 10 minutes, the extra time due to shuffling the indices is noticeable but it really doesn't matter whether an epoch takes 9m30 or 9m40. I've written my trainer in C++ using libTorch, I don't know how much time shuffling would take with PyTorch and Python, I assume PyTorch has builtin provisions for this.

I found that a low MSE and a low Validation error most of the the time don't mean that this is the best network, testing it with real games after each epoch seems necessary, for a single person this is hardly doable.

Daniel Shawul · Post by **Daniel Shawul** » Tue Aug 31, 2021 4:32 pm

Fabio Gobbato wrote: ↑Sun Aug 29, 2021 5:49 pm I know it's a good practice to shuffle positions between epochs, but has someone measured the difference of the training with and without shuffling?
With a dataset of 1 billion positions I have created a good net without shuffling and the error get lower at every epoch.
I would like to know if it's a big advantage to shuffle positions or one can omit to support shuffling in the training process.

If the positions are originally shuffled and you have a big dataset, it probably doesn't matter much.
The point of shuffling is to prevent the NN from learning the order of positions to do something with it.
I have a 40b fen for training a 20b NN so by the time I come back to it, I am sure it won't remember the order of positons so I don't shuffle on 2nd or more epochs.

Pio · Post by **Pio** » Tue Aug 31, 2021 5:17 pm

Daniel Shawul wrote: ↑Tue Aug 31, 2021 4:32 pm
Fabio Gobbato wrote: ↑Sun Aug 29, 2021 5:49 pm I know it's a good practice to shuffle positions between epochs, but has someone measured the difference of the training with and without shuffling?
With a dataset of 1 billion positions I have created a good net without shuffling and the error get lower at every epoch.
I would like to know if it's a big advantage to shuffle positions or one can omit to support shuffling in the training process.
If the positions are originally shuffled and you have a big dataset, it probably doesn't matter much.
The point of shuffling is to prevent the NN from learning the order of positions to do something with it.
I have a 40b fen for training a 20b NN so by the time I come back to it, I am sure it won't remember the order of positons so I don't shuffle on 2nd or more epochs.

How can the network learn the order of the data???

Without shuffling you might more easily get stuck in a local optimum though since the gradients will not be as diverse.

My intuition says that it should be better to put more positions close to end result in the beginning of the training. There is no need to learn openings if you cannot convert them to wins.

Daniel Shawul · Post by **Daniel Shawul** » Tue Aug 31, 2021 5:24 pm

Pio wrote: ↑Tue Aug 31, 2021 5:17 pm
Daniel Shawul wrote: ↑Tue Aug 31, 2021 4:32 pm
Fabio Gobbato wrote: ↑Sun Aug 29, 2021 5:49 pm I know it's a good practice to shuffle positions between epochs, but has someone measured the difference of the training with and without shuffling?
With a dataset of 1 billion positions I have created a good net without shuffling and the error get lower at every epoch.
I would like to know if it's a big advantage to shuffle positions or one can omit to support shuffling in the training process.
If the positions are originally shuffled and you have a big dataset, it probably doesn't matter much.
The point of shuffling is to prevent the NN from learning the order of positions to do something with it.
I have a 40b fen for training a 20b NN so by the time I come back to it, I am sure it won't remember the order of positons so I don't shuffle on 2nd or more epochs.
How can the network learn the order of the data???

Without shuffling you might more easily get stuck in a local optimum though since the gradients will not be as diverse.

My intuition says that it should be better to put more positions close to end result in the beginning of the training. There is no need to learn openings if you cannot convert them to wins.

Note that I am talking about shuffling done for the 2nd epoch and above.
What you say about shuffling only applies for the first one i.e. to have minibatches that are representative of the whole dataset.
If you use batch gradient descent, you don't need to actually shuffle the data at all.
For mini-batch gradient descent, the first shuffle is important for sure, but the second and above shuffles are not that necessary for a big dataset IMO.

A NN can certainly learn to remember order of samples (or mini-batches), and that IIRC is one of the reasons why you need to shuffle, besides the
need to have representative mini-batches of the data, which will leads to faster convergence.

Neural network training, positions shuffling between epochs

Neural network training, positions shuffling between epochs

Re: Neural network training, positions shuffling between epochs

Re: Neural network training, positions shuffling between epochs

Re: Neural network training, positions shuffling between epochs

Re: Neural network training, positions shuffling between epochs

Re: Neural network training, positions shuffling between epochs

Re: Neural network training, positions shuffling between epochs