Using Mini-Batch for tunig

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Using Mini-Batch for tunig

Post by Desperado »

I have read in several threads that some of the people tune their data with so called "Mini-Batches",
a subset of the total dataset. What is the idea and how can it be used.

A link to an easy introduction on that topic would already be interesting.

Thanks in advance.
User avatar
xr_a_y
Posts: 1871
Joined: Sat Nov 25, 2017 2:28 pm
Location: France

Re: Using Mini-Batch for tunig

Post by xr_a_y »

derjack
Posts: 16
Joined: Fri Dec 27, 2019 8:47 pm
Full name: Jacek Dermont

Re: Using Mini-Batch for tunig

Post by derjack »

Stochastic gradient descent could be seen as mini-batch of size 1. So you iterate over your data, one by one and update the weights after one position. In mini-batch you divide your training data into parts of size N (typically in range 32..1024 or more) and update weights once per mini-batch. It is usually more effective than SGD and more parallelizable be it via GPU or CPU, so in practice much faster.
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: Using Mini-Batch for tunig

Post by Desperado »

derjack wrote: Tue Jan 12, 2021 9:11 pm Stochastic gradient descent could be seen as mini-batch of size 1. So you iterate over your data, one by one and update the weights after one position. In mini-batch you divide your training data into parts of size N (typically in range 32..1024 or more) and update weights once per mini-batch. It is usually more effective than SGD and more parallelizable be it via GPU or CPU, so in practice much faster.
Splitting the training data into parts doesn't sound like it depends on a tuning algorithm itself, so what is the general idea.

I can imagine, that results in more updates of an parameter vector and you reach very fast a semi-good solution.
Further i would guess the challenge could be to calm down to avoid fluctuations or divergence.

In basic algorithms where i don't have something like a learning rate i would control it with stepcount * stepsize when updating a parameter.
Additionally someone could operate on every parameter in the beginning and later only pick a subset of parameter vector.

What do you think ? sorry, a complete new field for me.
derjack
Posts: 16
Joined: Fri Dec 27, 2019 8:47 pm
Full name: Jacek Dermont

Re: Using Mini-Batch for tunig

Post by derjack »

Desperado wrote: Tue Jan 12, 2021 9:30 pm
derjack wrote: Tue Jan 12, 2021 9:11 pm Stochastic gradient descent could be seen as mini-batch of size 1. So you iterate over your data, one by one and update the weights after one position. In mini-batch you divide your training data into parts of size N (typically in range 32..1024 or more) and update weights once per mini-batch. It is usually more effective than SGD and more parallelizable be it via GPU or CPU, so in practice much faster.
Splitting the training data into parts doesn't sound like it depends on a tuning algorithm itself, so what is the general idea.
Actually it is algorithm specific, or gradient descent specific. In mini-batches you still operate on the entire data in each epoch.

Maybe you misunderstood with something like using only a subset of positions and training on them, then using other subset of positions etc. training separately for each subsets, and since they are smaller, the training would be faster.
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: Using Mini-Batch for tunig

Post by Desperado »

derjack wrote: Tue Jan 12, 2021 10:07 pm
Desperado wrote: Tue Jan 12, 2021 9:30 pm
derjack wrote: Tue Jan 12, 2021 9:11 pm Stochastic gradient descent could be seen as mini-batch of size 1. So you iterate over your data, one by one and update the weights after one position. In mini-batch you divide your training data into parts of size N (typically in range 32..1024 or more) and update weights once per mini-batch. It is usually more effective than SGD and more parallelizable be it via GPU or CPU, so in practice much faster.
Splitting the training data into parts doesn't sound like it depends on a tuning algorithm itself, so what is the general idea.
Actually it is algorithm specific, or gradient descent specific. In mini-batches you still operate on the entire data in each epoch.

Maybe you misunderstood with something like using only a subset of positions and training on them, then using other subset of positions etc. training separately for each subsets, and since they are smaller, the training would be faster.
What i mean is that you accept changes during an epoch, so you update the reference fitness too (in the hope you are on the right track).
The would speed up things at the beginning but you would need to "cool down" the learning rate in later epochs, so you don't begin to fluctuate or to diverge because of too many updates of the reference fitness.