How to work with batch size in neural network

GBrouwer · Post by **GBrouwer** » Tue Jun 02, 2020 2:18 pm

Hello, together with some other students I am creating a chess engine. We have a basis for the engine running, the perft is correct, and we use a basic evaluation. However, we just finished training a supervised neural net with Tensorflow on games annotated by stockfish 11. The ANN has been trained using batch sizes of 100, 1000 and 10.000. Thus inference on these models requires a 100, 1000 or 10.000 features respectively. But at the moment we are using basic alpha-beta pruning, this usually works by evaluation the leaf node and then deciding to prune. I can't seem to figure out how this would work with for example 100 leaf nodes being evaluated at once. This would completely defeat the purpose of alpha beta, does it not?

I hope this is clear, if not please point it out so I can change the question.

brianr · Post by **brianr** » Tue Jun 02, 2020 2:51 pm

Perhaps I am not understanding the issue, but generally the batch size used for training a net has nothing to do with the batch size used when playing a game (inference only). Moreover, the batch size has little to do with the number of features, assuming features means things addressed by a traditional evaluation function, like a passed pawn, rook on open file, etc. Far more important would be the number of samples used for training and the number of games (as not every position in a game is typically used), along with the learning rate schedule.

Net inference processing is considered far too slow to work with A/B search, aside from the difficulty of assembling batches to keep the GPU busy. Variations of MCTS are used. See:
http://lczero.org/dev/wiki/technical-ex ... hess-zero/

GBrouwer · Post by **GBrouwer** » Tue Jun 02, 2020 3:43 pm

The feature is indeed the representation of the chessboard e.g. castling rights. One of the models has been trained with 100 games per batch, thus 100 features from those games have been used. The batch size is however baked into the model, the code to fit the model is as follows:

Code: Select all

network.fit(training_feature_array, training_score_array, epochs=10, batch_size=100)

If I try to predict an evaluation with this model by inputting just 1 feature it gives me an error saying I have to input a 100, because the batch size is not working otherwise.

Also the DeepMind team used MCTS, which might be faster this way, but we are loosely basing our implementation on the Giraffe engine by Matthew Lai. He said in his paper it was slower, but not too significant(at least not for us in this case). He however uses a custom build net in C++ which runs inference on a single feature.

Perhaps a better question is: if running a single feature through a neural net is too slow, then what is an alternative alpha-beta implementation that can handle running a batch; which would obviously speed things up.

Pio · Post by **Pio** » Tue Jun 02, 2020 5:13 pm

GBrouwer wrote: ↑Tue Jun 02, 2020 3:43 pm The feature is indeed the representation of the chessboard e.g. castling rights. One of the models has been trained with 100 games per batch, thus 100 features from those games have been used. The batch size is however baked into the model, the code to fit the model is as follows:
Code: Select all
network.fit(training_feature_array, training_score_array, epochs=10, batch_size=100)
If I try to predict an evaluation with this model by inputting just 1 feature it gives me an error saying I have to input a 100, because the batch size is not working otherwise.

Also the DeepMind team used MCTS, which might be faster this way, but we are loosely basing our implementation on the Giraffe engine by Matthew Lai. He said in his paper it was slower, but not too significant(at least not for us in this case). He however uses a custom build net in C++ which runs inference on a single feature.

Perhaps a better question is: if running a single feature through a neural net is too slow, then what is an alternative alpha-beta implementation that can handle running a batch; which would obviously speed things up.

Hi!

I think you might have misunderstood some things. What the batches do is that instead of updating the neural network weights after just one training sample you do it after many training samples. This can speed up the learning because the gradient can be computed faster and you have only to update the weights one time. Doing the learning in batches has another good property and that is that it prevents overfitting quite a lot. You can also use dropouts or/and minimise the weights in the network to prevent overfitting as well as using PCA or other dimension reducing algorithms and also using identity mapping between the layers.

I think the problem you encountered while training is that you provided a training sample of less than 100 training positions and that the library might not be able to handle that. If that is the case you can just make the number of tradings examples a multiple of 100 (batch size).

You really do not want to train features one after each other since that will lead to a suboptimal solution. You want to train all the features at the same time.

/Pio

jorose · Post by **jorose** » Tue Jun 02, 2020 8:59 pm

I also think there might also be some confusion with regards to technical terms.

A data sample is an instance of your data set. In your case a sample is any given chess position to be evaluated in some way.
A feature, as you mentioned, is an attribute of your data and has some value for each data sample. For example, whether or not the first player can castle queenside. Features are a part of the data.

Batch size is a part of the model and not the data. The batch size determines how many data samples should get processed at once by the model. This has very little to do with features. Batch size may be changed at any time between batches, so you can have a different batch size for training and inference.

Batch size is relevant in training for the reasons Pio mentioned. Furthermore batch size is relevant in both training and inference on GPU, because modern GPUs have so much computational power that can be performed in parallel, that it is wasteful to not feed multiple samples at once.

Both Winter and Giraffe are Alpha-Beta NN engines designed to run on CPU. CPUs have far less computational power, and thus batching during inference is not as beneficial. They both rely on a batch size of 1 without issues.

Allie explicitly has (or had?) minimax and NN on GPU. Allie however uses vanilla minimax for exactly this batching reason and the core of the search is still MCTS.

Batching is actually also a bit of an issue for MCTS and I am not sure how great LC0 solution really is. That being said I think nobody has (at least publicly) come up with a nice way to solve batching with AB search. That being said, to my knowledge not may people have actually tried to solve this issue for AB search. If you do find a good way, please share it! Perhaps I'd even use it myself

In summary your options are 1) Come up with a good mechanism for batching with AB-search (and publish it!) 2) Accept a performance hit on GPU (but not CPU) and rely on a batch size of 1 for inference. 3) Switch to MCTS and checkout how LC0 and Allie do it.

GBrouwer · Post by **GBrouwer** » Tue Jun 02, 2020 11:09 pm

I have been using feature and feature vector interchangeably apparently, my apologies.
As PIO suggested, batch size is normally used in training to train faster and prevent overfitting(did not know about preventing overfitting by having a batch size greater than 1, good to know). And that after that using batch size 1 during inference. This was indeed the plan, but Tensorflow 2 actually baked in the batch size into the model when I fitted the model, so it is not possible to do inference with only one data sample. However, this might still be possible by using a dynamic/variable batch size; which might be possbile and I will look into that.

I did not know about the Winter engine and will be looking into that.
As far as Giraffe goes, I was aware it only used inference on the CPU. Matthew stated that transfering the feature vector to the GPU, running inference and returning the result was too slow. Of course assuming a batch size of 1. I thought maybe someone had already improved it somehow but unfortunately not. I am however looking to implement the neural net with alpha beta, or maybe vanilla minimax if I can't figure out a better system for using alpha beta in combination with larger batches.

If I manage to come up with something I will definitely post it here, thanks a lot!

jorose · Post by **jorose** » Wed Jun 03, 2020 1:59 am

A bit of a hack you can do which is for sure not optimal, but much better than pure minimax is to do depth 1 minimax at the leaf nodes. I.e. if you have 30 legal moves in a position, that gives you a batch size of 30. This allows you to otherwise take full advantage of AB benefits.

Also note, depending on your NN architecture you might not need a quiescence search.

I will probably post a poper description of Winter's NN architecture this weekend. The source code might be useful for some inspiration. I need to update the training script, but an old version can be found here and relies on Tensorflow 2.0. Feel free to P.M. me if you have any specific questions!

How to work with batch size in neural network

How to work with batch size in neural network

Re: How to work with batch size in neural network

Re: How to work with batch size in neural network

Re: How to work with batch size in neural network

Re: How to work with batch size in neural network

Re: How to work with batch size in neural network

Re: How to work with batch size in neural network