fierz wrote: ↑Fri Jul 24, 2020 11:14 pm
1) If I want the NN to return an evaluation, which could be used in a traditional alphabeta search engine, how is this achieved most commonly? Do I have a last hidden layer that feeds into a single output neuron which uses e.g. a tanh activation function so -1 would be a loss and +1 would be a win? Or is there some other "standard" approach?
That is pretty much the standard approach for a single output value. For training purposes you can consider using sigmoid instead of tanh and then switching to either tanh or just using the logit without activation for gameplay. In this case you would use cross entropy loss and assign 1/0.5/0 as outputs for WDL.
For multiple output values (eg W/D/L) the standard approach would be to use a softmax activation function paired with cross-entropy loss. In Winter I have two outputs (W/W+D which trivially imply W/D/L) which are activated by sigmoid function with cross entropy loss. Whether this non-standard approach is
still worth it at the moment, I cannot say.
fierz wrote: ↑Fri Jul 24, 2020 11:14 pm
2) Do I code my NN myself in C or do I use libraries for this? If libraries, which ones (BLAS?)? It would seem to me that it is very easy to implement the NN code "by hand" (just a few matrix multiplications) but perhaps it will be inefficient if not using the right libraries?
For training I would recommend using some high level library such as TensorFlow 2.0 (which I currently use) or Pytorch. This has a couple of benefits. You can easily experiment with different architectures containing any number of state of the art features you might not want to implement unless the architecture looks like a clear improvement. It is a natural sanity check as you can compare the output of a network in your engine as well as from the library for some fixed input. Finally, you don't need to implement any training algorithms yourself. To me at least, the training algorithms are significantly more complicated than just implementing support for inference, so it has a bigger potential for bugs as well as potentially making your codebase bigger and convoluted.
As for inference, different projects rely on support on different levels.
I don't know any projects which rely on high level libraries directly for neural network support. I think that is a fine route if you don't need the absolute maximum of performance and customization options. The main downside is that something like NNUE, where you are doing some unusual tricks, might be hard or even impossible to do. Furthermore, you might not actually save a lof of work if you are working with a low level language. I find documentation a bit lacking for C/C++ in combination with higher level libraries and most of the functions you have to implement for inference are not actually that complicated anyways.
On GPU, LC0 relies on Cuda and CUDNN or alternatively on OpenCL. If you want to support GPUs, I would suggest you do the same. Support for GPUs is likely not worth it for you if you implement some NNUE based system or just a small fully connected network.
On CPU, LC0 relies on Blas. Afaik Blas backends can be used interchangeably and share the same basic interface.
Allie relies on the LC0 code for its Neural Network implementation. I think this has been a minor point of discussion in terms of debates on cloning, but that is outside of the scope of this thread. I mainly mention it here, because all engines I am mentioning in this post are open source and it might give you another reference point.
I don't think SF-NNUE relies on Blas, instead working directly with optimized code and SIMD CPU instructions.
Giraffe relies on Eigen. I am not really familiar with Eigen aside from knowing that it is a general purpose linear algebra library. If we are lucky, Matthew Lai could pop in and explain its benefits and whether he would recommend using that.
Winter doesn't rely on an external backend for inference and in fact doesn't even explicitly use SIMD instructions. The primary reason being that the last time I checked the assembly code, the compiler seemed to be doing that reasonably well. Somewhere in the middle of my TODO list is to add explicit support for SIMD instructions, so I don't have to trust the compiler anymore.
fierz wrote: ↑Fri Jul 24, 2020 11:14 pm
3) When it comes to training, all that I know so far is Texel's tuning method which I recently used to improve my checkers engine (
http://www.fierz.ch/cake186.php) by using logistic regression on win-loss-draw information on a few million (N) positions to improve the weights of my handwritten eval function. So essentially I do some kind of gradient descent for a rather small handful of parameters (a few 100). When training an NN, I read that the relu activation has the advantage that its derivative is easy to compute, but I'm not sure if/how I would need to use that. If I think of Texel's tuning method, I would set up some small NN to start with, and try to do the same as I did there = calculate the output of the NN for all N positions, calculate the error vs the game results; then change the weights of all parameters layer by layer starting from the last one, by a small amount to calculate a gradient, and do the same I did before? Is this totally wrong (because I don't calculate any derivatives here?)?
Based on your writing I think you are close. Perhaps you should read up some more on backpropogation as you should be calculating the derivatives at every layer. Backpropogation is more or less just a clever and optimized use of the chain rule for derivatives.
It might help to write out the neural network in the form of L(A(B(C(x))),y) where L is the loss function, A,B and C are the layers with their respective weight parameters, x is the input and y is the target label. Then think about how you would calculate the derivative to update the parameters in A, B and C respectively.
fierz wrote: ↑Fri Jul 24, 2020 11:14 pm
4) From reading about Stockfish NNUE I get the impression that they are not doing a regression vs game results, but rather vs the evaluation of a Stockfish search to X ply and try to learn that rather than the game result, which is different then the Texel tuning method. Is this distincition of trying to learn search eval vs trying to learn game results actually relevant or are the two +/- equal?
The distinction is somewhat relevant. For more information you can read up on temporal difference learning, which is more generally what SF-NNUE is doing. Giraffe did another form of this and I have previously done temporal difference learning in Winter as well.
The main advantages are that such an approach is less data reliant (you can train with positions where you don't have a game outcome) and your labels could potentially end up being more accurate. E.g. game result alone cannot really quantify if a position is unbalanced, but dynamically equal.
Temporal difference learning and other forms or reinforcement learning have their own issues though and I would recommend at least starting out with pure supervised learning of game outcomes. Once you know that is working, you can try out temporal difference learning or self play for comparison.
At some point the best in Winter was labelling positions as some weighted average between a low depth search and the actual game outcome.
fierz wrote: ↑Fri Jul 24, 2020 11:14 pm
Martin-who-didn't-realize-you-were-in-Indiana! Hope you are doing well there!
Indeed I am in Indiana and doing ok. I just moved out of my previous apartment and will be moving into my new one after a brief stay with friends. Hopefully I will be able to visit everyone back home for a month or so around Christmas!