With no non-linearity each layer is just a matrix multiplication, and you can actually collapse all layers into one, and get an equivalent linear function. Like you said, there are many things that cannot be modeled with a linear function.brtzsnr wrote:Geneva was the first version to use tensorflow. Glarus and the current development branch improved evaluation quiet a lot using this (e.g. backward pawns, king safety, knight & bishop psqt).

The NN I use is simply:

It takes 10min for the weights to converge and I need to train it twice: once with any search disabled, and once with quiescence search enabled. Before this, I implemented a general hill-climbing algorithm, but it was converging very slow (1 day) and the results were not always very good.Code: Select all

`WM = tf.Variable(tf.random_uniform([len(x_data[0]), 1])) WE = tf.Variable(tf.random_uniform([len(x_data[0]), 1])) xm = tf.matmul(x_data, WM) xe = tf.matmul(x_data, WE) P = tf.constant(p_data) y = xm*(1-P)+xe*P y = tf.sigmoid(y/2) loss = tf.reduce_mean(tf.square(y - y_data)) + 1e-4*tf.reduce_mean(tf.abs(WM) + tf.abs(WE)) optimizer = tf.train.AdamOptimizer(learning_rate=0.1) train = optimizer.minimize(loss)`

Training this way is much faster than playing 100k games for SPSA, but it has the disadvantage that it somehow limits the set of usable features. The NN should compute the same value as your evaluation function - without the sigmoid. For example a linear NN won't be able to compare values as in the following code from Stockfish. Probably here you need a deeper NN.

Code: Select all

`else if ( abs(eg) <= BishopValueEg && ei.pi->pawn_span(strongSide) <= 1 && !pos.pawn_passed(~strongSide, pos.square<KING>(~strongSide))) sf = ei.pi->pawn_span(strongSide) ? ScaleFactor(51) : ScaleFactor(37);`

With ReLU activation I found the sweet spot for Giraffe to be about 3 hidden layers. It took roughly 72 hours to converge, but 24-48 hours to get to a pretty good level.