Page 16 of 16

Re: A Crossroad in Computer Chess; Or Desperate Flailing for Relevance

Posted: Wed Oct 07, 2020 11:11 pm
by Tony P.
Actually, one of the advantages of the ReLU activation is that it maps all the negative inputs to the same zero output, so if an input was negative and remains negative after an incremental update, then the output remains zero, and fewer recalculations are needed in further layers.

Re: A Crossroad in Computer Chess; Or Desperate Flailing for Relevance

Posted: Wed Oct 07, 2020 11:29 pm
by mar
this is true (unless you use something like leaky ReLU), but the vast majority of work is done one the 1st layer
the point is that incremental updates will save orders of magnitude of work.
according to CPW, NNUE has 41k inputs and 1st layer has 512 neurons, so instead of doing 512x dot prod on 41k elements (doesn't matter if vectorized) in 1st layer, you only do a couple of incremental updates
so NNUE has a topology of 41k inputs, 512, 32, 32 hidden layers and 1 output
so that's 41kx512 weights, then 512*32 + 32*32 + 32*1
so the work for latter layers is of the order of 17.5k while the 1st layer is 21m (3 orders of magnitude more)
also, biases are per neuron (excluding the input layer which is passed as is), so the cost of doing a bias + activation fn is also negligible in this context

EDIT: forgot that the incremental update can be of course generalized as cached_dot_product + (new-old)*weight

Re: A Crossroad in Computer Chess; Or Desperate Flailing for Relevance

Posted: Thu Oct 08, 2020 12:33 am
by mmt
Tony P. wrote: Wed Oct 07, 2020 11:11 pm Actually, one of the advantages of the ReLU activation is that it maps all the negative inputs to the same zero output, so if an input was negative and remains negative after an incremental update, then the output remains zero, and fewer recalculations are needed in further layers.
Zeros would help if the matrix is sparse and the program takes advantage of this, not so much when it's treating it as a dense matrix.