A Crossroad in Computer Chess; Or Desperate Flailing for Relevance

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

Tony P.
Posts: 216
Joined: Sun Jan 22, 2017 8:30 pm
Location: Russia

Re: A Crossroad in Computer Chess; Or Desperate Flailing for Relevance

Post by Tony P. »

Actually, one of the advantages of the ReLU activation is that it maps all the negative inputs to the same zero output, so if an input was negative and remains negative after an incremental update, then the output remains zero, and fewer recalculations are needed in further layers.
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: A Crossroad in Computer Chess; Or Desperate Flailing for Relevance

Post by mar »

this is true (unless you use something like leaky ReLU), but the vast majority of work is done one the 1st layer
the point is that incremental updates will save orders of magnitude of work.
according to CPW, NNUE has 41k inputs and 1st layer has 512 neurons, so instead of doing 512x dot prod on 41k elements (doesn't matter if vectorized) in 1st layer, you only do a couple of incremental updates
so NNUE has a topology of 41k inputs, 512, 32, 32 hidden layers and 1 output
so that's 41kx512 weights, then 512*32 + 32*32 + 32*1
so the work for latter layers is of the order of 17.5k while the 1st layer is 21m (3 orders of magnitude more)
also, biases are per neuron (excluding the input layer which is passed as is), so the cost of doing a bias + activation fn is also negligible in this context

EDIT: forgot that the incremental update can be of course generalized as cached_dot_product + (new-old)*weight
Martin Sedlak
mmt
Posts: 343
Joined: Sun Aug 25, 2019 8:33 am
Full name: .

Re: A Crossroad in Computer Chess; Or Desperate Flailing for Relevance

Post by mmt »

Tony P. wrote: Wed Oct 07, 2020 11:11 pm Actually, one of the advantages of the ReLU activation is that it maps all the negative inputs to the same zero output, so if an input was negative and remains negative after an incremental update, then the output remains zero, and fewer recalculations are needed in further layers.
Zeros would help if the matrix is sparse and the program takes advantage of this, not so much when it's treating it as a dense matrix.