I could be beneficial to transpose the input layer, something want to try when I start working on optimization. I'm not interested in the network topology they use in Stockfish either. I'm experimenting with a fully connected net with 736 inputs which seems to perform a lot better than my old HCE.mar wrote: ↑Sun Jan 31, 2021 12:13 pm my point was simply that the inference can be efficient without non-portable intrinsics. also transposed weights have another advantage (point 1)
I never liked the idea of having accumulator for "efficient updates" - extra boilerplate and extra state you drag around with your board representation,
that being said I'm not interested in the "NNUE" topology, for starters I think it makes sense to experiment with much smaller nets with "natural topology" in the vein of what Halogen does, YMMV of course
also if you don't transpose then you eat lots of cache misses in the "efficient update" part if I'm not mistaken, because then the output weights are far away from each other in memory
besides if I'm not mistaken this thread is all about inference
Since I don't like Python stuff and writing a trainer from scratch would be very time consuming I've written the trainer in C++ making use of Facebooks libTorch library (which is mostly Caffe2). Last week I got my RTX-3090-Turbo, this brought the training time down by almost a factor 3, which is very convenient, it enables me to do several tests per day.
For the inference code I stick with AVX2 because AMD doesn't support AVX-512. Out of curiosity I've been experimenting with AVX-512 on my i9-10980XE, it's somewhat faster but certainly not by a factor of 2. Maybe the VNNI-DP runs twice as fast, but all the surrounding code (like the horizontal add) still gives a lot of overhead.