followup, note: constant cache might be serialized on Nvidia arch when different threads of a Warp access different addresses, AMD differs here, but in real world practice for lookup tables in chess you should also have differing addresses with scratch-pad memory, which results also in serialized access cos the memory banks are not coalescenced (64 squares 8-byte aligned vs. 32 threads 4-byte aligned)...depends on the real world use case, anyway, in most cases constant cache or scratch-pad memory should be faster than global memory (VRAM) access.smatovic wrote: ↑Tue Apr 19, 2022 5:18 pm2 cents, constant memory is about 64KB but only 8 to 16 KB are cached, latency is higher than registers but should be in the range of shared/scratch-pad memory...depends on the conrete architecture ofc...but yes, on GPU, computation over lookup, or alike.dangi12012 wrote: ↑Tue Apr 19, 2022 2:53 pm [...]
What made it much faster:
Constant Memory is really slow - dont use it. It servers a different purpose and is not applicable to chessprogramming.
Shared memory. This is what should be used to replace general ray lookups. I could improve Bob Lookup from 1.6B to 52.79B just from this change alone. Shared memory is limited to around 48kbyte - so Black Magic - Fixed shift and Pext Lookup really does not belong on GPUs.
[...]
--
Srdja
--
Srdja