Chess Programming Wiki accounts: how to get one?

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Chess Programming Wiki accounts: how to get one?

Post by dragontamer5788 »

I don't know exactly where or who to ask for a ChessProgramming Wiki account, but it seems like their account-creation pages are blocked off. I'm thinking of maybe writing a thing or two in the GPU page (probably the only page I can contribute to on that entire Wiki). https://www.chessprogramming.org/GPU

In particular, C++AMP is dead. Microsoft is no longer updating it (circa ~2014), and C++Amp + HCC have been deprecated from ROCm. Its a technological dead end, and it probably should come off of the Wiki. The technologies worth discussing are ROCm (AMD), CUDA (NVidia), and OpenCL 1.2 (portable, but old) deserve mentions for sure. There are some other technologies like OpenACC, SPIR, OpenMP device offload too.

Some discussion of low-level operations would also be nice. Shfl is universal between AMD and NVidia (but not in the standard OpenCL: you can access it in OpenCL but you won't find discussions for this instruction). Bit-reverse, ls1b, ffs (find first set), and other low-level instructions are all single-tick instructions in both NVidia PTX and AMD GCN assembly (with intrinsic to boot!), so a lot of bitwise chess algorithms probably can be ported easily.
smatovic
Posts: 2645
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Chess Programming Wiki accounts: how to get one?

Post by smatovic »

I am no admin, so I can not add you, probably Gerd Isenberg is the one to ask...

I wrote some parts of the gpu article, and admit that some stuff is outdated, I
have a more OpenCL centric point of view, not following the recent developments
of CUDA and ROC, but the last time I checked, ROCm was in beta status, only
available for Linux, and with hardware limitations like PCIe gen >= 3.0 and
only newer AMD gpus supported...

I guess the Neural Network aspect of current and upcoming gpus is more
interesting, which platforms (Intel OneAPI, Apple Metal) will support which
nn frameworks, just my pov.

--
Srdja
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: Chess Programming Wiki accounts: how to get one?

Post by dragontamer5788 »

smatovic wrote: Wed Aug 07, 2019 1:16 pm I wrote some parts of the gpu article, and admit that some stuff is outdated, I
have a more OpenCL centric point of view, not following the recent developments
of CUDA and ROC, but the last time I checked, ROCm was in beta status, only
available for Linux, and with hardware limitations like PCIe gen >= 3.0 and
only newer AMD gpus supported...
ROCm definitely still has its rough edges, but it does seem to be where AMD is putting most of their effort into these days. I think ROCm represents the future of AMD's software support: which is HIP (a CUDA-like layer), and statically-compiled OpenCL 2.0. Overall, AMD's software stack is just more limited than NVidia's.

AMD has two OpenCL implementations: the AMDGPU Pro drivers, and ROCm. The AMDGPU pro drivers haven't had any features updated in years however, the codebase seems like its in maintenance mode. In contrast, ROCm has an active Github page with somewhat responsive developers. There's a lot more motion on ROCm, so I'm comfortable declaring it to be the future of AMD's driver stack.

PCIe gen 3.0 is required for PCIe 3.0 atomic transactions. The CPU and GPU can atomically transact low-level primitives and share memory spaces. Its a good feature to standardize upon. Older machines (PCIe 2.0) won't support it, but tight integration of CPU + GPU code will make for easier programming in the future, as PCIe 3.0 (and PCIe 4.0) becomes widely deployed. (HIP_HOST_COHERENT Environmental variable and HipMalloc flags)

It sounds like AMD's plan is to eventually bring the ROCm OpenCL 2.0 compiler to Windows. All of the low-level details of ROCm have been committed and upstreamed to clang... as well as committed to AMD's ROCm github, so progress on this front is open-source and public.
I guess the Neural Network aspect of current and upcoming gpus is more
interesting, which platforms (Intel OneAPI, Apple Metal) will support which
nn frameworks, just my pov.
Neural Nets are certainly popular these days, but I don't have any experience with those frameworks. So I wouldn't be comfortable writing about them. For this generation of hardware, only NVidia Volta and NVidia Turing GPUs support FP16 matrix multiplications. So you're limited to RTX 20xx series cards or the super-expensive V100 if you really want to go down that path.

AMD Vega and RDNA supports FP16 dot-product and Fused-multiply-accumulate (aka: FMA). But its not as fast as the single-instruction 4x4 FP16 matrix multiplication primitive in NVidia's Volta and Turing processors. In effect: NVidia can process a neural-net half-float matrix-multiplication 16-at-a-time, while AMD Vega / RDNA only processes them 8-at-a-time.

But the 16-bit FMA instruction is extremely specialized. Outside of neural nets, 16-bit half floats are just too small and inaccurate to be effective in general problems. Neural Nets can remain effective at 8-bits or even 4-bits, so 16-bits is still plenty of precision for those calculations.

I guess AMD Vega / RDNA does have 16-bit FMA instructions to accelerate deep learning. But its just not competitive in that workload vs a dedicated 4x4 block like NVidia's implementation.
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: Chess Programming Wiki accounts: how to get one?

Post by dragontamer5788 »

dragontamer5788 wrote: Wed Aug 07, 2019 4:27 pm AMD Vega and RDNA supports FP16 dot-product and Fused-multiply-accumulate (aka: FMA). But its not as fast as the single-instruction 4x4 FP16 matrix multiplication primitive in NVidia's Volta and Turing processors. In effect: NVidia can process a neural-net half-float matrix-multiplication 16-at-a-time, while AMD Vega / RDNA only processes them 8-at-a-time.
I'm bad at counting.

NVidia's 4x4 FP16 multiplication instructions perform at least 112 operations per SM (the number of multiplies + adds for 4x4 by 4x4 matrix multiplication problem). Or something around there anyway... there's also a 4x4 matrix-add. Its more of a "fused multiply/add 4x4 matrix" instruction.

Probably best if I just linked the PTX instruction: https://docs.nvidia.com/cuda/parallel-t ... structions
smatovic
Posts: 2645
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Chess Programming Wiki accounts: how to get one?

Post by smatovic »

https://images.nvidia.com/content/volta ... epaper.pdf

Page 18 and 19 give also a good overview of Nvidia's TensorCores.

--
Srdja