You store the tree in the GPU? I am very surprised. My intuition is that managing the search tree with the CPU should be much more efficient, but I may be wrong.
trying to understand mcts
Moderators: hgm, Rebel, chrisw
-
- Posts: 438
- Joined: Mon Apr 24, 2006 8:06 pm
Re: trying to understand mcts
-
- Posts: 2645
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: trying to understand mcts
Yes, in GPU.Rémi Coulom wrote: ↑Thu Oct 03, 2019 11:56 amYou store the tree in the GPU? I am very surprised. My intuition is that managing the search tree with the CPU should be much more efficient, but I may be wrong.
Zeta v097 and v098 were an attempt to make use of thousands of parallel gpu-
threads via an parallel Best-First-MiniMax search and classic evaluation.
The aim was to avoid host-device latencies, so the whole movegen, search and
evaluation was done on gpu, only board and history and params transmitted forth,
and bestmove back.
It was experimental, and buggy, but played decent chess...
https://github.com/smatovic/Zeta/tree/v098
--
Srdja
-
- Posts: 438
- Joined: Mon Apr 24, 2006 8:06 pm
Re: trying to understand mcts
Very interesting, thanks.
When using large neural networks, evaluation is so slow that data transfers between CPU and GPU have very little cost.
To go back to the main topic: I have never tried MCTS with random playouts in chess. It might be interesting and fun to try, but it probably won't be very strong. But MCTS works well with the Alpha Zero approach.
I recommend not using the UCB formula for MCTS. The formula used by Alpha Zero is different, and better.
-
- Posts: 201
- Joined: Thu Jun 06, 2019 8:05 pm
- Full name: Percival Tiglao
Re: trying to understand mcts
GPUs have far higher-bandwidth RAM than the CPU. Case in point, a $400-class GPU like Vega64 has 480 GB/s bandwidth, while the NVidia 2060 SUPER (also $400) has 448 GB/s bandwidth. In contrast, typical Desktop CPUs only have 50 GB/s or less (Dual Channel DDR4 2666 == dual channel for 42GB-per-second theoretical). A Quad-channel HEDT (Threadripper, Cascade Lake X) may reach 100GB/s, but in any case, is far less bandwidth than a GPU.Rémi Coulom wrote: ↑Thu Oct 03, 2019 2:43 pmVery interesting, thanks.
When using large neural networks, evaluation is so slow that data transfers between CPU and GPU have very little cost.
Note: the latency of GPUs is much worse than CPUs, hundreds of cycles (and slower-cycles at that) to access L2 cache, let alone its main memory. The only way you actually benefit from this increased bandwidth is if you can parallelize the traversal. I have seen some papers with various parallel MCTS traversals proposed, but I think its probably a research problem worth exploring.
-------
I agree with you however, that large neural nets are too slow for MCTS to achieve much benefit from this. But perhaps "lighter" scale MCTS (kind of a fast-and-light approach like Stockfish: targeting Mega-Nodes/second or Giga-nodes/second speeds) would become possible if MCTS scaled up in speed a bit better.
With only 100s Kilo-nodes/second of typical neural nets, that's slow enough for PCIe to work out, so you can do MCTS on CPUs just fine.