A more interesting question about GPU verses CPU

hgm · Post by **hgm** » Mon Aug 13, 2018 3:42 pm

Daniel Shawul wrote: ↑Mon Aug 13, 2018 3:15 pmWell LC0 eval is a gigantic matrix multiplication and clearly needs a hardware optimized for that to perform well. As far as I am concerned, GPU is a speciality hardware, one could also design FGPA, like deep blue did, to accelerate stockfish eval by the same amount. Sure the FGPA would be much more expensive than the GPU but I am not concerned about cost. Cost is driven by need anyway and your assessment of algorithms based on that is going to vary with it.

LC0 probably does the same number of branching in the search part as Stockfish. The only difference is in the massively vectorized evaluation function that is suitable for the GPU. LC0 resorted to an inefficient vectorized eval (interms of number of operations) to exploit GPUs, Stockish could benefit from a different hardware while keeping the branching in its eval.

Alpha Zero, with its 4 TPUs, needed three orders of magnitude fewer nodes per move than Stockfish. So for LC0 on a GPU this factor is probably even larger. So if the search part needs the same number of branches per node, (which is likely what you mean), the number of branches per move would be orders of magnitude smaller.

You are probably right that Stockfish' eval could be accelerated by specially programmed FPGA hardware. But by how much? Which fraction of Stockfish' CPU utilization is spent on evaluation? Even if you reduce evaluation time to 0, Stockfish would still be slow, because it must search so many nodes, compared to LC0.

Sesse · Post by **Sesse** » Mon Aug 13, 2018 4:20 pm

Which fraction of Stockfish' CPU utilization is spent on evaluation?

A profiler will tell you this very quickly:

Code: Select all

  29,53%  stockfish  stockfish            [.] Eval::evaluate                                                                                                     
  27,86%  stockfish  stockfish            [.] (anonymous namespace)::search<((anonymous namespace)::NodeType)0>                                                  
  19,63%  stockfish  stockfish            [.] MovePicker::next_move                                                                                              
   9,97%  stockfish  stockfish            [.] (anonymous namespace)::qsearch<((anonymous namespace)::NodeType)0>                                                 
   3,51%  stockfish  stockfish            [.] Position::see_ge                                                                                                   
   1,52%  stockfish  stockfish            [.] generate<(GenType)5>                                                                                               
   1,50%  stockfish  [unknown]            [k] 0xffffffff81600a17                                                                                                 
   1,23%  stockfish  stockfish            [.] Pawns::Entry::do_king_safety<(Color)0>                                                                             
   1,10%  stockfish  stockfish            [.] Pawns::Entry::do_king_safety<(Color)1>                                                                             
   0,71%  stockfish  stockfish            [.] Position::pseudo_legal                                                                                             
   0,60%  stockfish  stockfish            [.] Material::probe

bob · Post by **bob** » Tue Aug 14, 2018 11:03 pm

Daniel Shawul wrote: ↑Mon Aug 13, 2018 3:15 pm
hgm wrote: ↑Sun Aug 12, 2018 4:42 pm To 'compare' algorithms you should first define a metric for algorithmic complexity. You seem to focus (completely arbitrarily) on the number of multiplications. One might just as well only consider the number of branches. With the same number of branches per second, Stockfish would not be a match for LC0 at all.

So without an objective criterion for selecting the metric, you can make the comparison come out any way you want.
Well LC0 eval is a gigantic matrix multiplication and clearly needs a hardware optimized for that to perform well. As far as I am concerned, GPU is a speciality hardware, one could also design FGPA, like deep blue did, to accelerate stockfish eval by the same amount. Sure the FGPA would be much more expensive than the GPU but I am not concerned about cost. Cost is driven by need anyway and your assessment of algorithms based on that is going to vary with it.

LC0 probably does the same number of branching in the search part as Stockfish. The only difference is in the massively vectorized evaluation function that is suitable for the GPU. LC0 resorted to an inefficient vectorized eval (interms of number of operations) to exploit GPUs, Stockish could benefit from a different hardware while keeping the branching in its eval.

Actually, DB used ASICs, Belle used FPGAs in 1980. Both are similar at some level (ignoring cost of course). Main goal was to produce a finite-state-machine that could execute a tree search + evaluation, which is a bit tricky when cycles per operation can vary dramatically depending on what is going on.

hgm · Post by **hgm** » Tue Aug 14, 2018 11:54 pm

Code: Select all

  29,53%  stockfish  stockfish            [.] Eval::evaluate                                                                                                     
  27,86%  stockfish  stockfish            [.] (anonymous namespace)::search<((anonymous namespace)::NodeType)0>                                                  
  19,63%  stockfish  stockfish            [.] MovePicker::next_move                                                                                              
   9,97%  stockfish  stockfish            [.] (anonymous namespace)::qsearch<((anonymous namespace)::NodeType)0>                                                 
   3,51%  stockfish  stockfish            [.] Position::see_ge                                                                                                   
   1,52%  stockfish  stockfish            [.] generate<(GenType)5>                                                                                               
   1,50%  stockfish  [unknown]            [k] 0xffffffff81600a17                                                                                                 
   1,23%  stockfish  stockfish            [.] Pawns::Entry::do_king_safety<(Color)0>                                                                             
   1,10%  stockfish  stockfish            [.] Pawns::Entry::do_king_safety<(Color)1>                                                                             
   0,71%  stockfish  stockfish            [.] Position::pseudo_legal                                                                                             
   0,60%  stockfish  stockfish            [.] Material::probe

So even if you would reduce the Eval and MovePicker time to zero by implementing them in hardware, you would only gain a factor 2 in speed. And you would lose much more by going from 48 CPUs to the single CPU that suffices for running LC0.

Sesse · Post by **Sesse** » Wed Aug 15, 2018 1:04 am

I doubt anyone would want to farm out evaluation only to an FPGA. (For one, the delay getting stuff back and forth would be counterproductive.) You'd certainly want to move at least parts of the search out as well.

Daniel Shawul · Post by **Daniel Shawul** » Wed Aug 15, 2018 1:36 am

hgm wrote: ↑Tue Aug 14, 2018 11:54 pm

Code: Select all

  29,53%  stockfish  stockfish            [.] Eval::evaluate                                                                                                     
  27,86%  stockfish  stockfish            [.] (anonymous namespace)::search<((anonymous namespace)::NodeType)0>                                                  
  19,63%  stockfish  stockfish            [.] MovePicker::next_move                                                                                              
   9,97%  stockfish  stockfish            [.] (anonymous namespace)::qsearch<((anonymous namespace)::NodeType)0>                                                 
   3,51%  stockfish  stockfish            [.] Position::see_ge                                                                                                   
   1,52%  stockfish  stockfish            [.] generate<(GenType)5>                                                                                               
   1,50%  stockfish  [unknown]            [k] 0xffffffff81600a17                                                                                                 
   1,23%  stockfish  stockfish            [.] Pawns::Entry::do_king_safety<(Color)0>                                                                             
   1,10%  stockfish  stockfish            [.] Pawns::Entry::do_king_safety<(Color)1>                                                                             
   0,71%  stockfish  stockfish            [.] Position::pseudo_legal                                                                                             
   0,60%  stockfish  stockfish            [.] Material::probe

So even if you would reduce the Eval and MovePicker time to zero by implementing them in hardware, you would only gain a factor 2 in speed. And you would lose much more by going from 48 CPUs to the single CPU that suffices for running LC0.

The point is not by how much you can accelerate stockfish's current eval , but the possibility that availability of new hardware acceleration opens up for adding ton of evaluation features that
in the past were deemed not worth it elowise.

For my NN eval the profile tells me 35% of it is spent on winograd transform function. The current stockfish eval could be a 1000x faster than this NN eval, so for instance one could add new subtle eval features manually until it is as slow as the NN eval, and then use a specialist hardware to accelerate it for "free".

Edit I quote this text from a deep blue paper

Hardware evaluation. The Deep Blue evaluation function is implemented in
hardware. In a way, this simplifies the task of programming Deep Blue. In a software
chess program, one must carefully consider adding new features, always keeping in
mind that a “better” evaluation function may take too long to execute, slowing down
the program to the extent that it plays more weakly. In Deep Blue, one does not need
to constantly re-weigh the worth of a particular evaluation function feature versus
its execution time: time to execute the evaluation function is a fixed constant. 5 On
the other hand, it is not possible to add new features to the hardware evaluation, 6
and software patches are painful and problematic, as noted above about Deep
Thought 2. For the most part, one must learn to either get by without a desired new
feature, or manufacture some surrogate out of the features that are already available.
Additionally, the extra complexity that is possible in the hardware evaluation function
creates an “embarrassment of riches”. There are so many features (8000) that tuning
the relative values of the features becomes a difficult task. The evaluation function is
described in Section 7.

Joost Buijs · Post by **Joost Buijs** » Thu Aug 16, 2018 6:49 am

bob wrote: ↑Tue Aug 14, 2018 11:03 pm Actually, DB used ASICs, Belle used FPGAs in 1980. Both are similar at some level (ignoring cost of course). Main goal was to produce a finite-state-machine that could execute a tree search + evaluation, which is a bit tricky when cycles per operation can vary dramatically depending on what is going on.

Belle used discrete logic, back in 1980 FPGAs didn't exist at all. The first FPGA by Xilinx came around 1985 or 1986.

yurikvelo · Post by **yurikvelo** » Tue Oct 09, 2018 11:30 am

Both GPU and CPU engines require "learning".
Stockfish "learn" using fishtest cluster network, currently it has CPU time = 54.87 years
All this 55 years knowledge is compiled into x86 binary build

NN GPU also require learning, Google used some time to teach their engine.

In terms of "ELO-gain vs TFlops spent" I think NN GPU is more efficient.

In case of Stockfish, one can download for free and instantly 55 years of CPU work

Dann Corbit · Post by **Dann Corbit** » Tue Oct 09, 2018 8:34 pm

hgm wrote: ↑Mon Aug 13, 2018 3:42 pm
Daniel Shawul wrote: ↑Mon Aug 13, 2018 3:15 pmWell LC0 eval is a gigantic matrix multiplication and clearly needs a hardware optimized for that to perform well. As far as I am concerned, GPU is a speciality hardware, one could also design FGPA, like deep blue did, to accelerate stockfish eval by the same amount. Sure the FGPA would be much more expensive than the GPU but I am not concerned about cost. Cost is driven by need anyway and your assessment of algorithms based on that is going to vary with it.

LC0 probably does the same number of branching in the search part as Stockfish. The only difference is in the massively vectorized evaluation function that is suitable for the GPU. LC0 resorted to an inefficient vectorized eval (interms of number of operations) to exploit GPUs, Stockish could benefit from a different hardware while keeping the branching in its eval.
Alpha Zero, with its 4 TPUs, needed three orders of magnitude fewer nodes per move than Stockfish. So for LC0 on a GPU this factor is probably even larger. So if the search part needs the same number of branches per node, (which is likely what you mean), the number of branches per move would be orders of magnitude smaller.

You are probably right that Stockfish' eval could be accelerated by specially programmed FPGA hardware. But by how much? Which fraction of Stockfish' CPU utilization is spent on evaluation? Even if you reduce evaluation time to 0, Stockfish would still be slow, because it must search so many nodes, compared to LC0.

Imagine if TPU and CPU had transparent access to the same memory. At some point, I expect this to happen. There are already some kit systems that do it.

A more interesting question about GPU verses CPU

Re: A more interesting question about GPU verses CPU

Re: A more interesting question about GPU verses CPU

Re: A more interesting question about GPU verses CPU

Re: A more interesting question about GPU verses CPU

Re: A more interesting question about GPU verses CPU

Re: A more interesting question about GPU verses CPU

Re: A more interesting question about GPU verses CPU

Re: A more interesting question about GPU verses CPU

Re: A more interesting question about GPU verses CPU