A more interesting question about GPU verses CPU

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
hgm
Posts: 27788
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: A more interesting question about GPU verses CPU

Post by hgm »

Daniel Shawul wrote: Mon Aug 13, 2018 3:15 pmWell LC0 eval is a gigantic matrix multiplication and clearly needs a hardware optimized for that to perform well. As far as I am concerned, GPU is a speciality hardware, one could also design FGPA, like deep blue did, to accelerate stockfish eval by the same amount. Sure the FGPA would be much more expensive than the GPU but I am not concerned about cost. Cost is driven by need anyway and your assessment of algorithms based on that is going to vary with it.

LC0 probably does the same number of branching in the search part as Stockfish. The only difference is in the massively vectorized evaluation function that is suitable for the GPU. LC0 resorted to an inefficient vectorized eval (interms of number of operations) to exploit GPUs, Stockish could benefit from a different hardware while keeping the branching in its eval.
Alpha Zero, with its 4 TPUs, needed three orders of magnitude fewer nodes per move than Stockfish. So for LC0 on a GPU this factor is probably even larger. So if the search part needs the same number of branches per node, (which is likely what you mean), the number of branches per move would be orders of magnitude smaller.

You are probably right that Stockfish' eval could be accelerated by specially programmed FPGA hardware. But by how much? Which fraction of Stockfish' CPU utilization is spent on evaluation? Even if you reduce evaluation time to 0, Stockfish would still be slow, because it must search so many nodes, compared to LC0.
Sesse
Posts: 300
Joined: Mon Apr 30, 2018 11:51 pm

Re: A more interesting question about GPU verses CPU

Post by Sesse »

Which fraction of Stockfish' CPU utilization is spent on evaluation?
A profiler will tell you this very quickly:

Code: Select all

  29,53%  stockfish  stockfish            [.] Eval::evaluate                                                                                                     
  27,86%  stockfish  stockfish            [.] (anonymous namespace)::search<((anonymous namespace)::NodeType)0>                                                  
  19,63%  stockfish  stockfish            [.] MovePicker::next_move                                                                                              
   9,97%  stockfish  stockfish            [.] (anonymous namespace)::qsearch<((anonymous namespace)::NodeType)0>                                                 
   3,51%  stockfish  stockfish            [.] Position::see_ge                                                                                                   
   1,52%  stockfish  stockfish            [.] generate<(GenType)5>                                                                                               
   1,50%  stockfish  [unknown]            [k] 0xffffffff81600a17                                                                                                 
   1,23%  stockfish  stockfish            [.] Pawns::Entry::do_king_safety<(Color)0>                                                                             
   1,10%  stockfish  stockfish            [.] Pawns::Entry::do_king_safety<(Color)1>                                                                             
   0,71%  stockfish  stockfish            [.] Position::pseudo_legal                                                                                             
   0,60%  stockfish  stockfish            [.] Material::probe                  
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: A more interesting question about GPU verses CPU

Post by bob »

Daniel Shawul wrote: Mon Aug 13, 2018 3:15 pm
hgm wrote: Sun Aug 12, 2018 4:42 pm To 'compare' algorithms you should first define a metric for algorithmic complexity. You seem to focus (completely arbitrarily) on the number of multiplications. One might just as well only consider the number of branches. With the same number of branches per second, Stockfish would not be a match for LC0 at all.

So without an objective criterion for selecting the metric, you can make the comparison come out any way you want.
Well LC0 eval is a gigantic matrix multiplication and clearly needs a hardware optimized for that to perform well. As far as I am concerned, GPU is a speciality hardware, one could also design FGPA, like deep blue did, to accelerate stockfish eval by the same amount. Sure the FGPA would be much more expensive than the GPU but I am not concerned about cost. Cost is driven by need anyway and your assessment of algorithms based on that is going to vary with it.

LC0 probably does the same number of branching in the search part as Stockfish. The only difference is in the massively vectorized evaluation function that is suitable for the GPU. LC0 resorted to an inefficient vectorized eval (interms of number of operations) to exploit GPUs, Stockish could benefit from a different hardware while keeping the branching in its eval.
Actually, DB used ASICs, Belle used FPGAs in 1980. Both are similar at some level (ignoring cost of course). Main goal was to produce a finite-state-machine that could execute a tree search + evaluation, which is a bit tricky when cycles per operation can vary dramatically depending on what is going on.
User avatar
hgm
Posts: 27788
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: A more interesting question about GPU verses CPU

Post by hgm »

Code: Select all

  29,53%  stockfish  stockfish            [.] Eval::evaluate                                                                                                     
  27,86%  stockfish  stockfish            [.] (anonymous namespace)::search<((anonymous namespace)::NodeType)0>                                                  
  19,63%  stockfish  stockfish            [.] MovePicker::next_move                                                                                              
   9,97%  stockfish  stockfish            [.] (anonymous namespace)::qsearch<((anonymous namespace)::NodeType)0>                                                 
   3,51%  stockfish  stockfish            [.] Position::see_ge                                                                                                   
   1,52%  stockfish  stockfish            [.] generate<(GenType)5>                                                                                               
   1,50%  stockfish  [unknown]            [k] 0xffffffff81600a17                                                                                                 
   1,23%  stockfish  stockfish            [.] Pawns::Entry::do_king_safety<(Color)0>                                                                             
   1,10%  stockfish  stockfish            [.] Pawns::Entry::do_king_safety<(Color)1>                                                                             
   0,71%  stockfish  stockfish            [.] Position::pseudo_legal                                                                                             
   0,60%  stockfish  stockfish            [.] Material::probe                  
So even if you would reduce the Eval and MovePicker time to zero by implementing them in hardware, you would only gain a factor 2 in speed. And you would lose much more by going from 48 CPUs to the single CPU that suffices for running LC0.
Sesse
Posts: 300
Joined: Mon Apr 30, 2018 11:51 pm

Re: A more interesting question about GPU verses CPU

Post by Sesse »

I doubt anyone would want to farm out evaluation only to an FPGA. (For one, the delay getting stuff back and forth would be counterproductive.) You'd certainly want to move at least parts of the search out as well.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: A more interesting question about GPU verses CPU

Post by Daniel Shawul »

hgm wrote: Tue Aug 14, 2018 11:54 pm

Code: Select all

  29,53%  stockfish  stockfish            [.] Eval::evaluate                                                                                                     
  27,86%  stockfish  stockfish            [.] (anonymous namespace)::search<((anonymous namespace)::NodeType)0>                                                  
  19,63%  stockfish  stockfish            [.] MovePicker::next_move                                                                                              
   9,97%  stockfish  stockfish            [.] (anonymous namespace)::qsearch<((anonymous namespace)::NodeType)0>                                                 
   3,51%  stockfish  stockfish            [.] Position::see_ge                                                                                                   
   1,52%  stockfish  stockfish            [.] generate<(GenType)5>                                                                                               
   1,50%  stockfish  [unknown]            [k] 0xffffffff81600a17                                                                                                 
   1,23%  stockfish  stockfish            [.] Pawns::Entry::do_king_safety<(Color)0>                                                                             
   1,10%  stockfish  stockfish            [.] Pawns::Entry::do_king_safety<(Color)1>                                                                             
   0,71%  stockfish  stockfish            [.] Position::pseudo_legal                                                                                             
   0,60%  stockfish  stockfish            [.] Material::probe                  
So even if you would reduce the Eval and MovePicker time to zero by implementing them in hardware, you would only gain a factor 2 in speed. And you would lose much more by going from 48 CPUs to the single CPU that suffices for running LC0.
The point is not by how much you can accelerate stockfish's current eval , but the possibility that availability of new hardware acceleration opens up for adding ton of evaluation features that
in the past were deemed not worth it elowise.

For my NN eval the profile tells me 35% of it is spent on winograd transform function. The current stockfish eval could be a 1000x faster than this NN eval, so for instance one could add new subtle eval features manually until it is as slow as the NN eval, and then use a specialist hardware to accelerate it for "free".

Edit I quote this text from a deep blue paper
Hardware evaluation. The Deep Blue evaluation function is implemented in
hardware. In a way, this simplifies the task of programming Deep Blue. In a software
chess program, one must carefully consider adding new features, always keeping in
mind that a “better” evaluation function may take too long to execute, slowing down
the program to the extent that it plays more weakly. In Deep Blue, one does not need
to constantly re-weigh the worth of a particular evaluation function feature versus
its execution time: time to execute the evaluation function is a fixed constant. 5 On
the other hand, it is not possible to add new features to the hardware evaluation, 6
and software patches are painful and problematic, as noted above about Deep
Thought 2. For the most part, one must learn to either get by without a desired new
feature, or manufacture some surrogate out of the features that are already available.
Additionally, the extra complexity that is possible in the hardware evaluation function
creates an “embarrassment of riches”. There are so many features (8000) that tuning
the relative values of the features becomes a difficult task. The evaluation function is
described in Section 7.
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: A more interesting question about GPU verses CPU

Post by Joost Buijs »

bob wrote: Tue Aug 14, 2018 11:03 pm Actually, DB used ASICs, Belle used FPGAs in 1980. Both are similar at some level (ignoring cost of course). Main goal was to produce a finite-state-machine that could execute a tree search + evaluation, which is a bit tricky when cycles per operation can vary dramatically depending on what is going on.
Belle used discrete logic, back in 1980 FPGAs didn't exist at all. The first FPGA by Xilinx came around 1985 or 1986.
User avatar
yurikvelo
Posts: 710
Joined: Sat Dec 06, 2014 1:53 pm

Re: A more interesting question about GPU verses CPU

Post by yurikvelo »

Both GPU and CPU engines require "learning".
Stockfish "learn" using fishtest cluster network, currently it has CPU time = 54.87 years
All this 55 years knowledge is compiled into x86 binary build

NN GPU also require learning, Google used some time to teach their engine.

In terms of "ELO-gain vs TFlops spent" I think NN GPU is more efficient.

In case of Stockfish, one can download for free and instantly 55 years of CPU work :)
Dann Corbit
Posts: 12538
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: A more interesting question about GPU verses CPU

Post by Dann Corbit »

hgm wrote: Mon Aug 13, 2018 3:42 pm
Daniel Shawul wrote: Mon Aug 13, 2018 3:15 pmWell LC0 eval is a gigantic matrix multiplication and clearly needs a hardware optimized for that to perform well. As far as I am concerned, GPU is a speciality hardware, one could also design FGPA, like deep blue did, to accelerate stockfish eval by the same amount. Sure the FGPA would be much more expensive than the GPU but I am not concerned about cost. Cost is driven by need anyway and your assessment of algorithms based on that is going to vary with it.

LC0 probably does the same number of branching in the search part as Stockfish. The only difference is in the massively vectorized evaluation function that is suitable for the GPU. LC0 resorted to an inefficient vectorized eval (interms of number of operations) to exploit GPUs, Stockish could benefit from a different hardware while keeping the branching in its eval.
Alpha Zero, with its 4 TPUs, needed three orders of magnitude fewer nodes per move than Stockfish. So for LC0 on a GPU this factor is probably even larger. So if the search part needs the same number of branches per node, (which is likely what you mean), the number of branches per move would be orders of magnitude smaller.

You are probably right that Stockfish' eval could be accelerated by specially programmed FPGA hardware. But by how much? Which fraction of Stockfish' CPU utilization is spent on evaluation? Even if you reduce evaluation time to 0, Stockfish would still be slow, because it must search so many nodes, compared to LC0.
Imagine if TPU and CPU had transparent access to the same memory. At some point, I expect this to happen. There are already some kit systems that do it.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.