Candidates for the computationally most efficient engine?

Discussion of chess software programming and technical issues.

Moderator: Ras

Alexander Schmidt
Posts: 1235
Joined: Thu May 10, 2007 2:49 pm

Re: Candidates for the computationally most efficient engine?

Post by Alexander Schmidt »

hgm wrote: Sat Jul 16, 2022 1:15 pm Is that really comparable to a CCRL rating?
It is SSDF rating which is, at this level, most compareable to human ratings.

BTW, Leela with one node sometimes blunders, but it can also win against stronger opponents.
hgm wrote: Sat Jul 16, 2022 1:15 pm how to force it tu use only 1 node).
It supports the UCI go nodes command, so set it to 1 node within the GUI.
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Candidates for the computationally most efficient engine?

Post by dangi12012 »

I looked a bit closer this evening into what is actually needed:

Here is the relevant line in the relevant file that does the actual work (depending on the cuda backend, cudnn, cublas, cuda)
https://github.com/LeelaChessZero/lc0/b ... rs.cc#L479

Which is abstracted as a network here:
https://github.com/LeelaChessZero/lc0/b ... da.cc#L537

Behind this global interface:
https://github.com/LeelaChessZero/lc0/b ... work.h#L55

MCTS Treenodes that need evaluation are inside this branch:
https://github.com/LeelaChessZero/lc0/b ... h.cc#L1340

The MCT is iterated over here:
https://github.com/LeelaChessZero/lc0/b ... h.cc#L1049

Now when removing everything that has to do with MCTS and only looking at Eval() (per layer) for one specific backend a minimum UCI subset 2k elo lc0 probably fits in 1k lines of code and a single file.

I am actually suprised that the core of MCTS is not actually happening on the device too - so there is actually a lot of copying from gpu to cpu for the nvidia backends (which are the fastest by far per dollar and watt)

If somebody knows exactly how many floating point operations are needed for one layer inference that would be very great!
Topology of the network (according to the docs): https://github.com/hujie-frank/SENet
How many floating point operations are needed for a single full pass of 320x24?
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Candidates for the computationally most efficient engine?

Post by dangi12012 »

With cuda it actually should not be a problem to run and store the mcts tree on the gpu fully.
In other threads I mentioned that tablebase probing can be done there too. (But no syzygy_probe.cu has been written as of yet)

The only thing the cpu would need to do is handle UCI and spit out the PV.

Back to the topic:

If we get the exact Floating point operations for a single full pass of 320x24.
Maybe we can cheat by adding this instruction: https://docs.nvidia.com/cuda/cuda-c-pro ... e-function
and calling go nodes 1
we can compare computational cost to a single thread stockfish with as many nodes that are needed to equal 1 node of lc0 across devices.

Then the answer of computational efficiency to reach 2k elo of these approaches are clear for 2022:
Pure NN (lc0)
Alphabeta - NN (SF nnue)
Pure Alphabeta (SF no nnue)
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Candidates for the computationally most efficient engine?

Post by dangi12012 »

I think its answered (at least for me)

When having a MCTS engine rollout phase - all the stuff about make - unmake does not need to be done.
All that is needed is a noisy "best move picker" that plays your best move and your opponents best move and so on until the game is over.

So from that perspective there is no tree to search - just the move ordering algorithm of Stockfish or the eval network of lc0 is needed.
Something that can look at all the moves of a position - and correctly (within reason) predict the best move.

The nice thing about a movepicker without unmake is that there are some very fast accelleration structures that can also just access the previous position for a sense of progress and dont need to cache values inside a tree.

In that sense its best to just measure the strength of the system when picking different policies for the rollout (NNUE or LC0 etc.) and computational efficiency might be a wrong goal. (The goal is to maximise strength of the whole system)
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Sesse
Posts: 300
Joined: Mon Apr 30, 2018 11:51 pm

Re: Candidates for the computationally most efficient engine?

Post by Sesse »

hgm wrote: Fri Jul 15, 2022 10:14 pm Ratings are measured by comparing engines at equal time usage. So wouldn't this simply be the engine with the highest rating?
Approximatively, yes, but not perfectly. For instance, the most efficient engine would almost certainly be single-threaded (if you run two threads, you use twice the CPU cycles but don't arrive at the move twice as fast), so if A is stronger than B on 16-core but it's due to superior scaling, B might be stronger than A on a single core. Similarly, some engines do relatively better at longer TCs than shorter (and an engine going to only 1800 would certainly use very little time).
User avatar
hgm
Posts: 28353
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Candidates for the computationally most efficient engine?

Post by hgm »

Sesse wrote: Tue Jul 19, 2022 1:40 pm
hgm wrote: Fri Jul 15, 2022 10:14 pm Ratings are measured by comparing engines at equal time usage. So wouldn't this simply be the engine with the highest rating?
Approximatively, yes, but not perfectly. For instance, the most efficient engine would almost certainly be single-threaded (if you run two threads, you use twice the CPU cycles but don't arrive at the move twice as fast), so if A is stronger than B on 16-core but it's due to superior scaling, B might be stronger than A on a single core. Similarly, some engines do relatively better at longer TCs than shorter (and an engine going to only 1800 would certainly use very little time).
True, but I was thinking about single-CPU ratings. Even then two engines could scale differently with thinking time. But that would probably only be an issue between NN and AB engines; within such a group all engines of similar rating are basically the same program.