Simplest way to implement quick and dirty lazy smp

Discussion of chess software programming and technical issues.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
User avatar
silentshark
Posts: 236
Joined: Sat Mar 27, 2010 6:15 pm
Contact:

Simplest way to implement quick and dirty lazy smp

Post by silentshark » Fri Jan 04, 2019 1:16 pm

Ok, so just say you haven't got a clue about lazy smp, and are looking for a very quick and dirty way to add to a non-smp program.. which is written in C.

What's the easiest and simplest way to implement? Where's the best start point?

User avatar
xr_a_y
Posts: 760
Joined: Sat Nov 25, 2017 1:28 pm
Location: France

Re: Simplest way to implement quick and dirty lazy smp

Post by xr_a_y » Fri Jan 04, 2019 1:22 pm

Just run the search in multiple forks using a shared hash table.

In C++11, it is much fun using std::thread/mutex/condition. Stockfish is a very simple and easy to understand implementation.
The one in Minic is derived from Stockfish and I guess quite copy/paste-able.

smatovic
Posts: 905
Joined: Wed Mar 10, 2010 9:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic
Contact:

Re: Simplest way to implement quick and dirty lazy smp

Post by smatovic » Sat Jan 05, 2019 3:19 pm

silentshark wrote:
Fri Jan 04, 2019 1:16 pm
Ok, so just say you haven't got a clue about lazy smp, and are looking for a very quick and dirty way to add to a non-smp program.. which is written in C.

What's the easiest and simplest way to implement? Where's the best start point?
1) Make sure all threads have their own memory stacks except global hash table
2) use the SHT, shared hash table, in all threads for TT-moves and TT-scores
3) test RMO - randomized move order in helper threads,
- root thread performs a normal AB search
- helper threads pick a random move after Nullmove, TT-move, IID-move etc.
- search terminates when any thread finishes its search
4) consider a quick but not dirty ABDADA implementation

--
Srdja

odomobo
Posts: 59
Joined: Thu Jul 05, 2018 11:09 pm
Location: Chicago, IL
Full name: Josh Odom

Re: Simplest way to implement quick and dirty lazy smp

Post by odomobo » Sat Jan 05, 2019 3:26 pm

Does randomized move order provide a benefit for lazy SMP?

smatovic
Posts: 905
Joined: Wed Mar 10, 2010 9:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic
Contact:

Re: Simplest way to implement quick and dirty lazy smp

Post by smatovic » Sat Jan 05, 2019 3:37 pm

odomobo wrote:
Sat Jan 05, 2019 3:26 pm
Does randomized move order provide a benefit for lazy SMP?
I implemented RMO on gpu and it was superior to plain SHT,
i tried some lazy smp ideas like different search depths,
but these did not work for me.

My ABDADA implementation is superior to RMO,
at least up to 32 threads.

--
Srdja

User avatar
silentshark
Posts: 236
Joined: Sat Mar 27, 2010 6:15 pm
Contact:

Re: Simplest way to implement quick and dirty lazy smp

Post by silentshark » Sat Jan 05, 2019 5:59 pm

All great suggestions, many thanks :-)

User avatar
JVMerlino
Posts: 1003
Joined: Wed Mar 08, 2006 9:15 pm
Location: San Francisco, California

Re: Simplest way to implement quick and dirty lazy smp

Post by JVMerlino » Sat Jan 05, 2019 6:31 pm

silentshark wrote:
Fri Jan 04, 2019 1:16 pm
Ok, so just say you haven't got a clue about lazy smp, and are looking for a very quick and dirty way to add to a non-smp program.. which is written in C.

What's the easiest and simplest way to implement? Where's the best start point?
Well, if your code is not thread-safe and you REALLY want the QUICKEST and DIRTIEST implementation, you could do what I did in Myrddin (thanks to an idea from H.G. Muller). Instead of threads, I used processes which are extremely easy to launch in parallel. You just have to make your various hash tables accessible by all processes via shared memory, and then have the master process feed moves via pipes to all of the slaves while they are in analyze mode. They fill up the hash tables, which the main process uses to speed up its search.

It worked a fair bit better than expected. My testing showed +90 elo for 4 CPU, and this is borne out by CCRL testing, which shows +93 (although CCRL has only played about 260 games with the 4 CPU version).

jm

User avatar
hgm
Posts: 23714
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: Simplest way to implement quick and dirty lazy smp

Post by hgm » Sat Jan 05, 2019 8:47 pm

Yeah, I used that in my engine HaQiKi D, also with satisfactory results (eventually). I don't connect the processes (by pipes) in a star topology, though, but as a linear chain. When a process receives the command "cores N" with N > 1 it launches a daughter process, and passes the command "cores N-1" to it. That will set up the chain.

The process that receives the "xboard" command (which is not passed along the chain) knows it is the master process communicating with the GUI. This then allocates the hash memory in a sharable way, and sends its name in a (non-standard) "attach" command (which is passed along to daughters) to the daughter, so the latter can map it in its own memory instead of allocating hash memory.

Game-state changing commands are passed to daughters, as well as any move the master makes itself. This keeps the processes in sync. Whenever the master starts searching, it sends an "analyze" command to its daughter, whenever it stops searching it sends "exit" to stop the analysis (both also passed along).

Basically the whole thing is implemented in the protocol driver loop.

User avatar
JVMerlino
Posts: 1003
Joined: Wed Mar 08, 2006 9:15 pm
Location: San Francisco, California

Re: Simplest way to implement quick and dirty lazy smp

Post by JVMerlino » Sat Jan 05, 2019 9:23 pm

hgm wrote:
Sat Jan 05, 2019 8:47 pm
Yeah, I used that in my engine HaQiKi D, also with satisfactory results (eventually). I don't connect the processes (by pipes) in a star topology, though, but as a linear chain. When a process receives the command "cores N" with N > 1 it launches a daughter process, and passes the command "cores N-1" to it. That will set up the chain.

The process that receives the "xboard" command (which is not passed along the chain) knows it is the master process communicating with the GUI. This then allocates the hash memory in a sharable way, and sends its name in a (non-standard) "attach" command (which is passed along to daughters) to the daughter, so the latter can map it in its own memory instead of allocating hash memory.

Game-state changing commands are passed to daughters, as well as any move the master makes itself. This keeps the processes in sync. Whenever the master starts searching, it sends an "analyze" command to its daughter, whenever it stops searching it sends "exit" to stop the analysis (both also passed along).

Basically the whole thing is implemented in the protocol driver loop.
The only significant difference between your implementation and mine is that I pass the address for the hash tables as commandline parameters to the slaves. How do you pass commands down the chain if not via pipes?

User avatar
lucasart
Posts: 3040
Joined: Mon May 31, 2010 11:29 am
Full name: lucasart
Contact:

Re: Simplest way to implement quick and dirty lazy smp

Post by lucasart » Sun Jan 06, 2019 12:44 am

JVMerlino wrote:
Sat Jan 05, 2019 6:31 pm
silentshark wrote:
Fri Jan 04, 2019 1:16 pm
Ok, so just say you haven't got a clue about lazy smp, and are looking for a very quick and dirty way to add to a non-smp program.. which is written in C.

What's the easiest and simplest way to implement? Where's the best start point?
Well, if your code is not thread-safe and you REALLY want the QUICKEST and DIRTIEST implementation, you could do what I did in Myrddin (thanks to an idea from H.G. Muller). Instead of threads, I used processes which are extremely easy to launch in parallel. You just have to make your various hash tables accessible by all processes via shared memory, and then have the master process feed moves via pipes to all of the slaves while they are in analyze mode. They fill up the hash tables, which the main process uses to speed up its search.

It worked a fair bit better than expected. My testing showed +90 elo for 4 CPU, and this is borne out by CCRL testing, which shows +93 (although CCRL has only played about 260 games with the 4 CPU version).

jm
I fail to see how this is easier. Honestly, managing subprocesses, and communications via pipes, is a lot more complicated in my experience (especially on POSIX systems, with the fork logic being quite mind bending). And it's hideous, from a design/coding standpoint. Simply start threads in an id_loop() and join when done. That's all there is to it. The rest is details, which you can improve on as you go along.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.

Post Reply