Pi64: Raspberry Pi 2B 64 element bramble

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Pi64: Raspberry Pi 2B 64 element bramble

Post by sje »

Pi64: Raspberry Pi 2B 64 element bramble

Idea: Build an 8x8 element processing grid using 64 quad core Raspberry Pi model 2B single board computers. Each Raspberry Pi 2B includes a quad core 900 MHz ARM7 CPU, 1 GiB RAM, a MicroSD card slot, and a 100 Mbps Ethernet port.

Parts list:

For each processing element (can be mapped to one chessboard square):

(1) Raspberry Pi model 2B ($45)
(1) 8 GiB MicroSD card to hold Debian Linux + user files ($12)
(1) 5V 2A wall wart power supply ($8)
(1) Mounting hardware ($5)
(1) 20 cm Cat 5 Ethernet cable ($3)
Total: $73

For each four processing elements, one first level assembly (can be mapped to four chessboard squares):

(1) Five port 100 Mbps Ethernet switch w/PS ($9)
(1) 5 socket power strip ($7)
(1) Mounting hardware ($5)
(1) 30 cm Cat 5 Ethernet cable ($4)
Total: $25 + 4 x $73 = $317

For each four first level assemblies, one second level assembly (can be mapped to sixteen chessboard squares):

(1) Five port 100 Mbps Ethernet switch w/PS ($9)
(1) Mounting hardware ($5)
(1) 40 cm Cat 5 Ethernet cable ($5)
Total: $19 + 4 x $317 = $1,287

For the four second level assemblies, the single third level assembly (can be mapped to sixty four chessboard squares):

(1) 5 m Cat 5 Ethernet cable ($15)
(1) Five port 100 Mbps Ethernet switch w/PS ($9)
(1) 5 socket power strip ($7)
(1) Mounting hardware ($5)
Total: $36 + 4 x $1,287 = $5,184
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Pi64 Mapping

Post by sje »

Pi64 Mapping

The intent is to allow flexible mappings of chess tasks to the 64 processing elements.

First mapping idea:

Map one chessboard square to each processing element with the four cores of that element working together on subtasks associated with that square. The Pi64 would have all its elements synchronized on a single position. Each element would know the entire position but would concentrate only on the tasks associated with its mapped square.

Examples: generating moves from the square, generating moves to the square, and positional evaluation calculations related to the square.

A processing element could also dispatch commands and retrieve results from all the elements in its locality. The locality would be the four, sixteen, or all sixty four adjacent elements in the 1:4:16 spanning tree.

Alternative: Map squares to cores and so use only sixteen processing elements total.

Alternative: Map 2^n (n=0..5) physical elements/cores to 64 virtual elements/cores.


Second mapping idea:

Map all elements/cores to tasks irrespective of any square mapping. Tasks running on a single element can share data via local memory; otherwise data is shared via the Pi64 LAN.
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Pi64 Control

Post by sje »

Pi64 Control

Each processing element has its own static IPv4 address from (say) 10.0.1.100 to 10.0.1.163 (here, IPv4 is likely faster than IPv6). There is no need for a router because there is no need for DHCP service. The grid connects to the house LAN via the uplink on the single five port uppermost switch. Switches instead of hubs are used for speed gain at a slightly higher cost. The protocol in use is TCP/IP to guard against packet loss.

A controlling computer on the house LAN can communicate directly with any of the processing elements. Alternatively, the controlling computer can talk with only one element and that element in turn talks with the others via tree spanning to distribute the total communications work.

Idea: Map each core to its own IP address and have each core run a thread dedicated to talking to the assigned address. Each physical Ethernet interface on a processing element then becomes four virtual interfaces.

Idea: Have two physical Ethernet interfaces on the controlling computer, and have these configured so that the house LAN traffic and the Pi64 traffic do not intermix. This should result in a speed gain because of enforced collision avoidance.

Idea: use 1 Gbps switches instead of 100 Mbps switches. This could result in a speed gain even though the processing element interfaces are limited to 100 Mbps. This would add perhaps $600 to the grid total cost.
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Pi64: Raspberry Pi 2B 64 element bramble

Post by Michael Sherwin »

Ok! But then what? I see that the design hints at chess programming and that the word chessboard is mentioned. So you're cutting up a chessboard into 64 individual squares and then gluing them back together in chunks. Regardless though it is implied that the overall chess program will be square centric as though a single square is a meaningful object in and of itself. So in the original position the 'd2' processor will ask the 'd3' processor if it is empty and if it is then it will ask the 'd4' processor if it is empty and then ask 'c3', 'b4' and 'a5' if they have a bishop or queen on them that will attack the king. If everything is okay then 'd2' will toss its pawn over to 'd4'. And 'empty' squares that are not asked any questions will set there idle with nothing to do. Also where is the main controlling program running. If I had a desktop computer with 64 logical processors it would make no sense to me to assign each to one square. I'm sorry but I just do not get it said the fish to the fisherman.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Pi64 Throughput

Post by sje »

Pi64 Throughput

It's difficult to give good throughput estimations without access to a physical Raspberry Pi model 2B.

Based on tests running Symbolic (a bitboard program) on a single core 700 MHz ARM6 Raspberry Pi model B, my guess is that a Pi64 grid would have a throughput of a 10 GHz quad core Intel Core i7, if such a thing existed.

A 32 bit piece-set program like Oscar might see twice the throughput of a bitboard program.

Very much depends on the defined task (perft/search), the algorithm in use, the distribution of work, and the communication overhead.

If each core is assigned work such that tens of milliseconds elapse between communication requests, then the tens of microseconds used per request will not be a significant chuck of the overall time used.
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Pi64 SquareMap Mode

Post by sje »

Pi64 SquareMap Mode

A Pi64 is said to be in SquareMap mode when its 64 elements/core are mapped one-to-one with the 64 squares of a chessboard.

The famous Belle hardware grid from the Old Days operated in SquareMap mode. Each element didn't have either a CPU nor a network connection; instead, there was some RAM/ROM, a grid-wide bus, some hardwired logic, and connections to the 3-to-8 adjacent elements and to the 2-to-8 knight-move adjacent elements. So, no element knew about the entire board position, rather only the contents of its 5-to-16 neighbors. Signals along ray directions were propagated through the 3-to-8 adjacent wiring.

A Pi64 running in SquareMap mode is much different. Each element/thread knows the entire position. Each element/thread can communicate via the Pi64 LAN to any other element/thread at wire speed through the switches without involving any other elements/threads. With the right algorithm, each element/thread can intelligently assign itself work by dividing the task at hand to calculate a partial result and these partial results can be summed over the whole grid.
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Pi64 Positional Evaluation

Post by sje »

Pi64 Positional Evaluation

A Pi64 running in SquareMap mode provides an interesting opportunity to experiment with parallel positional evaluation. A static positional evaluation function is constructed an a combination of 64 separate evaluations, one per chessboard square.

One idea here is to expand a traditional static evaluation to include a sophisticated square control scoring including static exchange analysis. Each element/core could conduct a mini quiescence search for more accurate results. Both occupied and unoccupied squares would have evaluations for control, safety, and other metrics. The results could be used for move ordering as well as for static evaluation.

Because each element/core knows about the entire position, the only communication necessary after an evaluation request would be only a few bytes from the element/core reporting the score. Other kinds of requests might require tens of bytes. The results could be sent for merging either to the controlling computer or to a prior designated parent element/core.
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Pi64 Move Mapping

Post by sje »

Pi64 Move Mapping

A Pi64 can also run in MoveMap mode. Each element/core knows the whole position, so each can calculate the moves from the position. Then, each element/core selects all the moves such that the move ordinal mod 64 is examined only by the element/core whose address mod 64 matches the move ordinal. Thus the work is evenly partitioned with zero communication overhead.

In the above, there will be some idle element/core time if the move count is not a multiple of the element/core count. Maybe on average half the elements/cores will be idle. However, for some move/path based evaluations, the moves/positions for the two or three ply tree could be examined in parallel, and this also being done with move ordinal based distribution. This would work quite well with perft() tasks.

A big win here is for calculating preliminary scores for move ordering. The controlling computer could say "Grid, the current position is a PV node or a direct descendent, so I want an accurate preliminary score calculated for each move and have a list of the moves with scores returned in descending score order." After a brief moment, the grid would reply "Boss, here it is and it was done without any unnecessary chit-chat with you."
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Automatic selection of a piece of the Pi

Post by sje »

In SquareMap mode, each element/core already knows the entire position. Processing tasks can be organized accordingly and automatically without communication among the elements/nodes. Only requests and results are communicated so as to minimize the total drag from network overhead.
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Pi64 SquareMap Mode

Post by Michael Sherwin »

sje wrote:Pi64 SquareMap Mode

A Pi64 is said to be in SquareMap mode when its 64 elements/core are mapped one-to-one with the 64 squares of a chessboard.

The famous Belle hardware grid from the Old Days operated in SquareMap mode. Each element didn't have either a CPU nor a network connection; instead, there was some RAM/ROM, a grid-wide bus, some hardwired logic, and connections to the 3-to-8 adjacent elements and to the 2-to-8 knight-move adjacent elements. So, no element knew about the entire board position, rather only the contents of its 5-to-16 neighbors. Signals along ray directions were propagated through the 3-to-8 adjacent wiring.

A Pi64 running in SquareMap mode is much different. Each element/thread knows the entire position. Each element/thread can communicate via the Pi64 LAN to any other element/thread at wire speed through the switches without involving any other elements/threads. With the right algorithm, each element/thread can intelligently assign itself work by dividing the task at hand to calculate a partial result and these partial results can be summed over the whole grid.
There is software running on general purpose processors and there are hardwired processors designed for a specific task. You are suggesting the former and treating it as though it were the later. There is no advantage to be had from this. There is plenty of disadvantage though. You create an artificial construct where none need be created. You create the need for communications that can only drag down performance and greatly increase the complexity of the program. On a pc with say 64 logical processors you can instantly point a thread to any code you want. Since there are only 32 pieces anyway you hardly need to firm_wire any logical processor to any one specific square. In reality a whole chessboard can be treated as an object with functions that work on that object. I would challenge you to write a C++ class that just dealt with single squares that made sense and can be utilized efficiently. I suppose that I could write a chess engine that had eight different arrays one for each rank of the chess board, etc, but that makes no sense at all.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through