Discussion of Special Purpose hardware for chess search

rfadden · Post by **rfadden** » Fri Apr 25, 2008 8:18 pm

I just posted the following at the bottom of an interesting thread (here) concerning the use of 8/16 bit microprocessors, possibly in parallel, to achieve a fast search.

The problem is that a high end x86 processor in it's typical implementation in a PC is just so efficient in it's use of a Hierarchy of Cache Memory (now mostly on-chip), and this leads to the situation where innexpensive microprocessors can't come close in speed to the x86 engines. Vast resources within an x86 are devoted to Cache issues and also to having many instructions being operated on in parallel within the chip to give a "Superscalar" performance (the chips can average multiple instructions per clock).

Relatively weak "control computer" chips might end up running at 1/100 th this speed, or possibly worse, they could run at 1/1000 th of the speed of a (relatively innexpensive) X86 chip, and so then imagine the effort of trying to get 1000 processors in a system, also with all of the parallel processing programming in an attempt to match one X86 chip that costs $50.

I would call this being on the wrong side of technology. Cheap, innexpensive microprocessors do not help in the task of trying to contruct a mega-engine of chess. To make a mega-engine of chess you *start* with the fastest x86 processors, already in a cluster as large as you can get. One key is that these processors should be connected in a "Tightly Coupled" interconnect scheme. In contrast, when the N processors are "Loosely Coupled", connected in a relatively slow Serial I/O scheme such as 1 Gigabit Ethernet, then this less tightly coupled interconnect tends to lead to serious problems in programming the set of processors and ultimately this causes real problems in getting the speed of search that is desired. (This is due to inefficiency in search.)

So for this reason, for this discussion, we stick with the largest N that we can arrange where the N fast X86 processors are tightly coupled with interconnect hardware. With current off-the-shelf hardware, N = 8.

To get further speed we consider special purpose hardware. In the other discussion there was initial mention of using FPGAs coupled with microprocessors and my suggestions start from that point, below:

-------------------------

I'm a hardware architect and mostly I design special purpose graphics hardware but essentially all of this work could be called the design of special purpose computers.

Yes, Belle, Hitech, then Deep Blue used this same technology of special purpose hardware for chess position evaluation, or for evaluation and for the last levels of search.

We have been using devices with FPGA's as the prototype units for testing our logic in the development of graphics chips and this still is one of the most important tools used in the field that I'm in. The FPGA chips have of course been getting more and more capable.

One of the key ideas is something like this: First take the computing problem that you have and try to use the fastest inexpensive or reasonable cost general purpose computer to solve the problem. If your goal is to compute faster than this, then look first toward the portion of your program that takes the majority of the calculation time.

For a chess engine, start with the Eval function and the use of the Transposition table within the innermost loop, and consider how fast your search engine would run if these operations took essentially zero time. (Not that the special purpose hardware would produce a result in zero time, but this is a rule-of-thumb way of considering the potential speed of special purpose hardware.)

So first decide if this is fast enough for what you want. We could talk about the actual computation time of Eval or the Transposition table use as a separate topic.

In some cases you may not be happy even if Eval essentially took zero time.

Imagine one chess position per clock going into Eval, and imagine at the end of a pure hardware pipeline, one Eval result per clock coming out of the special purpose hardware. That by the way is so fast that the general purpose computer might have a hard time even sustaining this rate of generating positions that are worth evaluating. Also note that any pipeline delay of such hardware would cause a different style of problem. There is a separation in time between when the positions are submitted and when the "results per clock" answers come back and are available for further use. In the world of special purpose hardware this "pipeline delay" issue or "transport delay" sometime ends up being the Tail that Wags the Dog. A chess program would typically prefer to get the results of Eval immediately so that the next move chosen for Eval can be based upon this Eval. That's a key characteristic of a typical alphabeta search.

So anyway decide if "Zero time for Eval" is fast enough for you.

If not, then the special purpose logic needs to include the last N levels of tree search as part of it's logic.

Roughly speaking I would say that the special purpose hardware is much easier to construct if you stay away from tree search and the sequential application of the AlphaBeta algorithm in hardware. For example do you want a hardware based Null Move algorithm in there? Well anyway, much of this kind of logic is placed into a state machine, and the logic cycles through chess board positions rather like a typical software solution, except the hardware units would be good at sustaining a "One position per clock" type of rate. Internally the special purpose hardware still has to feed the "one result per clock" style of Eval hardware, so that rate of operation roughly still limits the speed of the special purpose hardware.

Yes, in hardware you can have multiple copies of this Eval Unit, so then the ideas go off into a wide open area of many possibilities so for now I think we should talk about initially having one ultra-fast parallel Eval unit (to prime the pump and to start this discussion).

------------

I think for most applications you would be better off to first concentrate solely on Placing all of Eval and the use of the Transposition table into special purpose hardware, and then see how far you can go with "Zero Time Eval." Note that as I mention this all of the rest of the problem should be handled quite well on these ultra fast 8 processor computers.

So can all of Eval be done on an FPGA? Yes.

The clock rate of the FPGA will be significantly lower than the clock rate of the very latest special purpose chips, but on the other hand the clock rate of FPGA that are available today is about as fast as the special purpose chips were a handful of years back (like 5 years). This kind of clock rate seems fast enough to me.

-----------------

Here is an experiment that I have been thinking about constructing. Take one of your favorite chess engines that is available in source code and modify the engine as follows:

At the Eval function write new software routines to save the result of Eval, and also arrange that during an actual search all tree search is forced to follow exactly the same path in a repeatable fashion (avoid any randomness, avoid having the search controlled by elapsed time, since this will disrupt the experiment).

Save the results of Eval in an array in RAM. Add a small piece of logic that will optionally grab the Eval results from RAM instead of going through the calculation. Then run the chess engine in this "Speed Test" mode where for Eval you get an immediate "Zero calculation delay" result.

During this search you should get the same answer as before, and this would be used to confirm that the technique is working. Then simply measure this speed of search on your computer.

In other words you can simulate what it is like to have awesome special purpose hardware that is added to the computer for chess search. You can see the speed of search on your current computer.

Measure the speed-up. Use this technique to see the factor of speed improvement or to see how deep the engine could search given a position and given a desired time constraint.

Note that even if you want the special purpose hardware to do more than Eval, the above described "simulation" approach can still be used. Simply select the portion of the chess engine that you intend to put into hardware, and create a software "black box" that simulates the presense of this hardware.

Using the above described technique then run your chess engine and watch how fast it can search.

You can only use this technique while *repeating* a search. The first time through you use the software to record the answers that the hardware would otherwise produce, and in the second pass through, the timing pass, you simulate the hardware giving answers as fast as your software can accept these answers. You then see the speed of your chess engine which if finished would eventually be using special purpose hardware for part of the search.

As I said, this is the technique that I would like to try...

I mention this in the hopes of stirring up some interest or at least some interesting discussion...

Thanks.

Gerd Isenberg · Post by **Gerd Isenberg** » Sat Apr 26, 2008 12:10 am

Some special purpose super fast digital-analogue neuronal-network hardware seems suited for almost "zero time" evaluation + qsearch. Feeding in the position, let say a bus of 4*64 + some wires. Some combinatorial feature detection logic + digital-analogue dot-products with back-propagated and trained weight vectors for a final (analogue) evaluation-score. One may implement some ply minimax search with a few thousands of those evaluators with an in-build parallel move-generator. To get a three ply + qsearch in a few cycles

How expensive is 1 Tera byte of ram with random access latency of 1E-12 seconds (1 picosecond) by the way?

Zach Wegner · Post by **Zach Wegner** » Sat Apr 26, 2008 12:53 am

Gerd Isenberg wrote:Some special purpose super fast digital-analogue neuronal-network hardware seems suited for almost "zero time" evaluation + qsearch. Feeding in the position, let say a bus of 4*64 + some wires. Some combinatorial feature detection logic + digital-analogue dot-products with back-propagated and trained weight vectors for a final (analogue) evaluation-score. One may implement some ply minimax search with a few thousands of those evaluators with an in-build parallel move-generator. To get a three ply + qsearch in a few cycles

How expensive is 1 Tera byte of ram with random access latency of 1E-12 seconds (1 picosecond) by the way?

Haha, Gerd, you are always out there in outer space. I remember your idea to interface an analog evaluator with a Moog synthesizer... did you ever finish that?

But really, I think there is some possibility here. I think I would use quad-bitboards on an FPGA, and keep the entire board state on the FPGA, with some simple make-unmake-getnextmove commands. I'm interested in doing hardware search, too. Though I wouldn't map out the tree like you say, it might be possible to do something like it. Basically, you could have N copies of the move generator/move selector/eval, and some conditional logic between them that decides whether to go a ply deeper, back up a ply, or whatever. Of course, this would probably be pretty slow, but being able to run some part of the tree entirely in hardware seems very tantalizing.

I remember that Deep Blue had hardware search. Does anybody know how they did it?

Dann Corbit · Post by **Dann Corbit** » Sat Apr 26, 2008 1:32 am

Zach Wegner wrote:
Gerd Isenberg wrote:Some special purpose super fast digital-analogue neuronal-network hardware seems suited for almost "zero time" evaluation + qsearch. Feeding in the position, let say a bus of 4*64 + some wires. Some combinatorial feature detection logic + digital-analogue dot-products with back-propagated and trained weight vectors for a final (analogue) evaluation-score. One may implement some ply minimax search with a few thousands of those evaluators with an in-build parallel move-generator. To get a three ply + qsearch in a few cycles

How expensive is 1 Tera byte of ram with random access latency of 1E-12 seconds (1 picosecond) by the way?
Haha, Gerd, you are always out there in outer space. I remember your idea to interface an analog evaluator with a Moog synthesizer... did you ever finish that?

But really, I think there is some possibility here. I think I would use quad-bitboards on an FPGA, and keep the entire board state on the FPGA, with some simple make-unmake-getnextmove commands. I'm interested in doing hardware search, too. Though I wouldn't map out the tree like you say, it might be possible to do something like it. Basically, you could have N copies of the move generator/move selector/eval, and some conditional logic between them that decides whether to go a ply deeper, back up a ply, or whatever. Of course, this would probably be pretty slow, but being able to run some part of the tree entirely in hardware seems very tantalizing.

I remember that Deep Blue had hardware search. Does anybody know how they did it?

From:
Deep Blue
Murray Campbell
IBM T.J. Watson Research Center
Yorktown Heights, NY 10598, USA
mcam@us.ibm.com
A. Joseph Hoane Jr.
Sandbridge Technologies
1 N. Lexington Avenue
White Plains, NY 10601, USA
Joe.Hoane@SandbridgeTech.com
Feng-hsiung Hsu
Compaq Computer Corporation
Western Research Laboratory
250 University Avenue
Palo Alto, CA 94301, USA
Feng-hsiung.Hsu@compaq.com
June 29, 2001

“Parallel search
Deep Blue is composed of a 30-node RS/6000 SP computer and 480 chess chips, with 16 chips per node. The SP nodes communicate with each other using the MPI (Message Passing Interface) standard [10]. Communication is via a high-speed switch. The chess chips communicate with their host node via a Microchannel bus. This heterogeneous architecture has a strong influence on the parallel search algorithm used in Deep Blue, as discussed below.
6.1 Parallel Search Algorithm
To characterize the parallel search algorithm used in Deep Blue, we will use the taxonomy given in [5].
1) Processor hierarchy. Deep Blue uses a static processor tree, with one SP node controlling the other 29 nodes, which in turn control 16 chess chips each. The static nature of the hierarchy is in part determined by the fact that the chess chips are not general purpose processors and can only act as slaves. In addition, the chess chips can only communicate directly with their host node.
2) Control distribution. Deep Blue uses a centralized control of the parallel search. The control is managed on the SP nodes, since the chess chips do not have this functionality.
3) Parallelism possible. Deep Blue permits parallelism under the following conditions:
a) Type 1 (PV) nodes. After the first move has been examined at a PV node, all the alternatives may be examined in parallel (with an o
set window to enable the selective search algorithm). Null move searches, used to generate threat information for the selective search, can also be carried out in parallel. Finally, PV nodes at the hardware/software boundary can be searched in parallel. Because the hardware search allows only null-window searches, a number of searches are required to determine an exact score within a window.
b) Good type 2 nodes (nodes where the first move “fails high", or exceeds expectations). Most parallel algorithms do not allow or need parallelism in this case. Deep Blue executes reduced depth o
set searches as well as the null move searches in parallel after the first move fails high. As above, these searches generate information for use by the selective search.
c) Bad type 2 nodes (nodes where the fail high move is not searched first). The moves after the first are searched in parallel. After a fail high move is found, all the alternatives are searched in parallel with a reduced depth offset search. Null move searches also can execute in parallel at this point.
d) Type 3 nodes (nodes where all the moves fail low). These moves can all be searched in parallel. 4. Synchronization. Type 1 and type 2 nodes are synchronization points for Deep Blue. The first move must be evaluated before parallelism is allowed. There are also global synchronization points at the end of iterations.

[5] M.G. Brockington. A taxonomy of parallel game-tree search algorithms. ICCA Journal, 19(3):162{174, 1996.
[10] W. Gropp, E. Lusk, and A. Skjellum. Using MPI, 2nd Edition. MIT Press, 1999. ISBN 0-262-57132-3.29
[19] D.E. Knuth and R.W. Moore. An analysis of alpha-beta pruning. Artificial Intelligence, 6(4):293{326, 1975.”

CThinker · Post by **CThinker** » Sat Apr 26, 2008 3:29 am

Dann Corbit wrote:
Zach Wegner wrote:
Gerd Isenberg wrote:Some special purpose super fast digital-analogue neuronal-network hardware seems suited for almost "zero time" evaluation + qsearch. Feeding in the position, let say a bus of 4*64 + some wires. Some combinatorial feature detection logic + digital-analogue dot-products with back-propagated and trained weight vectors for a final (analogue) evaluation-score. One may implement some ply minimax search with a few thousands of those evaluators with an in-build parallel move-generator. To get a three ply + qsearch in a few cycles

How expensive is 1 Tera byte of ram with random access latency of 1E-12 seconds (1 picosecond) by the way?
Haha, Gerd, you are always out there in outer space. I remember your idea to interface an analog evaluator with a Moog synthesizer... did you ever finish that?

But really, I think there is some possibility here. I think I would use quad-bitboards on an FPGA, and keep the entire board state on the FPGA, with some simple make-unmake-getnextmove commands. I'm interested in doing hardware search, too. Though I wouldn't map out the tree like you say, it might be possible to do something like it. Basically, you could have N copies of the move generator/move selector/eval, and some conditional logic between them that decides whether to go a ply deeper, back up a ply, or whatever. Of course, this would probably be pretty slow, but being able to run some part of the tree entirely in hardware seems very tantalizing.

I remember that Deep Blue had hardware search. Does anybody know how they did it?
From:
Deep Blue
Murray Campbell
IBM T.J. Watson Research Center
Yorktown Heights, NY 10598, USA
mcam@us.ibm.com
A. Joseph Hoane Jr.
Sandbridge Technologies
1 N. Lexington Avenue
White Plains, NY 10601, USA
Joe.Hoane@SandbridgeTech.com
Feng-hsiung Hsu
Compaq Computer Corporation
Western Research Laboratory
250 University Avenue
Palo Alto, CA 94301, USA
Feng-hsiung.Hsu@compaq.com
June 29, 2001

“Parallel search
Deep Blue is composed of a 30-node RS/6000 SP computer and 480 chess chips, with 16 chips per node. The SP nodes communicate with each other using the MPI (Message Passing Interface) standard [10]. Communication is via a high-speed switch. The chess chips communicate with their host node via a Microchannel bus. This heterogeneous architecture has a strong influence on the parallel search algorithm used in Deep Blue, as discussed below.
6.1 Parallel Search Algorithm
To characterize the parallel search algorithm used in Deep Blue, we will use the taxonomy given in [5].
1) Processor hierarchy. Deep Blue uses a static processor tree, with one SP node controlling the other 29 nodes, which in turn control 16 chess chips each. The static nature of the hierarchy is in part determined by the fact that the chess chips are not general purpose processors and can only act as slaves. In addition, the chess chips can only communicate directly with their host node.
2) Control distribution. Deep Blue uses a centralized control of the parallel search. The control is managed on the SP nodes, since the chess chips do not have this functionality.
3) Parallelism possible. Deep Blue permits parallelism under the following conditions:
a) Type 1 (PV) nodes. After the first move has been examined at a PV node, all the alternatives may be examined in parallel (with an o
set window to enable the selective search algorithm). Null move searches, used to generate threat information for the selective search, can also be carried out in parallel. Finally, PV nodes at the hardware/software boundary can be searched in parallel. Because the hardware search allows only null-window searches, a number of searches are required to determine an exact score within a window.
b) Good type 2 nodes (nodes where the first move “fails high", or exceeds expectations). Most parallel algorithms do not allow or need parallelism in this case. Deep Blue executes reduced depth o
set searches as well as the null move searches in parallel after the first move fails high. As above, these searches generate information for use by the selective search.
c) Bad type 2 nodes (nodes where the fail high move is not searched first). The moves after the first are searched in parallel. After a fail high move is found, all the alternatives are searched in parallel with a reduced depth offset search. Null move searches also can execute in parallel at this point.
d) Type 3 nodes (nodes where all the moves fail low). These moves can all be searched in parallel. 4. Synchronization. Type 1 and type 2 nodes are synchronization points for Deep Blue. The first move must be evaluated before parallelism is allowed. There are also global synchronization points at the end of iterations.

[5] M.G. Brockington. A taxonomy of parallel game-tree search algorithms. ICCA Journal, 19(3):162{174, 1996.
[10] W. Gropp, E. Lusk, and A. Skjellum. Using MPI, 2nd Edition. MIT Press, 1999. ISBN 0-262-57132-3.29
[19] D.E. Knuth and R.W. Moore. An analysis of alpha-beta pruning. Artificial Intelligence, 6(4):293{326, 1975.”

And when it was time to play Kasparov, some guy had to sit there, be enslaved to the computer, and manually make the moves on behalf of the computer. Isn't that sad?

I think we have a lot of chess programs, and too few chess robots. These chips that we are discussing are well suited to these types of robotic applications.

I want a human-sized robot that can play with me on these giant chess pieces.
http://users.owt.com/rpeto/randy/gigs_7 ... chess.html

Dann, I believe you know this place.

Dann Corbit · Post by **Dann Corbit** » Sat Apr 26, 2008 3:51 am

Sure do know it.
Some big shots play there from time to time, as you probably know.
Driving directions from the Mall to my place of work:

FROM: NE 8th St AT 156th Ave NE
Bellevue, WA 98007 TO: 2039 152nd Ave NE
Redmond, WA 98052-5521

STEPS: 7 EST. DRIVE TIME: 5 minutes EST. DISTANCE: 1.6 miles
1 You are at NE 8th St AT 156th Ave NE, Bellevue, WA 98007
2 Go South on 156th Av NE
3 Turn onto NE 8th St
4 Turn onto 148th Av NE
5 Turn onto NE 20th St (Northup Wy)
6 Turn onto 152nd Av NE
7 You are at 2039 152nd Ave NE, Redmond, WA 98052-5521

ROUTE SUMMARY
STEPS: 7 EST. DRIVE TIME: 5 minutes EST. DISTANCE: 1.6 miles

Gerd Isenberg · Post by **Gerd Isenberg** » Sat Apr 26, 2008 11:26 am

Zach Wegner wrote:Haha, Gerd, you are always out there in outer space. I remember your idea to interface an analog evaluator with a Moog synthesizer... did you ever finish that?

No not yet, lack of time and resources. My 30+years old self-build moog-like synthesizer has only three voltage controlled oscillators, two filters and amplifiers. I am not sure either how to transform the chess position into the frequency domain. One may use a prime frequencies for each square plus shape and phase for the piece codes

CThinker · Post by **CThinker** » Sat Apr 26, 2008 8:41 pm

Dann Corbit wrote:Sure do know it.
Some big shots play there from time to time, as you probably know.
Driving directions from the Mall to my place of work:

FROM: NE 8th St AT 156th Ave NE
Bellevue, WA 98007 TO: 2039 152nd Ave NE
Redmond, WA 98052-5521

STEPS: 7 EST. DRIVE TIME: 5 minutes EST. DISTANCE: 1.6 miles
1 You are at NE 8th St AT 156th Ave NE, Bellevue, WA 98007
2 Go South on 156th Av NE
3 Turn onto NE 8th St
4 Turn onto 148th Av NE
5 Turn onto NE 20th St (Northup Wy)
6 Turn onto 152nd Av NE
7 You are at 2039 152nd Ave NE, Redmond, WA 98052-5521

ROUTE SUMMARY
STEPS: 7 EST. DRIVE TIME: 5 minutes EST. DISTANCE: 1.6 miles

I too am just about a mile from there. Once in a while (when it is not raining), I go there for lunch and watch people play.

Zach Wegner · Post by **Zach Wegner** » Sat Apr 26, 2008 9:24 pm

Gerd Isenberg wrote:
Zach Wegner wrote:Haha, Gerd, you are always out there in outer space. I remember your idea to interface an analog evaluator with a Moog synthesizer... did you ever finish that?
No not yet, lack of time and resources. My 30+years old self-build moog-like synthesizer has only three voltage controlled oscillators, two filters and amplifiers. I am not sure either how to transform the chess position into the frequency domain. One may use a prime frequencies for each square plus shape and phase for the piece codes

That's pretty cool, I wish I had the time, knowledge, and resources to build an analog. That's my other hobby, you know, playing sythesizers. I guess it is monophonic? You'd need some pretty complex electronics to do a polyphonic, and with 3 voices, what's the point really...

Of course, the all-time best synthesizer solo is Don Preston's Moog solo on Waka/Jawaka. I'm sure you're familiar. I love cranking that one up on vinyl.

And to keep this running on chess, for each search it could play the PV as a melody, each note being the board position as you describe (we could call them "Godel waves"). Sounds pretty sweet... Now that I think about it, I could probably do this in software pretty easily. Maybe a little plugin-option for ZCT?

Zach Wegner · Post by **Zach Wegner** » Sat Apr 26, 2008 9:28 pm

Dann Corbit wrote:[snip reference]

Hmm, not quite. That seems mainly about their parallel search, still software based. I want to say that they had some basic search logic in the chips, so that the last N plies are entirely hardware. Maybe this isn't the case, but I think that would probably be the most interesting application of special hardware. What's the use of having move generation and eval in hardware, when you need to call it every time from software and just wait for the result? For real efficiency you need to be able to just set the hardware in motion and let it search without needing software.

Discussion of Special Purpose hardware for chess search

Discussion of Special Purpose hardware for chess search

Re: Discussion of Special Purpose hardware for chess search

Re: Discussion of Special Purpose hardware for chess search

Re: Discussion of Special Purpose hardware for chess search

Re: Discussion of Special Purpose hardware for chess search

Re: Discussion of Special Purpose hardware for chess search

Re: Discussion of Special Purpose hardware for chess search

Re: Discussion of Special Purpose hardware for chess search

Re: Discussion of Special Purpose hardware for chess search

Re: Discussion of Special Purpose hardware for chess search