Low-RAM engine

hgm · Post by **hgm** » Wed Jan 28, 2015 3:25 pm

Although the Arduino Uno tournament seems to be off, it still seems a nice project to develop an engine that uses only minimal amounts of RAM. The Arduino Uno is based on an AtMega328 chip, which has 2KB of RAM and 32KB of Flash memory. (The latter is ROM for practical purposes, as writing there takes 2msec per 8-byte word, and to do that you have to erase a memory block of 64 bytes first.)

32KB for program and static data tables is quite a lot, and the true bottle neck is the 2KB of RAM. This should hold the engine's core data: board, piece list, game history (for rep detection), killer table, possibly an atack map. But also the stack with local variables. Squeezing the latter in 512 byte would already be a challenge (e.g. if you want to be able to do 20 ply, including QS, you can afford only about 20 bytes in local variable, and some of the variables would definitely have to be multiple bytes). You would have to be really selective in what variables you want to survive the recursive call (so they would have to be saved and restored), and which you rather rederive from others.

A tri-angular array for keeping track of the PV would already be expensive: at 20 ply it would need to contain ~200 moves, so if each move would be 2 bytes, another quarter of your memory is spoken for. A normal history table of 64x64 moves and 2-byte counters would take 8KB, 4 times what you have in total, and is obviously unaffordable. Even storing the moves in each node, for the purpose of sorting, might be too memory consuming: with an average 40 moves per position going 20 ply needs 800 moves. Even compressing the moves to 1 byte would then take 0.8KB, 40% of your memory! Of course any transposition table or pawn hash is totally out of the question.

So it seems we have to trade memory space for execution time quite extensively, redoing many calculations when the need arises, rather than storing them for later use. In particular it would be nice if we would be able to generate moves in the order of the static move sorting, without too much overhead. This could be done by staged move generation, with selective generation of captures, non-captures, checks, promotions etc., rather than doing full move generation every time while filtering out the moves needed for that stage (as mailbox engines typically geenrate captures). The latter technique could still be used in combination with selective stages, e.g. to separate good captures from bad captures you can run the capture generator twice, the first time filtering for LxH and equal captures, and unprotected pieces, the second time for exactly the opposite.

So I am thinking of a move generator with many stages, based on a capture generator that selectively generates captures of a given victim (e.g. by running through your piece list, subjecting all your pieces to 0x88 tests to see if they can capture that victim), so that you can run through the victims in MVV order. This in combination with non-capture generation by running through your own piece list, and generating the moves of every piece as you encounter it there. Promotions could be generated looping through the piece list to find 7th-rank Pawns. And checks could be generated by tabulating for each piece type which two steps would align it with the King as a function of the piece-King vector. (Such tables would take 211 bytes per piece type, which is easily affordable in 32KB.)

This way moves could be generated, and then immediately searched, in the order:

1) PV move (from previous iteration)
2) safe promotions (to unattacked square)
3) good & equal captures in MVV order (filter for Low x High, Equal x Equal and Any x Unprotected)
4) dubious captures in MVV order (filter for Low x High with lowest protector valuable enough that SEE could be >= 0)
5) unsafe promotions (and under-promotions?)
6) killers
7) safe non-captures (skip those to squares attacked by lower-valued piece)
8) unsafe non-captures
9) bad captures in MVV order (filter for LxH and low-valued protector)

This would require running the capture generator (plus successive testing for classification) three times, and the non-capture generator two times. An advantage of the staged design is that in the low-depth nodes you can abort generation once you reach the futile victims. At depths where you still search all checks you can then replace stages 6-9 by a dedicated checks generator, perhaps also running it twice, for safe and unsafe checks.

Gerd Isenberg · Post by **Gerd Isenberg** » Thu Jan 29, 2015 7:52 pm

Yes, very nice. 2K for globals and say 8-bytes per ply for an iterative search sounds huge compared to the 160 nibbles of Mini Chess or 128 bytes of Novag Micro Chess

Ply-array of an iterative search may go with 8 bytes per ply entry:

1. alpha (beta is always -alpha[ply-1])
2. movegen stage including last move for unmake, 2 bytes
3. 2 packed killers one byte each
4. depth due to extensions/reductions, return address, 2 bytes

hgm · Post by **hgm** » Thu Jan 29, 2015 9:58 pm

Actually I was planning to take beta as the (saved) alpha from the previous ply. I.e. test if(alpha + parentAlpha >= 0) in stead of the usual if(alpha >= beta). Because in vanilla alpha-beta beta = -parentAlpha. I am not sure if PVS would pay if you have no hash, so that re-searches will be expensive.

As to the return address, this is probably also a waste of space. There might be only 2 or 3 calls to Search (e.g. from the root, after making a null move, and after making a normal move. With simple branches you could decide after which one to continue. From the stack pointer you could see if you 'return' from the root. And from the move you have to undo in other cases, you would know if you return from a null-move search or not. So these 2 bytes might be used more profitably, even when on something not strictly necessary.

stegemma · Post by **stegemma** » Sat Jan 31, 2015 10:41 am

hgm wrote:Actually I was planning to take beta as the (saved) alpha from the previous ply. [...]

This is what i do since Drago (and even before Drago) and i called it "alfagemma", because i've seen that is equivalent to alfabeta (alphabeta).

In my sw, i always tested with the "current value of previous node" and assigned the initial value of any node, when entering in that node, to "the value of 2 nodes before". You need two nodes before the root, to make it works, but so you can avoid passing -beta and -alfa to alfabeta. Any node stores only current alfa, of course.

hgm · Post by **hgm** » Sat Jan 31, 2015 1:49 pm

Because these modern microcontrollers have RISC architectures, which lack indexed addressing modes, I was planning on keeping the alpha of the two latest levels in global variables (kept in registers, of which there are plenty), and on recursion push the oldest on the save stack, and then swap the two registers. Then you would not need complete stack frames for dummy nodes, but just the alpha values. They would represent alpha and -beta, and before the search they would both be initialized to -INF (or whatever values you would want to use for aspiration).

I guess it is a bit tricky when you do IID, because then you would need to reinitialize alpha to its original value for each iteration, and it can only be found on the stack. While for ID in the root it would not be on the stack, unless you pushed a dummy. But I guess the code for starting a new iteration would need to be different in the root anyway (if only because of how you would choose the next depth).

Zenmastur · Post by **Zenmastur** » Wed Feb 04, 2015 11:26 am

I think if I were going to try something like this I would start out on an AtMega2560 based system. It has 256K Flash, 4K EEPROM and 8K static ram. This should ease the development considerably. They can be had online for less than $20 with a USB cable for programming if you look around. Look for the Arduino Mega 2560 R3.

Regards,

Forrest

stegemma · Post by **stegemma** » Wed Feb 04, 2015 3:17 pm

Zenmastur wrote:I think if I were going to try something like this I would start out on an AtMega2560 based system. It has 256K Flash, 4K EEPROM and 8K static ram. This should ease the development considerably. They can be had online for less than $20 with a USB cable for programming if you look around. Look for the Arduino Mega 2560 R3.

Regards,

Forrest

Of course more memory is better but i think that 2 Kb could be an interesting challenge for anyone. If we need more memory... maybe it is better to develop for vintage computers, as Commodore 64 or Spectrum, because it would interest more people.

So i would like to start with Arduino UNO but i can change device, the cost is still low.

Low-RAM engine

Low-RAM engine

Re: Low-RAM engine

Re: Low-RAM engine

Re: Low-RAM engine

Re: Low-RAM engine

Re: Low-RAM engine

Re: Low-RAM engine