SMP, first shot at implementation

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: SMP, first shot at implementation

Post by Joost Buijs »

abulmo2 wrote: Tue Sep 15, 2020 2:27 pm
mvanthoor wrote: Tue Sep 15, 2020 1:42 pm
Joost Buijs wrote: Tue Sep 15, 2020 12:49 pm I have to admit that I don't know much about recent languages like D or Rust, if these languages offer default memory safety this must have some impact on performance, this is inevitable.
With a garbage collected language such as C#, you're correct.
There is a lot of myths about garbage collection. Here is the relative time spent in the garbage collector (GC) in Amoeba (programmed in D):

Code: Select all

0.00%  amoeba   amoeba              [.] nothrow void* gc.impl.conservative.gc.Gcx.smallAlloc(ulong, ref ulong, uint, const(TypeInfo))
0.00%  amoeba   amoeba              [.] nothrow uint gc.impl.conservative.gc.ConservativeGC.runLocked!(gc.impl.conservative.gc.ConservativeGC.getAttr(void*).go(gc.impl.conservative.gc.Gcx*, void*), gc.impl.conservative.gc.otherTime, gc.impl.conservative.gc.numOthers, gc.impl.conservative.gc.Gcx*, void*).runLocked(ref gc.impl.conservative.gc.Gcx*, ref void*)
0.00%  amoeba   amoeba              [.] nothrow void* gc.impl.conservative.gc.ConservativeGC.runLocked!(gc.impl.conservative.gc.ConservativeGC.mallocNoSync(ulong, uint, ref ulong, const(TypeInfo)), gc.impl.conservative.gc.mallocTime, gc.impl.conservative.gc.numMallocs, ulong, uint, ulong, const(TypeInfo)).runLocked(ref ulong, ref uint, ref ulong, ref const(TypeInfo))
0.00%  amoeba   amoeba              [.] nothrow uint gc.impl.conservative.gc.ConservativeGC.getAttr(void*).go(gc.impl.conservative.gc.Gcx*, void*)
0.00%  amoeba   amoeba              [.] nothrow void gc.impl.conservative.gc.Pool.setBits(ulong, uint)
0.00%  amoeba   amoeba              [.] nothrow int gc.impl.conservative.gc.Gcx.collectAllRoots(bool).__foreachbody3(ref core.gc.gcinterface.Range)
0.00%  amoeba   amoeba              [.] nothrow void gc.impl.conservative.gc.Gcx.scanBackground()
All these sum to 0.00%.
Garbage is collected only when memory allocation is requested. In Amoeba, memory is allocated only during program initialisation and when reading input from the user/GUI, not in critical code section, so costs nothing measurable during the program execution.
I my view garbage collection and memory safety are two totally different things, maybe I'm wrong.

My interpretation of garbage collection is freeing memory allocations that are not needed any longer, like artificially fixing a program that contains memory leaks.
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: SMP, first shot at implementation

Post by mar »

Joost Buijs wrote: Tue Sep 15, 2020 4:01 pm I my view garbage collection and memory safety are two totally different things, maybe I'm wrong.

My interpretation of garbage collection is freeing memory allocations that are not needed any longer, like artificially fixing a program that contains memory leaks.
No, garbage collectors are absolutely essential to ensure runtime memory safety in a multi-threaded environment. Reference counting is not thread safe (read the pointer is not [doesn't matter if you RC atomically] - so data races can still lead to dangling pointers etc)
Also, GC will not prevent you from leaking memory, that's another misconception.
Martin Sedlak
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: SMP, first shot at implementation

Post by Joost Buijs »

mar wrote: Tue Sep 15, 2020 5:13 pm
Joost Buijs wrote: Tue Sep 15, 2020 4:01 pm I my view garbage collection and memory safety are two totally different things, maybe I'm wrong.

My interpretation of garbage collection is freeing memory allocations that are not needed any longer, like artificially fixing a program that contains memory leaks.
No, garbage collectors are absolutely essential to ensure runtime memory safety in a multi-threaded environment. Reference counting is not thread safe (read the pointer is not [doesn't matter if you RC atomically] - so data races can still lead to dangling pointers etc)
Also, GC will not prevent you from leaking memory, that's another misconception.
You are probably right. I never used one of these scripting/functional programming languages, and never studied them in detail.
I always thought that GC was basically meant to relief the programmer from doing manual memory management.
fabianVDW
Posts: 146
Joined: Fri Mar 15, 2019 8:46 pm
Location: Germany
Full name: Fabian von der Warth

Re: SMP, first shot at implementation

Post by fabianVDW »

mvanthoor wrote: Tue Sep 15, 2020 3:06 am You can get away with 16 bits, for storing the entire move information? Interesting. (Or did you mean 16 bytes? But that would be a lot.)

I already have these:

Code: Select all

PIECE       	:   3        0-7 (use only 0-6)
FROM        	:   6        0-63
TO          	:   6        0-63
CAPTURE     	:   3        0-7 (captured piece)
PROMOTION   	:   3        0-7 (piece promoted to)
ENPASSANT   	:   1        0-1
DOUBLESTEP  	:   1        0-1
CASTLING    	:   1        0-1
That's 24 bits already, and I still need to store the depth (8 bits) and evaluation (don't know yet, but I assume 12-16 bits).
Throw away the 3 piece bits as they can be inferred from the from square and the gamestate. Throw away the capture because it can be inferred from the to square. Instead of a flag for enpassant, doublestep and so on, you can use the remaining 4 bits to encode for intotal 16 move types, while only 8 are needed: Namely Quiet, Capture, EnPassant, Castle, Promo Knight, Promo Bishop, Promo Rook, Promo Queen - atleast those are the cases I track in FabChess.
Author of FabChess: https://github.com/fabianvdW/FabChess
A UCI compliant chess engine written in Rust.
FabChessWiki: https://github.com/fabianvdW/FabChess/wiki
fabianvonderwarth@gmail.com
Pio
Posts: 334
Joined: Sat Feb 25, 2012 10:42 pm
Location: Stockholm

Re: SMP, first shot at implementation

Post by Pio »

fabianVDW wrote: Tue Sep 15, 2020 7:09 pm
mvanthoor wrote: Tue Sep 15, 2020 3:06 am You can get away with 16 bits, for storing the entire move information? Interesting. (Or did you mean 16 bytes? But that would be a lot.)

I already have these:

Code: Select all

PIECE       	:   3        0-7 (use only 0-6)
FROM        	:   6        0-63
TO          	:   6        0-63
CAPTURE     	:   3        0-7 (captured piece)
PROMOTION   	:   3        0-7 (piece promoted to)
ENPASSANT   	:   1        0-1
DOUBLESTEP  	:   1        0-1
CASTLING    	:   1        0-1
That's 24 bits already, and I still need to store the depth (8 bits) and evaluation (don't know yet, but I assume 12-16 bits).
Throw away the 3 piece bits as they can be inferred from the from square and the gamestate. Throw away the capture because it can be inferred from the to square. Instead of a flag for enpassant, doublestep and so on, you can use the remaining 4 bits to encode for intotal 16 move types, while only 8 are needed: Namely Quiet, Capture, EnPassant, Castle, Promo Knight, Promo Bishop, Promo Rook, Promo Queen - atleast those are the cases I track in FabChess.
You can throw away another 8 bits from your encoding since everything can be inferred from the move number 🙄
fabianVDW
Posts: 146
Joined: Fri Mar 15, 2019 8:46 pm
Location: Germany
Full name: Fabian von der Warth

Re: SMP, first shot at implementation

Post by fabianVDW »

Pio wrote: Tue Sep 15, 2020 8:02 pm
fabianVDW wrote: Tue Sep 15, 2020 7:09 pm
mvanthoor wrote: Tue Sep 15, 2020 3:06 am You can get away with 16 bits, for storing the entire move information? Interesting. (Or did you mean 16 bytes? But that would be a lot.)

I already have these:

Code: Select all

PIECE       	:   3        0-7 (use only 0-6)
FROM        	:   6        0-63
TO          	:   6        0-63
CAPTURE     	:   3        0-7 (captured piece)
PROMOTION   	:   3        0-7 (piece promoted to)
ENPASSANT   	:   1        0-1
DOUBLESTEP  	:   1        0-1
CASTLING    	:   1        0-1
That's 24 bits already, and I still need to store the depth (8 bits) and evaluation (don't know yet, but I assume 12-16 bits).
Throw away the 3 piece bits as they can be inferred from the from square and the gamestate. Throw away the capture because it can be inferred from the to square. Instead of a flag for enpassant, doublestep and so on, you can use the remaining 4 bits to encode for intotal 16 move types, while only 8 are needed: Namely Quiet, Capture, EnPassant, Castle, Promo Knight, Promo Bishop, Promo Rook, Promo Queen - atleast those are the cases I track in FabChess.
You can throw away another 8 bits from your encoding since everything can be inferred from the move number 🙄
Can you elaborate, I don't get it, what is the move number?

I also don't get the smiley- using less data to store the move in the hash entry and removing the data which you can get back from the gamestate FAST seems like a very sensible thing to do for me in order to store more in the TT.
Author of FabChess: https://github.com/fabianvdW/FabChess
A UCI compliant chess engine written in Rust.
FabChessWiki: https://github.com/fabianvdW/FabChess/wiki
fabianvonderwarth@gmail.com
Pio
Posts: 334
Joined: Sat Feb 25, 2012 10:42 pm
Location: Stockholm

Re: SMP, first shot at implementation

Post by Pio »

fabianVDW wrote: Tue Sep 15, 2020 9:42 pm
Pio wrote: Tue Sep 15, 2020 8:02 pm
fabianVDW wrote: Tue Sep 15, 2020 7:09 pm
mvanthoor wrote: Tue Sep 15, 2020 3:06 am You can get away with 16 bits, for storing the entire move information? Interesting. (Or did you mean 16 bytes? But that would be a lot.)

I already have these:

Code: Select all

PIECE       	:   3        0-7 (use only 0-6)
FROM        	:   6        0-63
TO          	:   6        0-63
CAPTURE     	:   3        0-7 (captured piece)
PROMOTION   	:   3        0-7 (piece promoted to)
ENPASSANT   	:   1        0-1
DOUBLESTEP  	:   1        0-1
CASTLING    	:   1        0-1
That's 24 bits already, and I still need to store the depth (8 bits) and evaluation (don't know yet, but I assume 12-16 bits).
Throw away the 3 piece bits as they can be inferred from the from square and the gamestate. Throw away the capture because it can be inferred from the to square. Instead of a flag for enpassant, doublestep and so on, you can use the remaining 4 bits to encode for intotal 16 move types, while only 8 are needed: Namely Quiet, Capture, EnPassant, Castle, Promo Knight, Promo Bishop, Promo Rook, Promo Queen - atleast those are the cases I track in FabChess.
You can throw away another 8 bits from your encoding since everything can be inferred from the move number 🙄
Can you elaborate, I don't get it, what is the move number?

I also don't get the smiley- using less data to store the move in the hash entry and removing the data which you can get back from the gamestate FAST seems like a very sensible thing to do for me in order to store more in the TT.
What I meant was that if you for a position generate all the moves (actually you don’t even have to generate all moves you only have to make sure they are generated in a deterministic order so you can reproduce them) you won’t have more than 255 moves so you can save the move index of the best move in the TT. When probing the TT you regenerate the moves from the node and pick the saved move index from the TT and take the move that has the same index as that from the TT.

I guess it might be possible to make a better scheme where you don’t save the index in the TT but a combined (pieceType, pieceNr, pseudomoveIndexForThepieceNr). In that way you only have to pick out the pieceNr bit of the bitboard for the pieceType, look up the attackbitboard for that position and extract the pseudomoveIndex bit for that attackbitboard to generate the toSq. To make it possible you will have to make an advanced lookup scheme that will change the representation depending on the number of pieces for each pieceType so that pieces that has lots of twins will be represented by fewer bits. The same goes for pieces with high mobility.
Pio
Posts: 334
Joined: Sat Feb 25, 2012 10:42 pm
Location: Stockholm

Re: SMP, first shot at implementation

Post by Pio »

Pio wrote: Tue Sep 15, 2020 10:00 pm
fabianVDW wrote: Tue Sep 15, 2020 9:42 pm
Pio wrote: Tue Sep 15, 2020 8:02 pm
fabianVDW wrote: Tue Sep 15, 2020 7:09 pm
mvanthoor wrote: Tue Sep 15, 2020 3:06 am You can get away with 16 bits, for storing the entire move information? Interesting. (Or did you mean 16 bytes? But that would be a lot.)

I already have these:

Code: Select all

PIECE       	:   3        0-7 (use only 0-6)
FROM        	:   6        0-63
TO          	:   6        0-63
CAPTURE     	:   3        0-7 (captured piece)
PROMOTION   	:   3        0-7 (piece promoted to)
ENPASSANT   	:   1        0-1
DOUBLESTEP  	:   1        0-1
CASTLING    	:   1        0-1
That's 24 bits already, and I still need to store the depth (8 bits) and evaluation (don't know yet, but I assume 12-16 bits).
Throw away the 3 piece bits as they can be inferred from the from square and the gamestate. Throw away the capture because it can be inferred from the to square. Instead of a flag for enpassant, doublestep and so on, you can use the remaining 4 bits to encode for intotal 16 move types, while only 8 are needed: Namely Quiet, Capture, EnPassant, Castle, Promo Knight, Promo Bishop, Promo Rook, Promo Queen - atleast those are the cases I track in FabChess.
You can throw away another 8 bits from your encoding since everything can be inferred from the move number 🙄
Can you elaborate, I don't get it, what is the move number?

I also don't get the smiley- using less data to store the move in the hash entry and removing the data which you can get back from the gamestate FAST seems like a very sensible thing to do for me in order to store more in the TT.
What I meant was that if you for a position generate all the moves (actually you don’t even have to generate all moves you only have to make sure they are generated in a deterministic order so you can reproduce them) you won’t have more than 255 moves so you can save the move index of the best move in the TT. When probing the TT you regenerate the moves from the node and pick the saved move index from the TT and take the move that has the same index as that from the TT.

I guess it might be possible to make a better scheme where you don’t save the index in the TT but a combined (pieceType, pieceNr, pseudomoveIndexForThepieceNr). In that way you only have to pick out the pieceNr bit of the bitboard for the pieceType, look up the attackbitboard for that position and extract the pseudomoveIndex bit for that attackbitboard to generate the toSq. To make it possible you will have to make an advanced lookup scheme that will change the representation depending on the number of pieces for each pieceType so that pieces that has lots of twins will be represented by fewer bits. The same goes for pieces with high mobility.
I save my move as 16 bits now, but in my representation I also save the pieceType. It is possible because after fromSq and toSq you have 4 bits left. 3 of those bits can be used to encode all piece types plus a special pieceType for uncastled rooks and one for enpassant pawn. The last bit says if the previous 3 bits are special. If they are 2 of the 3 bits represent the promotion piece type. I think I represented it in that way but my memory isn’t the best
User avatar
mvanthoor
Posts: 1784
Joined: Wed Jul 03, 2019 4:42 pm
Location: Netherlands
Full name: Marcel Vanthoor

Re: SMP, first shot at implementation

Post by mvanthoor »

fabianVDW wrote: Tue Sep 15, 2020 9:42 pm Can you elaborate, I don't get it, what is the move number?

I also don't get the smiley- using less data to store the move in the hash entry and removing the data which you can get back from the gamestate FAST seems like a very sensible thing to do for me in order to store more in the TT.
Storing less gives you the capability to store more moves in the hash table. I now store the 64-bit hash key and the 64-bit move information, which is 16 bytes; if I can have 4 buckets of 16 bytes each, that's 64 bytes for one cache line. (Storing more than 64 bytes exceeds the cache line, which causes a 5-10% performance hit. It's a detail I was made aware of by HGM.)

If one entry takes 16 bytes, I can store 65K moves per megabyte. At very fast time controls, I've noticed that a hash table over 512 MB is often a waste of space, because the engine doesn't have the time to actually fill it up. 512MB stores 33.5 million moves; 2GB stores 134 million moves, and on my computer, it takes Stockfish on 4 threads about 1.5 minutes to fill it up.

While testing perft, the biggest performance increases were noticed with hash tables up to +/- 32 MB. (+/- 2 million stored moves.) For every megabyte, the performance increase was clearly visible, although diminishing. It was unnecessary to increase hash table size beyond 2GB, because it just didn't bring any more performance; at least not in the depths that I'd be willing to wait for.

So... storing more in the hash table is at this point not really my priority. If I want that, I can just make the hash table bigger. I've noticed that, even with bitboards, caching stuff can be useful. I now infer the king_square from the king bitboard. (bb_pieces[white][king].trailing_zeros() as square, so a king on b1 will have have one trailing 0... and is thus on A1 + 1 square.) I'm going to test if it's useful to actually cache the king's square...

I first didn't have any piece lists, because I thought the bitboards were fast enough... I also didn't have any materials list... but now I do. If I can store it and then incrementally update it instead of calculating it, I do so, from the point that memory usage is not an issue.
Author of Rustic, an engine written in Rust.
Releases | Code | Docs | Progress | CCRL
User avatar
mvanthoor
Posts: 1784
Joined: Wed Jul 03, 2019 4:42 pm
Location: Netherlands
Full name: Marcel Vanthoor

Re: SMP, first shot at implementation

Post by mvanthoor »

mvanthoor wrote: Tue Sep 15, 2020 10:21 pm keeping an incremental king square...
Tried it... and it doesn't work. It's 5-6% slower to do it that way as compared to just calculating the king square from the bitboard..
Author of Rustic, an engine written in Rust.
Releases | Code | Docs | Progress | CCRL