Perft speed optimization (renamed)

mvanthoor · Post by **mvanthoor** » Fri May 08, 2020 6:22 pm

Hi Terje,

I've pulled the lastest master for Weiss after your last 1.5% update. In KiwiPete6, Weiss is now down to the higher 200.xx and lower 201.xx range.

I also cleaned up Rustic a bit more (basically reversing the "last correct option + everything" in match), removing some unnecessary type casts that aren't necessary anymore because of earlier changes, some consistency... It shouldn't have made any difference, but sometimes, Rustic now indeed achieves a 199.xx run for KiwiPete6.

Weiss: 200.98 seconds
Rustic: 199.33 seconds

Starpos7 are still the same for both (actually, Weiss lost about 0.2 seconds):

Weiss: 76.01 seconds
Rustic: 78.10 seoncds

After running tests with other positions, it seems Weiss and Rustic are always within +/- 2 seconds of one another; sometimes Weiss is a bit faster, sometimes it's Rustic, depending on the position.

I think that, for all intents and purposes, we can call the speed to be equal.

=== About engine and code construction; skip if not interested... ===

I've also tried a rearrangements of the code; basically, I moved gamestate updates such as castling permissions and side to move AFTER the legality check. (If the move isn't legal, those updates aren't necessary as unmake() will undo them.) Both with an early return from make(), and when using an if-statement, the performance drops by 6%.

It seems Rust (or the CPU...) is very sensitive to nested if-statements and upsetting the flow of straightforward code. Sometimes, trying to prevent doing something with if-statements is more expensive than just doing the work and then either throw it away or reverse it if it turned out to be unnecessary. (I saw the same thing in add_move() with the promotions.)

I do have a big refactor ahead, because I wrote the move generator as a module that does not change the board, but when I want to test the pinning stuff with the knight, I must change the board. As it is now, I'm in the weird position that some functions in the move generator must change the board (i.e. the bitboards) and not change the board (i.e., the move generator itself that is incorporated in the board), at the same time, and that's obviously not possible.

I did it this way so I don't have to pass the zobrist module, move generator, evaluator, and so on into each function.

In many chess engines, things such as the move generator are just a bunch of global variables and functions that you can change and call from anywhere, but that's not allowed in Rust. (Because, in that case, Rust can't guarantee thread saftey.)

I'm going to package the non-changing modules (move generator, zobrist-handler, evaluation, maybe others such as search...) into a single immutable core, which is initialized and created at the program start, and then passed into each function that also accepts a board.

That way, it resolves the mutability conflict, and I have to only pass one pointer to a global "core" struct; a struct that will never change. The only changing object in the engine will be the board.

So, how am I going to call that struct? "Core" is lame. I'm partial to "Cerebrus" or just "Brain" :p

===== output from perft =====

Weiss, No PEXT:

Code: Select all

Marcel@WORKSTATION MINGW64 /c/code/weiss/src
$ ./weiss-dev.exe
perft 6 r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq - 0 1

Starting perft to depth 6

move 1 : g2h3 : 158328615
...
move 48 : e1f1 : 139601450

Perft complete:
Time  : 200975ms
Leaves: 8031647685
LPS   : 39963416

Rustic (doesn't have PEXT yet)

Code: Select all

Benchmarking perft 1-6:

8   r . . . k . . r
7   i . i i q i b .
6   b n . . i n i .
5   . . . I N . . .
4   . i . . I . . .
3   . . N . . Q . i
2   I I I B B I I I
1   R . . . K . . R

    A B C D E F G H

Zobrist key:        fb86acbc2034870e
Active Color:       White
Castling:           KQkq
En Passant:         -
Half-move clock:    0
Full-move number:   1

Perft 1: 48 (0 ms, inf leaves/sec)
Perft 2: 2039 (0 ms, inf leaves/sec)
Perft 3: 97862 (2 ms, 48931000 leaves/sec)
Perft 4: 4085603 (101 ms, 40451514 leaves/sec)
Perft 5: 193690690 (4741 ms, 40854395 leaves/sec)
Perft 6: 8031647685 (199331 ms, 40293018 leaves/sec)
Total time spent: 204175 ms
Execution speed: 40306227 leaves/second

mvanthoor · Post by **mvanthoor** » Sun May 10, 2020 11:10 pm

So... a big part of the refactor has been finished. The code is now much cleaner. (The refactor allowed me to remove several forwarding functions and references-to-references.) It did gain some speed, but not too much:

Startpos7 is now consistently in the lower 77.x range.
KiwiPete6 is now consistently in the lower 198.x range.

I can't yet test a function that omits all the moves of a knight if the knight is pinned, because of this function:

Code: Select all

    pub fn gen_all_moves(&self, ml: &mut MoveList) {
        self.move_generator.gen_all_moves(self, ml);
    }

Basically, Rustic is wrtten in a pseudo-object oriented way. The move generator is in the board, so I don't have to pass the move generator down to each function seperately.

The above function in board calls the move generator, while passing a reference to the object it belongs to (&self, in this case the board). Rust doens't have any problems with that, as long as gen_all_moves() doesn't change the board. However, if I want to test this pin trick, I need to temporarily remove the knight from the board, and thus the board must be able to change:

Code: Select all

    pub fn gen_all_moves(&mut self, ml: &mut MoveList) {
        self.move_generator.gen_all_moves(self, ml);
    }

(Note the &mut in the function signature.)

Now Rust is going to complaining. You can have as many read-only references as you want, or ONE and ONLY ONE mutating reference. In this case, I already have a non-mutating reference (self.move_generator.gen_all_moves() uses a reference to the board because of "self.move_..."), and I'm also passing a mutating reference (the incoming &mut self) to the same function.

Obviously the complaint is that I now have a non-mutating and a mutating reference at the same time, and thus Rust can't guarantee either one. So the only solution is to split the move generator from the board. I'll have to think about how I can handle this (maybe using an encompassing struct that only has a single level of components, such as Board, Search, Evaluate...) or in another way. Not being able to instantiate a global variable/struct and then just refer to it makes this somewhat harder than I thought.

Pio · Post by **Pio** » Sun May 10, 2020 11:57 pm

mvanthoor wrote: ↑Sun May 10, 2020 11:10 pm So... a big part of the refactor has been finished. The code is now much cleaner. (The refactor allowed me to remove several forwarding functions and references-to-references.) It did gain some speed, but not too much:

Startpos7 is now consistently in the lower 77.x range.
KiwiPete6 is now consistently in the lower 198.x range.

I can't yet test a function that omits all the moves of a knight if the knight is pinned, because of this function:
Code: Select all
    pub fn gen_all_moves(&self, ml: &mut MoveList) {
        self.move_generator.gen_all_moves(self, ml);
    }
Basically, Rustic is wrtten in a pseudo-object oriented way. The move generator is in the board, so I don't have to pass the move generator down to each function seperately.

The above function in board calls the move generator, while passing a reference to the object it belongs to (&self, in this case the board). Rust doens't have any problems with that, as long as gen_all_moves() doesn't change the board. However, if I want to test this pin trick, I need to temporarily remove the knight from the board, and thus the board must be able to change:
Code: Select all
    pub fn gen_all_moves(&mut self, ml: &mut MoveList) {
        self.move_generator.gen_all_moves(self, ml);
    }
(Note the &mut in the function signature.)

Now Rust is going to complaining. You can have as many read-only references as you want, or ONE and ONLY ONE mutating reference. In this case, I already have a non-mutating reference (self.move_generator.gen_all_moves() uses a reference to the board because of "self.move_..."), and I'm also passing a mutating reference (the incoming &mut self) to the same function.

Obviously the complaint is that I now have a non-mutating and a mutating reference at the same time, and thus Rust can't guarantee either one. So the only solution is to split the move generator from the board. I'll have to think about how I can handle this (maybe using an encompassing struct that only has a single level of components, such as Board, Search, Evaluate...) or in another way. Not being able to instantiate a global variable/struct and then just refer to it makes this somewhat harder than I thought.

Why do you have to modify the board in the first place? Can’t you just generate queen moves from your kings position and see if it passes your knights before the opponents queen, rook or bishop?

Terje · Post by **Terje** » Mon May 11, 2020 1:33 am

Make a bitboard with all pieces, remove the knight in question from this bitboard, lookup rook/bishop attacks from your king with this bitboard as blockers, and see if it hit an enemy rook+queen or bishop+queen. Unless your attack lookup needs to use the current board as blockers this should work?

mvanthoor · Post by **mvanthoor** » Mon May 11, 2020 1:41 am

@Pio: Could be done that way.

I think removing the piece (not only a knight, in the future) and seeing whichever piece can attack the king is faster. This can ve extended for an in check mive generation, to capture or block the checker.

Having the situation as described above, where I can't pass a mutable board to a function (where the board is the only mutable element in the program at this point) is undesirable.

The solution I'll implement will also be needed for the hash tables in the future. (Probably not put the hash tables in the board and attach functions to them; or it'll cause the same problem as with the move generator.)

mvanthoor · Post by **mvanthoor** » Mon May 11, 2020 1:46 am

Terje wrote: ↑Mon May 11, 2020 1:33 am Make a bitboard with all pieces, remove the knight in question from this bitboard, lookup rook/bishop attacks from your king with this bitboard as blockers, and see if it hit an enemy rook+queen or bishop+queen. Unless your attack lookup needs to use the current board as blockers this should work?

Yes, I already have all of that in the board. square_attacked already works like this. I could make a separate function that receives a copy of the occupancy bitboards to work with, but that's an extra 8 byte copy to perform, and the undesirable design of the board/movegen connection still exists.

mvanthoor · Post by **mvanthoor** » Mon May 11, 2020 9:32 pm

I've finished the refactor for extracting the move generator from Board into a new base struct named Engine.

During the refactor, StartPos7 fluctuated somewhat in speed, and finally settled on ~78.5 seconds (so it lost a second).
KiwiPete6 however, steadily dropped perft time, and it's now in the upper 196.x range (without PEXT).

Now I can (probably) try that "pinned knight" trick I mentioned. I'm curious to see how much speed it gains. After that I'll probably shelve it as a future enhancement, to be implemented during pin detection.

Code: Select all

Benchmarking perft 1-6:

8   r . . . k . . r
7   i . i i q i b .
6   b n . . i n i .
5   . . . I N . . .
4   . i . . I . . .
3   . . N . . Q . i
2   I I I B B I I I
1   R . . . K . . R

    A B C D E F G H

Zobrist key:        fb86acbc2034870e
Active Color:       White
Castling:           KQkq
En Passant:         -
Half-move clock:    0
Full-move number:   1

Perft 1: 48 (0 ms, inf leaves/sec)
Perft 2: 2039 (0 ms, inf leaves/sec)
Perft 3: 97862 (3 ms, 32620666 leaves/sec)
Perft 4: 4085603 (104 ms, 39284644 leaves/sec)
Perft 5: 193690690 (4661 ms, 41555608 leaves/sec)
Perft 6: 8031647685 (196973 ms, 40775373 leaves/sec)
Total time spent: 201741 ms
Execution speed: 40792520 leaves/second

mvanthoor · Post by **mvanthoor** » Fri Oct 09, 2020 2:46 am

During the progress on Rustic, I never stopped to keep an eye on the engine's speed.

The last version, compiled in this thread, ran perft 7 from the starting position at around 78 seconds. It ran Perft 6 from the KiwiPete position at around 196 seconds.

During the upgrade to Rust 1.45, I encountered a massive 20% speed drop. (i.e., a compile with version 1.45.x was 20% slower than a compile with version 1.44.x) I reported this as a bug, with complete explanation and the conclusion that the problem was caused by the upgrade to LLVM 10, from version 9, but the Rust team at some point removed the "Priority" label from this, and replaced it by "Medium". Therefore I reverted all of my refactors, and did them again, using Rust 1.46 as a basis, keeping a close eye on engine speed with every line I changed.

(In the process, I removed incremental material updates in preparation of implementing a tapered evaluation. Incremental material values are of no use in this case, because piece values can change at any moment.)

Refactoring again on the basis of Rust 1.46, and by removing the incremental material update, I got Perft 7 from the starting position down to 72 seconds; the KiwiPete position ran at a speed of 182 seconds. That is an improvement of almost 10 percent.

Now Rust 1.47 was released. I just recompiled Rustic and tested again. The main difference between Rust 1.46 and 1.47 is the upgrade from LLVM 10 to 11. Now the speeds are 85 seconds for Perft 7 Startpos, and 185 seconds for Perft 6 Kiwipete.

Those are 82 -> 85 3.6% and 1.6% speed drops. I know that speed differences that are so small do not really matter for playing strength, but it's still making me tired of testing new Rust versions; I can't just be sure that a new version won't slow down the engine. (I have seen similar complaints about newer versions of GCC, generating measurably slower code than older versions for the same programs.) Maybe, if the speed drop isn't too severe, I should just ignore it, and hope that it will be improved again with the next version of Rust.

Apart from this, writing software in this language is _easy_, if you can get past the nested types that look like Russian Dolls ( I'm looking at you, Arc<Option<Mutex<Board>>> ... ) , and get to grok the borrow checker. If it compiles, and you have your logic correct, it WILL work as intended.

hgm · Post by **hgm** » Fri Oct 09, 2020 10:38 am

mvanthoor wrote: ↑Fri Oct 09, 2020 2:46 am(In the process, I removed incremental material updates in preparation of implementing a tapered evaluation. Incremental material values are of no use in this case, because piece values can change at any moment.)

Actually they are. You just have to keep track of more than one incremental total. But that is still way faster than calculating everything from scratch.

The standard method for doing this is to keep two incremental material/PST evals, 'opening' and 'endGame'. For the final evaluation the linear interpolation (material * opening + (MAXMATERIAL - material) * endGame)/MAXMATERIAL is then taken. MAXMATERIAL is a constant (often even a power of 2), so there is no real division here (which could be slow), just a shift, or perhaps a multiplication and a shift. This assumes piece values vary linearly with 'material' (which is also updated incrementally); for higher-order dependence you would have to keep track of more total PST evals.

In case you like micro-optimalizations you can even remove the need for two incremental PST evals, by packing them into one word: in KingSlayer all the PST and piece-value tables are initialized as ((8*(opening - endGame) + 1)/3 << 13) + endGame. The material count ranges from 0 to 24 (B=N=1, R=2, Q=4), and the single incremental pstEval is then multiplied in the final evaluation with (1<<19) + material, to produce

((1<<19) + material) * (((8*(opening - endGame) + 1)/3 << 13) + endGame) =
material * ((8*(opening - endGame) + 1)/3 << 13) + (endGame << 19) + material*endGame,

as the product of the two large terms overflows out of the 32-bit word. Right-shifting the product 19 bits gets rid of the third term (which is never larger than 1<<19), and leaves you with

material*(opening - endGame)/24 + rounding + endGame =
material*opening/24 + (1 - material/24)*endGame + rounding,

the sought linear interpolation. (For material=0 is equals endGame, for material=24 it equals opening.)

When MAXMATERIAL is a power of 2 things are even simpler, and you can do with only shifts, and no rounding; in KingSlayer an extra division by 3 is needed, leading to some rounding noise, which is minimized by effectively storing the difference part as fixed-point value with 3-bit fraction (because of the 8*). The rounding error is thus never more than 1 LSB of the total result. (It effectively forces the difference between endGame and opening values to be a multiple of 3 centi-Pawn, which would require an adjustment of at most one centi-Pawn of one or the other.)

mvanthoor · Post by **mvanthoor** » Fri Oct 09, 2020 11:01 am

hgm wrote: ↑Fri Oct 09, 2020 10:38 am Actually they are. You just have to keep track of more than one incremental total. But that is still way faster than calculating everything from scratch.

Yes, I know; I realised this as well yesterday. I can just keep BOTH material evaluations, obviously, update them incrementally, and then interpolate between them.

I'm still miffed about the fact that a performance drop always seems to be in the cards when using Rust, because of changing LLVM versions. (AFAIK, they're now focusing on compilation speed and stabilizing features.)

Perft speed optimization (renamed)

Re: Can't get Minic to run perft?

Re: Perft speed optimization (renamed)

Re: Perft speed optimization (renamed)

Re: Perft speed optimization (renamed)

Re: Perft speed optimization (renamed)

Re: Perft speed optimization (renamed)

Re: Perft speed optimization (renamed)

Re: Perft speed optimization (renamed)

Re: Perft speed optimization (renamed)

Re: Perft speed optimization (renamed)