Perft speed optimization (renamed)

mvanthoor · Post by **mvanthoor** » Tue Apr 07, 2020 7:15 pm

I found the mistake. If there's one element in the array, I should return index 0. (So should decrement the counter before returning the element).

Now it works. It seems not to have made any difference. Perft time is still 114 seconds. Rusts Vector is as efficient as a fixed-size array backed by a counter wrapped in a Struct that implements push() and pop() (without actually clearing the element on pop()), even though the array is on the stack, and the vector is on the heap.

I can only see in the profiler that "perft" takes most of the time (like... duh), and it highlights the "make_move" function as a hotspot. (Double... duh...). I can't actually see what part within make_move takes the most time and why

Maybe I'll do a test and call make_move and unmake_move like half a million times separately, just for profiling... (edit: exactly the same. Now the profiler pinpoints main() as the function consuming the most time, with the make_move() line highlighted. WHY is it not showing make_move() as the spending the most time? In this situation with the profiler only showing the function calling make_move, I can't pinpoint the hotspots within make_move().)

mvanthoor · Post by **mvanthoor** » Tue Apr 07, 2020 11:22 pm

Grr. I've found one thing in the move generator that was superfluous.

I checked for being in check, even before looking to see if the side to move could actually castle. So, I've move the calls to square_attacked() (for E1/F1, and so on) way down, to the point where everything must be correct (has correct castling rights, no squares blocked between king and rook) before I actually start to look if these squares are attacked.

It gained nothing.

make_move basically obliterates anything and everything. The profiler says that perft() runs for 55 seconds of the 114 seconds the run takes. So, I can open perft.rs directly in the profiler, and it (obviously...) highlights make_move. From there on out, I'm stuck, because the profiler doesn't list make_move in the list of functions that are used most. Strangely enough, it does list unmake_move.

Code: Select all

Function / Call Stack	CPU Time	Instructions Retired	Microarchitecture Usage	Module	Function (Full)	Source File	Start Address
rustic::extra::perft::perft	55.899s	798,580,000,000	85.1%	rustic.exe	rustic::extra::perft::perft	perft.rs	0x140002300
rustic::board::unmake_move::unmake_move	25.627s	409,800,000,000	93.8%	rustic.exe	rustic::board::unmake_move::unmake_move	unmake_move.rs	0x140001f70
rustic::movegen::info::square_attacked	20.475s	248,936,000,000	80.2%	rustic.exe	rustic::movegen::info::square_attacked	info.rs	0x1400039f0
rustic::movegen::gen::add_move	17.172s	242,528,000,000	85.3%	rustic.exe	rustic::movegen::gen::add_move	gen.rs	0x1400036a0

square_attacked is called a lot; twice for generating the castling moves (if there are castling rights and there aren't any blocking pieces), and once at the end of make_move. I don't think I can make that function any faster; it basically puts each piece type on the square where the king is, then generates its moves, and sees if a piece of that type of the opponent is actually on any of those genrated squares. (The so called "super-piece" method.)

add_move also seems (too) slow to me. Looking if a piece was captured on a square takes the most time in that function. (The rest are basically comparisons and bit flips and such.)

Even so, move generation in Perft only takes 7.4 seconds of the entire run. That's only 6.5% of the entire perft run; and perft doesn't do anything but generate and execute moves. So, I feel optimizing the move generator is not worth it at this point. Adding things such as pinned pieces and a special in-check move generator and such makes the move generator faster, but the *REAL* gain in that case would be to call make_move and unmake_move *LESS*.

If I could only profile make_move somehow, and see where it spends most of its time. Probably moving a piece from one square to another, as most other things have an if() statement around it. Compared to moving pieces, things such as castling occur way less often.

Sesse · Post by **Sesse** » Wed Apr 08, 2020 12:52 pm

Are you sure it's not being inlined?

Any profiler worth its salt should allow you to go in and look at individual lines or instructions these days, so you should be able to peek there.

mvanthoor · Post by **mvanthoor** » Wed Apr 08, 2020 6:12 pm

Sesse wrote: ↑Wed Apr 08, 2020 12:52 pm Are you sure it's not being inlined?

Any profiler worth its salt should allow you to go in and look at individual lines or instructions these days, so you should be able to peek there.

Hi

I'm going to investigate this by dropping optimizations down. Rust is VERY aggressive with inlining when you have maximum optimizations set. I'll first disable LTO, and then, if necessary, start dropping opt-level down from 3. I'm assuming that if a non-optimized function pinpoints line X as being the slowest, that it'll also be the slowest in the optimized function.

I've discovered a few things though, yesterday evening. "if"-statements in Rust can be very expensive. For example:

Code: Select all

let x = <do some calculation and comparisons that seem to be very heavy... and might deliver 0 as the end result>

After some thinking, I reach the conclusion that I need to do the calculation only in a certain condition; otherwise, it will always be 0 as a result. So I then write:

Code: Select all

let mut x = 0;
if <some condition here> {
	x = <heavy calculation>
}

Result: slower code. (One of those constructions increased my perft 7 time from 114 seconds to 120...) I have seen this in more than one place. Very often, at least in Rust, it seems that it is faster to just do perform X (possibly for nothing), than to first check if you should go and do it. The check is more expensive than actually performing X for nothing.

So, I went through some of the functions in the move generator and regrouped some things to minimize IF-checks, and then also inlined square_attacked(). (In Rust, on the highest optimization settings, the compiler decides on inlining, except if you state explicity always/never for a function).

This regrouping and inlining dropped perft 7 from the starting position from 114 to 100 seconds. So that's a 13% speedup right there, without actually changing any code or logic. square_attacked() and add_move() completely dropped off the proiler's radar.

When I first got Perft working, it ran at 277 seconds for perft 7. (Without hash, without bulk-counting, or special move generators.) Now, some time later, it runs at 100 seconds, still without hash, bulk-counting or special move generators. That's a 64% speedup, without actually changing any logic or adding tricks or functionality; it's just a matter of replacing technique X with Y, regrouping, inlining a function, and cutting out superfluous/double IF-checks.

I'll start having a look at less strongly optimized code to see if I can get some more information from the profiler.

Some of the progress was made through direct help by this community; even if there wasn't help like 'try this or that', posts often contained usable hints or idea's in the right direction. Thanks people

Terje · Post by **Terje** » Wed Apr 08, 2020 6:46 pm

mvanthoor wrote: ↑Wed Apr 08, 2020 6:12 pm
Sesse wrote: ↑Wed Apr 08, 2020 12:52 pm Are you sure it's not being inlined?

Any profiler worth its salt should allow you to go in and look at individual lines or instructions these days, so you should be able to peek there.
Hi I'm going to investigate this by dropping optimizations down. Rust is VERY aggressive with inlining when you have maximum optimizations set. I'll first disable LTO, and then, if necessary, start dropping opt-level down from 3. I'm assuming that if a non-optimized function pinpoints line X as being the slowest, that it'll also be the slowest in the optimized function.

I've discovered a few things though, yesterday evening. "if"-statements in Rust can be very expensive. For example:
Code: Select all
let x = <do some calculation and comparisons that seem to be very heavy... and might deliver 0 as the end result>
After some thinking, I reach the conclusion that I need to do the calculation only in a certain condition; otherwise, it will always be 0 as a result. So I then write:
Code: Select all
let mut x = 0;
if <some condition here> {
	x = <heavy calculation>
}
Result: slower code. (One of those constructions increased my perft 7 time from 114 seconds to 120...) I have seen this in more than one place. Very often, at least in Rust, it seems that it is faster to just do perform X (possibly for nothing), than to first check if you should go and do it. The check is more expensive than actually performing X for nothing.

So, I went through some of the functions in the move generator and regrouped some things to minimize IF-checks, and then also inlined square_attacked(). (In Rust, on the highest optimization settings, the compiler decides on inlining, except if you state explicity always/never for a function).

This regrouping and inlining dropped perft 7 from the starting position from 114 to 100 seconds. So that's a 13% speedup right there, without actually changing any code or logic. square_attacked() and add_move() completely dropped off the proiler's radar.

When I first got Perft working, it ran at 277 seconds for perft 7. (Without hash, without bulk-counting, or special move generators.) Now, some time later, it runs at 100 seconds, still without hash, bulk-counting or special move generators. That's a 64% speedup, without actually changing any logic or adding tricks or functionality; it's just a matter of replacing technique X with Y, regrouping, inlining a function, and cutting out superfluous/double IF-checks.

I'll start having a look at less strongly optimized code to see if I can get some more information from the profiler.

Some of the progress was made through direct help by this community; even if there wasn't help like 'try this or that', posts often contained usable hints or idea's in the right direction. Thanks people

Branches are expensive if they are hard to predict as the CPU will guess which outcome will happen and start working on that path while still doing the branch test. If it then finds that the guess was wrong all the work it did was wasted.

Inlined functions drop off the profiler's radar as they are no longer called, but are just code inserted in the function surrounding it. They still take time, counted as the surrounding functions time, although with less overhead and possibly more optimizations from knowing the context for the code.

Nice speedup!

mvanthoor · Post by **mvanthoor** » Wed Apr 08, 2020 8:14 pm

Terje wrote: ↑Wed Apr 08, 2020 6:46 pm ]Branches are expensive if they are hard to predict as the CPU will guess which outcome will happen and start working on that path while still doing the branch test. If it then finds that the guess was wrong all the work it did was wasted.

Yes, I know. I just hadn't expected that an if-statement could actually be more expensive than a function call and a computation (that might actually end up yielding nothing, such as finding all the rook moves if there's no rook).

Inlined functions drop off the profiler's radar as they are no longer called, but are just code inserted in the function surrounding it. They still take time, counted as the surrounding functions time, although with less overhead and possibly more optimizations from knowing the context for the code.

I know how inlining works, but I thought the profiler would have been able to split them out again using the debug symbols. Apparently not.

I needed to drop back to opt-level = 1 and no LTO to actually see all of the functions in the profiler.

The most expensive functions are remove_piece() and put_piece() in make_move(). Obviously.

A long time ago, I was debating between:

Code: Select all

bb_pieces [2] [6]

and:

Code: Select all

bb_white [6]
bb_black [6]

I chose the second method, thinking that I could just take a pointer to the bitboard in C-style and keep everything nicely split:

Code: Select all

bb_use = (side == WHITE) ? bb_white : bb_black

That was before I decided to code the engine in Rust, and in the very beginning, I didn't know / realize that this doesn't work in Rust. Because you have a mutating reference (which in C would be a pointer) to a component of the board, you can't do anything else with that board. You can only have many readers, or one reader/writer, not both, because of Rust's memory safety guarantees.

So... I now have to put an "if side == WHITE" in both remove_piece() and put_piece(), which apparently is very expensive. So I'm going to try and switch the bitboard representation to method 1, so I can write "bitboard_for_pieces [WHITE] [KING]" and take out two if-statements that now get called A LOT.

Then I'll see from there.

Nice speedup!

Thanks. Let's see what this bitboard change is gonna bring this evening

Should I hit 90 seconds or less with Perft 7, I'll promise I'll stop optimizing move generation and making and finish the search and UCI functionality

(PS: I still at some time need to run the original version of VICE. I'm curious to see how fast the engine is which was written by my "teacher"

)

Sesse · Post by **Sesse** » Wed Apr 08, 2020 11:00 pm

If you have debugging information, at least perf will de-inline it for you when showing the profile; pass the --inline flag. (It's not perfect, but it's a lot better than nothing.) This is assuming Rust's tool chain gives out full DWARF, which I'd assume it does.

mvanthoor · Post by **mvanthoor** » Wed Apr 08, 2020 11:18 pm

Sesse wrote: ↑Wed Apr 08, 2020 11:00 pm If you have debugging information, at least perf will de-inline it for you when showing the profile; pass the --inline flag. (It's not perfect, but it's a lot better than nothing.) This is assuming Rust's tool chain gives out full DWARF, which I'd assume it does.

I can't use perf because I'm on Windows. Even though Intel VTune should understand DWARF according to the docs, it doesn't. The function names are mangled. I've switched to the MSVC toolchain. Basically, the only difference should be the output of the debug symbols and using link.exe from MS instead of ld.exe from GNU... but Rustic compiled with the MSVC toolchain is MUCH faster (and I mean... like 40%) than when it's compiled with the GNU toolchain.

I've implemented the bitboards in the "bb_side[side][piece]" way now, so I can write "bb_side[WHITE][KING]". It solved many strange constructions throughout the engine, and it dropped perft 7 times from 100.1 seconds to 95.5 (it fluctuates between 94.9 and 95.7). There are some hotspots in make_move (if-statements...) that I might yet be able to optimize away, especially the one for removing castling permissions when moving away from the start square. I'll test VICE/Weiss/... CastlePerm table for this. I purposely didn't implement it like that because VICE did it like that

Perft test:

Code: Select all

Benchmarking perft 1-7:

8   r n b q k b n r
7   i i i i i i i i
6   . . . . . . . .
5   . . . . . . . .
4   . . . . . . . .
3   . . . . . . . .
2   I I I I I I I I
1   R N B Q K B N R

    A B C D E F G H

Zobrist key:        819aa694337673fb
Active Color:       White
Castling:           KQkq
En Passant:         -
Half-move clock:    0
Full-move number:   1

Perft 1: 20 (0 ms, inf leaves/sec)
Perft 2: 400 (0 ms, inf leaves/sec)
Perft 3: 8902 (0 ms, inf leaves/sec)
Perft 4: 197281 (6 ms, 32880166 leaves/sec)
Perft 5: 4865609 (145 ms, 33555924 leaves/sec)
Perft 6: 119060324 (3607 ms, 33008129 leaves/sec)
Perft 7: 3195901860 (95511 ms, 33461086 leaves/sec)
Total time spent: 99269 ms
Execution speed: 33444825 leaves/second

There's one strange thing though. In info.rs, which has square_attacked() in it, I found a calculation which was left over, from the time when Rustic didn't have a full bitboard implementation yet:

Code: Select all

A:

let bb_square = 1u64 << square;
let bb_pawns = if attacker == WHITE {
        (bb_square & !board.bb_files[FILE_A]) >> 9 | (bb_square & !board.bb_files[FILE_H]) >> 7
} else {
        (bb_square & !board.bb_files[FILE_A]) << 7 | (bb_square & !board.bb_files[FILE_H]) << 9
};

This calculates which pawns attack a certain square.

I thought to be smart, and replace it with a call to the move generator, which should be able to just copy the relevant attack bitboard to bb_pawns:

Code: Select all

B:

let bb_pawns = board.get_pawn_attacks(attacker ^ 1, square);

This works, but my perft time shot up from 95 to 107 seconds.

In the profiler, the call to get_pawn_attacks() doesn't take any more time than any of the other calls such as get_non_slider_attacks(). Still, how is it possible that this ugly construction in code snippet A is _SO MUCH_ faster? The only thing I can imagine is that the compiler somehow completely optimizes it away. Obviously, it's not possible to calculate every other piece in this function on the spot.

Tomorrow evening I'll run a full perftsuite, with option A and option B, and see which one is faster over the entire suite; but I suspect it to be ugly option A. In that case, I'll leave it in, even though it pains my eyes

Terje · Post by **Terje** » Wed Apr 08, 2020 11:46 pm

mvanthoor wrote: ↑Wed Apr 08, 2020 11:18 pm There's one strange thing though. In info.rs, which has square_attacked() in it, I found a calculation which was left over, from the time when Rustic didn't have a full bitboard implementation yet:
Code: Select all
A:

let bb_square = 1u64 << square;
let bb_pawns = if attacker == WHITE {
        (bb_square & !board.bb_files[FILE_A]) >> 9 | (bb_square & !board.bb_files[FILE_H]) >> 7
} else {
        (bb_square & !board.bb_files[FILE_A]) << 7 | (bb_square & !board.bb_files[FILE_H]) << 9
};
This calculates which pawns attack a certain square.

I thought to be smart, and replace it with a call to the move generator, which should be able to just copy the relevant attack bitboard to bb_pawns:
Code: Select all
B:

let bb_pawns = board.get_pawn_attacks(attacker ^ 1, square);
This works, but my perft time shot up from 95 to 107 seconds.

In the profiler, the call to get_pawn_attacks() doesn't take any more time than any of the other calls such as get_non_slider_attacks(). Still, how is it possible that this ugly construction in code snippet A is _SO MUCH_ faster? The only thing I can imagine is that the compiler somehow completely optimizes it away. Obviously, it's not possible to calculate every other piece in this function on the spot.

Tomorrow evening I'll run a full perftsuite, with option A and option B, and see which one is faster over the entire suite; but I suspect it to be ugly option A. In that case, I'll leave it in, even though it pains my eyes

For getting pawn attacks, having a function that shifts a bitboard in some direction can make the code a lot prettier. For example:

Code: Select all

const Direction up = (color == WHITE ? NORTH : SOUTH);
return ShiftBB(up+WEST, pawns) | ShiftBB(up+EAST, pawns);

Also useful for pawn movegen and eval

Of course, I don't know Rust, so appologies if this isn't applicable for some reason.

mvanthoor · Post by **mvanthoor** » Thu Apr 09, 2020 1:41 am

Of course that calculation can be wrapped in a function.

What I find strange is the fact that taking the result of this calculation out of the move generator's attack table is so incredibly much slower than just calculating the result again (and again... and again...). If I wrap the code in a function, it might not be much different than just calling get_pawn_attacks() (which just returns "pawns_attacks[side][square]"...)

Perft speed optimization (renamed)

Re: Can't get Minic to run perft?

Re: Can't get Minic to run perft?

Re: Can't get Minic to run perft?

Re: Can't get Minic to run perft?

Re: Can't get Minic to run perft?

Re: Can't get Minic to run perft?

Re: Can't get Minic to run perft?

Re: Can't get Minic to run perft?

Re: Can't get Minic to run perft?

Re: Can't get Minic to run perft?