Perft speed optimization (renamed)
Moderators: hgm, Rebel, chrisw
-
- Posts: 1784
- Joined: Wed Jul 03, 2019 4:42 pm
- Location: Netherlands
- Full name: Marcel Vanthoor
Re: Can't get Minic to run perft?
I found the mistake. If there's one element in the array, I should return index 0. (So should decrement the counter before returning the element).
Now it works. It seems not to have made any difference. Perft time is still 114 seconds. Rusts Vector is as efficient as a fixed-size array backed by a counter wrapped in a Struct that implements push() and pop() (without actually clearing the element on pop()), even though the array is on the stack, and the vector is on the heap.
I can only see in the profiler that "perft" takes most of the time (like... duh), and it highlights the "make_move" function as a hotspot. (Double... duh...). I can't actually see what part within make_move takes the most time and why
Maybe I'll do a test and call make_move and unmake_move like half a million times separately, just for profiling... (edit: exactly the same. Now the profiler pinpoints main() as the function consuming the most time, with the make_move() line highlighted. WHY is it not showing make_move() as the spending the most time? In this situation with the profiler only showing the function calling make_move, I can't pinpoint the hotspots within make_move().)
Now it works. It seems not to have made any difference. Perft time is still 114 seconds. Rusts Vector is as efficient as a fixed-size array backed by a counter wrapped in a Struct that implements push() and pop() (without actually clearing the element on pop()), even though the array is on the stack, and the vector is on the heap.
I can only see in the profiler that "perft" takes most of the time (like... duh), and it highlights the "make_move" function as a hotspot. (Double... duh...). I can't actually see what part within make_move takes the most time and why
Maybe I'll do a test and call make_move and unmake_move like half a million times separately, just for profiling... (edit: exactly the same. Now the profiler pinpoints main() as the function consuming the most time, with the make_move() line highlighted. WHY is it not showing make_move() as the spending the most time? In this situation with the profiler only showing the function calling make_move, I can't pinpoint the hotspots within make_move().)
-
- Posts: 1784
- Joined: Wed Jul 03, 2019 4:42 pm
- Location: Netherlands
- Full name: Marcel Vanthoor
Re: Can't get Minic to run perft?
Grr. I've found one thing in the move generator that was superfluous.
I checked for being in check, even before looking to see if the side to move could actually castle. So, I've move the calls to square_attacked() (for E1/F1, and so on) way down, to the point where everything must be correct (has correct castling rights, no squares blocked between king and rook) before I actually start to look if these squares are attacked.
It gained nothing.
make_move basically obliterates anything and everything. The profiler says that perft() runs for 55 seconds of the 114 seconds the run takes. So, I can open perft.rs directly in the profiler, and it (obviously...) highlights make_move. From there on out, I'm stuck, because the profiler doesn't list make_move in the list of functions that are used most. Strangely enough, it does list unmake_move.
square_attacked is called a lot; twice for generating the castling moves (if there are castling rights and there aren't any blocking pieces), and once at the end of make_move. I don't think I can make that function any faster; it basically puts each piece type on the square where the king is, then generates its moves, and sees if a piece of that type of the opponent is actually on any of those genrated squares. (The so called "super-piece" method.)
add_move also seems (too) slow to me. Looking if a piece was captured on a square takes the most time in that function. (The rest are basically comparisons and bit flips and such.)
Even so, move generation in Perft only takes 7.4 seconds of the entire run. That's only 6.5% of the entire perft run; and perft doesn't do anything but generate and execute moves. So, I feel optimizing the move generator is not worth it at this point. Adding things such as pinned pieces and a special in-check move generator and such makes the move generator faster, but the *REAL* gain in that case would be to call make_move and unmake_move *LESS*.
If I could only profile make_move somehow, and see where it spends most of its time. Probably moving a piece from one square to another, as most other things have an if() statement around it. Compared to moving pieces, things such as castling occur way less often.
I checked for being in check, even before looking to see if the side to move could actually castle. So, I've move the calls to square_attacked() (for E1/F1, and so on) way down, to the point where everything must be correct (has correct castling rights, no squares blocked between king and rook) before I actually start to look if these squares are attacked.
It gained nothing.
make_move basically obliterates anything and everything. The profiler says that perft() runs for 55 seconds of the 114 seconds the run takes. So, I can open perft.rs directly in the profiler, and it (obviously...) highlights make_move. From there on out, I'm stuck, because the profiler doesn't list make_move in the list of functions that are used most. Strangely enough, it does list unmake_move.
Code: Select all
Function / Call Stack CPU Time Instructions Retired Microarchitecture Usage Module Function (Full) Source File Start Address
rustic::extra::perft::perft 55.899s 798,580,000,000 85.1% rustic.exe rustic::extra::perft::perft perft.rs 0x140002300
rustic::board::unmake_move::unmake_move 25.627s 409,800,000,000 93.8% rustic.exe rustic::board::unmake_move::unmake_move unmake_move.rs 0x140001f70
rustic::movegen::info::square_attacked 20.475s 248,936,000,000 80.2% rustic.exe rustic::movegen::info::square_attacked info.rs 0x1400039f0
rustic::movegen::gen::add_move 17.172s 242,528,000,000 85.3% rustic.exe rustic::movegen::gen::add_move gen.rs 0x1400036a0
add_move also seems (too) slow to me. Looking if a piece was captured on a square takes the most time in that function. (The rest are basically comparisons and bit flips and such.)
Even so, move generation in Perft only takes 7.4 seconds of the entire run. That's only 6.5% of the entire perft run; and perft doesn't do anything but generate and execute moves. So, I feel optimizing the move generator is not worth it at this point. Adding things such as pinned pieces and a special in-check move generator and such makes the move generator faster, but the *REAL* gain in that case would be to call make_move and unmake_move *LESS*.
If I could only profile make_move somehow, and see where it spends most of its time. Probably moving a piece from one square to another, as most other things have an if() statement around it. Compared to moving pieces, things such as castling occur way less often.
-
- Posts: 300
- Joined: Mon Apr 30, 2018 11:51 pm
Re: Can't get Minic to run perft?
Are you sure it's not being inlined?
Any profiler worth its salt should allow you to go in and look at individual lines or instructions these days, so you should be able to peek there.
Any profiler worth its salt should allow you to go in and look at individual lines or instructions these days, so you should be able to peek there.
-
- Posts: 1784
- Joined: Wed Jul 03, 2019 4:42 pm
- Location: Netherlands
- Full name: Marcel Vanthoor
Re: Can't get Minic to run perft?
Hi I'm going to investigate this by dropping optimizations down. Rust is VERY aggressive with inlining when you have maximum optimizations set. I'll first disable LTO, and then, if necessary, start dropping opt-level down from 3. I'm assuming that if a non-optimized function pinpoints line X as being the slowest, that it'll also be the slowest in the optimized function.
I've discovered a few things though, yesterday evening. "if"-statements in Rust can be very expensive. For example:
Code: Select all
let x = <do some calculation and comparisons that seem to be very heavy... and might deliver 0 as the end result>
Code: Select all
let mut x = 0;
if <some condition here> {
x = <heavy calculation>
}
So, I went through some of the functions in the move generator and regrouped some things to minimize IF-checks, and then also inlined square_attacked(). (In Rust, on the highest optimization settings, the compiler decides on inlining, except if you state explicity always/never for a function).
This regrouping and inlining dropped perft 7 from the starting position from 114 to 100 seconds. So that's a 13% speedup right there, without actually changing any code or logic. square_attacked() and add_move() completely dropped off the proiler's radar.
When I first got Perft working, it ran at 277 seconds for perft 7. (Without hash, without bulk-counting, or special move generators.) Now, some time later, it runs at 100 seconds, still without hash, bulk-counting or special move generators. That's a 64% speedup, without actually changing any logic or adding tricks or functionality; it's just a matter of replacing technique X with Y, regrouping, inlining a function, and cutting out superfluous/double IF-checks.
I'll start having a look at less strongly optimized code to see if I can get some more information from the profiler.
Some of the progress was made through direct help by this community; even if there wasn't help like 'try this or that', posts often contained usable hints or idea's in the right direction. Thanks people
-
- Posts: 347
- Joined: Tue Nov 19, 2019 4:34 am
- Location: https://github.com/TerjeKir/weiss
- Full name: Terje Kirstihagen
Re: Can't get Minic to run perft?
Branches are expensive if they are hard to predict as the CPU will guess which outcome will happen and start working on that path while still doing the branch test. If it then finds that the guess was wrong all the work it did was wasted.mvanthoor wrote: ↑Wed Apr 08, 2020 6:12 pmHi I'm going to investigate this by dropping optimizations down. Rust is VERY aggressive with inlining when you have maximum optimizations set. I'll first disable LTO, and then, if necessary, start dropping opt-level down from 3. I'm assuming that if a non-optimized function pinpoints line X as being the slowest, that it'll also be the slowest in the optimized function.
I've discovered a few things though, yesterday evening. "if"-statements in Rust can be very expensive. For example:
After some thinking, I reach the conclusion that I need to do the calculation only in a certain condition; otherwise, it will always be 0 as a result. So I then write:Code: Select all
let x = <do some calculation and comparisons that seem to be very heavy... and might deliver 0 as the end result>
Result: slower code. (One of those constructions increased my perft 7 time from 114 seconds to 120...) I have seen this in more than one place. Very often, at least in Rust, it seems that it is faster to just do perform X (possibly for nothing), than to first check if you should go and do it. The check is more expensive than actually performing X for nothing.Code: Select all
let mut x = 0; if <some condition here> { x = <heavy calculation> }
So, I went through some of the functions in the move generator and regrouped some things to minimize IF-checks, and then also inlined square_attacked(). (In Rust, on the highest optimization settings, the compiler decides on inlining, except if you state explicity always/never for a function).
This regrouping and inlining dropped perft 7 from the starting position from 114 to 100 seconds. So that's a 13% speedup right there, without actually changing any code or logic. square_attacked() and add_move() completely dropped off the proiler's radar.
When I first got Perft working, it ran at 277 seconds for perft 7. (Without hash, without bulk-counting, or special move generators.) Now, some time later, it runs at 100 seconds, still without hash, bulk-counting or special move generators. That's a 64% speedup, without actually changing any logic or adding tricks or functionality; it's just a matter of replacing technique X with Y, regrouping, inlining a function, and cutting out superfluous/double IF-checks.
I'll start having a look at less strongly optimized code to see if I can get some more information from the profiler.
Some of the progress was made through direct help by this community; even if there wasn't help like 'try this or that', posts often contained usable hints or idea's in the right direction. Thanks people
Inlined functions drop off the profiler's radar as they are no longer called, but are just code inserted in the function surrounding it. They still take time, counted as the surrounding functions time, although with less overhead and possibly more optimizations from knowing the context for the code.
Nice speedup!
-
- Posts: 1784
- Joined: Wed Jul 03, 2019 4:42 pm
- Location: Netherlands
- Full name: Marcel Vanthoor
Re: Can't get Minic to run perft?
Yes, I know. I just hadn't expected that an if-statement could actually be more expensive than a function call and a computation (that might actually end up yielding nothing, such as finding all the rook moves if there's no rook).
I know how inlining works, but I thought the profiler would have been able to split them out again using the debug symbols. Apparently not.Inlined functions drop off the profiler's radar as they are no longer called, but are just code inserted in the function surrounding it. They still take time, counted as the surrounding functions time, although with less overhead and possibly more optimizations from knowing the context for the code.
I needed to drop back to opt-level = 1 and no LTO to actually see all of the functions in the profiler.
The most expensive functions are remove_piece() and put_piece() in make_move(). Obviously.
A long time ago, I was debating between:
Code: Select all
bb_pieces [2] [6]
Code: Select all
bb_white [6]
bb_black [6]
Code: Select all
bb_use = (side == WHITE) ? bb_white : bb_black
So... I now have to put an "if side == WHITE" in both remove_piece() and put_piece(), which apparently is very expensive. So I'm going to try and switch the bitboard representation to method 1, so I can write "bitboard_for_pieces [WHITE] [KING]" and take out two if-statements that now get called A LOT.
Then I'll see from there.
Thanks. Let's see what this bitboard change is gonna bring this eveningNice speedup!
Should I hit 90 seconds or less with Perft 7, I'll promise I'll stop optimizing move generation and making and finish the search and UCI functionality (PS: I still at some time need to run the original version of VICE. I'm curious to see how fast the engine is which was written by my "teacher" )
-
- Posts: 300
- Joined: Mon Apr 30, 2018 11:51 pm
Re: Can't get Minic to run perft?
If you have debugging information, at least perf will de-inline it for you when showing the profile; pass the --inline flag. (It's not perfect, but it's a lot better than nothing.) This is assuming Rust's tool chain gives out full DWARF, which I'd assume it does.
-
- Posts: 1784
- Joined: Wed Jul 03, 2019 4:42 pm
- Location: Netherlands
- Full name: Marcel Vanthoor
Re: Can't get Minic to run perft?
I can't use perf because I'm on Windows. Even though Intel VTune should understand DWARF according to the docs, it doesn't. The function names are mangled. I've switched to the MSVC toolchain. Basically, the only difference should be the output of the debug symbols and using link.exe from MS instead of ld.exe from GNU... but Rustic compiled with the MSVC toolchain is MUCH faster (and I mean... like 40%) than when it's compiled with the GNU toolchain.Sesse wrote: ↑Wed Apr 08, 2020 11:00 pm If you have debugging information, at least perf will de-inline it for you when showing the profile; pass the --inline flag. (It's not perfect, but it's a lot better than nothing.) This is assuming Rust's tool chain gives out full DWARF, which I'd assume it does.
I've implemented the bitboards in the "bb_side[side][piece]" way now, so I can write "bb_side[WHITE][KING]". It solved many strange constructions throughout the engine, and it dropped perft 7 times from 100.1 seconds to 95.5 (it fluctuates between 94.9 and 95.7). There are some hotspots in make_move (if-statements...) that I might yet be able to optimize away, especially the one for removing castling permissions when moving away from the start square. I'll test VICE/Weiss/... CastlePerm table for this. I purposely didn't implement it like that because VICE did it like that
Perft test:
Code: Select all
Benchmarking perft 1-7:
8 r n b q k b n r
7 i i i i i i i i
6 . . . . . . . .
5 . . . . . . . .
4 . . . . . . . .
3 . . . . . . . .
2 I I I I I I I I
1 R N B Q K B N R
A B C D E F G H
Zobrist key: 819aa694337673fb
Active Color: White
Castling: KQkq
En Passant: -
Half-move clock: 0
Full-move number: 1
Perft 1: 20 (0 ms, inf leaves/sec)
Perft 2: 400 (0 ms, inf leaves/sec)
Perft 3: 8902 (0 ms, inf leaves/sec)
Perft 4: 197281 (6 ms, 32880166 leaves/sec)
Perft 5: 4865609 (145 ms, 33555924 leaves/sec)
Perft 6: 119060324 (3607 ms, 33008129 leaves/sec)
Perft 7: 3195901860 (95511 ms, 33461086 leaves/sec)
Total time spent: 99269 ms
Execution speed: 33444825 leaves/second
Code: Select all
A:
let bb_square = 1u64 << square;
let bb_pawns = if attacker == WHITE {
(bb_square & !board.bb_files[FILE_A]) >> 9 | (bb_square & !board.bb_files[FILE_H]) >> 7
} else {
(bb_square & !board.bb_files[FILE_A]) << 7 | (bb_square & !board.bb_files[FILE_H]) << 9
};
I thought to be smart, and replace it with a call to the move generator, which should be able to just copy the relevant attack bitboard to bb_pawns:
Code: Select all
B:
let bb_pawns = board.get_pawn_attacks(attacker ^ 1, square);
In the profiler, the call to get_pawn_attacks() doesn't take any more time than any of the other calls such as get_non_slider_attacks(). Still, how is it possible that this ugly construction in code snippet A is _SO MUCH_ faster? The only thing I can imagine is that the compiler somehow completely optimizes it away. Obviously, it's not possible to calculate every other piece in this function on the spot.
Tomorrow evening I'll run a full perftsuite, with option A and option B, and see which one is faster over the entire suite; but I suspect it to be ugly option A. In that case, I'll leave it in, even though it pains my eyes
-
- Posts: 347
- Joined: Tue Nov 19, 2019 4:34 am
- Location: https://github.com/TerjeKir/weiss
- Full name: Terje Kirstihagen
Re: Can't get Minic to run perft?
For getting pawn attacks, having a function that shifts a bitboard in some direction can make the code a lot prettier. For example:mvanthoor wrote: ↑Wed Apr 08, 2020 11:18 pm There's one strange thing though. In info.rs, which has square_attacked() in it, I found a calculation which was left over, from the time when Rustic didn't have a full bitboard implementation yet:
This calculates which pawns attack a certain square.Code: Select all
A: let bb_square = 1u64 << square; let bb_pawns = if attacker == WHITE { (bb_square & !board.bb_files[FILE_A]) >> 9 | (bb_square & !board.bb_files[FILE_H]) >> 7 } else { (bb_square & !board.bb_files[FILE_A]) << 7 | (bb_square & !board.bb_files[FILE_H]) << 9 };
I thought to be smart, and replace it with a call to the move generator, which should be able to just copy the relevant attack bitboard to bb_pawns:
This works, but my perft time shot up from 95 to 107 seconds.Code: Select all
B: let bb_pawns = board.get_pawn_attacks(attacker ^ 1, square);
In the profiler, the call to get_pawn_attacks() doesn't take any more time than any of the other calls such as get_non_slider_attacks(). Still, how is it possible that this ugly construction in code snippet A is _SO MUCH_ faster? The only thing I can imagine is that the compiler somehow completely optimizes it away. Obviously, it's not possible to calculate every other piece in this function on the spot.
Tomorrow evening I'll run a full perftsuite, with option A and option B, and see which one is faster over the entire suite; but I suspect it to be ugly option A. In that case, I'll leave it in, even though it pains my eyes
Code: Select all
const Direction up = (color == WHITE ? NORTH : SOUTH);
return ShiftBB(up+WEST, pawns) | ShiftBB(up+EAST, pawns);
Of course, I don't know Rust, so appologies if this isn't applicable for some reason.
-
- Posts: 1784
- Joined: Wed Jul 03, 2019 4:42 pm
- Location: Netherlands
- Full name: Marcel Vanthoor
Re: Can't get Minic to run perft?
Of course that calculation can be wrapped in a function.
What I find strange is the fact that taking the result of this calculation out of the move generator's attack table is so incredibly much slower than just calculating the result again (and again... and again...). If I wrap the code in a function, it might not be much different than just calling get_pawn_attacks() (which just returns "pawns_attacks[side][square]"...)
What I find strange is the fact that taking the result of this calculation out of the move generator's attack table is so incredibly much slower than just calculating the result again (and again... and again...). If I wrap the code in a function, it might not be much different than just calling get_pawn_attacks() (which just returns "pawns_attacks[side][square]"...)