Profiling performance test (perft) related question

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
maksimKorzh
Posts: 771
Joined: Sat Sep 08, 2018 5:37 pm
Location: Ukraine
Full name: Maksim Korzh

Profiling performance test (perft) related question

Post by maksimKorzh »

I've written two engines that use slightly different move generation philosophy. Now I'm working with some sort of a hybrid solution to get the most out of both of them. I'm using gprof profiler to compare perft test results. These results confused me totally, I'm asking for help to explain how is the following is actually possible... :shock: :shock: :shock:
So here are my tests.

in this position
[d]r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq - 0 1
at depth 3 we have 97862 nodes visited

My first engine's(Chenglite) flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
45.45 0.05 0.05 100362 0.00 0.00 MakeMove
45.45 0.10 0.05 48 1.04 2.29 Perft
9.09 0.11 0.01 107164 0.00 0.00 IsSquareAttacked
0.00 0.11 0.00 100362 0.00 0.00 AddMove
0.00 0.11 0.00 2088 0.00 0.00 GenerateMoves
0.00 0.11 0.00 2 0.00 0.00 GetTimeMs
0.00 0.11 0.00 1 0.00 0.00 InitSearch
0.00 0.11 0.00 1 0.00 0.00 ParseFen
0.00 0.11 0.00 1 0.00 110.00 PerftTest

% the percentage of the total running time of the
time program used by this function.

which is pretty satisfying

Chenglite's design:

0x88 board representation
VICE - like move generator, but copy-make make move approach instead of make/unmake and NO PIECE LIST


my second engine's(Tighty) flat profile:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
77.85 3.55 3.55 4280842 0.00 0.00 B // eval position to score moves for move ordering
13.38 4.16 0.61 100507 0.00 0.00 G // generate moves
5.37 4.41 0.24 1 0.24 4.55 F
1.86 4.49 0.09 8530175 0.00 0.00 M //make move is called twice - first time in movegen, second - in perft
1.32 4.55 0.06 8530175 0.00 0.00 U //unmake move
0.11 4.55 0.01 X
0.11 4.56 0.01 frame_dummy
0.00 4.56 0.00 1 0.00 0.00 A // parse fen

% the percentage of the total running time of the
time program used by this function.

almost 5 times slower...

Tighty's design is similar to micro-Max
it uses 3 nested loops in move gen, just like microMax but has separated make/unmake move routines and move list
Tighty is a king capture engine just like micro-Max so it doesn't have is_square_attacked() routine.

The first thing confused me the most at this point is number of movegen calls - 2088 in Chenglite vs 100507 in Tighty, WTF?????
please ignore eval() calls in Tighty's movegen, it takes the most of time, but doesn't matter at the moment, also make/unmake move is called twice - first in movegen for under promotion(just like in micro-Max Tighty is OR'ing piece in target square if this is a promoted pawn AFTER the move is made) and move scoring purposes which is quite inefficient practically but allows to ignore all known move-ordering techniques and sort the move list before the very first search iteration. So I'll emphasize the question again - movegen calls - 2088 in Chenglite vs 100507 in Tighty - how on earth is that possible???

And at the end the engine I'm currently working at(MCP - minimalist chess program) flat profile:

Flat profile:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
68.85 1.26 1.26 673873 0.00 0.00 square_att
13.11 1.50 0.24 99950 0.00 0.00 generate_moves
9.84 1.68 0.18 1 0.18 1.83 perft
5.46 1.78 0.10 100362 0.00 0.00 make_move
2.73 1.83 0.05 4237213 0.00 0.00 add_move
0.00 1.83 0.00 1 0.00 0.00 parse_fen

% the percentage of the total running time of the
time program used by this function.

it's at least 3 times faster than Tighty, but still 2 times slower than Chenglite which I consider to be strange...
MCP design.

0x88 board representation
3 nested loops in movegen, like micro-Max, BUT NO make/unmake move in movegen(underpromotions are now generated separately from making moves) no eval for move scoring. IMPORTANT: MCP compare to Tighty and micro-Max is NOT a KING CAPTURE engine, it has is_square_attacked function written in the style of move gen - same 3 nested loops(over pieces, directions and squares). It has PIECE LIST(the very first piece list in my chess programming life) which is the same as VICE's piece list - piecec[piece][pieceNum[piece]] (pieceNum array is used to iterate over pieces of the same piece type). BUT generate moves is called 99950 times!!!! it's less than 100507 times in tighty (I believe this is because is_square_attacked saves some movegen calls) but still much more than in Chenglite - 2088 calls.

I REALLY WONDER WHY is that happening. I mean could someone please kindly explain to me why the difference in move gen (and is_square_attecked) calls is so HUGE? What am I doing wrong and what particularly influences the number of function calls?
Richard Allbert
Posts: 792
Joined: Wed Jul 19, 2006 9:58 am

Re: Profiling performance test (perft) related question

Post by Richard Allbert »

It's quite difficult to understand, but are the nodes visited the same for each version? You state 97862 in the first example, and then not for the others.

Next point: Have you done some perft testing to check the move generators are working correctly?
User avatar
maksimKorzh
Posts: 771
Joined: Sat Sep 08, 2018 5:37 pm
Location: Ukraine
Full name: Maksim Korzh

Re: Profiling performance test (perft) related question

Post by maksimKorzh »

Richard Allbert wrote: Thu Nov 15, 2018 12:27 pm It's quite difficult to understand, but are the nodes visited the same for each version? You state 97862 in the first example, and then not for the others.

Next point: Have you done some perft testing to check the move generators are working correctly?
Thanks for your reply, Richard, yes, the move generator is indeed working correctly, and 97862 was in both cases. The problem was a simple typo bug - I was working for many hours that day and felt completely exhausted - the problem was I've placed if(!depth) { nodes++; return } in the loop, traversing over move list so it didn't return from function immediately.

Currently I'm facing another problem - I have to almost the same make move functions it MCP and Chenglite(both have move encoded into 32 - bit integer, same move flags, the only difference that MCP keeps track of the rookSq - a variable containing the rook to move on castling and at the same time flagging that the particular move is indeed castling. This variable was in micro-Max), but the problem is that in the above position this time at depth 4 Chenglite spends 0,9 ms on make move while MCP - 2,56 :( :( :( :( Now I'm checking out the theory if the nubmer of rejected illegal moves is different.

MCP:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
47.94 3.71 3.71 1 3.71 7.74 perft
31.42 6.15 2.44 4248664 0.00 0.00 make_move
14.26 7.25 1.10 4539743 0.00 0.00 square_att
5.68 7.70 0.44 99950 0.00 0.00 generate_moves
0.58 7.74 0.04 4237213 0.00 0.00 add_move
0.06 7.75 0.01 1 0.01 0.01 parse_fen
0.06 7.75 0.01 frame_dummy


Chenglite:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
34.10 1.49 1.49 48 0.03 0.09 Perft
31.12 2.85 1.36 4539743 0.00 0.00 IsSquareAttacked
24.71 3.93 1.08 4248664 0.00 0.00 MakeMove
8.01 4.28 0.35 99950 0.00 0.00 GenerateMoves
2.06 4.37 0.09 4248664 0.00 0.00 AddMove
0.00 4.37 0.00 2 0.00 0.00 GetTimeMs
0.00 4.37 0.00 1 0.00 0.00 InitSearch
0.00 4.37 0.00 1 0.00 0.00 ParseFen
0.00 4.37 0.00 1 0.00 4.37 PerftTest