engine speed issues

Discussion of chess software programming and technical issues.

Moderator: Ras

Sazgr
Posts: 66
Joined: Thu Dec 09, 2021 8:26 pm
Full name: Kyle Zhang

engine speed issues

Post by Sazgr »

I recently profiled my 2mnps engine, doing a 1 minute search on the start position.

As any further pruning (I already have null move, LMR, futility, delta) is not going to make the pure speed faster, and I have seen many many engines faster than 2mnps on here, there must be something wrong with the speed of my engine. However, the most surprising functions in the profiling report are:

1. make_move() - 12.61%
2. undo_move() - 9.6%
As my engine uses bitboards, I was not expecting make_move to take such a big portion of the running time. Even though make_move was called 102.6 million times, quiescence was called the same number of times and only took up around 3% of the time. A possible explanation is that I do incremental PST's in make_move which might slow it down. Full 20% of the time used for making moves is suspicious

For reference here is the make_move function, undo_move is similar

Code: Select all

void Position::make_move(Move move) {
    hash ^= castling_rights.back() ^ zobrist_enpassant[enpassant_square.back()];
    hash ^= zobrist_black;
    side_to_move = !side_to_move;
    switch (move.flag()) {
        case none:
            fill_sq(move.start(), empty_square);
            fill_sq(move.end(), move.piece());
            break;
        case knight_pr:
            fill_sq(move.start(), empty_square);
            fill_sq(move.end(), move.piece() + 2);
            break;
        case bishop_pr:
            fill_sq(move.start(), empty_square);
            fill_sq(move.end(), move.piece() + 4);
            break;
        case rook_pr:
            fill_sq(move.start(), empty_square);
            fill_sq(move.end(), move.piece() + 6);
            break;
        case queen_pr:
            fill_sq(move.start(), empty_square);
            fill_sq(move.end(), move.piece() + 8);
            break;
        case k_castling:
            fill_sq(move.start(), empty_square);
            fill_sq(move.start() + 3, empty_square);
            fill_sq(move.end(), move.piece());
            fill_sq(move.start() + 1, move.piece() - 4);
            break;
        case q_castling:
            fill_sq(move.start(), empty_square);
            fill_sq(move.start() - 4, empty_square);
            fill_sq(move.end(), move.piece());
            fill_sq(move.start() - 1, move.piece() - 4);
            break;
        case enpassant:
            fill_sq(move.start(), empty_square);
            fill_sq(move.end() ^ 8, empty_square);//ep square
            fill_sq(move.end(), move.piece());
            break;
    }
    enpassant_square.push_back((!(move.piece() & ~1) && move.end() == (move.start() ^ 16)) ? ((move.start() + move.end()) >> 1) : 64);
    castling_rights.push_back(castling_rights.back() & ~((1ull << move.start()) | (1ull << move.end())));
    hash ^= castling_rights.back() ^ zobrist_enpassant[enpassant_square.back()];
}
Notes:
castling rights is a bitboard with up to 6 bits filled for the unmoved kings and rooks
fill_sq flips a bit in two bitboards, does 2 xors on the hash, and updates 2 sets of PSTs and a phase variable

Any suggestions would be very helpful !
User avatar
hgm
Posts: 28396
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: engine speed issues

Post by hgm »

2Mnps is very fast, not? For a bitboard engine.

If make/unmake takes 22.2%, and QS only 3%, what is the remaining 75% spent on? I would think that is the most-likely candidate for improvement.
clayt
Posts: 29
Joined: Thu Jun 09, 2022 5:09 am
Full name: Clayton Ramsey

Re: engine speed issues

Post by clayt »

I can't say I know too much about what's going on here, but I can offer some guesses.
1. Is your search single-threaded? In this case, 2Mnps is actually not super unreasonable.
2. You seem to be calling `move.start()` and `move.end()` a lot. I assume that each time, it masks and shifts some information from your move - in this case, I recommend storing those values to save some effort.

Spending about that much time on making moves seems roughly correct, but 9.6% of your time on undo is a bit high. Are you doing copy-and-make or are you actively updating the board on undo?

Lastly: how much time is spent on move generation? I suggest that you use your profiling data to generate a flamegraph, since that will make it a little easier to see where time is being spent.
Sazgr
Posts: 66
Joined: Thu Dec 09, 2021 8:26 pm
Full name: Kyle Zhang

Re: engine speed issues

Post by Sazgr »

hgm wrote: Tue Jul 19, 2022 6:50 pm 2Mnps is very fast, not? For a bitboard engine.

If make/unmake takes 22.2%, and QS only 3%, what is the remaining 75% spent on? I would think that is the most-likely candidate for improvement.
In my engine

search 28%
movegen 25%
make/unmake 23%
TT query/insert 10%
check detection 6%
vector create/destroy 4%
other 3%
Sazgr
Posts: 66
Joined: Thu Dec 09, 2021 8:26 pm
Full name: Kyle Zhang

Re: engine speed issues

Post by Sazgr »

clayt wrote: Tue Jul 19, 2022 6:55 pm I can't say I know too much about what's going on here, but I can offer some guesses.
1. Is your search single-threaded? In this case, 2Mnps is actually not super unreasonable.
2. You seem to be calling `move.start()` and `move.end()` a lot. I assume that each time, it masks and shifts some information from your move - in this case, I recommend storing those values to save some effort.

Spending about that much time on making moves seems roughly correct, but 9.6% of your time on undo is a bit high. Are you doing copy-and-make or are you actively updating the board on undo?

Lastly: how much time is spent on move generation? I suggest that you use your profiling data to generate a flamegraph, since that will make it a little easier to see where time is being spent.
1. Yes my search is single threaded. So it is only slightly unreasonable? What would you expect for a single threaded search?

2. OK I will try that.

I am doing make-unmake, my unmake function looks pretty similar except i just pop castling and ep rights off the stacks

25-30% of the time is for move generation, and it is mostly generating the attack map (6% of the 25-30%) (for king move and castling legality checking) and slider attacks

And yes I will try the flamegraph :)
User avatar
hgm
Posts: 28396
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: engine speed issues

Post by hgm »

I would think 1Mnps is normal.

What puzzles me is that search takes so much, and QS so little. Normally 85% of all nodes are QS nodes. Given that factor a call to search takes 65 times more time than a call to QS. While these usually do so much the same thing that I never bother to make separate routines for them.

Do you have some complicated move sorting going on in search?
Sazgr
Posts: 66
Joined: Thu Dec 09, 2021 8:26 pm
Full name: Kyle Zhang

Re: engine speed issues

Post by Sazgr »

hgm wrote: Tue Jul 19, 2022 8:00 pm I would think 1Mnps is normal.

What puzzles me is that search takes so much, and QS so little. Normally 85% of all nodes are QS nodes. Given that factor a call to search takes 65 times more time than a call to QS. While these usually do so much the same thing that I never bother to make separate routines for them.

Do you have some complicated move sorting going on in search?
Weirdly out of 18million nodes only 3million are Qnodes, I will disable all QS pruning and check for bugs.

I do not do any more complicated sort in search than in qsearch, just staged move generation (hash move, captures, quiets) the last two are sorted with std::sort(). I once tried picking the best move every time but somehow it got even slower. I might re-try that though.
User avatar
hgm
Posts: 28396
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: engine speed issues

Post by hgm »

Well, sorting algorithms usually scale worse than linear with the number of moves, and there are many more quiets than captures. I am not sure though if std::sort would not be mentioned separately in the profile (as part of the 3%). Low number of QS nodes is certainly suspect; most of the search nodes must be leaves, so you have very few QS calls per leaf.

What does search do anyway, if you count move generation, make/unmake and check testing separately anyway. It sounds like it should just be a tight loop over moves where an iteration does almost nothing. (Because if(score > alpha) only rarely would be true.) Surely make + unmake should do by far the most work in that loop?
clayt
Posts: 29
Joined: Thu Jun 09, 2022 5:09 am
Full name: Clayton Ramsey

Re: engine speed issues

Post by clayt »

Sazgr wrote: Tue Jul 19, 2022 7:45 pm
clayt wrote: Tue Jul 19, 2022 6:55 pm I can't say I know too much about what's going on here, but I can offer some guesses.
1. Is your search single-threaded? In this case, 2Mnps is actually not super unreasonable.
2. You seem to be calling `move.start()` and `move.end()` a lot. I assume that each time, it masks and shifts some information from your move - in this case, I recommend storing those values to save some effort.

Spending about that much time on making moves seems roughly correct, but 9.6% of your time on undo is a bit high. Are you doing copy-and-make or are you actively updating the board on undo?

Lastly: how much time is spent on move generation? I suggest that you use your profiling data to generate a flamegraph, since that will make it a little easier to see where time is being spent.
1. Yes my search is single threaded. So it is only slightly unreasonable? What would you expect for a single threaded search?

2. OK I will try that.

I am doing make-unmake, my unmake function looks pretty similar except i just pop castling and ep rights off the stacks

25-30% of the time is for move generation, and it is mostly generating the attack map (6% of the 25-30%) (for king move and castling legality checking) and slider attacks

And yes I will try the flamegraph :)
2Mnps is quite fast for single-threaded search, assuming you're doing evaluations. If you want a more accurate estimate as to whether your make-unmake process is fast enough, consider writing a perft routine and benchmarking that, so that time spent on updating PST values and other such things isn't wasted.

You might find that you can get performance improvements from copy-on-make in some circumstances. It helps to have some perft data to verify which approach is better for you.

For some data, my engine, which uses copy-on-make and likely spends a little more time on evaluation, runs at 1.5-1.9Mnps in a single thread. In terms of proportional calls observed by perf, here's an example case of my engine searching to a depth of 10:
  • 59.6% Quiescence.
  • 22.1%: move generation (of which 12.7% of the total is spent tagging moves with their effect on incremental evaluation).
  • 14.7%: Move make.
  • 9.9%: Transposition table probe.
  • 9.1%: Leaf evaluation.
  • 7.7%: Game over
  • 5.8%: Move sorting (using a lazy insertion sort).
  • 3.9%: Transposition table storage.
  • 3.6%: Move unmake.
Chessnut1071
Posts: 313
Joined: Tue Aug 03, 2021 2:41 pm
Full name: Bill Beame

Re: engine speed issues

Post by Chessnut1071 »

Sazgr wrote: Tue Jul 19, 2022 4:28 pm I recently profiled my 2mnps engine, doing a 1 minute search on the start position.

As any further pruning (I already have null move, LMR, futility, delta) is not going to make the pure speed faster, and I have seen many many engines faster than 2mnps on here, there must be something wrong with the speed of my engine. However, the most surprising functions in the profiling report are:

1. make_move() - 12.61%
2. undo_move() - 9.6%
As my engine uses bitboards, I was not expecting make_move to take such a big portion of the running time. Even though make_move was called 102.6 million times, quiescence was called the same number of times and only took up around 3% of the time. A possible explanation is that I do incremental PST's in make_move which might slow it down. Full 20% of the time used for making moves is suspicious

For reference here is the make_move function, undo_move is similar

Code: Select all

void Position::make_move(Move move) {
    hash ^= castling_rights.back() ^ zobrist_enpassant[enpassant_square.back()];
    hash ^= zobrist_black;
    side_to_move = !side_to_move;
    switch (move.flag()) {
        case none:
            fill_sq(move.start(), empty_square);
            fill_sq(move.end(), move.piece());
            break;
        case knight_pr:
            fill_sq(move.start(), empty_square);
            fill_sq(move.end(), move.piece() + 2);
            break;
        case bishop_pr:
            fill_sq(move.start(), empty_square);
            fill_sq(move.end(), move.piece() + 4);
            break;
        case rook_pr:
            fill_sq(move.start(), empty_square);
            fill_sq(move.end(), move.piece() + 6);
            break;
        case queen_pr:
            fill_sq(move.start(), empty_square);
            fill_sq(move.end(), move.piece() + 8);
            break;
        case k_castling:
            fill_sq(move.start(), empty_square);
            fill_sq(move.start() + 3, empty_square);
            fill_sq(move.end(), move.piece());
            fill_sq(move.start() + 1, move.piece() - 4);
            break;
        case q_castling:
            fill_sq(move.start(), empty_square);
            fill_sq(move.start() - 4, empty_square);
            fill_sq(move.end(), move.piece());
            fill_sq(move.start() - 1, move.piece() - 4);
            break;
        case enpassant:
            fill_sq(move.start(), empty_square);
            fill_sq(move.end() ^ 8, empty_square);//ep square
            fill_sq(move.end(), move.piece());
            break;
    }
    enpassant_square.push_back((!(move.piece() & ~1) && move.end() == (move.start() ^ 16)) ? ((move.start() + move.end()) >> 1) : 64);
    castling_rights.push_back(castling_rights.back() & ~((1ull << move.start()) | (1ull << move.end())));
    hash ^= castling_rights.back() ^ zobrist_enpassant[enpassant_square.back()];
}
Notes:
castling rights is a bitboard with up to 6 bits filled for the unmoved kings and rooks
fill_sq flips a bit in two bitboards, does 2 xors on the hash, and updates 2 sets of PSTs and a phase variable

Any suggestions would be very helpful !
2Mnps is about the same as my PEXT bitboard engine, which includes evaluation. If you're using an older chip, single thread i7, most of the recently developed bitboard engines are very close to your speed. Obviously, there are developers on this board with significantly faster engines, but, that comes after years of tweaking and adjusting. The biggest increase you have control over is the optimized evaluation function, assuming you have an efficient hash and history. My engine doubled its speed after Muhler corrected my hash table. With the ideas and suggestions on this board you should be able to at least double your speed in a year; however, they keep getting speedier as well. A year ago I had what I thought was a speedy 5,600 nodes/sec VBA engine in Excel and never used C++ or C#. I'm not sure why I'm doing this because it sure isn't making me any money. bet of luck