Endgame performance

jabolcni · Post by **jabolcni** » Fri Jan 31, 2025 9:12 pm

I haven’t been too happy with how my engine, Lambergar, handles endgames. In many games, it gets a solid advantage but then somehow lets it slip, ending in a draw. To figure out what’s going on, I decided to dig into the data and see if the problem is real (or just my impression) and whether there are any obvious fixes.
I also want to hear what others think—maybe you’ve run into something similar or have ideas I haven’t considered.

I set up a gauntlet tournament with 15 engines around Lambergar’s strength and ran 5000 games at 30+0.3s time control. The results confirmed that my engine is about where I expected—right in the middle of the pack.
To analyze the games, I built a small database where I track: Game ID, FEN string, Game phase (based on Lambergar’s material), Lambergar’s evaluation, Game result, Stockfish 17 evaluation at depth 10

So far, I’ve processed 1200 games, skipping positions where there’s a forced mate.

Figure 1: Difference between Lambergar’s eval and Stockfish’s

Figure 2: Stockfish eval (red) vs. Lambergar eval (blue)

Surprisingly, Lambergar’s evaluations are pretty close to Stockfish’s—closer than I expected. I thought I’d need to apply some scaling factor, but that doesn’t seem necessary.
What’s Happening in Different Phases:
• Phases 10–32 → Looks good.
• Phases 6–10 → Also fine.
• Phases 3 and 5 → More problematic.
• Phase 0 → Definitely weird.
• Phases 31, 30, 28, 25 → Not sure what’s happening here yet.

So yes, the endgame is a problem, but it’s not as bad as I feared. I was expecting something much messier.

I want to take a closer look at bad draws—games where Lambergar had a clear winning eval but still ended in a draw. Specifically, I’m defining a bad draw as a game where Lambergar had at least +1.0 eval for two consecutive moves but couldn’t convert.
If anyone has thoughts, ideas, or has run into similar issues, I’d love to hear them!

jabolcni · Post by **jabolcni** » Mon Feb 03, 2025 9:47 am

I continued my analysis and expanded the dataset to 5209 games. To better understand where Lambergar is struggling, I separated the games into middle game and endgame based on the following phase calculation:

PIECE_VALUES = {
chess.PAWN: 0,
chess.KNIGHT: 3,
chess.BISHOP: 3,
chess.ROOK: 5,
chess.QUEEN: 10
}

An endgame is defined as a phase ≤ 6, while everything else is considered middle game (which also includes late opening moves). I excluded book moves and positions where mate was already found.
Results Overview

Total games analyzed: 5209
Total draws: 2098
Total middle game draws: 2093
Total endgame draws: 1324

Bad draws (Lambergar, Middle Game): 560 (26.76%)
Good draws (Lambergar, Middle Game): 604 (28.86%)
Bad draws (Stockfish, Middle Game): 446 (21.31%)
Good draws (Stockfish, Middle Game): 748 (35.74%)

Bad draws (Lambergar, Endgame): 304 (22.96%)
Good draws (Lambergar, Endgame): 254 (19.18%)
Bad draws (Stockfish, Endgame): 61 (4.61%)
Good draws (Stockfish, Endgame): 140 (10.57%)

Bad Draw: Lambergar had at least +1.0 eval for two consecutive moves but failed to convert.
Good Draw: Lambergar had at most -1.0 eval for two consecutive moves but still managed to hold.

Middle Game:

Lambergar’s bad draw rate (27%) is quite close to Stockfish’s (25%), so no major surprises here.

The number of bad and good draws is roughly equal based on Lambergar’s eval, while Stockfish sees almost 40% of draws as good draws.

This suggests that Lambergar often gets into trouble but finds a way to escape with a draw.

Endgame – Unexpected Findings:

Bad draw rate based on Lambergar’s eval is 23%, which is lower than in the middle game.

Stockfish’s evaluation, however, shows only 5% bad draws and 11% good draws—twice as many good draws as bad ones!

This suggests that Lambergar's actual endgame play is not the problem—the real issue seems to be how it evaluates positions.

It looks like Lambergar’s endgame evaluations are numerically too high, possibly by a factor of 2×.

Previously, I experimented with scaling evals based on game phase and the 50-move rule, but it led to a small Elo drop. However, given these results, I’ll retest that approach to see if it improves the balance between bad and good draws.

Would love to hear any thoughts or ideas on this!

Endgame performance

Endgame performance

Re: Endgame performance