I also want to hear what others think—maybe you’ve run into something similar or have ideas I haven’t considered.
I set up a gauntlet tournament with 15 engines around Lambergar’s strength and ran 5000 games at 30+0.3s time control. The results confirmed that my engine is about where I expected—right in the middle of the pack.
To analyze the games, I built a small database where I track: Game ID, FEN string, Game phase (based on Lambergar’s material), Lambergar’s evaluation, Game result, Stockfish 17 evaluation at depth 10
So far, I’ve processed 1200 games, skipping positions where there’s a forced mate.
- Figure 1: Difference between Lambergar’s eval and Stockfish’s
- Figure 2: Stockfish eval (red) vs. Lambergar eval (blue)


Surprisingly, Lambergar’s evaluations are pretty close to Stockfish’s—closer than I expected. I thought I’d need to apply some scaling factor, but that doesn’t seem necessary.
What’s Happening in Different Phases:
• Phases 10–32 → Looks good.
• Phases 6–10 → Also fine.
• Phases 3 and 5 → More problematic.
• Phase 0 → Definitely weird.
• Phases 31, 30, 28, 25 → Not sure what’s happening here yet.
So yes, the endgame is a problem, but it’s not as bad as I feared. I was expecting something much messier.
I want to take a closer look at bad draws—games where Lambergar had a clear winning eval but still ended in a draw. Specifically, I’m defining a bad draw as a game where Lambergar had at least +1.0 eval for two consecutive moves but couldn’t convert.
If anyone has thoughts, ideas, or has run into similar issues, I’d love to hear them!