New Guest Engine: Maverick NNUE...

Steve Maughan · Post by **Steve Maughan** » Sun May 24, 2026 6:32 pm

It's been an absolute pleasure to work with Ed Schröder and Chris Whittington on the release of Maverick NNUE. The NNUE was trained on 1 billion positions from Maverick's self-play games, augmented with Leela data. Ed processed the training data, and Chris contributed the search code that brings it all together.

Based on a battery of test games, the engine weighs in at approximately 3450 Elo, an increase of around 900 Elo over Maverick 1.5.

You can read more and download the engine here:

Rebel Guest Engines

Many thanks to Ed and Chris for putting in the work to make this a reality.

— Steve

Modern Times · Post by **Modern Times** » Sun May 24, 2026 7:43 pm

Steve Maughan wrote: ↑Sun May 24, 2026 6:32 pm Based on a battery of test games, the engine weighs in at approximately 3450 Elo, an increase of around 900 Elo over Maverick 1.5.

Enormous gain !

Peter Berger · Post by **Peter Berger** » Mon May 25, 2026 7:58 pm

One slowly gets the impression that the guest engines that are optimized for strength end up at pretty similar strength, even if their starting points were different, as different their style may look at the same time.
Is there some kind of rating list that has all the engines using the CSTal search? It would be interesting to be able to compair.

jorose · Post by **jorose** » Tue May 26, 2026 8:21 am

This is an interesting project, though I am not sure at this moment what the point is.

Based on the writing on the page, we are training NNUEs based on self play data of various engines. This is then used with the CSTal search code. In other words, we have replaced both the entire search and evaluation of the "Guest Engine" with another engine. From this perspective it is quite unsurprising we get a very different rating from the initial engine's rating, but I am assuming that is not the point of this project.

I am assuming we are trying to get a new engine with a similar style to the "Guest Engine". Do we have evidence that the guest engine in fact has a similar style to the new engine? Does it prefer similar openings?
This community has worked on different style benchmarks and similarity tools, I assume it is possible to verify whether the style is maintained in some way?

Also, why are we augmenting the training data with Leela training data? I can only assume this is to increase strength, but why? If the only thing tethering our resulting engine to the "guest engine" is training data, I feel there should be a stronger reasoning than strength. At that point I would suggest improving the engine further by completely replacing the training data with Leela training data.

chrisw · Post by **chrisw** » Tue May 26, 2026 11:21 am

jorose wrote: ↑Tue May 26, 2026 8:21 am This is an interesting project, though I am not sure at this moment what the point is.

Based on the writing on the page, we are training NNUEs based on self play data of various engines. This is then used with the CSTal search code. In other words, we have replaced both the entire search and evaluation of the "Guest Engine" with another engine. From this perspective it is quite unsurprising we get a very different rating from the initial engine's rating, but I am assuming that is not the point of this project.

I am assuming we are trying to get a new engine with a similar style to the "Guest Engine". Do we have evidence that the guest engine in fact has a similar style to the new engine? Does it prefer similar openings?
This community has worked on different style benchmarks and similarity tools, I assume it is possible to verify whether the style is maintained in some way?

Also, why are we augmenting the training data with Leela training data? I can only assume this is to increase strength, but why? If the only thing tethering our resulting engine to the "guest engine" is training data, I feel there should be a stronger reasoning than strength. At that point I would suggest improving the engine further by completely replacing the training data with Leela training data.

I imagine it’s a question of the number of new engine training games that are available. I guess this depends on the resources of the donating/new engine programmer. Needs 100 million imo.

jorose · Post by **jorose** » Tue May 26, 2026 1:58 pm

Is the data based on the evaluations of the positions from the self play games or just the game outcomes?

In the case we are just training on game outcome I am very skeptical there is much relation between the NNUE's outputs and the initial engine's outputs.

In the case we are training on the engine's evaluation outputs, things are more interesting and I could see how there might be some stylistic similarities (as the NNUE's output is then kind of a rough estimate of the depth N output from the guest engine). At least intuitively it would make sense?

It would be cool to see some kind of data to see how similar the engines actually are. I am imagining something like this: you currently have something like 5 engines that you distilled into NNUE weights. We can compare each of these engines against CSTal with the different sets of weights. If the similarity hypothesis has merit, we would expect the guest engines to most resemble the CSTal with weights based on the training data generated by the particular guest engine.

chrisw · Post by **chrisw** » Tue May 26, 2026 2:18 pm

jorose wrote: ↑Tue May 26, 2026 1:58 pm Is the data based on the evaluations of the positions from the self play games or just the game outcomes?

In the case we are just training on game outcome I am very skeptical there is much relation between the NNUE's outputs and the initial engine's outputs.

In the case we are training on the engine's evaluation outputs, things are more interesting and I could see how there might be some stylistic similarities (as the NNUE's output is then kind of a rough estimate of the depth N output from the guest engine). At least intuitively it would make sense?

It would be cool to see some kind of data to see how similar the engines actually are. I am imagining something like this: you currently have something like 5 engines that you distilled into NNUE weights. We can compare each of these engines against CSTal with the different sets of weights. If the similarity hypothesis has merit, we would expect the guest engines to most resemble the CSTal with weights based on the training data generated by the particular guest engine.

I believe the training is targeting eval, although you’ll need to confirm with Ed. We are 8000 kms apart in different time zones.
The original Cstal NNUE was trained by me and I forget now what it was trained on, probably a great mix of stuff, weighted. Ed uses his own training data, so I would guess he has a base of his own and then uses the guest engine on top. All depends on game count. You would probably get a more consistent test between guest engines by not trying compare original cstal.

Addendum: the degree of similarity/usage of guest engine characteristics is going to be pretty much dependent on getting as large as possible as game set from the guest author. When I tested in the past, you can get a fairly exciting engine by just training a few iterations, but the more iterations the more stable. It’s down to what the guest author can provide in terms of game count

Peter Berger · Post by **Peter Berger** » Tue May 26, 2026 2:50 pm

I can give a qualitative answer to the less interesting part of your question, as I've watched some of the guest engines playing for several hours by now. At least for S13, it clearly does have a distinct playing style very different from Rebel and CSTal..This will be measurable I am sure.

The more interesting question whether it somehow keeps the style of the original guest engine in some way, or if it is just some new entity - I have no Idea.

Rebel · Post by **Rebel** » Tue May 26, 2026 4:17 pm

jorose wrote: ↑Tue May 26, 2026 8:21 am This is an interesting project, though I am not sure at this moment what the point is.

Based on the writing on the page, we are training NNUEs based on self play data of various engines. This is then used with the CSTal search code. In other words, we have replaced both the entire search and evaluation of the "Guest Engine" with another engine. From this perspective it is quite unsurprising we get a very different rating from the initial engine's rating, but I am assuming that is not the point of this project.

I am assuming we are trying to get a new engine with a similar style to the "Guest Engine". Do we have evidence that the guest engine in fact has a similar style to the new engine? Does it prefer similar openings?
This community has worked on different style benchmarks and similarity tools, I assume it is possible to verify whether the style is maintained in some way?

Also, why are we augmenting the training data with Leela training data? I can only assume this is to increase strength, but why? If the only thing tethering our resulting engine to the "guest engine" is training data, I feel there should be a stronger reasoning than strength. At that point I would suggest improving the engine further by completely replacing the training data with Leela training data.

I think you will remember the energy and computer time you needed to create data for NNUE evaluation playing millions of self play games especially with limited hardware. For guest engines I demand an absolute minimum of one billion positions which practically means playing 14-15 million self play games, depending on your hardware weeks or months.

And 1B is peanuts, AI flourishes with more data, more data. I offer authors to increase the volume by playing another 14-15 million self play games and have 2B positions or (to ease the pain) add ready to use 1B Leela positions. After the 1B self play experience you may guess the choice. And mixing incompatible data of 2 engines with colliding moves and scores has its own set of problems.

Regarding strength, a quick robin match between the 4 guest engines.

Code: Select all

No. Name             Win Draw Loss Unf.  Score Games       %
------------------------------------------------------------
  1 SP3-NNUE        +222 =208  -39   *0  326.0   469   69.5%
  2 Maverick-NNUE   +198 =225  -47   *0  310.5   470   66.1%
  3 S13-NNUE        +128 =221 -121   *0  238.5   470   50.7%
  4 Strong-Malt-1.0  +14 =100 -355   *0   64.0   469   13.6%

Total Games:     939
White Wins:      337 (35.9%)
Black Wins:      225 (24.0%)
Draws:           377 (40.1%)

Regarding similarity : https://rebel7775.wixsite.com/rebel/gue ... similarity

Steve Maughan · Post by **Steve Maughan** » Tue May 26, 2026 7:24 pm

I'm shocked that Maverick-NNUE is stronger than S13-NNUE.

My first thought is that this implies Maverick has a better positional evaluation than Shredder 13 — which I highly doubt. Could the reason be that Maverick was trained on more position (Leela positions were added) — I guess it's possible? Or could it be that S13 is much more selective than Maverick, which in normal games results in a much deeper searches. However in shallow 7-ply fixed depths games the selectivity hurts Shredder's playing strength compared to Mavericks less selective search.

Thoughts?

— Steve

New Guest Engine: Maverick NNUE...

New Guest Engine: Maverick NNUE...

Re: New Guest Engine: Maverick NNUE...

Re: New Guest Engine: Maverick NNUE...

Re: New Guest Engine: Maverick NNUE...

Re: New Guest Engine: Maverick NNUE...

Re: New Guest Engine: Maverick NNUE...

Re: New Guest Engine: Maverick NNUE...

Re: New Guest Engine: Maverick NNUE...

Re: New Guest Engine: Maverick NNUE...

Re: New Guest Engine: Maverick NNUE...