Stockfish NNUE and testsuites

Jouni · Post by **Jouni** » Wed Jul 29, 2020 4:28 pm

This Stockfish was wonderful surprise! Lczero speed in my GPU was 10-20 positions/s and suddenly You have NN engine with 5 Mn/s without hardware update

. Clearly it's better than SF dev in playing, but how about testsuites? After doing a lot tests with different nets I think it's equal to SF dev. With same search it can't be better? Yes it solves some classic positions very fast, but there seems to be more easy one, which remain unsolved. One example: Arasan test suite with 3 minutes/4 cores. SF dev got 185/200 (80 min) and NNUE 178 (95m). NNUE is good reminder, that testsuites are useless. Even 60 ELO can't detect.

Leto · Post by **Leto** » Wed Jul 29, 2020 5:07 pm

Or perhaps that's an indication that the testsuite is flawed.

Vinvin · Post by **Vinvin** » Wed Jul 29, 2020 5:46 pm

As I pointed here : http://talkchess.com/forum3/viewtopic.p ... 14#p853414
Stockfish-NNUE is faster than all the A/B engines for 14 positions of the Hard-Talkchess-2020 set :

Code: Select all

 85) h5-h6       : 3   seconds
 97) .. Qh5-f5   : 12  seconds
100) a5-a6       : 1   seconds
133) Bh5-f3      : 38  seconds
139) .. Ne8-c7   : 6   seconds
146) Rd1-d8      : 0   seconds
155) Bd3-g6      : 7   seconds
160) Ne4-g5      : 0   seconds
184) .. Bf5-g4   : 123 seconds
185) .. Kg8-g7   : 1   seconds
186) Rf3-f6      : 4   seconds
189) Qf3xf4      : 4   seconds
193) .. c3xb2    : 0   seconds
197) .. Qb2-c2   : 25  seconds

All positions are here : http://talkchess.com/forum3/viewtopic.p ... 35#p827135

Jouni · Post by **Jouni** » Fri Jul 31, 2020 3:41 pm

But there is one exception. TTT1 at http://dorszcz.blogspot.com/p/ttt1.html. SF dev solved in my test 34/100, but SF NNUE 63/100

.

Jouni · Post by **Jouni** » Mon Aug 10, 2020 8:33 pm

After testing a lot of different nets my conclusion. SF NNUE is about 100 ELO better than SF11, but this is not visible in any "standard" testsuite I use! In same suites like Arasan SF NNUE is worse than handcrafted SF. 2 possible reasons : 1) testsuites are mostly useless and 2) SF NNUE is satisfied to find winning move even if it's not the best move?

dkappe · Post by **dkappe** » Mon Aug 10, 2020 8:51 pm

Jouni wrote: ↑Mon Aug 10, 2020 8:33 pm After testing a lot of different nets my conclusion. SF NNUE is about 100 ELO better than SF11, but this is not visible in any "standard" testsuite I use! In same suites like Arasan SF NNUE is worse than handcrafted SF. 2 possible reasons : 1) testsuites are mostly useless and 2) SF NNUE is satisfied to find winning move even if it's not the best move?

You should try some of the other nets: Toga III, Frosty, LizardFish, Night Nurse. They, especially NiNu, may give you a different opinion, especially when avoiding the hybrid mod.

peter · Post by **peter** » Mon Aug 10, 2020 8:59 pm

Jouni wrote: ↑Mon Aug 10, 2020 8:33 pm After testing a lot of different nets my conclusion. SF NNUE is about 100 ELO better than SF11, but this is not visible in any "standard" testsuite I use! In same suites like Arasan SF NNUE is worse than handcrafted SF. 2 possible reasons : 1) testsuites are mostly useless and 2) SF NNUE is satisfied to find winning move even if it's not the best move?

How many games against which opponent- pool do you need to get your 100 Elo difference, Jouni?
Or do you (as at the moment is usus) simply take selfplay-Elo for granted?
Matches against SF 11 or against SF dev. before NNUE are to me somewhat like "advanced selfplay", isnt' it?

You can combine several test suites, if a single one isn't enough statistical significance for you, like e.g Ferdinand Mosca does:

http://talkchess.com/forum3/viewtopic.p ... 57#p854457

Add HTC, Eret and maybe STS too, I guess you'll still be faster in expressing statistical signifikant differences then in ordinary rating- list- Elo with standard hardware- TC, openings and a mixed pool of opponents, of course including some LC0- like engines too.

Error is always to see necessity to convert differences in test suites to "standard" Elo, whatever this term might mean nowadays. Who compares Elo of human players to Eng-Eng-Elo or to corr.- chess Elo?
Are you sure you can reproduce and compare your 100 Elo at TCEC? With an error- bar smaller than the performance- difference? Or come into an confidence- interval of 95% to any standard rating- list- Elo with your estimated 100?

Testsuite- differences are measurements of their own like selfplay- Elo or TCEC- Elo or rating- list- Elo are. One thing is true for all of these measurements: playing strength is always position- dependent, Eng-Eng- Matches are nothing else but played out testsuites neither, testsuites of (opening- ) test positions as well as other testsuites.
If you let play out from very short openings (of some reason for normal chess) or the starting position itself bookless, draw- death of modern computerchess will kill your 100 Elo from any hardware- TC of modern hardware and let's say 30'+5" upwards even at advanced selfplay, so what?

Dann Corbit · Post by **Dann Corbit** » Tue Aug 11, 2020 2:11 am

Jouni wrote: ↑Mon Aug 10, 2020 8:33 pm After testing a lot of different nets my conclusion. SF NNUE is about 100 ELO better than SF11, but this is not visible in any "standard" testsuite I use! In same suites like Arasan SF NNUE is worse than handcrafted SF. 2 possible reasons : 1) testsuites are mostly useless and 2) SF NNUE is satisfied to find winning move even if it's not the best move?

For most test suites, the new SF will play exactly like old SF.
That is because as soon as the score becomes unbalanced (usually right off the bat for a test suite) the eval used is SF alpha-beta.
I guess if you run the special nnue-only SF builds, it will clobber most of the test suites because it uses NNUE all the time.
https://github.com/joergoster/Stockfish-NNUE

RogerC · Post by **RogerC** » Tue Aug 11, 2020 2:57 am

Hi,

5 positions and best moves for testing NNUE nets :

q7/4P3/8/6pk/1Q1Bn1b1/8/2r3PK/3R4 w - - 0 1 bm Qb7
2q1k3/1Npp2K1/1pP2P2/3Pp3/8/8/3P1P2/8 w - - 0 1 bm Nd6+
rn1qrnk1/p4pp1/1p1pp3/6P1/2Pp1PN1/2PQ4/P5P1/2KR3R w - - 0 1 bm Nh6+
4k3/4Pp2/1P1p1P1P/pPpPpK2/pr2pbP1/7r/3RP3/NN5b w - - 0 1 bm Rb2
7q/6pk/1R6/5K1p/3B2Pp/7P/3P4/8 w - - 0 1 bm Rh6+

mwyoung · Post by **mwyoung** » Tue Aug 11, 2020 6:09 am

Jouni wrote: ↑Mon Aug 10, 2020 8:33 pm SF NNUE is about 100 ELO better than SF11, but this is not visible in any "standard" testsuite I use!

You need to test SF NNUE in real games. It is not even close to 100 Elo better! Unless you play 1 core, and ultra fast time controls. SF NNUE does not scale "very well".

Stockfish NNUE and testsuites

Stockfish NNUE and testsuites

Re: Stockfish NNUE and testsuites

Re: Stockfish NNUE and testsuites

Re: Stockfish NNUE and testsuites

Re: Stockfish NNUE and testsuites

Re: Stockfish NNUE and testsuites

Re: Stockfish NNUE and testsuites

Re: Stockfish NNUE and testsuites

Re: Stockfish NNUE and testsuites

Re: Stockfish NNUE and testsuites