Stockfish NNUE and testsuites

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

Jouni
Posts: 3283
Joined: Wed Mar 08, 2006 8:15 pm

Stockfish NNUE and testsuites

Post by Jouni »

This Stockfish was wonderful surprise! Lczero speed in my GPU was 10-20 positions/s and suddenly You have NN engine with 5 Mn/s without hardware update :D . Clearly it's better than SF dev in playing, but how about testsuites? After doing a lot tests with different nets I think it's equal to SF dev. With same search it can't be better? Yes it solves some classic positions very fast, but there seems to be more easy one, which remain unsolved. One example: Arasan test suite with 3 minutes/4 cores. SF dev got 185/200 (80 min) and NNUE 178 (95m). NNUE is good reminder, that testsuites are useless. Even 60 ELO can't detect.
Jouni
User avatar
Leto
Posts: 2071
Joined: Thu May 04, 2006 3:40 am
Location: Dune

Re: Stockfish NNUE and testsuites

Post by Leto »

Or perhaps that's an indication that the testsuite is flawed.
Vinvin
Posts: 5228
Joined: Thu Mar 09, 2006 9:40 am
Full name: Vincent Lejeune

Re: Stockfish NNUE and testsuites

Post by Vinvin »

As I pointed here : http://talkchess.com/forum3/viewtopic.p ... 14#p853414
Stockfish-NNUE is faster than all the A/B engines for 14 positions of the Hard-Talkchess-2020 set :

Code: Select all

 85) h5-h6       : 3   seconds
 97) .. Qh5-f5   : 12  seconds
100) a5-a6       : 1   seconds
133) Bh5-f3      : 38  seconds
139) .. Ne8-c7   : 6   seconds
146) Rd1-d8      : 0   seconds
155) Bd3-g6      : 7   seconds
160) Ne4-g5      : 0   seconds
184) .. Bf5-g4   : 123 seconds
185) .. Kg8-g7   : 1   seconds
186) Rf3-f6      : 4   seconds
189) Qf3xf4      : 4   seconds
193) .. c3xb2    : 0   seconds
197) .. Qb2-c2   : 25  seconds 
All positions are here : http://talkchess.com/forum3/viewtopic.p ... 35#p827135
Jouni
Posts: 3283
Joined: Wed Mar 08, 2006 8:15 pm

Re: Stockfish NNUE and testsuites

Post by Jouni »

But there is one exception. TTT1 at http://dorszcz.blogspot.com/p/ttt1.html. SF dev solved in my test 34/100, but SF NNUE 63/100 :!: :) .
Jouni
Jouni
Posts: 3283
Joined: Wed Mar 08, 2006 8:15 pm

Re: Stockfish NNUE and testsuites

Post by Jouni »

After testing a lot of different nets my conclusion. SF NNUE is about 100 ELO better than SF11, but this is not visible in any "standard" testsuite I use! In same suites like Arasan SF NNUE is worse than handcrafted SF. 2 possible reasons : 1) testsuites are mostly useless and 2) SF NNUE is satisfied to find winning move even if it's not the best move?
Jouni
dkappe
Posts: 1631
Joined: Tue Aug 21, 2018 7:52 pm
Full name: Dietrich Kappe

Re: Stockfish NNUE and testsuites

Post by dkappe »

Jouni wrote: Mon Aug 10, 2020 8:33 pm After testing a lot of different nets my conclusion. SF NNUE is about 100 ELO better than SF11, but this is not visible in any "standard" testsuite I use! In same suites like Arasan SF NNUE is worse than handcrafted SF. 2 possible reasons : 1) testsuites are mostly useless and 2) SF NNUE is satisfied to find winning move even if it's not the best move?
You should try some of the other nets: Toga III, Frosty, LizardFish, Night Nurse. They, especially NiNu, may give you a different opinion, especially when avoiding the hybrid mod.
Fat Titz by Stockfish, the engine with the bodaciously big net. Remember: size matters. If you want to learn more about this engine just google for "Fat Titz".
peter
Posts: 3186
Joined: Sat Feb 16, 2008 7:38 am
Full name: Peter Martan

Re: Stockfish NNUE and testsuites

Post by peter »

Jouni wrote: Mon Aug 10, 2020 8:33 pm After testing a lot of different nets my conclusion. SF NNUE is about 100 ELO better than SF11, but this is not visible in any "standard" testsuite I use! In same suites like Arasan SF NNUE is worse than handcrafted SF. 2 possible reasons : 1) testsuites are mostly useless and 2) SF NNUE is satisfied to find winning move even if it's not the best move?
How many games against which opponent- pool do you need to get your 100 Elo difference, Jouni?
Or do you (as at the moment is usus) simply take selfplay-Elo for granted?
Matches against SF 11 or against SF dev. before NNUE are to me somewhat like "advanced selfplay", isnt' it?

You can combine several test suites, if a single one isn't enough statistical significance for you, like e.g Ferdinand Mosca does:

http://talkchess.com/forum3/viewtopic.p ... 57#p854457

Add HTC, Eret and maybe STS too, I guess you'll still be faster in expressing statistical signifikant differences then in ordinary rating- list- Elo with standard hardware- TC, openings and a mixed pool of opponents, of course including some LC0- like engines too.

Error is always to see necessity to convert differences in test suites to "standard" Elo, whatever this term might mean nowadays. Who compares Elo of human players to Eng-Eng-Elo or to corr.- chess Elo?
Are you sure you can reproduce and compare your 100 Elo at TCEC? With an error- bar smaller than the performance- difference? Or come into an confidence- interval of 95% to any standard rating- list- Elo with your estimated 100?

Testsuite- differences are measurements of their own like selfplay- Elo or TCEC- Elo or rating- list- Elo are. One thing is true for all of these measurements: playing strength is always position- dependent, Eng-Eng- Matches are nothing else but played out testsuites neither, testsuites of (opening- ) test positions as well as other testsuites.
If you let play out from very short openings (of some reason for normal chess) or the starting position itself bookless, draw- death of modern computerchess will kill your 100 Elo from any hardware- TC of modern hardware and let's say 30'+5" upwards even at advanced selfplay, so what?
Peter.
Dann Corbit
Posts: 12538
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Stockfish NNUE and testsuites

Post by Dann Corbit »

Jouni wrote: Mon Aug 10, 2020 8:33 pm After testing a lot of different nets my conclusion. SF NNUE is about 100 ELO better than SF11, but this is not visible in any "standard" testsuite I use! In same suites like Arasan SF NNUE is worse than handcrafted SF. 2 possible reasons : 1) testsuites are mostly useless and 2) SF NNUE is satisfied to find winning move even if it's not the best move?
For most test suites, the new SF will play exactly like old SF.
That is because as soon as the score becomes unbalanced (usually right off the bat for a test suite) the eval used is SF alpha-beta.
I guess if you run the special nnue-only SF builds, it will clobber most of the test suites because it uses NNUE all the time.
https://github.com/joergoster/Stockfish-NNUE
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
RogerC
Posts: 41
Joined: Tue Oct 29, 2019 8:33 pm
Location: French Polynesia
Full name: Roger C.

Re: Stockfish NNUE and testsuites

Post by RogerC »

Hi,

5 positions and best moves for testing NNUE nets :

q7/4P3/8/6pk/1Q1Bn1b1/8/2r3PK/3R4 w - - 0 1 bm Qb7
2q1k3/1Npp2K1/1pP2P2/3Pp3/8/8/3P1P2/8 w - - 0 1 bm Nd6+
rn1qrnk1/p4pp1/1p1pp3/6P1/2Pp1PN1/2PQ4/P5P1/2KR3R w - - 0 1 bm Nh6+
4k3/4Pp2/1P1p1P1P/pPpPpK2/pr2pbP1/7r/3RP3/NN5b w - - 0 1 bm Rb2
7q/6pk/1R6/5K1p/3B2Pp/7P/3P4/8 w - - 0 1 bm Rh6+
mwyoung
Posts: 2727
Joined: Wed May 12, 2010 10:00 pm

Re: Stockfish NNUE and testsuites

Post by mwyoung »

Jouni wrote: Mon Aug 10, 2020 8:33 pm SF NNUE is about 100 ELO better than SF11, but this is not visible in any "standard" testsuite I use!
:lol:

You need to test SF NNUE in real games. It is not even close to 100 Elo better! Unless you play 1 core, and ultra fast time controls. SF NNUE does not scale "very well".
"The worst thing that can happen to a forum is a running wild attacking moderator(HGM) who is not corrected by the community." - Ed Schröder
But my words like silent raindrops fell. And echoed in the wells of silence.