Jouni wrote: ↑Mon Aug 10, 2020 8:33 pm
After testing a lot of different nets my conclusion. SF NNUE is about 100 ELO better than SF11, but this is not visible in any "standard" testsuite I use! In same suites like Arasan SF NNUE is worse than handcrafted SF. 2 possible reasons : 1) testsuites are mostly useless and 2) SF NNUE is satisfied to find winning move even if it's not the best move?
How many games against which opponent- pool do you need to get your 100 Elo difference, Jouni?
Or do you (as at the moment is usus) simply take selfplay-Elo for granted?
Matches against SF 11 or against SF dev. before NNUE are to me somewhat like "advanced selfplay", isnt' it?
You can combine several test suites, if a single one isn't enough statistical significance for you, like e.g Ferdinand Mosca does:
http://talkchess.com/forum3/viewtopic.p ... 57#p854457
Add HTC, Eret and maybe STS too, I guess you'll still be faster in expressing statistical signifikant differences then in ordinary rating- list- Elo with standard hardware- TC, openings and a mixed pool of opponents, of course including some LC0- like engines too.
Error is always to see necessity to convert differences in test suites to "standard" Elo, whatever this term might mean nowadays. Who compares Elo of human players to Eng-Eng-Elo or to corr.- chess Elo?
Are you sure you can reproduce and compare your 100 Elo at TCEC? With an error- bar smaller than the performance- difference? Or come into an confidence- interval of 95% to any standard rating- list- Elo with your estimated 100?
Testsuite- differences are measurements of their own like selfplay- Elo or TCEC- Elo or rating- list- Elo are. One thing is true for all of these measurements: playing strength is always position- dependent, Eng-Eng- Matches are nothing else but played out testsuites neither, testsuites of (opening- ) test positions as well as other testsuites.
If you let play out from very short openings (of some reason for normal chess) or the starting position itself bookless, draw- death of modern computerchess will kill your 100 Elo from any hardware- TC of modern hardware and let's say 30'+5" upwards even at advanced selfplay, so what?