SPCC: Testrun of Lc0 J92260 finished
Moderators: hgm, Dann Corbit, Harvey Williamson
Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
SPCC: Testrun of Lc0 J92260 finished
NNtestrun of Lc0 0.26.3 J92260 finished  a new impressive highscore!
https://www.spcc.de
(Perhaps you have to clear your browsercache or reload the website)
https://www.spcc.de
(Perhaps you have to clear your browsercache or reload the website)
Re: SPCC: Testrun of Lc0 J92260 finished
Nets trained by James Horsfall Thomas(Jhortos) is really amazing.
Number of trained games of 320 x 24 net is ~130.000.000. This number is ~3x of number of trained games of AlphaZero.
Number of trained games of 320 x 24 net is ~130.000.000. This number is ~3x of number of trained games of AlphaZero.

 Posts: 11923
 Joined: Wed Mar 08, 2006 7:57 pm
 Location: Redmond, WA USA
 Contact:
Re: SPCC: Testrun of Lc0 J92260 finished
That is quite a leap, +34b Elo better than the previous top score
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
Re: SPCC: Testrun of Lc0 J92260 finished
I found it surprising, but I think that the "advanced scoring Armageddon" ("ASA") scheme used here can introduce a very large bias compared to standard elo. From S. Pohl website (https://www.spcc.de/nbscarmageddonopenings.htm):Dann Corbit wrote: ↑Sun Oct 18, 2020 2:41 amThat is quite a leap, +34b Elo better than the previous top score
When white looses a game, it is recorded as two losses. Plus the standard Armageddon "a draw with white is a loss". It is getting far from regular chess scoring scheme. i.e., assume that Engine A and Engine B play a reference engine, Engine C (like Stockfish in this case, with elo = 0):My Advanced Armageddon scoring is:
Win for white = 1 point for white
Draw = 1 point for black
Win for black = 2 points for black (!!!)
Engine A vs Engine C:
Engine A score as white: +1,0,=4 > 1/5
Engine A score as black: +2,2,=1 > 5/7 (the "draw as black" win, plus the two wins as black that count for four wins, so 2 extra games)
"Advanced scoring Armageddon" total > 6/12 (elo = 0).
Standard chess total (+3,2,=5) > 5.5/10 (about + 40 elo).
There is a downward bias of 40 elo here.
Engine B vs Engine C:
Engine B score as white: +2,2,=1 > 2/7 (two extra games as white losses count twice)
Engine B score as black: +0,0,=5 > 5/5
"Advanced scoring Armageddon" total > 7/12 (about + 65 elo).
Standard chess total (+2,2,=6) > 5/10 (elo = 0)
There is an upward bias of 65 elo here.
The gap between the two rating systems is over 100 elo in this example, and Engine B is better than Engine A with ASA, whereas in standard chess Engine A is superior. The bias can be important and unsystematic. This might explain the unexpected leap in elo.
As for J92 nets, it seems that they are generally improving. The author test them at fast time control (https://github.com/jhorthos/lczerotrai ... aTraining ), with thousands of games. The nets are a few elo apart from each other:
Code: Select all
1 lc0.net.J92270 : 20.4 5.1 8000 53 2132 1678 4190 52 59
2 lc0.net.J92190 : 19.5 5.3 8000 53 2069 1635 4296 54 83
3 lc0.net.J92210 : 15.8 5.4 8000 52 2007 1655 4338 54 67
4 lc0.net.J92B205 : 14.2 5.3 8000 52 2026 1710 4264 53 51
5 lc0.net.J92240 : 14.1 5.2 8000 52 1984 1670 4346 54 55
6 lc0.net.J92180 : 13.6 5.2 8000 52 1990 1686 4324 54 56
7 lc0.net.J92220 : 13.1 5.3 8000 52 2031 1739 4230 53 66
8 lc0.net.J92145 : 11.5 5.4 8000 52 1963 1706 4331 54 72
9 lc0.net.J92160 : 9.2 5.3 8000 51 1995 1789 4216 53 58
10 lc0.net.J92130 : 8.5 5.3 8000 51 1987 1798 4215 53 60
Re: SPCC: Testrun of Lc0 J92260 finished
MMarco wrote: ↑Fri Oct 23, 2020 9:05 amI found it surprising, but I think that the "advanced scoring Armageddon" ("ASA") scheme used here can introduce a very large bias compared to standard elo. From S. Pohl website (https://www.spcc.de/nbscarmageddonopenings.htm):Dann Corbit wrote: ↑Sun Oct 18, 2020 2:41 amThat is quite a leap, +34b Elo better than the previous top scoreWhen white looses a game, it is recorded as two losses. Plus the standard Armageddon "a draw with white is a loss". It is getting far from regular chess scoring scheme. i.e., assume that Engine A and Engine B play a reference engine, Engine C (like Stockfish in this case, with elo = 0):My Advanced Armageddon scoring is:
Win for white = 1 point for white
Draw = 1 point for black
Win for black = 2 points for black (!!!)
Engine A vs Engine C:
Engine A score as white: +1,0,=4 > 1/5
Engine A score as black: +2,2,=1 > 5/7 (the "draw as black" win, plus the two wins as black that count for four wins, so 2 extra games)
"Advanced scoring Armageddon" total > 6/12 (elo = 0).
Standard chess total (+3,2,=5) > 5.5/10 (about + 40 elo).
There is a downward bias of 40 elo here.
Engine B vs Engine C:
Engine B score as white: +2,2,=1 > 2/7 (two extra games as white losses count twice)
Engine B score as black: +0,0,=5 > 5/5
"Advanced scoring Armageddon" total > 7/12 (about + 65 elo).
Standard chess total (+2,2,=6) > 5/10 (elo = 0)
There is an upward bias of 65 elo here.
The gap between the two rating systems is over 100 elo in this example, and Engine B is better than Engine A with ASA, whereas in standard chess Engine A is superior. The bias can be important and unsystematic. This might explain the unexpected leap in elo.
That is false. The Advanced Armageddon Scoring makes the Elospreading between results wider. With classical scoring, a lot of results of my Lc0testing would be so close, that it is pure random, which net is better or a little bit worse.
Armageddon scoring could lead to distortions, because there are only 2 scores for 3 possible game endings (draw and win for black are both a point for black). But my new Advanced Armageddon Scoring has 3 scores for 3 possible game endings, like the classical scoring system. That is the main progress and advantage of my scoring system and the reason, why there are no distortions, only a spreading of results!
Here is, what I mean:
Classical scoring of my NNgames (first 10 places in the list)
Code: Select all
Program Elo +  Games Score Av.Op. Draws
1 Lc0 0.26.3 J92260 (30x384) : 3631 22 22 500 58.9 % 3568 42.2 %
2 Lc0 0.26.2 J92130 (30x384) : 3617 22 22 500 57.0 % 3568 45.2 %
3 Lc0 0.26.3 65536 (24x320) : 3610 21 21 500 56.0 % 3568 41.2 %
4 Lc0 0.26.3 65411 (24x320) : 3610 22 22 500 56.0 % 3568 44.4 %
5 Lc0 0.24.1 LS 14.3 (20x256) : 3609 23 23 500 55.8 % 3568 44.4 %
6 Lc0 0.25.1 LS 15 (20x256) : 3608 22 22 500 55.7 % 3568 45.0 %
7 Lc0 0.24.1 LS 14.2 (20x256) : 3607 21 21 500 55.6 % 3568 43.6 %
8 Lc0 0.26.2 T60B.7105 (24x320 : 3606 21 21 500 55.4 % 3568 47.6 %
9 Lc0 0.26.2 J92160 (30x384) : 3603 22 22 500 55.0 % 3568 45.6 %
10 Lc0 0.26.1 t604619 (30x384) : 3600 22 22 500 54.6 % 3568 45.2 %
Code: Select all
1 Lc0 0.26.3 J92260 (30x384) : 3689 515 (+343,= 0,172), 66.6 %
2 Lc0 0.26.2 J92130 (30x384) : 3655 521 (+324,= 0,197), 62.2 %
3 Lc0 0.26.3 65536 (24x320) : 3648 514 (+315,= 0,199), 61.3 %
4 Lc0 0.24.1 LS 14.3 (20x256) : 3644 513 (+311,= 0,202), 60.6 %
5 Lc0 0.25.1 LS 15 (20x256) : 3643 512 (+310,= 0,202), 60.5 %
6 Lc0 0.26.3 65411 (24x320) : 3641 519 (+313,= 0,206), 60.3 %
7 Lc0 0.26.2 J92160 (30x384) : 3635 511 (+304,= 0,207), 59.5 %
8 Lc0 0.26.2 T60B.7105 (24x320) : 3634 519 (+308,= 0,211), 59.3 %
9 Lc0 0.24.1 LS 14.2 (20x256) : 3633 520 (+308,= 0,212), 59.2 %
10 Lc0 0.25.1 LS 15 Kayra4 : 3624 513 (+297,= 0,216), 57.9 %
"But mention, that the usage of my NBSCArmageddon openings spreads the Eloresults around 2.25x wider, than using classical openings for testing(!), so with classical openings, you would need an errorbar of +/ 9 Elo for the same statistical quality of the results (= the rankings of Lc0 nets here). And for an errorbar of +/ 9 elo, you need around 3000 games, not 500, which means 6x more games (and 6x more PCtime)!!"
Look at the score of the 2 new T60 nets 65536 and 65411: With classical scoring, the progress of learning disappears. Both have 3610 Elo. With Advanced Armageddon Scoring, 65536 is +7 Elo better. Not much, but a progress.
What else can I say?
Re: SPCC: Testrun of Lc0 J92260 finished
pohl4711 wrote: ↑Fri Oct 23, 2020 9:51 amMMarco wrote: ↑Fri Oct 23, 2020 9:05 amI found it surprising, but I think that the "advanced scoring Armageddon" ("ASA") scheme used here can introduce a very large bias compared to standard elo. From S. Pohl website (https://www.spcc.de/nbscarmageddonopenings.htm):Dann Corbit wrote: ↑Sun Oct 18, 2020 2:41 amThat is quite a leap, +34b Elo better than the previous top scoreWhen white looses a game, it is recorded as two losses. Plus the standard Armageddon "a draw with white is a loss". It is getting far from regular chess scoring scheme. i.e., assume that Engine A and Engine B play a reference engine, Engine C (like Stockfish in this case, with elo = 0):My Advanced Armageddon scoring is:
Win for white = 1 point for white
Draw = 1 point for black
Win for black = 2 points for black (!!!)
Engine A vs Engine C:
Engine A score as white: +1,0,=4 > 1/5
Engine A score as black: +2,2,=1 > 5/7 (the "draw as black" win, plus the two wins as black that count for four wins, so 2 extra games)
"Advanced scoring Armageddon" total > 6/12 (elo = 0).
Standard chess total (+3,2,=5) > 5.5/10 (about + 40 elo).
There is a downward bias of 40 elo here.
Engine B vs Engine C:
Engine B score as white: +2,2,=1 > 2/7 (two extra games as white losses count twice)
Engine B score as black: +0,0,=5 > 5/5
"Advanced scoring Armageddon" total > 7/12 (about + 65 elo).
Standard chess total (+2,2,=6) > 5/10 (elo = 0)
There is an upward bias of 65 elo here.
The gap between the two rating systems is over 100 elo in this example, and Engine B is better than Engine A with ASA, whereas in standard chess Engine A is superior. The bias can be important and unsystematic. This might explain the unexpected leap in elo.
That is false. The Advanced Armageddon Scoring makes the Elospreading between results wider. With classical scoring, a lot of results of my Lc0testing would be so close, that it is pure random, which net is better or a little bit worse.
Armageddon scoring could lead to distortions, because there are only 2 scores for 3 possible game endings (draw and win for black are both a point for black). But my new Advanced Armageddon Scoring has 3 scores for 3 possible game endings, like the classical scoring system. That is the main progress and advantage of my scoring system and the reason, why there are no distortions, only a spreading of results!
Here is, what I mean:
Classical scoring of my NNgames (first 10 places in the list)And here the Advanced Armageddon Scoring:Code: Select all
Program Elo +  Games Score Av.Op. Draws 1 Lc0 0.26.3 J92260 (30x384) : 3631 22 22 500 58.9 % 3568 42.2 % 2 Lc0 0.26.2 J92130 (30x384) : 3617 22 22 500 57.0 % 3568 45.2 % 3 Lc0 0.26.3 65536 (24x320) : 3610 21 21 500 56.0 % 3568 41.2 % 4 Lc0 0.26.3 65411 (24x320) : 3610 22 22 500 56.0 % 3568 44.4 % 5 Lc0 0.24.1 LS 14.3 (20x256) : 3609 23 23 500 55.8 % 3568 44.4 % 6 Lc0 0.25.1 LS 15 (20x256) : 3608 22 22 500 55.7 % 3568 45.0 % 7 Lc0 0.24.1 LS 14.2 (20x256) : 3607 21 21 500 55.6 % 3568 43.6 % 8 Lc0 0.26.2 T60B.7105 (24x320 : 3606 21 21 500 55.4 % 3568 47.6 % 9 Lc0 0.26.2 J92160 (30x384) : 3603 22 22 500 55.0 % 3568 45.6 % 10 Lc0 0.26.1 t604619 (30x384) : 3600 22 22 500 54.6 % 3568 45.2 %
As you can see, in the classical scoring, J92260 is clearly the best, too. But from place 3 to 10, all results are in a 10Elo intervall, which is so small, that the place of the nets in that list is pure random, if you dont play some thousand games with each net. With Advanced Armageddon Scoring, the Elointerval from place 3 to 10 is 24 Elo. Which is 2.4x wider (around 2.25x was, what I predicted!). And, what that means, can be read on my website:Code: Select all
1 Lc0 0.26.3 J92260 (30x384) : 3689 515 (+343,= 0,172), 66.6 % 2 Lc0 0.26.2 J92130 (30x384) : 3655 521 (+324,= 0,197), 62.2 % 3 Lc0 0.26.3 65536 (24x320) : 3648 514 (+315,= 0,199), 61.3 % 4 Lc0 0.24.1 LS 14.3 (20x256) : 3644 513 (+311,= 0,202), 60.6 % 5 Lc0 0.25.1 LS 15 (20x256) : 3643 512 (+310,= 0,202), 60.5 % 6 Lc0 0.26.3 65411 (24x320) : 3641 519 (+313,= 0,206), 60.3 % 7 Lc0 0.26.2 J92160 (30x384) : 3635 511 (+304,= 0,207), 59.5 % 8 Lc0 0.26.2 T60B.7105 (24x320) : 3634 519 (+308,= 0,211), 59.3 % 9 Lc0 0.24.1 LS 14.2 (20x256) : 3633 520 (+308,= 0,212), 59.2 % 10 Lc0 0.25.1 LS 15 Kayra4 : 3624 513 (+297,= 0,216), 57.9 %
"But mention, that the usage of my NBSCArmageddon openings spreads the Eloresults around 2.25x wider, than using classical openings for testing(!), so with classical openings, you would need an errorbar of +/ 9 Elo for the same statistical quality of the results (= the rankings of Lc0 nets here). And for an errorbar of +/ 9 elo, you need around 3000 games, not 500, which means 6x more games (and 6x more PCtime)!!"
What else can I say?
LOL. Can you explain what's wrong with what I wrote?That is false
Your "Armageddon Advanced scoring" is just introducing noise. The lists you shown just proved it. The Armageddon list doesn't preserve the ranking of the standard elo list. ie 65411 goes down from #4 to #6, LS 14.3 and 15 go up a rank while LS 14.2 looses two ranks. T604619 is kicked out of the top ten etc. Thise is what un unsystematic bias is: sometimes in favor of an engine, sometimes it goes against it (or against another engine). This extra noise is added on top the random fluctuation we usually get from the sample results.so with classical openings, you would need an errorbar of +/ 9 Elo for the same statistical quality of the results (= the rankings of Lc0 nets here).
Your list is interesting anyway, and I just saw I can dowload the "nonarmageddonized"database if I want. Thanks for running the games and making them available. I'll look into them in the future. I do watch yout list quite a bit, but I prefer to havea standard chess scoring and rating scheme.
The "advanced scoring scheme" do not give better statistical quality with fewer games, it gives worse statistical quality with the same number of games!
Think about it. You have a basic set of information in your data (the standard chess results) and then your rescore them according to some fancy rules (Adavanced Armageddon) that doesn't preserve the rankings (nor the number of games) and then, like by magic, you would get get better statistical estimates??!! It doesn't work like that... I can agree that it seems to strech the ratings (it did in my numerical example above) but it also distorts the list. Apart from the wrong rankings, the errorbars associated with them have also now become unthrusworthy, same for the LOS.
Re: SPCC: Testrun of Lc0 J92260 finished
The differences of rankings in the two performance lists comes from the fact, that 7 engines/nets share a 10 Elointerval of strength with classical scoring. So, with 500 played games the ranking from place 310 is pure random (these results are the "noise") with classical scoring. And with Advanced Armageddon Scoring it is not, because the Elointerval is 2.4x bigger. Is it really a surprise, that the ranking is different, when one list has a random ranking and the other list has not?!? And this is  by the way  the reason for me to use NBSC openings and Advanced Armageddon Scoring for the testruns of lc0 and it's nets. The nets are often so close in strength, that it is impossible to get a valid ranking beyond random, when playing 500 games, only. And more than 500 games are difficult to play, because, when using lc0 it is not possible to play games simultaneously on one machine. I need 2.5 days for one 500 games testrun of lc0 vs Stockfish.MMarco wrote: ↑Fri Oct 23, 2020 10:39 am
Your "Armageddon Advanced scoring" is just introducing noise. The lists you shown just proved it. The Armageddon list doesn't preserve the ranking of the standard elo list. ie 65411 goes down from #4 to #6, LS 14.3 and 15 go up a rank while LS 14.2 looses two ranks. T604619 is kicked out of the top ten etc. Thise is what un unsystematic bias is: sometimes in favor of an engine, sometimes it goes against it (or against another engine). This extra noise is added on top the random fluctuation we usually get from the sample results.
And I will definitly not use classical openings and classical scoring, because I will not waste my PCtime, getting random rankings (or measuring noise)
Re: SPCC: Testrun of Lc0 J92260 finished
Anybody, who want, should look on my testing results of my NBSC openings with Advances Armageddon Scoring compared to classical openings and scoring:
https://www.spcc.de/nbscarmageddonopenings.htm
Just 2 examples from there (classical Stockfish openings from Fishtest compared with my Noomen 3move NBSC openings using Advanced Armageddon Scoring (these openings are used for my lc0 testings)
Testing conditions:
2'+1'', Singlethread, i78750H Hexacore mobile CPU, 256MB Hash, cutechesscli (no TB for engines, but 5 Syzygy for cutechess), Contempt=0 for all Stockfish. All openings replayed with reversed colors. Round Robin with 1500 games with SF 11, SF 10, SF 9 and SF 8. Each SF played 250 games vs. Each of the 3 opponents = 1500 games per testrun. ORDO for the ratings (3400 Elo basevalue)
Not so bad for openings and a scoring system, that "adds noise" to results, I believe...
And a proof, that saying "The advanced scoring scheme do not give better statistical quality with fewer games, it gives worse statistical quality with the same number of games!" is obviously wrong!
Interesting additional facts are the comparsions between the single engines
Distance SF 11 to 10: 35 Elo gets to 75 Elo = 2.14x bigger
Distance SF 10 to 9: 48 Elo gets to 109 Elo = 2.27x bigger
Distance SF 9 to 8: 63 Elo gets to 130 Elo = 2.06x bigger
(Overall Distance SF 11 to 8: 146 Elo gets to 314 Elo = 2.15x bigger)
So, the ratingdistances are very close in the "zooming factor", my Advanced Armageddon Scoring adds. So the results are spreaded enormously but are definitly not distorted.
Again: Not so bad for openings and a scoring system, that "adds noise" to results, I believe...
https://www.spcc.de/nbscarmageddonopenings.htm
Just 2 examples from there (classical Stockfish openings from Fishtest compared with my Noomen 3move NBSC openings using Advanced Armageddon Scoring (these openings are used for my lc0 testings)
Testing conditions:
2'+1'', Singlethread, i78750H Hexacore mobile CPU, 256MB Hash, cutechesscli (no TB for engines, but 5 Syzygy for cutechess), Contempt=0 for all Stockfish. All openings replayed with reversed colors. Round Robin with 1500 games with SF 11, SF 10, SF 9 and SF 8. Each SF played 250 games vs. Each of the 3 opponents = 1500 games per testrun. ORDO for the ratings (3400 Elo basevalue)
Code: Select all
Stockfish Framework 8moves v3:
Program Elo +  Games Score Av.Op. Draws
1 Stockfish 11 bmi2 : 3466 14 14 750 62.1 % 3378 68.7 %
2 Stockfish 10 bmi2 : 3431 13 13 750 55.8 % 3390 75.6 %
3 Stockfish 9 bmi2 : 3383 14 14 750 46.7 % 3406 71.3 %
4 Stockfish 8 bmi2 : 3320 14 14 750 35.4 % 3427 63.9 %
Drawrate : 69.9 % (smaller is better)
Elo spreading (first to last): 146 Elo (bigger is better)
Code: Select all
NBSC Advanced Armageddon Noomen 3moves:
Program Elo +  Games Score Av.Op. Draws
1 Stockfish 11 bmi2 : 3543 15 15 803 74.0 % 3347 0.0 %
2 Stockfish 10 bmi2 : 3468 14 14 785 61.8 % 3374 0.0 %
3 Stockfish 9 bmi2 : 3359 14 14 794 42.8 % 3414 0.0 %
4 Stockfish 8 bmi2 : 3229 16 16 816 22.1 % 3459 0.0 %
Drawrate : 0 % (smaller is better)
Elo spreading (first to last): 314 Elo (bigger is better)
White Score : 46.5 %
Number of wins for Black (= 2 points for Black in advanced scoring): 99
And a proof, that saying "The advanced scoring scheme do not give better statistical quality with fewer games, it gives worse statistical quality with the same number of games!" is obviously wrong!
Interesting additional facts are the comparsions between the single engines
Distance SF 11 to 10: 35 Elo gets to 75 Elo = 2.14x bigger
Distance SF 10 to 9: 48 Elo gets to 109 Elo = 2.27x bigger
Distance SF 9 to 8: 63 Elo gets to 130 Elo = 2.06x bigger
(Overall Distance SF 11 to 8: 146 Elo gets to 314 Elo = 2.15x bigger)
So, the ratingdistances are very close in the "zooming factor", my Advanced Armageddon Scoring adds. So the results are spreaded enormously but are definitly not distorted.
Again: Not so bad for openings and a scoring system, that "adds noise" to results, I believe...
Re: SPCC: Testrun of Lc0 J92260 finished
And that is your main fault. Advanced Armageddon Scoring comes not alone to a "basic set of information". The rescoring is done on games, that were played out of openings, that give a clear and measureable advantage for white (black is not allowed to castle short). So the basic set of information here is, that a draw is a big success for black, because white had a clear advantage at the beginning of the game. And a win for black is an ever bigger success for black. So the Advanced Armageddon Rescoring is not a new "fancy rule", but a rescoring, which considers the advantage of white at the beginning of the game.
Using Advance Armageddon Rescoring on games played out of classical, balanced openings would be nonsense. True and I agree. But that is not, what I am doing. Advanced Armageddon scoring must always (and only!) be used on games, where white has a clear advantage at the beginning. So, the rescoring must be used on my NBSC openings or on my forthcoming Unbalanced Human Openings (will be released soon). Not on games played with classical, balanced openings. That is very important!
Re: SPCC: Testrun of Lc0 J92260 finished
I wrote this last month in another thread :
Assuming the weak side never wins, armageddon scoring doubles the displayed score difference, nonlinearly increase the computed elo difference (double it at minimum, but can be infinitely higher), but the underlying data is exactly the same. The variance in % score is exactly doubled too, and the elo error bars are nonlinearly increased through the same factor as the elo difference.
In no way shape or form does this scoring change allow to increase statistical certainty.
Now, let's assume that the weak side sometimes manage to win games despite the strong initial disadvantage. Armageddon scoring actually suppresses relevant information about the engine strengths. That's why you came up with "Advanced armageddon scoring". But why not keep classical scoring then ?
Take 12 games. 3 white win, 3 white draws, 3 black loss, 3 black draws. Advanced armageddon : 6/12. Classical : 6/12. Now 3 white wins, 3 white draws, 4 black losses, 2 black wins. Advanced Armageddon : 7/14. Classical : 6.5/12. The relative reward for a weak side win with classical scoring is bigger than with "advanced armageddon" scoring.
The error bar numbers you're getting from ordo are simply off. Lucky case for you is if they're still correct upperbounds with your scoring method and are over twice time too big for classical scoring. The less lucky case is that your new error bars are underestimating the real variance and you think your ranking are more reliable than they really are.
Personally, I don't trust 500 games sample to have a low enough variance when comparing engines that are close in strength.
See also this thread by Laskos : http://talkchess.com/forum3/viewtopic.php?f=2&t=75080
The fundamental insight is that while picking a position very close to the draw/win frontier do improve the skill ceiling by mostly avoiding chess's very wide draw margin, scoring draws from such positions as wins from the weak side is cosmetic.Armageddon variants theoretically work, but are much less appealing. Removing the drawing margin theoretically increase skill ceiling, but that's compared to start position chess. Compared to "TCEC chess" with forced openings that try to be close to the draw/win frontier, you'll be able to display 2x bigger elo spread because you mark draws by the weak side as wins, but really complexity is equivalent. Take NBSC as start position, score draws just as draws, how is it in any way an inferior game to NBSC Armageddon, assuming reverse games are played which is always true in engine chess ? Stefan even went the whole way to score black wins as +2, just keep classical scoring at this point, it will reward those black wins automatically !
Assuming the weak side never wins, armageddon scoring doubles the displayed score difference, nonlinearly increase the computed elo difference (double it at minimum, but can be infinitely higher), but the underlying data is exactly the same. The variance in % score is exactly doubled too, and the elo error bars are nonlinearly increased through the same factor as the elo difference.
In no way shape or form does this scoring change allow to increase statistical certainty.
Now, let's assume that the weak side sometimes manage to win games despite the strong initial disadvantage. Armageddon scoring actually suppresses relevant information about the engine strengths. That's why you came up with "Advanced armageddon scoring". But why not keep classical scoring then ?
Take 12 games. 3 white win, 3 white draws, 3 black loss, 3 black draws. Advanced armageddon : 6/12. Classical : 6/12. Now 3 white wins, 3 white draws, 4 black losses, 2 black wins. Advanced Armageddon : 7/14. Classical : 6.5/12. The relative reward for a weak side win with classical scoring is bigger than with "advanced armageddon" scoring.
Changing the scoring method like you did does absolutely nothing to help avoid stochastic noise.pohl4711 wrote: ↑Fri Oct 23, 2020 8:13 pmThe differences of rankings in the two performance lists comes from the fact, that 7 engines/nets share a 10 Elointerval of strength with classical scoring. So, with 500 played games the ranking from place 310 is pure random (these results are the "noise") with classical scoring.
...
And I will definitly not use classical openings and classical scoring, because I will not waste my PCtime, getting random rankings (or measuring noise)
The error bar numbers you're getting from ordo are simply off. Lucky case for you is if they're still correct upperbounds with your scoring method and are over twice time too big for classical scoring. The less lucky case is that your new error bars are underestimating the real variance and you think your ranking are more reliable than they really are.
Personally, I don't trust 500 games sample to have a low enough variance when comparing engines that are close in strength.
See also this thread by Laskos : http://talkchess.com/forum3/viewtopic.php?f=2&t=75080