SPCC: Testrun of Lc0 J92-260 finished

Discussion of computer chess matches and engine tournaments.

Moderators: hgm, Rebel, chrisw

User avatar
pohl4711
Posts: 2434
Joined: Sat Sep 03, 2011 7:25 am
Location: Berlin, Germany
Full name: Stefan Pohl

SPCC: Testrun of Lc0 J92-260 finished

Post by pohl4711 »

NN-testrun of Lc0 0.26.3 J92-260 finished - a new impressive highscore!

https://www.sp-cc.de

(Perhaps you have to clear your browsercache or reload the website)
mehmet123
Posts: 670
Joined: Sun Jan 26, 2020 10:38 pm
Location: Turkey
Full name: Mehmet Karaman

Re: SPCC: Testrun of Lc0 J92-260 finished

Post by mehmet123 »

Nets trained by James Horsfall Thomas(Jhortos) is really amazing.
Number of trained games of 320 x 24 net is ~130.000.000. This number is ~3x of number of trained games of AlphaZero.
Dann Corbit
Posts: 12538
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: SPCC: Testrun of Lc0 J92-260 finished

Post by Dann Corbit »

That is quite a leap, +34b Elo better than the previous top score
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
MMarco
Posts: 195
Joined: Sun Apr 12, 2020 1:09 am
Full name: Marc-O Moisan-Plante

Re: SPCC: Testrun of Lc0 J92-260 finished

Post by MMarco »

Dann Corbit wrote: Sun Oct 18, 2020 4:41 am That is quite a leap, +34b Elo better than the previous top score
I found it surprising, but I think that the "advanced scoring Armageddon" ("ASA") scheme used here can introduce a very large bias compared to standard elo. From S. Pohl website (https://www.sp-cc.de/nbsc-armageddon-openings.htm):
My Advanced Armageddon scoring is:
Win for white = 1 point for white
Draw = 1 point for black
Win for black = 2 points for black (!!!)
When white looses a game, it is recorded as two losses. Plus the standard Armageddon "a draw with white is a loss". It is getting far from regular chess scoring scheme. i.e., assume that Engine A and Engine B play a reference engine, Engine C (like Stockfish in this case, with elo = 0):

Engine A vs Engine C:
Engine A score as white: +1,-0,=4 --> 1/5
Engine A score as black: +2,-2,=1 --> 5/7 (the "draw as black" win, plus the two wins as black that count for four wins, so 2 extra games)
"Advanced scoring Armageddon" total --> 6/12 (elo = 0).
Standard chess total (+3,-2,=5) --> 5.5/10 (about + 40 elo).
There is a downward bias of 40 elo here.

Engine B vs Engine C:
Engine B score as white: +2,-2,=1 --> 2/7 (two extra games as white losses count twice)
Engine B score as black: +0,-0,=5 --> 5/5
"Advanced scoring Armageddon" total --> 7/12 (about + 65 elo).
Standard chess total (+2,-2,=6) --> 5/10 (elo = 0)
There is an upward bias of 65 elo here.

The gap between the two rating systems is over 100 elo in this example, and Engine B is better than Engine A with ASA, whereas in standard chess Engine A is superior. The bias can be important and unsystematic. This might explain the unexpected leap in elo.

As for J92 nets, it seems that they are generally improving. The author test them at fast time control (https://github.com/jhorthos/lczero-trai ... a-Training ), with thousands of games. The nets are a few elo apart from each other:

Code: Select all

   1 lc0.net.J92-270     :    20.4    5.1    8000    53   2132   1678   4190    52      59
   2 lc0.net.J92-190     :    19.5    5.3    8000    53   2069   1635   4296    54      83
   3 lc0.net.J92-210     :    15.8    5.4    8000    52   2007   1655   4338    54      67
   4 lc0.net.J92B-205    :    14.2    5.3    8000    52   2026   1710   4264    53      51
   5 lc0.net.J92-240     :    14.1    5.2    8000    52   1984   1670   4346    54      55
   6 lc0.net.J92-180     :    13.6    5.2    8000    52   1990   1686   4324    54      56
   7 lc0.net.J92-220     :    13.1    5.3    8000    52   2031   1739   4230    53      66
   8 lc0.net.J92-145     :    11.5    5.4    8000    52   1963   1706   4331    54      72
   9 lc0.net.J92-160     :     9.2    5.3    8000    51   1995   1789   4216    53      58
  10 lc0.net.J92-130     :     8.5    5.3    8000    51   1987   1798   4215    53      60
User avatar
pohl4711
Posts: 2434
Joined: Sat Sep 03, 2011 7:25 am
Location: Berlin, Germany
Full name: Stefan Pohl

Re: SPCC: Testrun of Lc0 J92-260 finished

Post by pohl4711 »

MMarco wrote: Fri Oct 23, 2020 11:05 am
Dann Corbit wrote: Sun Oct 18, 2020 4:41 am That is quite a leap, +34b Elo better than the previous top score
I found it surprising, but I think that the "advanced scoring Armageddon" ("ASA") scheme used here can introduce a very large bias compared to standard elo. From S. Pohl website (https://www.sp-cc.de/nbsc-armageddon-openings.htm):
My Advanced Armageddon scoring is:
Win for white = 1 point for white
Draw = 1 point for black
Win for black = 2 points for black (!!!)
When white looses a game, it is recorded as two losses. Plus the standard Armageddon "a draw with white is a loss". It is getting far from regular chess scoring scheme. i.e., assume that Engine A and Engine B play a reference engine, Engine C (like Stockfish in this case, with elo = 0):

Engine A vs Engine C:
Engine A score as white: +1,-0,=4 --> 1/5
Engine A score as black: +2,-2,=1 --> 5/7 (the "draw as black" win, plus the two wins as black that count for four wins, so 2 extra games)
"Advanced scoring Armageddon" total --> 6/12 (elo = 0).
Standard chess total (+3,-2,=5) --> 5.5/10 (about + 40 elo).
There is a downward bias of 40 elo here.

Engine B vs Engine C:
Engine B score as white: +2,-2,=1 --> 2/7 (two extra games as white losses count twice)
Engine B score as black: +0,-0,=5 --> 5/5
"Advanced scoring Armageddon" total --> 7/12 (about + 65 elo).
Standard chess total (+2,-2,=6) --> 5/10 (elo = 0)
There is an upward bias of 65 elo here.

The gap between the two rating systems is over 100 elo in this example, and Engine B is better than Engine A with ASA, whereas in standard chess Engine A is superior. The bias can be important and unsystematic. This might explain the unexpected leap in elo.

That is false. The Advanced Armageddon Scoring makes the Elo-spreading between results wider. With classical scoring, a lot of results of my Lc0-testing would be so close, that it is pure random, which net is better or a little bit worse.
Armageddon scoring could lead to distortions, because there are only 2 scores for 3 possible game endings (draw and win for black are both a point for black). But my new Advanced Armageddon Scoring has 3 scores for 3 possible game endings, like the classical scoring system. That is the main progress and advantage of my scoring system and the reason, why there are no distortions, only a spreading of results!

Here is, what I mean:

Classical scoring of my NN-games (first 10 places in the list)

Code: Select all

     Program                               Elo    +    -   Games   Score   Av.Op.  Draws

   1 Lc0 0.26.3 J92-260 (30x384)         : 3631   22   22   500    58.9 %   3568   42.2 %
   2 Lc0 0.26.2 J92-130 (30x384)         : 3617   22   22   500    57.0 %   3568   45.2 %
   3 Lc0 0.26.3 65536 (24x320)           : 3610   21   21   500    56.0 %   3568   41.2 %
   4 Lc0 0.26.3 65411 (24x320)           : 3610   22   22   500    56.0 %   3568   44.4 %
   5 Lc0 0.24.1 LS 14.3 (20x256)         : 3609   23   23   500    55.8 %   3568   44.4 %
   6 Lc0 0.25.1 LS 15 (20x256)           : 3608   22   22   500    55.7 %   3568   45.0 %
   7 Lc0 0.24.1 LS 14.2 (20x256)         : 3607   21   21   500    55.6 %   3568   43.6 %
   8 Lc0 0.26.2 T60B.7-105 (24x320       : 3606   21   21   500    55.4 %   3568   47.6 %
   9 Lc0 0.26.2 J92-160 (30x384)         : 3603   22   22   500    55.0 %   3568   45.6 %
  10 Lc0 0.26.1 t60-4619 (30x384)        : 3600   22   22   500    54.6 %   3568   45.2 %
And here the Advanced Armageddon Scoring:

Code: Select all

1  Lc0 0.26.3 J92-260 (30x384)      : 3689 515 (+343,=  0,-172), 66.6 %
2  Lc0 0.26.2 J92-130 (30x384)      : 3655 521 (+324,=  0,-197), 62.2 %
3  Lc0 0.26.3 65536 (24x320)        : 3648 514 (+315,=  0,-199), 61.3 %
4  Lc0 0.24.1 LS 14.3 (20x256)      : 3644 513 (+311,=  0,-202), 60.6 %
5  Lc0 0.25.1 LS 15 (20x256)        : 3643 512 (+310,=  0,-202), 60.5 %
6  Lc0 0.26.3 65411 (24x320)        : 3641 519 (+313,=  0,-206), 60.3 %
7  Lc0 0.26.2 J92-160 (30x384)      : 3635 511 (+304,=  0,-207), 59.5 %
8  Lc0 0.26.2 T60B.7-105 (24x320)   : 3634 519 (+308,=  0,-211), 59.3 %
9  Lc0 0.24.1 LS 14.2 (20x256)      : 3633 520 (+308,=  0,-212), 59.2 %
10 Lc0 0.25.1 LS 15 Kayra4          : 3624 513 (+297,=  0,-216), 57.9 %
As you can see, in the classical scoring, J92-260 is clearly the best, too. But from place 3 to 10, all results are in a 10-Elo interval, which is so small, that the place of the nets in that list is pure random, if you dont play some thousand games with each net. With Advanced Armageddon Scoring, the Elo-interval from place 3 to 10 is 24 Elo. Which is 2.4x wider (around 2.25x was, what I predicted!). And, what that means, can be read on my website:
"But mention, that the usage of my NBSC-Armageddon openings spreads the Elo-results around 2.25x wider, than using classical openings for testing(!), so with classical openings, you would need an errorbar of +/- 9 Elo for the same statistical quality of the results (= the rankings of Lc0 nets here). And for an errorbar of +/- 9 elo, you need around 3000 games, not 500, which means 6x more games (and 6x more PC-time)!!"

Look at the score of the 2 new T60 nets 65536 and 65411: With classical scoring, the progress of learning disappears. Both have 3610 Elo. With Advanced Armageddon Scoring, 65536 is +7 Elo better. Not much, but a progress.
What else can I say?
MMarco
Posts: 195
Joined: Sun Apr 12, 2020 1:09 am
Full name: Marc-O Moisan-Plante

Re: SPCC: Testrun of Lc0 J92-260 finished

Post by MMarco »

pohl4711 wrote: Fri Oct 23, 2020 11:51 am
MMarco wrote: Fri Oct 23, 2020 11:05 am
Dann Corbit wrote: Sun Oct 18, 2020 4:41 am That is quite a leap, +34b Elo better than the previous top score
I found it surprising, but I think that the "advanced scoring Armageddon" ("ASA") scheme used here can introduce a very large bias compared to standard elo. From S. Pohl website (https://www.sp-cc.de/nbsc-armageddon-openings.htm):
My Advanced Armageddon scoring is:
Win for white = 1 point for white
Draw = 1 point for black
Win for black = 2 points for black (!!!)
When white looses a game, it is recorded as two losses. Plus the standard Armageddon "a draw with white is a loss". It is getting far from regular chess scoring scheme. i.e., assume that Engine A and Engine B play a reference engine, Engine C (like Stockfish in this case, with elo = 0):

Engine A vs Engine C:
Engine A score as white: +1,-0,=4 --> 1/5
Engine A score as black: +2,-2,=1 --> 5/7 (the "draw as black" win, plus the two wins as black that count for four wins, so 2 extra games)
"Advanced scoring Armageddon" total --> 6/12 (elo = 0).
Standard chess total (+3,-2,=5) --> 5.5/10 (about + 40 elo).
There is a downward bias of 40 elo here.

Engine B vs Engine C:
Engine B score as white: +2,-2,=1 --> 2/7 (two extra games as white losses count twice)
Engine B score as black: +0,-0,=5 --> 5/5
"Advanced scoring Armageddon" total --> 7/12 (about + 65 elo).
Standard chess total (+2,-2,=6) --> 5/10 (elo = 0)
There is an upward bias of 65 elo here.

The gap between the two rating systems is over 100 elo in this example, and Engine B is better than Engine A with ASA, whereas in standard chess Engine A is superior. The bias can be important and unsystematic. This might explain the unexpected leap in elo.

That is false. The Advanced Armageddon Scoring makes the Elo-spreading between results wider. With classical scoring, a lot of results of my Lc0-testing would be so close, that it is pure random, which net is better or a little bit worse.
Armageddon scoring could lead to distortions, because there are only 2 scores for 3 possible game endings (draw and win for black are both a point for black). But my new Advanced Armageddon Scoring has 3 scores for 3 possible game endings, like the classical scoring system. That is the main progress and advantage of my scoring system and the reason, why there are no distortions, only a spreading of results!

Here is, what I mean:

Classical scoring of my NN-games (first 10 places in the list)

Code: Select all

     Program                               Elo    +    -   Games   Score   Av.Op.  Draws

   1 Lc0 0.26.3 J92-260 (30x384)         : 3631   22   22   500    58.9 %   3568   42.2 %
   2 Lc0 0.26.2 J92-130 (30x384)         : 3617   22   22   500    57.0 %   3568   45.2 %
   3 Lc0 0.26.3 65536 (24x320)           : 3610   21   21   500    56.0 %   3568   41.2 %
   4 Lc0 0.26.3 65411 (24x320)           : 3610   22   22   500    56.0 %   3568   44.4 %
   5 Lc0 0.24.1 LS 14.3 (20x256)         : 3609   23   23   500    55.8 %   3568   44.4 %
   6 Lc0 0.25.1 LS 15 (20x256)           : 3608   22   22   500    55.7 %   3568   45.0 %
   7 Lc0 0.24.1 LS 14.2 (20x256)         : 3607   21   21   500    55.6 %   3568   43.6 %
   8 Lc0 0.26.2 T60B.7-105 (24x320       : 3606   21   21   500    55.4 %   3568   47.6 %
   9 Lc0 0.26.2 J92-160 (30x384)         : 3603   22   22   500    55.0 %   3568   45.6 %
  10 Lc0 0.26.1 t60-4619 (30x384)        : 3600   22   22   500    54.6 %   3568   45.2 %
And here the Advanced Armageddon Scoring:

Code: Select all

1  Lc0 0.26.3 J92-260 (30x384)      : 3689 515 (+343,=  0,-172), 66.6 %
2  Lc0 0.26.2 J92-130 (30x384)      : 3655 521 (+324,=  0,-197), 62.2 %
3  Lc0 0.26.3 65536 (24x320)        : 3648 514 (+315,=  0,-199), 61.3 %
4  Lc0 0.24.1 LS 14.3 (20x256)      : 3644 513 (+311,=  0,-202), 60.6 %
5  Lc0 0.25.1 LS 15 (20x256)        : 3643 512 (+310,=  0,-202), 60.5 %
6  Lc0 0.26.3 65411 (24x320)        : 3641 519 (+313,=  0,-206), 60.3 %
7  Lc0 0.26.2 J92-160 (30x384)      : 3635 511 (+304,=  0,-207), 59.5 %
8  Lc0 0.26.2 T60B.7-105 (24x320)   : 3634 519 (+308,=  0,-211), 59.3 %
9  Lc0 0.24.1 LS 14.2 (20x256)      : 3633 520 (+308,=  0,-212), 59.2 %
10 Lc0 0.25.1 LS 15 Kayra4          : 3624 513 (+297,=  0,-216), 57.9 %
As you can see, in the classical scoring, J92-260 is clearly the best, too. But from place 3 to 10, all results are in a 10-Elo intervall, which is so small, that the place of the nets in that list is pure random, if you dont play some thousand games with each net. With Advanced Armageddon Scoring, the Elo-interval from place 3 to 10 is 24 Elo. Which is 2.4x wider (around 2.25x was, what I predicted!). And, what that means, can be read on my website:
"But mention, that the usage of my NBSC-Armageddon openings spreads the Elo-results around 2.25x wider, than using classical openings for testing(!), so with classical openings, you would need an errorbar of +/- 9 Elo for the same statistical quality of the results (= the rankings of Lc0 nets here). And for an errorbar of +/- 9 elo, you need around 3000 games, not 500, which means 6x more games (and 6x more PC-time)!!"

What else can I say?
That is false
LOL. Can you explain what's wrong with what I wrote?
so with classical openings, you would need an errorbar of +/- 9 Elo for the same statistical quality of the results (= the rankings of Lc0 nets here).
Your "Armageddon Advanced scoring" is just introducing noise. The lists you shown just proved it. The Armageddon list doesn't preserve the ranking of the standard elo list. ie 65411 goes down from #4 to #6, LS 14.3 and 15 go up a rank while LS 14.2 looses two ranks. T60-4619 is kicked out of the top ten etc. Thise is what un unsystematic bias is: sometimes in favor of an engine, sometimes it goes against it (or against another engine). This extra noise is added on top the random fluctuation we usually get from the sample results.

Your list is interesting anyway, and I just saw I can dowload the "non-armageddonized"database if I want. Thanks for running the games and making them available. I'll look into them in the future. I do watch yout list quite a bit, but I prefer to havea standard chess scoring and rating scheme.

The "advanced scoring scheme" do not give better statistical quality with fewer games, it gives worse statistical quality with the same number of games!

Think about it. You have a basic set of information in your data (the standard chess results) and then your rescore them according to some fancy rules (Adavanced Armageddon) that doesn't preserve the rankings (nor the number of games) and then, like by magic, you would get get better statistical estimates??!! It doesn't work like that... I can agree that it seems to strech the ratings (it did in my numerical example above) but it also distorts the list. Apart from the wrong rankings, the error-bars associated with them have also now become unthrusworthy, same for the LOS.
User avatar
pohl4711
Posts: 2434
Joined: Sat Sep 03, 2011 7:25 am
Location: Berlin, Germany
Full name: Stefan Pohl

Re: SPCC: Testrun of Lc0 J92-260 finished

Post by pohl4711 »

MMarco wrote: Fri Oct 23, 2020 12:39 pm
Your "Armageddon Advanced scoring" is just introducing noise. The lists you shown just proved it. The Armageddon list doesn't preserve the ranking of the standard elo list. ie 65411 goes down from #4 to #6, LS 14.3 and 15 go up a rank while LS 14.2 looses two ranks. T60-4619 is kicked out of the top ten etc. Thise is what un unsystematic bias is: sometimes in favor of an engine, sometimes it goes against it (or against another engine). This extra noise is added on top the random fluctuation we usually get from the sample results.
The differences of rankings in the two performance lists comes from the fact, that 7 engines/nets share a 10 Elo-interval of strength with classical scoring. So, with 500 played games the ranking from place 3-10 is pure random (these results are the "noise") with classical scoring. And with Advanced Armageddon Scoring it is not, because the Elo-interval is 2.4x bigger. Is it really a surprise, that the ranking is different, when one list has a random ranking and the other list has not?!? And this is - by the way - the reason for me to use NBSC openings and Advanced Armageddon Scoring for the testruns of lc0 and it's nets. The nets are often so close in strength, that it is impossible to get a valid ranking beyond random, when playing 500 games, only. And more than 500 games are difficult to play, because, when using lc0 it is not possible to play games simultaneously on one machine. I need 2.5 days for one 500 games testrun of lc0 vs Stockfish.
And I will definitly not use classical openings and classical scoring, because I will not waste my PC-time, getting random rankings (or measuring noise)
User avatar
pohl4711
Posts: 2434
Joined: Sat Sep 03, 2011 7:25 am
Location: Berlin, Germany
Full name: Stefan Pohl

Re: SPCC: Testrun of Lc0 J92-260 finished

Post by pohl4711 »

Anybody, who want, should look on my testing results of my NBSC openings with Advances Armageddon Scoring compared to classical openings and scoring:
https://www.sp-cc.de/nbsc-armageddon-openings.htm

Just 2 examples from there (classical Stockfish openings from Fishtest compared with my Noomen 3move NBSC openings using Advanced Armageddon Scoring (these openings are used for my lc0 testings)

Testing conditions:
2'+1'', Singlethread, i7-8750H Hexacore mobile CPU, 256MB Hash, cutechess-cli (no TB for engines, but 5 Syzygy for cutechess), Contempt=0 for all Stockfish. All openings replayed with reversed colors. Round Robin with 1500 games with SF 11, SF 10, SF 9 and SF 8. Each SF played 250 games vs. Each of the 3 opponents = 1500 games per testrun. ORDO for the ratings (3400 Elo base-value)

Code: Select all

Stockfish Framework 8moves v3:

     Program                Elo    +    -   Games   Score   Av.Op.  Draws
   1 Stockfish 11 bmi2    : 3466   14   14   750    62.1 %   3378   68.7 %
   2 Stockfish 10 bmi2    : 3431   13   13   750    55.8 %   3390   75.6 %
   3 Stockfish 9 bmi2     : 3383   14   14   750    46.7 %   3406   71.3 %
   4 Stockfish 8 bmi2     : 3320   14   14   750    35.4 %   3427   63.9 %

Draw-rate                    : 69.9 % (smaller is better)
Elo spreading (first to last): 146 Elo (bigger is better)

Code: Select all

NBSC Advanced Armageddon Noomen 3-moves:

     Program                Elo    +    -   Games   Score   Av.Op.  Draws
   1 Stockfish 11 bmi2    : 3543   15   15   803    74.0 %   3347    0.0 %
   2 Stockfish 10 bmi2    : 3468   14   14   785    61.8 %   3374    0.0 %
   3 Stockfish 9 bmi2     : 3359   14   14   794    42.8 %   3414    0.0 %
   4 Stockfish 8 bmi2     : 3229   16   16   816    22.1 %   3459    0.0 %

Draw-rate                    : 0 % (smaller is better)
Elo spreading (first to last): 314 Elo (bigger is better)
White Score                  : 46.5 %
Number of wins for Black (= 2 points for Black in advanced scoring): 99
Not so bad for openings and a scoring system, that "adds noise" to results, I believe...
And a proof, that saying "The advanced scoring scheme do not give better statistical quality with fewer games, it gives worse statistical quality with the same number of games!" is obviously wrong!

Interesting additional facts are the comparsions between the single engines
Distance SF 11 to 10: 35 Elo gets to 75 Elo = 2.14x bigger
Distance SF 10 to 9: 48 Elo gets to 109 Elo = 2.27x bigger
Distance SF 9 to 8: 63 Elo gets to 130 Elo = 2.06x bigger
(Overall Distance SF 11 to 8: 146 Elo gets to 314 Elo = 2.15x bigger)

So, the rating-distances are very close in the "zooming factor", my Advanced Armageddon Scoring adds. So the results are spreaded enormously but are definitly not distorted.
Again: Not so bad for openings and a scoring system, that "adds noise" to results, I believe...
User avatar
pohl4711
Posts: 2434
Joined: Sat Sep 03, 2011 7:25 am
Location: Berlin, Germany
Full name: Stefan Pohl

Re: SPCC: Testrun of Lc0 J92-260 finished

Post by pohl4711 »

MMarco wrote: Fri Oct 23, 2020 12:39 pm Think about it. You have a basic set of information in your data (the standard chess results) and then your rescore them according to some fancy rules (Adavanced Armageddon)
And that is your main fault. Advanced Armageddon Scoring comes not alone to a "basic set of information". The rescoring is done on games, that were played out of openings, that give a clear and measureable advantage for white (black is not allowed to castle short). So the basic set of information here is, that a draw is a big success for black, because white had a clear advantage at the beginning of the game. And a win for black is an ever bigger success for black. So the Advanced Armageddon Rescoring is not a new "fancy rule", but a rescoring, which considers the advantage of white at the beginning of the game.

Using Advance Armageddon Rescoring on games played out of classical, balanced openings would be nonsense. True and I agree. But that is not, what I am doing. Advanced Armageddon scoring must always (and only!) be used on games, where white has a clear advantage at the beginning. So, the rescoring must be used on my NBSC openings or on my forthcoming Unbalanced Human Openings (will be released soon). Not on games played with classical, balanced openings. That is very important!
Alayan
Posts: 550
Joined: Tue Nov 19, 2019 8:48 pm
Full name: Alayan Feh

Re: SPCC: Testrun of Lc0 J92-260 finished

Post by Alayan »

I wrote this last month in another thread :
Armageddon variants theoretically work, but are much less appealing. Removing the drawing margin theoretically increase skill ceiling, but that's compared to start position chess. Compared to "TCEC chess" with forced openings that try to be close to the draw/win frontier, you'll be able to display 2x bigger elo spread because you mark draws by the weak side as wins, but really complexity is equivalent. Take NBSC as start position, score draws just as draws, how is it in any way an inferior game to NBSC Armageddon, assuming reverse games are played which is always true in engine chess ? Stefan even went the whole way to score black wins as +2, just keep classical scoring at this point, it will reward those black wins automatically !
The fundamental insight is that while picking a position very close to the draw/win frontier do improve the skill ceiling by mostly avoiding chess's very wide draw margin, scoring draws from such positions as wins from the weak side is cosmetic.

Assuming the weak side never wins, armageddon scoring doubles the displayed score difference, non-linearly increase the computed elo difference (double it at minimum, but can be infinitely higher), but the underlying data is exactly the same. The variance in % score is exactly doubled too, and the elo error bars are non-linearly increased through the same factor as the elo difference.

In no way shape or form does this scoring change allow to increase statistical certainty.

Now, let's assume that the weak side sometimes manage to win games despite the strong initial disadvantage. Armageddon scoring actually suppresses relevant information about the engine strengths. That's why you came up with "Advanced armageddon scoring". But why not keep classical scoring then ?

Take 12 games. 3 white win, 3 white draws, 3 black loss, 3 black draws. Advanced armageddon : 6/12. Classical : 6/12. Now 3 white wins, 3 white draws, 4 black losses, 2 black wins. Advanced Armageddon : 7/14. Classical : 6.5/12. The relative reward for a weak side win with classical scoring is bigger than with "advanced armageddon" scoring.
pohl4711 wrote: Fri Oct 23, 2020 10:13 pm The differences of rankings in the two performance lists comes from the fact, that 7 engines/nets share a 10 Elo-interval of strength with classical scoring. So, with 500 played games the ranking from place 3-10 is pure random (these results are the "noise") with classical scoring.
...
And I will definitly not use classical openings and classical scoring, because I will not waste my PC-time, getting random rankings (or measuring noise)
Changing the scoring method like you did does absolutely nothing to help avoid stochastic noise.

The error bar numbers you're getting from ordo are simply off. Lucky case for you is if they're still correct upperbounds with your scoring method and are over twice time too big for classical scoring. The less lucky case is that your new error bars are underestimating the real variance and you think your ranking are more reliable than they really are.

Personally, I don't trust 500 games sample to have a low enough variance when comparing engines that are close in strength.

See also this thread by Laskos : http://talkchess.com/forum3/viewtopic.php?f=2&t=75080