AlphaZero No Castling Chess

lkaufman · Post by **lkaufman** » Tue Dec 10, 2019 6:26 pm

Laskos wrote: ↑Tue Dec 10, 2019 1:26 pm
lkaufman wrote: ↑Tue Dec 10, 2019 7:07 am
lkaufman wrote: ↑Mon Dec 09, 2019 5:37 pm
Javier Ros wrote: ↑Mon Dec 09, 2019 11:11 am
Nordlandia wrote: ↑Mon Dec 09, 2019 5:10 am Engines need to know that they're playing armageddon. So they need to be taught playing that mode.
I agree, in the same way AlphaZero has been trained to play No Castling Chess, Lc0 must learn No Castling Chess or BNC with draw advantage, while alpha-beta programs must be also modified. A new opening repertoire will be created for each variation of chess. The contempt factor for each side is not enough.

The experiments of Laskos and Larry Kaufman are very interesting but when the programs take into account the draw advantage the results will vary.
Anyway the proposal seems balanced and acceptable.
Although knowing the Armageddon rule is ideal, I'm pretty sure that using a White Contempt of 75 in Komodo comes close enough to this for most practical purposes. I think that the results indicate that the exact value doesn't matter much, because as the value goes higher, both sides modify their play more towards the Armageddon rule, and the effects cancel out.
I added Fat Fritz to the experiment, although it doesn't have Contempt so you may consider the result less reliable than the Komodo results. It ran on an RTX 2080 at 1' + 1". Result: 177 White wins, 8 Black wins, 185 draws, so 177/370 points = 47.8%. So this brings the overall results for all engines tested down to between 50% and 51% (depending on how you weight them). It is really amazing how fair this variant appears to be, at least between engines!
I realized that it may be a flaw in the testing of this idea to run only identical or very similar engines against each other. In the real world, engines and humans don't generally play against clones, this is not a proper test. So I'm running some unrelated engine matches. They don't have to be of equal strength, as long as they are within a hundred elo or so of each other this should still work, since each side gets half White and half Black. Of course if the engines were a thousand elo apart the result would come out 50% for each color since the stronger engine would win all the games, half with each color, but with moderate elo gaps any White-Black bias should show up.
My first test was Stockfish 10 on 6 fast cpu cores vs. Fat Fritz on RTX 2080, at 1' +0.6". Stockfish won the match 60 to 40, but that's not what matters. White won 58 games, with 42 draws and not a single Black win for either engine! This is a bit worrisome, as 58 to 42 is rather significant. We'll have to see how other unrelated pairings come out. It may turn out that NBC Armageddon isn't as fair as we thought, in which case we can fall back on NBSC, no Black short castling, which would obviously raise Black's prospects. But let's wait for results first.
I had this in the morning, a RR of 600 games with top 3 engines at 60 + 0.6:
Code: Select all
   # PLAYER        : RATING  ERROR    POINTS  PLAYED     (%)   CFS(next)
   1 SF_9          :  46.58  19.48     232.5     400    58.1      95    
   2 Houdini_6     :  19.35  18.85     213.5     400    53.4     100    
   3 Komodo_131    : -65.93  19.52     154.0     400    38.5     ---    

White advantage = 157.98 +/- 11.35
Draw rate (equal opponents) = 44.33 % +/- 2.40
Komodo surprisingly performs quite poorly, although I put White Contempt = 75, and the default small Contempt for the other two, as they have no Colored Contempt. I hope high Contempt in Komodo doesn't harm its performance and doesn't skew the overall result.

Here is the important aspect:
Code: Select all
Games        : 600 (finished)

White Wins   : 314 (52.3 %)
Black Wins   : 70 (11.7 %)
Draws        : 216 (36.0 %)
Unfinished   : 0

White Score  : 70.3 %
Black Score  : 29.7 %
White wins are 52.3%, which is not that bad. My theory was that for determining the borderline at longer TC, both the White score (52.3%, the border is 50%) and the White performance in normal scoring (70.3%, the border is 75%) should be considered. At shorter TC more "accidents" happen, and a Black win (many here) is often an accident between similar in strength engines. But "accidents" do happen the other way around too, some of White wins are also "accidents". At the same time, the performance at 60 + 0.6 in normal scoring is significantly below 75%, which combined 52.3% White wins, denotes that it is debatable what sort of opening it is at much longer TC.

I am now testing at 240 + 2.4 the same RR in 600 games. It will take almost a day probably, but if my theory stands, White might score even below 52.3% at longer TC, but White performance might be above 70.3% in normal scoring (maybe not above the threshold of 75%). That is due to less Black wins at longer TC, and possibly, less White wins. Let's see, but the statistical fluctuations are not that small even in 600 games, so one has to be cautious inferring too many things.

My overnight NBC test was latest Komodo on 7 threads vs. Fat Fritz on RTX 2080 at 1' +0.6". The result was that Komodo won by a single game, 88.5 to 87.5, so Komodo doesn't seem to be weak in this variant in general. The breakdown was: White won 98, Black won 9, Draws 69. So 98 to 78 Armageddon score, which is 55.7%. So again somewhat higher than you are getting, and worrisome. Note that this was with a lot of horsepower (7 fast cpu threads and the RTX 2080), so it's roughly like a 5' +3" test on one thread. It may be that the NNs behave quite differently on this than the A/B engines (or at least when playing against them), or else it just favors White to have more horsepower.
I don't quite agree about 75% in normal scoring being the win/draw dividing line. I know this is true with zero Black wins, but until we actually reach a level where that happens I don't see how it is relevant, and if it does happen then both measures give the same result.
I don't know why Komodo did so poorly in your test; on one thread at 1 min it is somewhat weaker than Houdini 6 but not that much weaker, and my result vs FF suggests that at least with multiple threads it is fine at this tc. I suppose the Contempt setting should hurt it slightly in classical scoring but should help it with Armageddon scoring.
Regarding my laptop if you want to know how it's possible ask Dell (Alienware Area 51).
I'm trying SF9 vs Komodo now on one thread at 2' + 1".

Laskos · Post by **Laskos** » Tue Dec 10, 2019 10:43 pm

lkaufman wrote: ↑Tue Dec 10, 2019 6:26 pm
Laskos wrote: ↑Tue Dec 10, 2019 1:26 pm
lkaufman wrote: ↑Tue Dec 10, 2019 7:07 am
lkaufman wrote: ↑Mon Dec 09, 2019 5:37 pm
Javier Ros wrote: ↑Mon Dec 09, 2019 11:11 am
Nordlandia wrote: ↑Mon Dec 09, 2019 5:10 am Engines need to know that they're playing armageddon. So they need to be taught playing that mode.
I agree, in the same way AlphaZero has been trained to play No Castling Chess, Lc0 must learn No Castling Chess or BNC with draw advantage, while alpha-beta programs must be also modified. A new opening repertoire will be created for each variation of chess. The contempt factor for each side is not enough.

The experiments of Laskos and Larry Kaufman are very interesting but when the programs take into account the draw advantage the results will vary.
Anyway the proposal seems balanced and acceptable.
Although knowing the Armageddon rule is ideal, I'm pretty sure that using a White Contempt of 75 in Komodo comes close enough to this for most practical purposes. I think that the results indicate that the exact value doesn't matter much, because as the value goes higher, both sides modify their play more towards the Armageddon rule, and the effects cancel out.
I added Fat Fritz to the experiment, although it doesn't have Contempt so you may consider the result less reliable than the Komodo results. It ran on an RTX 2080 at 1' + 1". Result: 177 White wins, 8 Black wins, 185 draws, so 177/370 points = 47.8%. So this brings the overall results for all engines tested down to between 50% and 51% (depending on how you weight them). It is really amazing how fair this variant appears to be, at least between engines!
I realized that it may be a flaw in the testing of this idea to run only identical or very similar engines against each other. In the real world, engines and humans don't generally play against clones, this is not a proper test. So I'm running some unrelated engine matches. They don't have to be of equal strength, as long as they are within a hundred elo or so of each other this should still work, since each side gets half White and half Black. Of course if the engines were a thousand elo apart the result would come out 50% for each color since the stronger engine would win all the games, half with each color, but with moderate elo gaps any White-Black bias should show up.
My first test was Stockfish 10 on 6 fast cpu cores vs. Fat Fritz on RTX 2080, at 1' +0.6". Stockfish won the match 60 to 40, but that's not what matters. White won 58 games, with 42 draws and not a single Black win for either engine! This is a bit worrisome, as 58 to 42 is rather significant. We'll have to see how other unrelated pairings come out. It may turn out that NBC Armageddon isn't as fair as we thought, in which case we can fall back on NBSC, no Black short castling, which would obviously raise Black's prospects. But let's wait for results first.
I had this in the morning, a RR of 600 games with top 3 engines at 60 + 0.6:
Code: Select all
   # PLAYER        : RATING  ERROR    POINTS  PLAYED     (%)   CFS(next)
   1 SF_9          :  46.58  19.48     232.5     400    58.1      95    
   2 Houdini_6     :  19.35  18.85     213.5     400    53.4     100    
   3 Komodo_131    : -65.93  19.52     154.0     400    38.5     ---    

White advantage = 157.98 +/- 11.35
Draw rate (equal opponents) = 44.33 % +/- 2.40
Komodo surprisingly performs quite poorly, although I put White Contempt = 75, and the default small Contempt for the other two, as they have no Colored Contempt. I hope high Contempt in Komodo doesn't harm its performance and doesn't skew the overall result.

Here is the important aspect:
Code: Select all
Games        : 600 (finished)

White Wins   : 314 (52.3 %)
Black Wins   : 70 (11.7 %)
Draws        : 216 (36.0 %)
Unfinished   : 0

White Score  : 70.3 %
Black Score  : 29.7 %
White wins are 52.3%, which is not that bad. My theory was that for determining the borderline at longer TC, both the White score (52.3%, the border is 50%) and the White performance in normal scoring (70.3%, the border is 75%) should be considered. At shorter TC more "accidents" happen, and a Black win (many here) is often an accident between similar in strength engines. But "accidents" do happen the other way around too, some of White wins are also "accidents". At the same time, the performance at 60 + 0.6 in normal scoring is significantly below 75%, which combined 52.3% White wins, denotes that it is debatable what sort of opening it is at much longer TC.

I am now testing at 240 + 2.4 the same RR in 600 games. It will take almost a day probably, but if my theory stands, White might score even below 52.3% at longer TC, but White performance might be above 70.3% in normal scoring (maybe not above the threshold of 75%). That is due to less Black wins at longer TC, and possibly, less White wins. Let's see, but the statistical fluctuations are not that small even in 600 games, so one has to be cautious inferring too many things.
My overnight NBC test was latest Komodo on 7 threads vs. Fat Fritz on RTX 2080 at 1' +0.6". The result was that Komodo won by a single game, 88.5 to 87.5, so Komodo doesn't seem to be weak in this variant in general. The breakdown was: White won 98, Black won 9, Draws 69. So 98 to 78 Armageddon score, which is 55.7%. So again somewhat higher than you are getting, and worrisome. Note that this was with a lot of horsepower (7 fast cpu threads and the RTX 2080), so it's roughly like a 5' +3" test on one thread. It may be that the NNs behave quite differently on this than the A/B engines (or at least when playing against them), or else it just favors White to have more horsepower.
I don't quite agree about 75% in normal scoring being the win/draw dividing line. I know this is true with zero Black wins, but until we actually reach a level where that happens I don't see how it is relevant, and if it does happen then both measures give the same result.
I don't know why Komodo did so poorly in your test; on one thread at 1 min it is somewhat weaker than Houdini 6 but not that much weaker, and my result vs FF suggests that at least with multiple threads it is fine at this tc. I suppose the Contempt setting should hurt it slightly in classical scoring but should help it with Armageddon scoring.
Regarding my laptop if you want to know how it's possible ask Dell (Alienware Area 51).
I'm trying SF9 vs Komodo now on one thread at 2' + 1".

I am not saying that "normal" scoring of 75% is the "true borderline", only that close to 50% White wins and close below 75% at some time control might denote borderline for the NBC with Armageddon scoring. I don't know why you cling by strict 50% White wins at some time controls between unequal engines with weird contempts. Take two Komodos with an uncolored Contempt=100 (both), a +0.60 eval position and play games on one core at 60+0.6. You might easily get 50%+ White wins. Then set Contempt=0, and play from the same +0.60 position games at 600+6, and you will probably get less than 50% White wins. So, where now is your decision about "50%" rule?

First let's agree on the term "borderline" in this Armageddon variant. I mean by it here that we cannot decide whether it's a win or a draw at indefinitely long time controls, and it's important in Armageddon scoring that the rate of White wins to longer and longer time controls stays fairly stable and does not diverge to 0% or 100% to very very strong play. Stable, close to 50%, but not necessarily exactly 50%. Not when using weird contempts and different engines. The stronger and stronger play between close in strength engines might converge to 50% White wins and 0% Black wins for a borderline position, ideally.

I do not want to argue much, but I have studied in the past what I consider "borderline", White/Draw "strong play" boundary, and there is a specific path from "weak play" to "strong play" boundary. See for example here (pictures are gone, unfortunately):

The flow of Komodo eval

viewtopic.php?f=2&t=60212

After 400+ games at 4x larger time control (240+2.4), it seems plausible that the position is borderline, but let's wait to the end of RR of 600 games. Also, statistical fluctuations are not negligible in 600 games, so differences less than 1.0-1.5% should be considered as inconclusive.

I am not sure why we are getting a bit different results of White performance, but the engines are different, and in fact it still can be attributed to fluctuations. The stranger thing is the Komodo's performance difference against Houdini and SF, I checked and re-checked and nothing seems wrong.

lkaufman · Post by **lkaufman** » Wed Dec 11, 2019 12:58 am

Laskos wrote: ↑Tue Dec 10, 2019 10:43 pm
lkaufman wrote: ↑Tue Dec 10, 2019 6:26 pm
Laskos wrote: ↑Tue Dec 10, 2019 1:26 pm
lkaufman wrote: ↑Tue Dec 10, 2019 7:07 am
lkaufman wrote: ↑Mon Dec 09, 2019 5:37 pm
Javier Ros wrote: ↑Mon Dec 09, 2019 11:11 am
Nordlandia wrote: ↑Mon Dec 09, 2019 5:10 am Engines need to know that they're playing armageddon. So they need to be taught playing that mode.
I agree, in the same way AlphaZero has been trained to play No Castling Chess, Lc0 must learn No Castling Chess or BNC with draw advantage, while alpha-beta programs must be also modified. A new opening repertoire will be created for each variation of chess. The contempt factor for each side is not enough.

The experiments of Laskos and Larry Kaufman are very interesting but when the programs take into account the draw advantage the results will vary.
Anyway the proposal seems balanced and acceptable.
Although knowing the Armageddon rule is ideal, I'm pretty sure that using a White Contempt of 75 in Komodo comes close enough to this for most practical purposes. I think that the results indicate that the exact value doesn't matter much, because as the value goes higher, both sides modify their play more towards the Armageddon rule, and the effects cancel out.
I added Fat Fritz to the experiment, although it doesn't have Contempt so you may consider the result less reliable than the Komodo results. It ran on an RTX 2080 at 1' + 1". Result: 177 White wins, 8 Black wins, 185 draws, so 177/370 points = 47.8%. So this brings the overall results for all engines tested down to between 50% and 51% (depending on how you weight them). It is really amazing how fair this variant appears to be, at least between engines!
I realized that it may be a flaw in the testing of this idea to run only identical or very similar engines against each other. In the real world, engines and humans don't generally play against clones, this is not a proper test. So I'm running some unrelated engine matches. They don't have to be of equal strength, as long as they are within a hundred elo or so of each other this should still work, since each side gets half White and half Black. Of course if the engines were a thousand elo apart the result would come out 50% for each color since the stronger engine would win all the games, half with each color, but with moderate elo gaps any White-Black bias should show up.
My first test was Stockfish 10 on 6 fast cpu cores vs. Fat Fritz on RTX 2080, at 1' +0.6". Stockfish won the match 60 to 40, but that's not what matters. White won 58 games, with 42 draws and not a single Black win for either engine! This is a bit worrisome, as 58 to 42 is rather significant. We'll have to see how other unrelated pairings come out. It may turn out that NBC Armageddon isn't as fair as we thought, in which case we can fall back on NBSC, no Black short castling, which would obviously raise Black's prospects. But let's wait for results first.
I had this in the morning, a RR of 600 games with top 3 engines at 60 + 0.6:
Code: Select all
   # PLAYER        : RATING  ERROR    POINTS  PLAYED     (%)   CFS(next)
   1 SF_9          :  46.58  19.48     232.5     400    58.1      95    
   2 Houdini_6     :  19.35  18.85     213.5     400    53.4     100    
   3 Komodo_131    : -65.93  19.52     154.0     400    38.5     ---    

White advantage = 157.98 +/- 11.35
Draw rate (equal opponents) = 44.33 % +/- 2.40
Komodo surprisingly performs quite poorly, although I put White Contempt = 75, and the default small Contempt for the other two, as they have no Colored Contempt. I hope high Contempt in Komodo doesn't harm its performance and doesn't skew the overall result.

Here is the important aspect:
Code: Select all
Games        : 600 (finished)

White Wins   : 314 (52.3 %)
Black Wins   : 70 (11.7 %)
Draws        : 216 (36.0 %)
Unfinished   : 0

White Score  : 70.3 %
Black Score  : 29.7 %
White wins are 52.3%, which is not that bad. My theory was that for determining the borderline at longer TC, both the White score (52.3%, the border is 50%) and the White performance in normal scoring (70.3%, the border is 75%) should be considered. At shorter TC more "accidents" happen, and a Black win (many here) is often an accident between similar in strength engines. But "accidents" do happen the other way around too, some of White wins are also "accidents". At the same time, the performance at 60 + 0.6 in normal scoring is significantly below 75%, which combined 52.3% White wins, denotes that it is debatable what sort of opening it is at much longer TC.

I am now testing at 240 + 2.4 the same RR in 600 games. It will take almost a day probably, but if my theory stands, White might score even below 52.3% at longer TC, but White performance might be above 70.3% in normal scoring (maybe not above the threshold of 75%). That is due to less Black wins at longer TC, and possibly, less White wins. Let's see, but the statistical fluctuations are not that small even in 600 games, so one has to be cautious inferring too many things.
My overnight NBC test was latest Komodo on 7 threads vs. Fat Fritz on RTX 2080 at 1' +0.6". The result was that Komodo won by a single game, 88.5 to 87.5, so Komodo doesn't seem to be weak in this variant in general. The breakdown was: White won 98, Black won 9, Draws 69. So 98 to 78 Armageddon score, which is 55.7%. So again somewhat higher than you are getting, and worrisome. Note that this was with a lot of horsepower (7 fast cpu threads and the RTX 2080), so it's roughly like a 5' +3" test on one thread. It may be that the NNs behave quite differently on this than the A/B engines (or at least when playing against them), or else it just favors White to have more horsepower.
I don't quite agree about 75% in normal scoring being the win/draw dividing line. I know this is true with zero Black wins, but until we actually reach a level where that happens I don't see how it is relevant, and if it does happen then both measures give the same result.
I don't know why Komodo did so poorly in your test; on one thread at 1 min it is somewhat weaker than Houdini 6 but not that much weaker, and my result vs FF suggests that at least with multiple threads it is fine at this tc. I suppose the Contempt setting should hurt it slightly in classical scoring but should help it with Armageddon scoring.
Regarding my laptop if you want to know how it's possible ask Dell (Alienware Area 51).
I'm trying SF9 vs Komodo now on one thread at 2' + 1".
I am not saying that "normal" scoring of 75% is the "true borderline", only that close to 50% White wins and close below 75% at some time control might denote borderline for the NBC with Armageddon scoring. I don't know why you cling by strict 50% White wins at some time controls between unequal engines with weird contempts. Take two Komodos with an uncolored Contempt=100 (both), a +0.60 eval position and play games on one core at 60+0.6. You might easily get 50%+ White wins. Then set Contempt=0, and play from the same +0.60 position games at 600+6, and you will probably get less than 50% White wins. So, where now is your decision about "50%" rule?

First let's agree on the term "borderline" in this Armageddon variant. I mean by it here that we cannot decide whether it's a win or a draw at indefinitely long time controls, and it's important in Armageddon scoring that the rate of White wins to longer and longer time controls stays fairly stable and does not diverge to 0% or 100% to very very strong play. Stable, close to 50%, but not necessarily exactly 50%. Not when using weird contempts and different engines. The stronger and stronger play between close in strength engines might converge to 50% White wins and 0% Black wins for a borderline position, ideally.

I do not want to argue much, but I have studied in the past what I consider "borderline", White/Draw "strong play" boundary, and there is a specific path from "weak play" to "strong play" boundary. See for example here (pictures are gone, unfortunately):

The flow of Komodo eval

viewtopic.php?f=2&t=60212

After 400+ games at 4x larger time control (240+2.4), it seems plausible that the position is borderline, but let's wait to the end of RR of 600 games. Also, statistical fluctuations are not negligible in 600 games, so differences less than 1.0-1.5% should be considered as inconclusive.

I am not sure why we are getting a bit different results of White performance, but the engines are different, and in fact it still can be attributed to fluctuations. The stranger thing is the Komodo's performance difference against Houdini and SF, I checked and re-checked and nothing seems wrong.

I ran 279 games of latest Komodo vs. Stockfish 9 (K using 75 White Contempt, SF default settings) at 2' + 1" single thread; White won 143, Black won 24, with 112 draws, so White scored 143/279 = 51.3%. Stockfish won by 155 to 124 points (normal scoring), maybe roughly in line with what we would get in normal chess at this level. So overall it seems that there isn't much difference between our results excluding the NN vs A/B tests, I think overall without those tests it's right around 51% for White, which is of course quite ok especially as it doesn't seem to be too sensitive to time control. But somehow when an NN (at least Fat Fritz) plays vs. A/B, results seem to be somewhat significantly better than this for White. I don't know what to conclude from this, perhaps it just means that which side you would choose depends on your style and skill set relative to your opponent. Anyway unless your 600 game test shows a surprisingly good White result, I think we won't be able to improve on the idea in any acceptable way. Allowing Black to castle long would be the simplest acceptable way to help Black, but it would almost certainly push the White score well below 50% and would probably be further from the magic threshold between win and draw. Given that White scores 55% or so in normal chess, I suppose even a 52 or 53% result for White in NBC Armageddon should be acceptable.

Laskos · Post by **Laskos** » Wed Dec 11, 2019 10:58 pm

lkaufman wrote: ↑Wed Dec 11, 2019 12:58 am
Laskos wrote: ↑Tue Dec 10, 2019 10:43 pm
lkaufman wrote: ↑Tue Dec 10, 2019 6:26 pm
Laskos wrote: ↑Tue Dec 10, 2019 1:26 pm
lkaufman wrote: ↑Tue Dec 10, 2019 7:07 am
lkaufman wrote: ↑Mon Dec 09, 2019 5:37 pm
Javier Ros wrote: ↑Mon Dec 09, 2019 11:11 am
Nordlandia wrote: ↑Mon Dec 09, 2019 5:10 am Engines need to know that they're playing armageddon. So they need to be taught playing that mode.
I agree, in the same way AlphaZero has been trained to play No Castling Chess, Lc0 must learn No Castling Chess or BNC with draw advantage, while alpha-beta programs must be also modified. A new opening repertoire will be created for each variation of chess. The contempt factor for each side is not enough.

The experiments of Laskos and Larry Kaufman are very interesting but when the programs take into account the draw advantage the results will vary.
Anyway the proposal seems balanced and acceptable.
Although knowing the Armageddon rule is ideal, I'm pretty sure that using a White Contempt of 75 in Komodo comes close enough to this for most practical purposes. I think that the results indicate that the exact value doesn't matter much, because as the value goes higher, both sides modify their play more towards the Armageddon rule, and the effects cancel out.
I added Fat Fritz to the experiment, although it doesn't have Contempt so you may consider the result less reliable than the Komodo results. It ran on an RTX 2080 at 1' + 1". Result: 177 White wins, 8 Black wins, 185 draws, so 177/370 points = 47.8%. So this brings the overall results for all engines tested down to between 50% and 51% (depending on how you weight them). It is really amazing how fair this variant appears to be, at least between engines!
I realized that it may be a flaw in the testing of this idea to run only identical or very similar engines against each other. In the real world, engines and humans don't generally play against clones, this is not a proper test. So I'm running some unrelated engine matches. They don't have to be of equal strength, as long as they are within a hundred elo or so of each other this should still work, since each side gets half White and half Black. Of course if the engines were a thousand elo apart the result would come out 50% for each color since the stronger engine would win all the games, half with each color, but with moderate elo gaps any White-Black bias should show up.
My first test was Stockfish 10 on 6 fast cpu cores vs. Fat Fritz on RTX 2080, at 1' +0.6". Stockfish won the match 60 to 40, but that's not what matters. White won 58 games, with 42 draws and not a single Black win for either engine! This is a bit worrisome, as 58 to 42 is rather significant. We'll have to see how other unrelated pairings come out. It may turn out that NBC Armageddon isn't as fair as we thought, in which case we can fall back on NBSC, no Black short castling, which would obviously raise Black's prospects. But let's wait for results first.
I had this in the morning, a RR of 600 games with top 3 engines at 60 + 0.6:
Code: Select all
   # PLAYER        : RATING  ERROR    POINTS  PLAYED     (%)   CFS(next)
   1 SF_9          :  46.58  19.48     232.5     400    58.1      95    
   2 Houdini_6     :  19.35  18.85     213.5     400    53.4     100    
   3 Komodo_131    : -65.93  19.52     154.0     400    38.5     ---    

White advantage = 157.98 +/- 11.35
Draw rate (equal opponents) = 44.33 % +/- 2.40
Komodo surprisingly performs quite poorly, although I put White Contempt = 75, and the default small Contempt for the other two, as they have no Colored Contempt. I hope high Contempt in Komodo doesn't harm its performance and doesn't skew the overall result.

Here is the important aspect:
Code: Select all
Games        : 600 (finished)

White Wins   : 314 (52.3 %)
Black Wins   : 70 (11.7 %)
Draws        : 216 (36.0 %)
Unfinished   : 0

White Score  : 70.3 %
Black Score  : 29.7 %
White wins are 52.3%, which is not that bad. My theory was that for determining the borderline at longer TC, both the White score (52.3%, the border is 50%) and the White performance in normal scoring (70.3%, the border is 75%) should be considered. At shorter TC more "accidents" happen, and a Black win (many here) is often an accident between similar in strength engines. But "accidents" do happen the other way around too, some of White wins are also "accidents". At the same time, the performance at 60 + 0.6 in normal scoring is significantly below 75%, which combined 52.3% White wins, denotes that it is debatable what sort of opening it is at much longer TC.

I am now testing at 240 + 2.4 the same RR in 600 games. It will take almost a day probably, but if my theory stands, White might score even below 52.3% at longer TC, but White performance might be above 70.3% in normal scoring (maybe not above the threshold of 75%). That is due to less Black wins at longer TC, and possibly, less White wins. Let's see, but the statistical fluctuations are not that small even in 600 games, so one has to be cautious inferring too many things.
My overnight NBC test was latest Komodo on 7 threads vs. Fat Fritz on RTX 2080 at 1' +0.6". The result was that Komodo won by a single game, 88.5 to 87.5, so Komodo doesn't seem to be weak in this variant in general. The breakdown was: White won 98, Black won 9, Draws 69. So 98 to 78 Armageddon score, which is 55.7%. So again somewhat higher than you are getting, and worrisome. Note that this was with a lot of horsepower (7 fast cpu threads and the RTX 2080), so it's roughly like a 5' +3" test on one thread. It may be that the NNs behave quite differently on this than the A/B engines (or at least when playing against them), or else it just favors White to have more horsepower.
I don't quite agree about 75% in normal scoring being the win/draw dividing line. I know this is true with zero Black wins, but until we actually reach a level where that happens I don't see how it is relevant, and if it does happen then both measures give the same result.
I don't know why Komodo did so poorly in your test; on one thread at 1 min it is somewhat weaker than Houdini 6 but not that much weaker, and my result vs FF suggests that at least with multiple threads it is fine at this tc. I suppose the Contempt setting should hurt it slightly in classical scoring but should help it with Armageddon scoring.
Regarding my laptop if you want to know how it's possible ask Dell (Alienware Area 51).
I'm trying SF9 vs Komodo now on one thread at 2' + 1".
I am not saying that "normal" scoring of 75% is the "true borderline", only that close to 50% White wins and close below 75% at some time control might denote borderline for the NBC with Armageddon scoring. I don't know why you cling by strict 50% White wins at some time controls between unequal engines with weird contempts. Take two Komodos with an uncolored Contempt=100 (both), a +0.60 eval position and play games on one core at 60+0.6. You might easily get 50%+ White wins. Then set Contempt=0, and play from the same +0.60 position games at 600+6, and you will probably get less than 50% White wins. So, where now is your decision about "50%" rule?

First let's agree on the term "borderline" in this Armageddon variant. I mean by it here that we cannot decide whether it's a win or a draw at indefinitely long time controls, and it's important in Armageddon scoring that the rate of White wins to longer and longer time controls stays fairly stable and does not diverge to 0% or 100% to very very strong play. Stable, close to 50%, but not necessarily exactly 50%. Not when using weird contempts and different engines. The stronger and stronger play between close in strength engines might converge to 50% White wins and 0% Black wins for a borderline position, ideally.

I do not want to argue much, but I have studied in the past what I consider "borderline", White/Draw "strong play" boundary, and there is a specific path from "weak play" to "strong play" boundary. See for example here (pictures are gone, unfortunately):

The flow of Komodo eval

viewtopic.php?f=2&t=60212

After 400+ games at 4x larger time control (240+2.4), it seems plausible that the position is borderline, but let's wait to the end of RR of 600 games. Also, statistical fluctuations are not negligible in 600 games, so differences less than 1.0-1.5% should be considered as inconclusive.

I am not sure why we are getting a bit different results of White performance, but the engines are different, and in fact it still can be attributed to fluctuations. The stranger thing is the Komodo's performance difference against Houdini and SF, I checked and re-checked and nothing seems wrong.
I ran 279 games of latest Komodo vs. Stockfish 9 (K using 75 White Contempt, SF default settings) at 2' + 1" single thread; White won 143, Black won 24, with 112 draws, so White scored 143/279 = 51.3%. Stockfish won by 155 to 124 points (normal scoring), maybe roughly in line with what we would get in normal chess at this level. So overall it seems that there isn't much difference between our results excluding the NN vs A/B tests, I think overall without those tests it's right around 51% for White, which is of course quite ok especially as it doesn't seem to be too sensitive to time control. But somehow when an NN (at least Fat Fritz) plays vs. A/B, results seem to be somewhat significantly better than this for White. I don't know what to conclude from this, perhaps it just means that which side you would choose depends on your style and skill set relative to your opponent. Anyway unless your 600 game test shows a surprisingly good White result, I think we won't be able to improve on the idea in any acceptable way. Allowing Black to castle long would be the simplest acceptable way to help Black, but it would almost certainly push the White score well below 50% and would probably be further from the magic threshold between win and draw. Given that White scores 55% or so in normal chess, I suppose even a 52 or 53% result for White in NBC Armageddon should be acceptable.

I forgot the elementary easy things to test before the larger experiments (which serve mostly the theoretical purposes).
Leela mimics the best the game of humans, far better than AB engines.

In my experience top T30 and T40 nets are the level of IM at say 50 nodes, and the level of top GM at about 1000 nodes. Here are the respective matches for NBC Chess, each 1000 games, between the top T40 net and the top T30 net (about 70 Elo points difference):

IM, 50 nodes/move

Code: Select all

Games        : 1000 (finished)

White Wins   : 537 (53.7 %)
Black Wins   : 260 (26.0 %)
Draws        : 203 (20.3 %)
Unfinished   : 0

White Score  : 63.9 %
Black Score  : 36.2 %

Strong GM, 1000 nodes/move

Code: Select all

Games        : 1000 (finished)

White Wins   : 528 (52.8 %)
Black Wins   : 91 (9.1 %)
Draws        : 381 (38.1 %)
Unfinished   : 0

White Score  : 71.9 %
Black Score  : 28.1 %

The flow is as I would like it to be: convergent from above towards 50% White wins, and from below towards 75% as White performance with normal scoring. That is my experience with truly borderline positions.

====================================================================================

Now, those long Round-Robins having 600 games with Komodo, SF and Houdini. The result is satisfactory for accepting the NBC Chess as theoretically borderline, hard to decide White win / Draw position even theoretically, scoring a stubborn close to 50% for White wins (Armageddon scoring) at different time controls, strengths, engines.

The results are here:

RR at 60s + 0.6s, 600 games, 3 top AB engines:

Code: Select all

Games        : 600 (finished)

White Wins   : 314 (52.3 %)
Black Wins   : 70 (11.7 %)
Draws        : 216 (36.0 %)
Unfinished   : 0

White Score  : 70.3 %
Black Score  : 29.7 %

RR at 240s + 2.4s, 600 games, same 3 top AB engines:

Code: Select all

Games        : 600 (finished)

White Wins   : 317 (52.8 %)
Black Wins   : 40 (6.7 %)
Draws        : 243 (40.5 %)
Unfinished   : 0

White Score  : 73.1 %
Black Score  : 26.9 %

The flow is towards 75% White performance from below, which is good, and almost stable in White wins (314/600 to 317/600, statistically insignificant difference). It would have been good if the longer TC test gave a tiny bit smaller number of White wins, but really, one would have to play several thousand games to observe whether that happens with some statistical significance.

==============
==============

All in all, it seems to me this variant is sound for human players and it is removing draws from the game, which is important. I also analyzed the statistical properties, the |Elo difference| / |Elo error margins|, i.e. the resolving power of this variant. It is about 1.35 higher than that of regular Chess, meaning that it needs a factor of about 1.8 less games for the same statistical significance (say same LOS, p-value or for a SPRT stop). This is quite an achievement per se, I had hard times building more sensitive opening suites for regular Chess, and NBC Chess is far above in sensitivity compared to all of them.

lkaufman · Post by **lkaufman** » Thu Dec 12, 2019 6:57 pm

Laskos wrote: ↑Wed Dec 11, 2019 10:58 pm
lkaufman wrote: ↑Wed Dec 11, 2019 12:58 am
Laskos wrote: ↑Tue Dec 10, 2019 10:43 pm
lkaufman wrote: ↑Tue Dec 10, 2019 6:26 pm
Laskos wrote: ↑Tue Dec 10, 2019 1:26 pm
lkaufman wrote: ↑Tue Dec 10, 2019 7:07 am
lkaufman wrote: ↑Mon Dec 09, 2019 5:37 pm
Javier Ros wrote: ↑Mon Dec 09, 2019 11:11 am
Nordlandia wrote: ↑Mon Dec 09, 2019 5:10 am Engines need to know that they're playing armageddon. So they need to be taught playing that mode.
I agree, in the same way AlphaZero has been trained to play No Castling Chess, Lc0 must learn No Castling Chess or BNC with draw advantage, while alpha-beta programs must be also modified. A new opening repertoire will be created for each variation of chess. The contempt factor for each side is not enough.

The experiments of Laskos and Larry Kaufman are very interesting but when the programs take into account the draw advantage the results will vary.
Anyway the proposal seems balanced and acceptable.
Although knowing the Armageddon rule is ideal, I'm pretty sure that using a White Contempt of 75 in Komodo comes close enough to this for most practical purposes. I think that the results indicate that the exact value doesn't matter much, because as the value goes higher, both sides modify their play more towards the Armageddon rule, and the effects cancel out.
I added Fat Fritz to the experiment, although it doesn't have Contempt so you may consider the result less reliable than the Komodo results. It ran on an RTX 2080 at 1' + 1". Result: 177 White wins, 8 Black wins, 185 draws, so 177/370 points = 47.8%. So this brings the overall results for all engines tested down to between 50% and 51% (depending on how you weight them). It is really amazing how fair this variant appears to be, at least between engines!
I realized that it may be a flaw in the testing of this idea to run only identical or very similar engines against each other. In the real world, engines and humans don't generally play against clones, this is not a proper test. So I'm running some unrelated engine matches. They don't have to be of equal strength, as long as they are within a hundred elo or so of each other this should still work, since each side gets half White and half Black. Of course if the engines were a thousand elo apart the result would come out 50% for each color since the stronger engine would win all the games, half with each color, but with moderate elo gaps any White-Black bias should show up.
My first test was Stockfish 10 on 6 fast cpu cores vs. Fat Fritz on RTX 2080, at 1' +0.6". Stockfish won the match 60 to 40, but that's not what matters. White won 58 games, with 42 draws and not a single Black win for either engine! This is a bit worrisome, as 58 to 42 is rather significant. We'll have to see how other unrelated pairings come out. It may turn out that NBC Armageddon isn't as fair as we thought, in which case we can fall back on NBSC, no Black short castling, which would obviously raise Black's prospects. But let's wait for results first.
I had this in the morning, a RR of 600 games with top 3 engines at 60 + 0.6:
Code: Select all
   # PLAYER        : RATING  ERROR    POINTS  PLAYED     (%)   CFS(next)
   1 SF_9          :  46.58  19.48     232.5     400    58.1      95    
   2 Houdini_6     :  19.35  18.85     213.5     400    53.4     100    
   3 Komodo_131    : -65.93  19.52     154.0     400    38.5     ---    

White advantage = 157.98 +/- 11.35
Draw rate (equal opponents) = 44.33 % +/- 2.40
Komodo surprisingly performs quite poorly, although I put White Contempt = 75, and the default small Contempt for the other two, as they have no Colored Contempt. I hope high Contempt in Komodo doesn't harm its performance and doesn't skew the overall result.

Here is the important aspect:
Code: Select all
Games        : 600 (finished)

White Wins   : 314 (52.3 %)
Black Wins   : 70 (11.7 %)
Draws        : 216 (36.0 %)
Unfinished   : 0

White Score  : 70.3 %
Black Score  : 29.7 %
White wins are 52.3%, which is not that bad. My theory was that for determining the borderline at longer TC, both the White score (52.3%, the border is 50%) and the White performance in normal scoring (70.3%, the border is 75%) should be considered. At shorter TC more "accidents" happen, and a Black win (many here) is often an accident between similar in strength engines. But "accidents" do happen the other way around too, some of White wins are also "accidents". At the same time, the performance at 60 + 0.6 in normal scoring is significantly below 75%, which combined 52.3% White wins, denotes that it is debatable what sort of opening it is at much longer TC.

I am now testing at 240 + 2.4 the same RR in 600 games. It will take almost a day probably, but if my theory stands, White might score even below 52.3% at longer TC, but White performance might be above 70.3% in normal scoring (maybe not above the threshold of 75%). That is due to less Black wins at longer TC, and possibly, less White wins. Let's see, but the statistical fluctuations are not that small even in 600 games, so one has to be cautious inferring too many things.
My overnight NBC test was latest Komodo on 7 threads vs. Fat Fritz on RTX 2080 at 1' +0.6". The result was that Komodo won by a single game, 88.5 to 87.5, so Komodo doesn't seem to be weak in this variant in general. The breakdown was: White won 98, Black won 9, Draws 69. So 98 to 78 Armageddon score, which is 55.7%. So again somewhat higher than you are getting, and worrisome. Note that this was with a lot of horsepower (7 fast cpu threads and the RTX 2080), so it's roughly like a 5' +3" test on one thread. It may be that the NNs behave quite differently on this than the A/B engines (or at least when playing against them), or else it just favors White to have more horsepower.
I don't quite agree about 75% in normal scoring being the win/draw dividing line. I know this is true with zero Black wins, but until we actually reach a level where that happens I don't see how it is relevant, and if it does happen then both measures give the same result.
I don't know why Komodo did so poorly in your test; on one thread at 1 min it is somewhat weaker than Houdini 6 but not that much weaker, and my result vs FF suggests that at least with multiple threads it is fine at this tc. I suppose the Contempt setting should hurt it slightly in classical scoring but should help it with Armageddon scoring.
Regarding my laptop if you want to know how it's possible ask Dell (Alienware Area 51).
I'm trying SF9 vs Komodo now on one thread at 2' + 1".
I am not saying that "normal" scoring of 75% is the "true borderline", only that close to 50% White wins and close below 75% at some time control might denote borderline for the NBC with Armageddon scoring. I don't know why you cling by strict 50% White wins at some time controls between unequal engines with weird contempts. Take two Komodos with an uncolored Contempt=100 (both), a +0.60 eval position and play games on one core at 60+0.6. You might easily get 50%+ White wins. Then set Contempt=0, and play from the same +0.60 position games at 600+6, and you will probably get less than 50% White wins. So, where now is your decision about "50%" rule?

First let's agree on the term "borderline" in this Armageddon variant. I mean by it here that we cannot decide whether it's a win or a draw at indefinitely long time controls, and it's important in Armageddon scoring that the rate of White wins to longer and longer time controls stays fairly stable and does not diverge to 0% or 100% to very very strong play. Stable, close to 50%, but not necessarily exactly 50%. Not when using weird contempts and different engines. The stronger and stronger play between close in strength engines might converge to 50% White wins and 0% Black wins for a borderline position, ideally.

I do not want to argue much, but I have studied in the past what I consider "borderline", White/Draw "strong play" boundary, and there is a specific path from "weak play" to "strong play" boundary. See for example here (pictures are gone, unfortunately):

The flow of Komodo eval

viewtopic.php?f=2&t=60212

After 400+ games at 4x larger time control (240+2.4), it seems plausible that the position is borderline, but let's wait to the end of RR of 600 games. Also, statistical fluctuations are not negligible in 600 games, so differences less than 1.0-1.5% should be considered as inconclusive.

I am not sure why we are getting a bit different results of White performance, but the engines are different, and in fact it still can be attributed to fluctuations. The stranger thing is the Komodo's performance difference against Houdini and SF, I checked and re-checked and nothing seems wrong.
I ran 279 games of latest Komodo vs. Stockfish 9 (K using 75 White Contempt, SF default settings) at 2' + 1" single thread; White won 143, Black won 24, with 112 draws, so White scored 143/279 = 51.3%. Stockfish won by 155 to 124 points (normal scoring), maybe roughly in line with what we would get in normal chess at this level. So overall it seems that there isn't much difference between our results excluding the NN vs A/B tests, I think overall without those tests it's right around 51% for White, which is of course quite ok especially as it doesn't seem to be too sensitive to time control. But somehow when an NN (at least Fat Fritz) plays vs. A/B, results seem to be somewhat significantly better than this for White. I don't know what to conclude from this, perhaps it just means that which side you would choose depends on your style and skill set relative to your opponent. Anyway unless your 600 game test shows a surprisingly good White result, I think we won't be able to improve on the idea in any acceptable way. Allowing Black to castle long would be the simplest acceptable way to help Black, but it would almost certainly push the White score well below 50% and would probably be further from the magic threshold between win and draw. Given that White scores 55% or so in normal chess, I suppose even a 52 or 53% result for White in NBC Armageddon should be acceptable.
I forgot the elementary easy things to test before the larger experiments (which serve mostly the theoretical purposes).
Leela mimics the best the game of humans, far better than AB engines.

In my experience top T30 and T40 nets are the level of IM at say 50 nodes, and the level of top GM at about 1000 nodes. Here are the respective matches for NBC Chess, each 1000 games, between the top T40 net and the top T30 net (about 70 Elo points difference):

IM, 50 nodes/move
Code: Select all
Games        : 1000 (finished)

White Wins   : 537 (53.7 %)
Black Wins   : 260 (26.0 %)
Draws        : 203 (20.3 %)
Unfinished   : 0

White Score  : 63.9 %
Black Score  : 36.2 %
Strong GM, 1000 nodes/move
Code: Select all
Games        : 1000 (finished)

White Wins   : 528 (52.8 %)
Black Wins   : 91 (9.1 %)
Draws        : 381 (38.1 %)
Unfinished   : 0

White Score  : 71.9 %
Black Score  : 28.1 %
The flow is as I would like it to be: convergent from above towards 50% White wins, and from below towards 75% as White performance with normal scoring. That is my experience with truly borderline positions.

====================================================================================

Now, those long Round-Robins having 600 games with Komodo, SF and Houdini. The result is satisfactory for accepting the NBC Chess as theoretically borderline, hard to decide White win / Draw position even theoretically, scoring a stubborn close to 50% for White wins (Armageddon scoring) at different time controls, strengths, engines.

The results are here:

RR at 60s + 0.6s, 600 games, 3 top AB engines:
Code: Select all
Games        : 600 (finished)

White Wins   : 314 (52.3 %)
Black Wins   : 70 (11.7 %)
Draws        : 216 (36.0 %)
Unfinished   : 0

White Score  : 70.3 %
Black Score  : 29.7 %  
RR at 240s + 2.4s, 600 games, same 3 top AB engines:
Code: Select all
Games        : 600 (finished)

White Wins   : 317 (52.8 %)
Black Wins   : 40 (6.7 %)
Draws        : 243 (40.5 %)
Unfinished   : 0

White Score  : 73.1 %
Black Score  : 26.9 %
The flow is towards 75% White performance from below, which is good, and almost stable in White wins (314/600 to 317/600, statistically insignificant difference). It would have been good if the longer TC test gave a tiny bit smaller number of White wins, but really, one would have to play several thousand games to observe whether that happens with some statistical significance.

==============
==============

All in all, it seems to me this variant is sound for human players and it is removing draws from the game, which is important. I also analyzed the statistical properties, the |Elo difference| / |Elo error margins|, i.e. the resolving power of this variant. It is about 1.35 higher than that of regular Chess, meaning that it needs a factor of about 1.8 less games for the same statistical significance (say same LOS, p-value or for a SPRT stop). This is quite an achievement per se, I had hard times building more sensitive opening suites for regular Chess, and NBC Chess is far above in sensitivity compared to all of them.

I ran 614 games of SF9 vs latest Komodo wtih NBC Arm. on four threads at 2' + 1". Result agrees remarkably well with your results, namely 321 White wins, 48 Black wins, 245 draws so 52.3% with Armageddon scoring, 72.2% normal scoring. It seems that with the longer and higher quality (mp) tests the results are in the 52 to 53% range mostly, let's call it 52.5%. This is acceptable, but does confirm my initial belief that White is for choice and suggests that it is more likely a theoretical win than a theoretical draw, although of course no one knows.
But I think we should consider the alternative NBSC (No Black Short castling). Based on comparing evals of top engines, my best guess is that the normal scoring for White will drop about two percentage points or a bit more while the Armageddon score will drop about 3 percentage points. If I am right this would put it about 49.5%, which is much more balanced than 52.5% (five times as close to equal!). In view of the psychological advantage of playing for the win, it is better to be slightly under 50% in engine play than slightly over. It is also closer to normal chess (Black can still castle queenside). It might seem like this is too much to give Black half of his castling rights, but in actual practice it is rather difficult for Black to castle queenside without making concessions if White plays agressively, and when Black does manage to castle long White will almost surely have castled short leading to games with a low draw probability, which is what White wants. So the swing from NBC to NSBC is not nearly as big as it might look at first glance. Anyway, I think we should test this and then compare to see which is really better.

D Sceviour · Post by **D Sceviour** » Sun Dec 15, 2019 1:02 am

D Sceviour wrote: ↑Sat Dec 14, 2019 6:21 pm Does "no castling chess" have an official variant name?

Winboard says it is "nocastle" I suppose.

"nocastle" is a Fischer Random Castling variant without castling it seems from trying to get it to play in Winboard. So the question still remains, is there a variant name for standard chess without castling?

Sorry for hijacking the thread, but this can be re-posted under a new topic if anybody wants.

Laskos · Post by **Laskos** » Mon Dec 16, 2019 6:56 am

lkaufman wrote: ↑Thu Dec 12, 2019 6:57 pm
Laskos wrote: ↑Wed Dec 11, 2019 10:58 pm
lkaufman wrote: ↑Wed Dec 11, 2019 12:58 am
Laskos wrote: ↑Tue Dec 10, 2019 10:43 pm
lkaufman wrote: ↑Tue Dec 10, 2019 6:26 pm
Laskos wrote: ↑Tue Dec 10, 2019 1:26 pm
lkaufman wrote: ↑Tue Dec 10, 2019 7:07 am
lkaufman wrote: ↑Mon Dec 09, 2019 5:37 pm
Javier Ros wrote: ↑Mon Dec 09, 2019 11:11 am
Nordlandia wrote: ↑Mon Dec 09, 2019 5:10 am Engines need to know that they're playing armageddon. So they need to be taught playing that mode.
I agree, in the same way AlphaZero has been trained to play No Castling Chess, Lc0 must learn No Castling Chess or BNC with draw advantage, while alpha-beta programs must be also modified. A new opening repertoire will be created for each variation of chess. The contempt factor for each side is not enough.

The experiments of Laskos and Larry Kaufman are very interesting but when the programs take into account the draw advantage the results will vary.
Anyway the proposal seems balanced and acceptable.
Although knowing the Armageddon rule is ideal, I'm pretty sure that using a White Contempt of 75 in Komodo comes close enough to this for most practical purposes. I think that the results indicate that the exact value doesn't matter much, because as the value goes higher, both sides modify their play more towards the Armageddon rule, and the effects cancel out.
I added Fat Fritz to the experiment, although it doesn't have Contempt so you may consider the result less reliable than the Komodo results. It ran on an RTX 2080 at 1' + 1". Result: 177 White wins, 8 Black wins, 185 draws, so 177/370 points = 47.8%. So this brings the overall results for all engines tested down to between 50% and 51% (depending on how you weight them). It is really amazing how fair this variant appears to be, at least between engines!
I realized that it may be a flaw in the testing of this idea to run only identical or very similar engines against each other. In the real world, engines and humans don't generally play against clones, this is not a proper test. So I'm running some unrelated engine matches. They don't have to be of equal strength, as long as they are within a hundred elo or so of each other this should still work, since each side gets half White and half Black. Of course if the engines were a thousand elo apart the result would come out 50% for each color since the stronger engine would win all the games, half with each color, but with moderate elo gaps any White-Black bias should show up.
My first test was Stockfish 10 on 6 fast cpu cores vs. Fat Fritz on RTX 2080, at 1' +0.6". Stockfish won the match 60 to 40, but that's not what matters. White won 58 games, with 42 draws and not a single Black win for either engine! This is a bit worrisome, as 58 to 42 is rather significant. We'll have to see how other unrelated pairings come out. It may turn out that NBC Armageddon isn't as fair as we thought, in which case we can fall back on NBSC, no Black short castling, which would obviously raise Black's prospects. But let's wait for results first.
I had this in the morning, a RR of 600 games with top 3 engines at 60 + 0.6:
Code: Select all
   # PLAYER        : RATING  ERROR    POINTS  PLAYED     (%)   CFS(next)
   1 SF_9          :  46.58  19.48     232.5     400    58.1      95    
   2 Houdini_6     :  19.35  18.85     213.5     400    53.4     100    
   3 Komodo_131    : -65.93  19.52     154.0     400    38.5     ---    

White advantage = 157.98 +/- 11.35
Draw rate (equal opponents) = 44.33 % +/- 2.40
Komodo surprisingly performs quite poorly, although I put White Contempt = 75, and the default small Contempt for the other two, as they have no Colored Contempt. I hope high Contempt in Komodo doesn't harm its performance and doesn't skew the overall result.

Here is the important aspect:
Code: Select all
Games        : 600 (finished)

White Wins   : 314 (52.3 %)
Black Wins   : 70 (11.7 %)
Draws        : 216 (36.0 %)
Unfinished   : 0

White Score  : 70.3 %
Black Score  : 29.7 %
White wins are 52.3%, which is not that bad. My theory was that for determining the borderline at longer TC, both the White score (52.3%, the border is 50%) and the White performance in normal scoring (70.3%, the border is 75%) should be considered. At shorter TC more "accidents" happen, and a Black win (many here) is often an accident between similar in strength engines. But "accidents" do happen the other way around too, some of White wins are also "accidents". At the same time, the performance at 60 + 0.6 in normal scoring is significantly below 75%, which combined 52.3% White wins, denotes that it is debatable what sort of opening it is at much longer TC.

I am now testing at 240 + 2.4 the same RR in 600 games. It will take almost a day probably, but if my theory stands, White might score even below 52.3% at longer TC, but White performance might be above 70.3% in normal scoring (maybe not above the threshold of 75%). That is due to less Black wins at longer TC, and possibly, less White wins. Let's see, but the statistical fluctuations are not that small even in 600 games, so one has to be cautious inferring too many things.
My overnight NBC test was latest Komodo on 7 threads vs. Fat Fritz on RTX 2080 at 1' +0.6". The result was that Komodo won by a single game, 88.5 to 87.5, so Komodo doesn't seem to be weak in this variant in general. The breakdown was: White won 98, Black won 9, Draws 69. So 98 to 78 Armageddon score, which is 55.7%. So again somewhat higher than you are getting, and worrisome. Note that this was with a lot of horsepower (7 fast cpu threads and the RTX 2080), so it's roughly like a 5' +3" test on one thread. It may be that the NNs behave quite differently on this than the A/B engines (or at least when playing against them), or else it just favors White to have more horsepower.
I don't quite agree about 75% in normal scoring being the win/draw dividing line. I know this is true with zero Black wins, but until we actually reach a level where that happens I don't see how it is relevant, and if it does happen then both measures give the same result.
I don't know why Komodo did so poorly in your test; on one thread at 1 min it is somewhat weaker than Houdini 6 but not that much weaker, and my result vs FF suggests that at least with multiple threads it is fine at this tc. I suppose the Contempt setting should hurt it slightly in classical scoring but should help it with Armageddon scoring.
Regarding my laptop if you want to know how it's possible ask Dell (Alienware Area 51).
I'm trying SF9 vs Komodo now on one thread at 2' + 1".
I am not saying that "normal" scoring of 75% is the "true borderline", only that close to 50% White wins and close below 75% at some time control might denote borderline for the NBC with Armageddon scoring. I don't know why you cling by strict 50% White wins at some time controls between unequal engines with weird contempts. Take two Komodos with an uncolored Contempt=100 (both), a +0.60 eval position and play games on one core at 60+0.6. You might easily get 50%+ White wins. Then set Contempt=0, and play from the same +0.60 position games at 600+6, and you will probably get less than 50% White wins. So, where now is your decision about "50%" rule?

First let's agree on the term "borderline" in this Armageddon variant. I mean by it here that we cannot decide whether it's a win or a draw at indefinitely long time controls, and it's important in Armageddon scoring that the rate of White wins to longer and longer time controls stays fairly stable and does not diverge to 0% or 100% to very very strong play. Stable, close to 50%, but not necessarily exactly 50%. Not when using weird contempts and different engines. The stronger and stronger play between close in strength engines might converge to 50% White wins and 0% Black wins for a borderline position, ideally.

I do not want to argue much, but I have studied in the past what I consider "borderline", White/Draw "strong play" boundary, and there is a specific path from "weak play" to "strong play" boundary. See for example here (pictures are gone, unfortunately):

The flow of Komodo eval

viewtopic.php?f=2&t=60212

After 400+ games at 4x larger time control (240+2.4), it seems plausible that the position is borderline, but let's wait to the end of RR of 600 games. Also, statistical fluctuations are not negligible in 600 games, so differences less than 1.0-1.5% should be considered as inconclusive.

I am not sure why we are getting a bit different results of White performance, but the engines are different, and in fact it still can be attributed to fluctuations. The stranger thing is the Komodo's performance difference against Houdini and SF, I checked and re-checked and nothing seems wrong.
I ran 279 games of latest Komodo vs. Stockfish 9 (K using 75 White Contempt, SF default settings) at 2' + 1" single thread; White won 143, Black won 24, with 112 draws, so White scored 143/279 = 51.3%. Stockfish won by 155 to 124 points (normal scoring), maybe roughly in line with what we would get in normal chess at this level. So overall it seems that there isn't much difference between our results excluding the NN vs A/B tests, I think overall without those tests it's right around 51% for White, which is of course quite ok especially as it doesn't seem to be too sensitive to time control. But somehow when an NN (at least Fat Fritz) plays vs. A/B, results seem to be somewhat significantly better than this for White. I don't know what to conclude from this, perhaps it just means that which side you would choose depends on your style and skill set relative to your opponent. Anyway unless your 600 game test shows a surprisingly good White result, I think we won't be able to improve on the idea in any acceptable way. Allowing Black to castle long would be the simplest acceptable way to help Black, but it would almost certainly push the White score well below 50% and would probably be further from the magic threshold between win and draw. Given that White scores 55% or so in normal chess, I suppose even a 52 or 53% result for White in NBC Armageddon should be acceptable.
I forgot the elementary easy things to test before the larger experiments (which serve mostly the theoretical purposes).
Leela mimics the best the game of humans, far better than AB engines.

In my experience top T30 and T40 nets are the level of IM at say 50 nodes, and the level of top GM at about 1000 nodes. Here are the respective matches for NBC Chess, each 1000 games, between the top T40 net and the top T30 net (about 70 Elo points difference):

IM, 50 nodes/move
Code: Select all
Games        : 1000 (finished)

White Wins   : 537 (53.7 %)
Black Wins   : 260 (26.0 %)
Draws        : 203 (20.3 %)
Unfinished   : 0

White Score  : 63.9 %
Black Score  : 36.2 %
Strong GM, 1000 nodes/move
Code: Select all
Games        : 1000 (finished)

White Wins   : 528 (52.8 %)
Black Wins   : 91 (9.1 %)
Draws        : 381 (38.1 %)
Unfinished   : 0

White Score  : 71.9 %
Black Score  : 28.1 %
The flow is as I would like it to be: convergent from above towards 50% White wins, and from below towards 75% as White performance with normal scoring. That is my experience with truly borderline positions.

====================================================================================

Now, those long Round-Robins having 600 games with Komodo, SF and Houdini. The result is satisfactory for accepting the NBC Chess as theoretically borderline, hard to decide White win / Draw position even theoretically, scoring a stubborn close to 50% for White wins (Armageddon scoring) at different time controls, strengths, engines.

The results are here:

RR at 60s + 0.6s, 600 games, 3 top AB engines:
Code: Select all
Games        : 600 (finished)

White Wins   : 314 (52.3 %)
Black Wins   : 70 (11.7 %)
Draws        : 216 (36.0 %)
Unfinished   : 0

White Score  : 70.3 %
Black Score  : 29.7 %  
RR at 240s + 2.4s, 600 games, same 3 top AB engines:
Code: Select all
Games        : 600 (finished)

White Wins   : 317 (52.8 %)
Black Wins   : 40 (6.7 %)
Draws        : 243 (40.5 %)
Unfinished   : 0

White Score  : 73.1 %
Black Score  : 26.9 %
The flow is towards 75% White performance from below, which is good, and almost stable in White wins (314/600 to 317/600, statistically insignificant difference). It would have been good if the longer TC test gave a tiny bit smaller number of White wins, but really, one would have to play several thousand games to observe whether that happens with some statistical significance.

==============
==============

All in all, it seems to me this variant is sound for human players and it is removing draws from the game, which is important. I also analyzed the statistical properties, the |Elo difference| / |Elo error margins|, i.e. the resolving power of this variant. It is about 1.35 higher than that of regular Chess, meaning that it needs a factor of about 1.8 less games for the same statistical significance (say same LOS, p-value or for a SPRT stop). This is quite an achievement per se, I had hard times building more sensitive opening suites for regular Chess, and NBC Chess is far above in sensitivity compared to all of them.
I ran 614 games of SF9 vs latest Komodo wtih NBC Arm. on four threads at 2' + 1". Result agrees remarkably well with your results, namely 321 White wins, 48 Black wins, 245 draws so 52.3% with Armageddon scoring, 72.2% normal scoring. It seems that with the longer and higher quality (mp) tests the results are in the 52 to 53% range mostly, let's call it 52.5%. This is acceptable, but does confirm my initial belief that White is for choice and suggests that it is more likely a theoretical win than a theoretical draw, although of course no one knows.
But I think we should consider the alternative NBSC (No Black Short castling). Based on comparing evals of top engines, my best guess is that the normal scoring for White will drop about two percentage points or a bit more while the Armageddon score will drop about 3 percentage points. If I am right this would put it about 49.5%, which is much more balanced than 52.5% (five times as close to equal!). In view of the psychological advantage of playing for the win, it is better to be slightly under 50% in engine play than slightly over. It is also closer to normal chess (Black can still castle queenside). It might seem like this is too much to give Black half of his castling rights, but in actual practice it is rather difficult for Black to castle queenside without making concessions if White plays agressively, and when Black does manage to castle long White will almost surely have castled short leading to games with a low draw probability, which is what White wants. So the swing from NBC to NSBC is not nearly as big as it might look at first glance. Anyway, I think we should test this and then compare to see which is really better.

Yes, you were right, and your NBSC (or NSBC?) comes as both very balanced in Armageddon type scoring and widening the Elo range even more than NBC, which was already widening the ratings compared to the regular Chess.

Here is the file of 358 different 5-mover openings, built with the help of Komodo on 4 threads with Contempt=75. Maybe you will have a look at some positions to see if Komodo plays reasonably the openings, while still getting variety. I am not sure I have better tools than Komodo to build the opening suite, as I do not trust Lc0 in variants (aside that, it doesn't have a Contempt)

http://s000.tinyupload.com/?file_id=818 ... 6325639490

The results are very promising for NBSC.

First, 600 games RR using Komodo, SF anf Houdini.

30 + 0.3s
White score: 48.3%
+290 =221 -89 66.7% (normal scoring)

120 + 1.2s
White score: 48.3%
+290 =243 -67 68.5% (normal scoring)

Very stable White wins score (Armageddon scoring) very close to 50% (just a bit lower). More draws to longer TC, but I guess they will hardly go over 50%.

Second, Lc0 testing, between best t40 and t30 nets:

2000 games at 100 nodes/move:
+1064 =386 -550 62.8%
White score: 53.2%

1000 games at 1000 nodes/move
+514 =303 -183 66.5%
White score: 51.4%

Draw rate increases to longer TC, Armageddon score decreases from above towards 50%. Again, a promising result.

The only one to give a bad result is Komodo self-play (Contempt=75):

400 games at 30 + 0.3s:
+168 =203 -29 67.3%
White score: 42.0%

But I think we can safely conclude that the draw rate here is artificially inflated.

The sensitivity to strength of NBSC is even higher than that of NBC, and much higher than that of normal Chess. Also, the close to 50% White wins scoring seems to depend very weakly on time control, which is important for a fair scoring in a single game. Side and reversed playing would be even fairer because some may have preferences in White or Black playing this variant, they are very asymmetric playing-style-wise.

Laskos · Post by **Laskos** » Mon Dec 16, 2019 4:13 pm

Laskos wrote: ↑Mon Dec 16, 2019 6:56 am
Second, Lc0 testing, between best t40 and t30 nets:

2000 games at 100 nodes/move:
+1064 =386 -550 62.8%
White score: 53.2%

1000 games at 1000 nodes/move
+514 =303 -183 66.5%
White score: 51.4%

Draw rate increases to longer TC, Armageddon score decreases from above towards 50%. Again, a promising result.

And again good result at 5000 nodes/move Lc0 testing between best t40 and t30 nets:

1000 games at 5000 nodes/move
+509 =366 -125 69.2%
White score: 50.9%

Nice progression from lower nodes count. AB engines have 48.3% White score at similar strength. All in all, very balanced and seems to slowly converge towards close to 50%.

lkaufman · Post by **lkaufman** » Mon Dec 16, 2019 6:37 pm

Laskos wrote: ↑Mon Dec 16, 2019 4:13 pm
Laskos wrote: ↑Mon Dec 16, 2019 6:56 am
Second, Lc0 testing, between best t40 and t30 nets:

2000 games at 100 nodes/move:
+1064 =386 -550 62.8%
White score: 53.2%

1000 games at 1000 nodes/move
+514 =303 -183 66.5%
White score: 51.4%

Draw rate increases to longer TC, Armageddon score decreases from above towards 50%. Again, a promising result.

And again good result at 5000 nodes/move Lc0 testing between best t40 and t30 nets:

1000 games at 5000 nodes/move
+509 =366 -125 69.2%
White score: 50.9%

Nice progression from lower nodes count. AB engines have 48.3% White score at similar strength. All in all, very balanced and seems to slowly converge towards close to 50%.

I decided to check out your one bad result, Komodo self-play, to see if it got better or worse with more time and more threads. At 2' + 1" on four threads, latest Komodo dev. got 79 wins, 98 draws, and one loss for 44.4% Armageddon score, somewhat better than your result for selfplay. So at least it's moving in the right direction. With your results for non-selfplay averaging around 49.6% (using only the longest TC for Lc0), looks like this version is most likely as close as we can hope to get to being fair without crossing the line of White having a forced win (it seems). It's better to be below this line than above it, so I think this may be the perfect Armageddon variant at last! I'll try to get some more data and look for ways to promote the idea.

Jonathan003 · Post by **Jonathan003** » Tue Dec 17, 2019 3:36 pm

I wonder if new tools for opening preparation could not help to lower the draw rates in classical chess?
Like the true percentile output in Fat Fritz that displays the expected wins, draws and losses, for a variation.
Also better opening training software, like the (hopefully) upcoming Chess Position Trainer 6. The opening trainer of Chesstempo witch is in beta stage. And the new opening training with Fritz 17.
Maybe Grandmasters wil feel more confident to take there changes on alternative opening lines with the use of these new software.
If you see the rising YouTube video's about chess, it seems like chess is more popular than ever.
I woud like to see 'No castling Chess', and 'Larry' chess, as alternatives to Fischer random chess. But I personally at this moment would like to keep classical chess as it is, the main game of chess.

AlphaZero No Castling Chess

Re: AlphaZero No Castling Chess

Re: AlphaZero No Castling Chess

Re: AlphaZero No Castling Chess

Re: AlphaZero No Castling Chess

Re: AlphaZero No Castling Chess

Re: In No castling Chess what Engine would be the King or the best?

Re: AlphaZero No Castling Chess

Re: AlphaZero No Castling Chess

Re: AlphaZero No Castling Chess

Re: In No castling Chess what Engine would be the King or the best?