If applying your adjudications (not very mathematically sound), we still get a pretty significant result. So, count all draws from advantageous starting position as losses.Laskos wrote: ↑Mon Nov 12, 2018 11:51 pmIn 4 days I managed to perform this test you propose, and then I interpreted the result using a mathematically sound pentanomial variance (error margin) for paired (side-reversed) games developed and derived by Michel Van den Bergh and me, and described briefly here https://www.chessprogramming.org/Match_Statistics. My openings are pretty markedly unbalanced (80cp-100cp advantage for White), are played side and reversed, and draw rate is kept pretty low. The correct pentanomial error margins in this case are 1.8-2.2 times smaller than naive trinomial error margins usually shown in rating tools, because the outcomes in paired games are pretty highly correlated.lkaufman wrote: ↑Fri Nov 09, 2018 5:42 am A couple points worth mentioning. If you want to eliminate the possible distortion of one engine simply being much stronger than the other, I suggest you test Komodo 12.2 mcts (or wait for the bugfix 12.2.1) vs. Komodo 9 (or 9.02 or 9.1 if you prefer), which is our best free version and is very evenly matched with Komodo 12.2 MCTS in my tests. But most likely you will find that normal Komodo scales better from 1 to 10 seconds on one thread. The reason is that at one second per move, Komodo MCTS doesn't have enough time to really "do its thing" and is more or less a crippled normal Komodo. But at ten seconds per move (or even five) the MCTS aspect is in full effect. So my main point is that how Komodo MCTS scales from 1 to 10 seconds on one thread is not predictive of how it would scale from 5" to 50". My data is inconclusive on this point, I think the scaling is pretty similar. We'll know when CCRL has ratings for both 40/4 and 40/40 for Komodo MCTS, or CEGT for 40/4 and 40/20, or fastgm for 10' and 60', which can be compared with Komodo 9.
I think your three queens solution to the draw problem is interesting, but perhaps not so predictive of normal chess. My preferred solution to the draw problem is to start with positions evaluated around 0.7 or so by Komodo, counting draws as wins for the bad side. With alternating colors, no draws at all, equal chances, and reasonably normal chess.
The tests are at 6'' per move and 60'' per move. on 1 i7 3.8 GHz thread (4 concurrent games are running on 4 cores). I set hash at 512MB in both cases. The results are:
6'' per move:-6.9 Elo points with 15.9 Elo points 1 sigma pentanomial error margin.Code: Select all
Games Completed = 100 of 100 (Avg game length = 804.127 sec) Settings = Gauntlet/512MB/6000ms per move/M 1500cp for 5 moves, D 160 moves/EPD:C:\LittleBlitzer\OP_08_10_W_Trim.epd(5840) Time = 20394 sec elapsed, 0 sec remaining 1. Komodo 9.1 49.0/100 37-39-24 (L: m=39 t=0 i=0 a=0) (D: r=4 i=18 f=0 s=1 a=1) (tpm=5736.5 d=27.38 nps=2090592) 2. Komodo 12.2 MCTS 51.0/100 39-37-24 (L: m=37 t=0 i=0 a=0) (D: r=4 i=18 f=0 s=1 a=1) (tpm=5934.5 d=12.90 nps=2064)
60'' per move38.4 Elo points with 15.2 Elo points 1 sigma pentanomial error margin.Code: Select all
Games Completed = 100 of 100 (Avg game length = 9338.734 sec) Settings = Gauntlet/512MB/60000ms per move/M 1500cp for 5 moves, D 160 moves/EPD:C:\LittleBlitzer\OP_08_10_W_Trim.epd(5840) Time = 241335 sec elapsed, 0 sec remaining 1. Komodo 9.1 55.5/100 40-29-31 (L: m=29 t=0 i=0 a=0) (D: r=8 i=17 f=1 s=0 a=5) (tpm=57467.1 d=33.96 nps=2194737) 2. Komodo 12.2 MCTS 44.5/100 29-40-31 (L: m=39 t=0 i=0 a=1) (D: r=8 i=17 f=1 s=0 a=5) (tpm=59428.3 d=17.66 nps=1658)
============================================================
- Difference: 45.3 Elo points.
1 sigma (pentanomial) for the difference: 22.0 Elo points.
98.0% that Komodo 12.2 MCTS scales worse than Komodo 9.1 A/B. That is already a fairly significant result.
For STC we have:
52:48 MCTS against A/B
For LTC we have:
40:60 MCTS against A/B
96% likelihood that A/B scales better. Which is again fairly significant, although the sound mathematical pentanomial variance gives even higher 98% likelihood.