TCEC S15, END of an ERA event is much more Brutal than I thought!

peter · Post by **peter** » Tue May 28, 2019 3:02 pm

Laskos wrote: ↑Tue May 28, 2019 2:57 pm
peter wrote: ↑Tue May 28, 2019 2:52 pm
Laskos wrote: ↑Tue May 28, 2019 2:39 pm
peter wrote: ↑Tue May 28, 2019 2:09 pm
Laskos wrote: ↑Tue May 28, 2019 1:48 pm LOS for 10-3 score is 97%, and the confidence interval is 94%, a bit lower than 2 standard deviations.
LOS?
Likelyhood of Superiority.
Ah, I see, but that wasn't my question, was it?
Superiority could be for a single game out of 100 won with 99 draws too.

I thought, we would measure engine- strength in centi- Elo only nowadays, so my question was, if for 87% draw rate and 100 games 26 Elo would be within or without the error bar of 95% confidence.
Not within 95%, but within 94%. I already wrote that. So, the result is inside usual 2 standard deviations error margins, but barely so.

As far as I know 95% is not the maximum but the minimum of confidence to speak about "power" or "significance" of statistics, in every science except computerchess obviously.

Laskos · Post by **Laskos** » Tue May 28, 2019 3:10 pm

peter wrote: ↑Tue May 28, 2019 3:02 pm
Laskos wrote: ↑Tue May 28, 2019 2:57 pm

LOS for 10-3 score is 97%, and the confidence interval is 94%, a bit lower than 2 standard deviations.

Likelyhood of Superiority.
Ah, I see, but that wasn't my question, was it?
Superiority could be for a single game out of 100 won with 99 draws too.

I thought, we would measure engine- strength in centi- Elo only nowadays, so my question was, if for 87% draw rate and 100 games 26 Elo would be within or without the error bar of 95% confidence.
Not within 95%, but within 94%. I already wrote that. So, the result is inside usual 2 standard deviations error margins, but barely so.
As far as I know 95% is not the maximum but the minimum of confidence to speak about "power" or "significance" of statistics, in every science except computerchess obviously.

Confidence here is related to p-value. About p-values and testing methodology there are many things to say, but the things are complicated enough that most scientists using them don't understand a lot. My simple advice in chess matches would be to use LOS of 99.9% as stopping rule, and even literal LOS if following a very strict testing methodology.

Raphexon · Post by **Raphexon** » Tue May 28, 2019 4:14 pm

Kanizsa wrote: ↑Tue May 28, 2019 1:05 pm A new possible challenge for LC0.

in his latest book Garry Kasparov still states that “Centaur mode” is still the best expression of strength in chess games.
I am doubtful after witnessing how LC0 defeated Stockfish.

In order to test Kasparov's ipothesis I would suggest the following challenge:
GM + best collection Alpha Beta programs available at the spot vs. best Neural Network program (LC0 or Alpha 0)

Principal addictive rules:
- GM Centaur mode has access to the screen of analysis of AB programs, but without expanding analysis tree handmade
- on the other hand, GM centaur mode has the possibility to withdraw the move after LC0 reply and is allowed to play another single substitute move, with a time penalty.

According to you, who would win in a match of 16-24 games in this conditions ?
GM centaur mode or LCO ?

Why wouldn't the GM be allowed to use the best NN-engine?

peter · Post by **peter** » Tue May 28, 2019 5:37 pm

Laskos wrote: ↑Tue May 28, 2019 3:10 pm
peter wrote: ↑Tue May 28, 2019 3:02 pm
Laskos wrote: ↑Tue May 28, 2019 2:57 pm

LOS for 10-3 score is 97%, and the confidence interval is 94%, a bit lower than 2 standard deviations.

Likelyhood of Superiority.
Ah, I see, but that wasn't my question, was it?
Superiority could be for a single game out of 100 won with 99 draws too.

I thought, we would measure engine- strength in centi- Elo only nowadays, so my question was, if for 87% draw rate and 100 games 26 Elo would be within or without the error bar of 95% confidence.
Not within 95%, but within 94%. I already wrote that. So, the result is inside usual 2 standard deviations error margins, but barely so.
As far as I know 95% is not the maximum but the minimum of confidence to speak about "power" or "significance" of statistics, in every science except computerchess obviously.

Confidence here is related to p-value. About p-values and testing methodology there are many things to say, but the things are complicated enough that most scientists using them don't understand a lot. My simple advice in chess matches would be to use LOS of 99.9% as stopping rule, and even literal LOS if following a very strict testing methodology.

But here your LOS is 97 only, not 99.9.

Ozymandias · Post by **Ozymandias** » Tue May 28, 2019 5:41 pm

Raphexon wrote: ↑Tue May 28, 2019 4:14 pm
Kanizsa wrote: ↑Tue May 28, 2019 1:05 pm A new possible challenge for LC0.

in his latest book Garry Kasparov still states that “Centaur mode” is still the best expression of strength in chess games.
I am doubtful after witnessing how LC0 defeated Stockfish.

In order to test Kasparov's ipothesis I would suggest the following challenge:
GM + best collection Alpha Beta programs available at the spot vs. best Neural Network program (LC0 or Alpha 0)

Principal addictive rules:
- GM Centaur mode has access to the screen of analysis of AB programs, but without expanding analysis tree handmade
- on the other hand, GM centaur mode has the possibility to withdraw the move after LC0 reply and is allowed to play another single substitute move, with a time penalty.

According to you, who would win in a match of 16-24 games in this conditions ?
GM centaur mode or LCO ?
Why wouldn't the GM be allowed to use the best NN-engine?

That's a good question, but not the only one at all. Why a GM? They have a very poor record as centaur players. Why can't they expand "analysis tree handmade"? Whatever that means. Why the "possibility to withdraw the move after LC0 reply"? No need for that under normal tournament conditions.

Not to mention that the main question can't even begin to be answered without clarifying book conditions for both players.

peter · Post by **peter** » Tue May 28, 2019 5:58 pm

Ozymandias wrote: ↑Tue May 28, 2019 5:41 pm
Raphexon wrote: ↑Tue May 28, 2019 4:14 pm
Kanizsa wrote: ↑Tue May 28, 2019 1:05 pm A new possible challenge for LC0.

in his latest book Garry Kasparov still states that “Centaur mode” is still the best expression of strength in chess games.
I am doubtful after witnessing how LC0 defeated Stockfish.

In order to test Kasparov's ipothesis I would suggest the following challenge:
GM + best collection Alpha Beta programs available at the spot vs. best Neural Network program (LC0 or Alpha 0)

Principal addictive rules:
- GM Centaur mode has access to the screen of analysis of AB programs, but without expanding analysis tree handmade
- on the other hand, GM centaur mode has the possibility to withdraw the move after LC0 reply and is allowed to play another single substitute move, with a time penalty.

According to you, who would win in a match of 16-24 games in this conditions ?
GM centaur mode or LCO ?
Why wouldn't the GM be allowed to use the best NN-engine?
That's a good question, but not the only one at all. Why a GM? They have a very poor record as centaur players. Why can't they expand "analysis tree handmade"? Whatever that means. Why the "possibility to withdraw the move after LC0 reply"? No need for that under normal tournament conditions.

Not to mention that the main question can't even begin to be answered without clarifying book conditions for both players.

Let's not let it get too complicated. Simply let any corr.- chess IM play some games against a bookless LC0. The human master would kick LC0's ass for sure quite significantly (even if the sample would stay rather small) and he of course would use LC0 too, just to be sure, to see it's mistakes coming early enough when building up traps out of opening- and games- databases, and he surely wouldn't use LC0 only neither just for that same reason.
So Kasparov's Centaur- thesis is to be proved wrong or right rather easily, as far as I know, he didn't explicitely exclude corr.- TC, did he?
So what?

Laskos · Post by **Laskos** » Tue May 28, 2019 6:04 pm

peter wrote: ↑Tue May 28, 2019 5:37 pm
Laskos wrote: ↑Tue May 28, 2019 3:10 pm
peter wrote: ↑Tue May 28, 2019 3:02 pm
Laskos wrote: ↑Tue May 28, 2019 2:57 pm

LOS for 10-3 score is 97%, and the confidence interval is 94%, a bit lower than 2 standard deviations.

Likelyhood of Superiority.
Ah, I see, but that wasn't my question, was it?
Superiority could be for a single game out of 100 won with 99 draws too.

I thought, we would measure engine- strength in centi- Elo only nowadays, so my question was, if for 87% draw rate and 100 games 26 Elo would be within or without the error bar of 95% confidence.
Not within 95%, but within 94%. I already wrote that. So, the result is inside usual 2 standard deviations error margins, but barely so.
As far as I know 95% is not the maximum but the minimum of confidence to speak about "power" or "significance" of statistics, in every science except computerchess obviously.

Confidence here is related to p-value. About p-values and testing methodology there are many things to say, but the things are complicated enough that most scientists using them don't understand a lot. My simple advice in chess matches would be to use LOS of 99.9% as stopping rule, and even literal LOS if following a very strict testing methodology.
But here your LOS is 97 only, not 99.9.

Here it would be not that far off from saying "Leela is 97% likely to be stronger than SF", as "testing" was rigid and strict. 99.9% should be used as a stopping rule, meaning that you can stop the match whenever you like the result. But in this case one cannot say "Leela is 99.9% likely to be stronger", only something like "above 95%". There are methodological differences between the two and often scientists are in fact using the latter, lax methodolgy, but interpret the result using "strict methodology" interpretation. There is a great deal of "false positive" and not reproducible papers even in important scientific journals due to bad use of p-values and such criteria (equivalent to LOS and error margins here).
Also, with small samples like here (+10 -3), one should be double careful. Remove or add 1-2 datapoints, and we get a quite different LOS.

Ozymandias · Post by **Ozymandias** » Tue May 28, 2019 7:32 pm

peter wrote: ↑Tue May 28, 2019 5:58 pmLet's not let it get too complicated. Simply let any corr.- chess IM play some games against a bookless LC0. The human master would kick LC0's ass for sure quite significantly (even if the sample would stay rather small) and he of course would use LC0 too, just to be sure, to see it's mistakes coming early enough when building up traps out of opening- and games- databases, and he surely wouldn't use LC0 only neither just for that same reason.
So Kasparov's Centaur- thesis is to be proved wrong or right rather easily, as far as I know, he didn't explicitely exclude corr.- TC, did he?
So what?

I don't think it'd be a bloodbath, unless the centaur player were able to find an opening trap, Lc0 would fall for every time. But that would be an artificial result, after all you're throwing away the hash info, which would save Lc0 if it could be accessed. It'd be more realistic if Lc0 could have access to a book developed with Lc0 evals, like the latest cerebellum, updated after every game. That's what I'd do for a more realistic result.

jorose · Post by **jorose** » Tue May 28, 2019 8:32 pm

I haven't been following this closely and I don't really want to get involved, but I can't help but point out that throwing out 8 decisive games because the same side happened to win is a fairly extreme thing to do. I don't know how much confidence can be had in the results as soon as you start being that hand wavy.

I doubt the engines in a few years will lose the weaker side of those positions against current SF and Leela.

Perhaps if we ran this same set several times the other minimatches that ended in one engines favor would end up going in two wins for the same color and the drawless 1-1 matches would end in decisive minimatches or double draws.

I also don't understand why we are even discussing this. TCEC is not intending to do the impossible thing of determining the "best" engine under all circumstances. There are many factors and either side can argue TCEC is totally unfair for their side.

What we know is LC0 won the S15 TCEC SuFi with a score of 53.5-46.5 and it was great entertainment. I also felt it was a much better battle of ice and fire than the last GoT season.

Michel · Post by **Michel** » Tue May 28, 2019 9:00 pm

jorose wrote: ↑Tue May 28, 2019 8:32 pm I haven't been following this closely and I don't really want to get involved, but I can't help but point out that throwing out 8 decisive games because the same side happened to win is a fairly extreme thing to do. I don't know how much confidence can be had in the results as soon as you start being that hand wavy.

It has nothing to do with being hand wavy. On the contrary. The trinomial model is simply wrong in the case of unbalanced positions and to get accurate results one should use the pentanomial model instead.

However if there are no double wins (were there?) then the pentanomial model degenerates again to a trinomial model where reciprocal wins should be discarded (if one wants to reject the null hypothesis of equal strength).

TCEC S15, END of an ERA event is much more Brutal than I thought!

Re: TCEC S15, END of an ERA event is much more Brutal than I thought!

Re: TCEC S15, END of an ERA event is much more Brutal than I thought!

Re: TCEC S15, END of an ERA event is much more Brutal than I thought!

Re: TCEC S15, END of an ERA event is much more Brutal than I thought!

Re: TCEC S15, END of an ERA event is much more Brutal than I thought!

Re: TCEC S15, END of an ERA event is much more Brutal than I thought!

Re: TCEC S15, END of an ERA event is much more Brutal than I thought!

Re: TCEC S15, END of an ERA event is much more Brutal than I thought!

Re: TCEC S15, END of an ERA event is much more Brutal than I thought!

Re: TCEC S15, END of an ERA event is much more Brutal than I thought!