Expected performance and eval of Komodo 8 and SF 6

nimh · Post by **nimh** » Mon Feb 09, 2015 6:35 pm

Thanks, Laskos! Once I've finished gathering and analyzing all data, i'll experiment around a little to see if it really improves the trustability of results. Trustability can be measured via comparing the coefficient of determination (R2) values of logarithmic or exponential fits for rating vs accuracy plots.

Laskos · Post by **Laskos** » Mon Feb 09, 2015 8:17 pm

Now, Gaviota with Gaviota TB. To move 25 it's within error margins with Gaviota no TB:

But to move 70, the difference with TBs is apparent:

Adam Hair · Post by **Adam Hair** » Tue Feb 10, 2015 11:23 am

Thanks Kai.

Isaac · Post by **Isaac** » Tue Feb 10, 2015 3:36 pm

Thanks a lot Kai for all these interesting "stats".
I hope Miguel Ballicora will see the graphs you've made for Gaviota.

Adam Hair · Post by **Adam Hair** » Tue Feb 10, 2015 10:55 pm

Yes, Miguel has seen them. We are hoping that we (or more precisely, Miguel) can use this type of information to improve Gaviota.

Laskos · Post by **Laskos** » Tue Feb 10, 2015 11:56 pm

Adam Hair wrote:Yes, Miguel has seen them. We are hoping that we (or more precisely, Miguel) can use this type of information to improve Gaviota.

Yes, I think the engine would better evaluate in score expectancies rather than in imponderable quantities related to pawn value, that's why I was curious to perform such a test.

michiguel · Post by **michiguel** » Wed Feb 11, 2015 12:19 am

Laskos wrote:
Adam Hair wrote:Yes, Miguel has seen them. We are hoping that we (or more precisely, Miguel) can use this type of information to improve Gaviota.
Yes, I think the engine would better evaluate in score expectancies rather than in imponderable quantities related to pawn value, that's why I was curious to perform such a test.

This is very interesting. I looked at it, but I have been a bit short of time lately to post here an opinion.

The bottom line is that it is sort of embarrassing for the engine. I agree also with Peter in previous posts in which he said that something should be done. The first temptation is to hacked the evaluation and multiply it by some scores based on what part of the game the engine is. But, I suspect it may not solve anything, since this could be an indication that the engine sucks in the endgame. In other words, the eval may be ok, but the engine does not know how to win or draw.

Sometimes in science, you have a parameter that is a good indicator of how good or bad you are doing, but you cannot use it to influence your decisions. Otherwise, it ends being independent and useful. I have the intuition (I may be wrong) that this is similar. In other words, we should try to improve our endgame, and the checked if things got better in with this test, rather than multiplying by a factor to make it look better.

A lot of food for thought here...
Miguel

Ajedrecista · Post by **Ajedrecista** » Mon Sep 21, 2015 11:43 am

Hello:

World Chess Cup 2015 is being played these days and can be followed here:

LIVE FIDE World Chess Cup 2015.

Below SF multi-PV 3 analysis one can click in Analysis tab and the site displays expected probabilities of white win (WW), draw (D) and black win (BW). They are independent of the phase of the game: an eval of +1 will return the same probabilities with 27 pieces OTB than with 11 pieces.

Evals are from SF and from white POV. They are symmetric, I mean: if there is {Eval; WW, D, BW} then there is {Eval' = -Eval; WW' = BW, D' = D, BW' = WW}. There are some evals without data because I did not find all the probabilities but you can expect them to be near their neighbours. I am aware that {WW(Eval), D(Eval), BW(Eval)} are not monotonic functions due to some points. Here is the data that I collected:

Code: Select all

Eval	 WW	 D	 BW	Comments

0.00	0.17	0.66	0.17	Equal.
0.01	0.17	0.66	0.17
0.02	0.17	0.66	0.17
0.03	0.17	0.67	0.16
0.04	0.18	0.66	0.16
0.05	0.18	0.66	0.16
0.06	0.18	0.66	0.16
0.07	0.18	0.66	0.16
0.08	0.18	0.66	0.16
0.09	0.19	0.66	0.15
0.10	0.19	0.66	0.15
0.11	0.19	0.66	0.15
0.12	0.19	0.66	0.15
0.13	0.19	0.66	0.15
0.14	0.19	0.66	0.15
0.15	0.20	0.66	0.14
0.16	0.20	0.66	0.14
0.17	0.20	0.66	0.14
0.18	0.20	0.66	0.14
0.19	0.20	0.66	0.14
0.20	0.21	0.65	0.14
0.21	0.21	0.65	0.14
0.22	0.21	0.65	0.14
0.23	0.21	0.65	0.14
0.24	0.22	0.65	0.13
0.25	0.22	0.65	0.13
0.26	0.22	0.65	0.13
0.27	0.22	0.65	0.13
0.28	0.22	0.65	0.13
0.29	0.23	0.64	0.13
0.30	0.23	0.64	0.13
0.31	0.23	0.64	0.13
0.32	0.24	0.64	0.12
0.33	0.24	0.64	0.12
0.34	0.24	0.64	0.12
0.35	0.24	0.64	0.12
0.36	0.25	0.63	0.12
0.37	0.25	0.63	0.12
0.38	0.25	0.63	0.12
0.39	0.25	0.63	0.12
0.40	0.26	0.63	0.11
0.41	0.26	0.63	0.11
0.42	0.26	0.63	0.11
0.43	0.27	0.62	0.11
0.44	0.27	0.62	0.11
0.45	0.27	0.62	0.11
0.46	0.27	0.62	0.11
0.47	0.28	0.61	0.11
0.48	0.28	0.61	0.11
0.49	0.28	0.61	0.11
0.50	0.29	0.61	0.10	Equal.
0.51	0.29	0.61	0.10	White is slightly better.
0.52	0.30	0.60	0.10
0.53	0.30	0.60	0.10
0.54	0.30	0.60	0.10
0.55	0.30	0.60	0.10
0.56	0.31	0.59	0.10
0.57	0.31	0.59	0.10
0.58	0.31	0.59	0.10
0.59	0.32	0.58	0.10
0.60	0.32	0.58	0.10
0.61	0.32	0.58	0.10
0.62	0.33	0.58	0.09
0.63	0.33	0.58	0.09
0.64	0.34	0.57	0.09
0.65	0.34	0.57	0.09
0.66	0.34	0.57	0.09
0.67	0.35	0.56	0.09
0.68	0.35	0.56	0.09
0.69	0.35	0.56	0.09
0.70	0.36	0.55	0.09
0.71	0.36	0.55	0.09
0.72	0.37	0.54	0.09
0.73	0.37	0.54	0.09
0.74	0.38	0.54	0.08
0.75	0.38	0.54	0.08
0.76	0.39	0.53	0.08
0.77	0.39	0.53	0.08
0.78	0.39	0.53	0.08
0.79	0.40	0.52	0.08
0.80	0.40	0.52	0.08
0.81	0.41	0.51	0.08
0.82	0.41	0.51	0.08
0.83	0.41	0.51	0.08
0.84	0.42	0.50	0.08
0.85	0.42	0.50	0.08
0.86	0.43	0.49	0.08
0.87	0.43	0.49	0.08
0.88	0.44	0.49	0.07
0.89	0.44	0.48	0.08
0.90	0.45	0.48	0.07
0.91	0.45	0.48	0.07
0.92	0.46	0.47	0.07
0.93	0.46	0.47	0.07
0.94	0.47	0.46	0.07
0.95	0.47	0.46	0.07
0.96	0.48	0.45	0.07
0.97	0.48	0.45	0.07
0.98	0.49	0.44	0.07
0.99
1.00	0.49	0.44	0.07	White is slightly better.
1.01	0.50	0.43	0.07	White is much better.
1.02	0.51	0.43	0.06
1.03	0.51	0.42	0.07
1.04	0.52	0.42	0.06
1.05	0.52	0.42	0.06
1.06	0.53	0.41	0.06
1.07	0.53	0.41	0.06
1.08	0.54	0.40	0.06
1.09	0.54	0.40	0.06
1.10	0.55	0.39	0.06
1.11	0.55	0.39	0.06
1.12	0.56	0.38	0.06
1.13	0.56	0.38	0.06
1.14	0.57	0.37	0.06
1.15	0.57	0.37	0.06
1.16	0.58	0.36	0.06
1.17	0.58	0.36	0.06
1.18	0.59	0.36	0.05
1.19	0.59	0.35	0.06
1.20	0.60	0.35	0.05
1.21	0.61	0.34	0.05
1.22	0.61	0.34	0.05
1.23	0.62	0.33	0.05
1.24	0.62	0.33	0.05
1.25	0.63	0.32	0.05
1.26	0.63	0.32	0.05
1.27	0.64	0.31	0.05
1.28	0.64	0.31	0.05
1.29	0.65	0.30	0.05
1.30	0.65	0.30	0.05
1.31	0.66	0.29	0.05
1.32	0.66	0.29	0.05
1.33	0.67	0.29	0.04
1.34	0.67	0.28	0.05
1.35	0.68	0.28	0.04
1.36	0.68	0.27	0.05
1.37	0.69	0.27	0.04
1.38	0.69	0.27	0.04
1.39	0.70	0.26	0.04
1.40	0.70	0.26	0.04
1.41	0.71	0.25	0.04
1.42	0.71	0.25	0.04
1.43	0.72	0.24	0.04
1.44	0.72	0.24	0.04
1.45	0.73	0.23	0.04
1.46	0.73	0.23	0.04
1.47	0.73	0.23	0.04
1.48	0.74	0.22	0.04
1.49	0.74	0.22	0.04
1.50	0.75	0.21	0.04
1.51	0.75	0.21	0.04
1.52	0.76	0.21	0.03
1.53	0.76	0.20	0.04
1.54	0.77	0.20	0.03
1.55	0.77	0.20	0.03
1.56	0.78	0.19	0.03
1.57	0.78	0.19	0.03
1.58	0.79	0.18	0.03
1.59	0.79	0.18	0.03
1.60	0.79	0.18	0.03
1.61	0.80	0.17	0.03
1.62	0.80	0.17	0.03
1.63	0.80	0.17	0.03
1.64	0.81	0.16	0.03
1.65	0.81	0.16	0.03
1.66	0.82	0.15	0.03
1.67	0.82	0.15	0.03
1.68	0.82	0.15	0.03
1.69	0.83	0.14	0.03
1.70	0.83	0.14	0.03
1.71	0.83	0.14	0.03
1.72	0.84	0.13	0.03
1.73
1.74	0.85	0.13	0.02
1.75	0.85	0.12	0.03
1.76	0.85	0.12	0.03
1.77	0.86	0.12	0.02
1.78	0.86	0.12	0.02
1.79	0.86	0.11	0.03
1.80	0.87	0.11	0.02
1.81	0.87	0.11	0.02
1.82
1.83	0.88	0.10	0.02
1.84	0.88	0.10	0.02
1.85	0.88	0.10	0.02
1.86	0.88	0.10	0.02
1.87	0.89	0.09	0.02
1.88	0.89	0.09	0.02
1.89	0.89	0.09	0.02
1.90	0.89	0.09	0.02
1.91	0.90	0.08	0.02
1.92	0.90	0.08	0.02
1.93	0.90	0.08	0.02
1.94	0.90	0.08	0.02
1.95	0.91	0.07	0.02
1.96	0.91	0.07	0.02
1.97	0.91	0.07	0.02
1.98	0.91	0.07	0.02
1.99
2.00	0.92	0.06	0.02	White is much better.
2.01	0.92	0.06	0.02	White is winning.
2.02	0.92	0.06	0.02
2.03	0.92	0.06	0.02
2.04	0.93	0.06	0.01
2.05	0.93	0.06	0.01
2.06	0.93	0.06	0.01
2.07	0.93	0.05	0.02
2.08	0.93	0.05	0.02
2.09	0.94	0.05	0.01
2.10	0.94	0.05	0.01
2.11	0.94	0.05	0.01
2.12	0.94	0.05	0.01
2.13	0.94	0.05	0.01
2.14	0.94	0.05	0.01
2.15	0.95	0.04	0.01
2.16	0.95	0.04	0.01
2.17	0.95	0.04	0.01
2.18	0.95	0.04	0.01
2.19	0.95	0.04	0.01
2.20
2.21	0.95	0.04	0.01
2.22	0.95	0.04	0.01
2.23	0.96	0.03	0.01
2.24	0.96	0.03	0.01
2.25	0.96	0.03	0.01
2.26	0.96	0.03	0.01
2.27	0.96	0.03	0.01
2.28	0.96	0.03	0.01
2.29	0.96	0.03	0.01
2.30	0.96	0.03	0.01
2.31	0.96	0.03	0.01
2.32	0.97	0.02	0.01
2.33	0.97	0.02	0.01
2.34	0.97	0.02	0.01
2.35	0.97	0.02	0.01
2.36	0.97	0.02	0.01
2.37	0.97	0.02	0.01
2.38	0.97	0.02	0.01
2.39	0.97	0.02	0.01
2.40	0.97	0.02	0.01
2.41	0.97	0.02	0.01
2.42	0.97	0.02	0.01
2.43	0.97	0.02	0.01
2.44	0.97	0.02	0.01
2.45	0.97	0.02	0.01
2.46	0.98	0.01	0.01
2.47	0.98	0.01	0.01
2.48	0.98	0.01	0.01
2.49	0.98	0.01	0.01
2.50	0.98	0.01	0.01
2.51	0.98	0.01	0.01
2.52	0.98	0.01	0.01
2.53	0.98	0.01	0.01
2.54	0.98	0.01	0.01
2.55	0.98	0.01	0.01
2.56	0.98	0.01	0.01
2.57	0.98	0.01	0.01
2.58	0.98	0.01	0.01
2.59	0.98	0.01	0.01
2.60	0.99	0.01	0.00
2.61	0.99	0.01	0.00
2.62	0.99	0.01	0.00
2.63	0.99	0.01	0.00
2.64	0.99	0.01	0.00
2.65	0.99	0.01	0.00
2.66	0.99	0.01	0.00
2.67	0.99	0.01	0.00
2.68
2.69	0.99	0.01	0.00
2.70	0.99	0.01	0.00
2.71	0.99	0.01	0.00
2.72	0.99	0.01	0.00
2.73	0.99	0.01	0.00
2.74	0.99	0.01	0.00
2.75	0.99	0.01	0.00
2.76	0.99	0.01	0.00
2.77	0.99	0.01	0.00
2.78	0.99	0.01	0.00
2.79	0.99	0.01	0.00
2.80	0.99	0.01	0.00
2.81	0.99	0.01	0.00
2.82	0.99	0.01	0.00
2.83	0.99	0.01	0.00
2.84	0.99	0.01	0.00
2.85	0.99	0.01	0.00
2.86	0.99	0.01	0.00
2.87	1.00	0.00	0.00
2.88	1.00	0.00	0.00
2.89	1.00	0.00	0.00	White is winning.
[...]

I hope no typos. An acceptable adjust trying some random coefficients in your proposed formula is:

Code: Select all

(Expected white score) = WW + D/2 ~ 0.5 + 0.5*[sign(Eval)]*tanh[0.5*|Eval|^(1.51)]

Roundings on {WW, D, BW} are up to 0.01 = 1% so the resolution in (Expected white score) could be poor.

I am sure that anyone can find a better fit and I am also curious about WW = WW(Eval), D = D(Eval) and BW = BW(Eval). I hope that this info will be somewhat helpful.

------------------------

Someone posted this Twitter account a few weeks ago in TCEC chat:

Chess Forecaster

I paste the URL here just in case someone finds it interesting.

Regards from Spain.

Ajedrecista.

DustyMonkey · Post by **DustyMonkey** » Wed Sep 23, 2015 10:55 pm

michiguel wrote:
Laskos wrote: The bottom line is that it is sort of embarrassing for the engine. I agree also with Peter in previous posts in which he said that something should be done. The first temptation is to hacked the evaluation and multiply it by some scores based on what part of the game the engine is. But, I suspect it may not solve anything, since this could be an indication that the engine sucks in the endgame. In other words, the eval may be ok, but the engine does not know how to win or draw.

...

Miguel
Hello, I am the Joseph the compiled the graphs referenced in the 2nd post of this thread. The original intent of my study was to check the veracity of Robert's claim about Houdini 4's eval modeling the expected value of the game.

The original study showed pretty convincingly that H4's eval is largely no different from other engines and I concluded that he probably isnt doing anything other than maybe some final step scaling at the output. H4, SFDD and K6 all showed the same problem to the same essential degree: the graphs are the same shapes.

Also part of the study (sorry, graphs arent available anymore) showed that the depth of the search had an insignificant effect on the resulting graphs. I decided that this observation wasnt unexpected at all even though I originally thought otherwise.

I wanted to run the tests again but generate graphs using material instead of move number (among other things!), but I got sidetracked in a major way on an unrelated machine learning project.

In any event the move numbers I was using (15, 30, 45, 60) were just proxies for "game stage", and "game stage" is just a proxy for the material makeup of the position.

I think the material makeup of the position would allow for a good linear separator for approximating the error between the evals and the expected values, but I think ultimately the function is largely non-linear.

I think the way to tackle this study properly at this point is to grab a million random positions from CCRL's top-rated 40/40 set (*), keeping track of the games final result for each position, and doing some k-means(**) or other clustering on a vector representing the material on the board of each position. Then for each cluster compute the expected value, as well as tabulate the engine evals for each position within each cluster.

(*) The reason for this is to minimize the number of conversions within the games.
(**) k-means is fast and very easy to implement, but may have issues with the almost-binary nature of a material vector, and the choice of k will be up to trial and error.

DustyMonkey · Post by **DustyMonkey** » Wed Sep 23, 2015 11:24 pm

Laskos wrote: Thanks. I see that move 15 is almost identical to move 30, so the trimming to move 40 was pretty lucky one to get a stable result. I don't fully understand the method. The graph seems normalized (0.00 is 50%), but CCRL games are against unequal opponents, and the mapping from unequal opponents to equal opponents seems non-trivial.

The methodology was exactly this:

From the set of CCRL 40/40 games, I pulled out all the games where both engines were rated 3000+. IIRC there were over 30,000 games that fit this criteria.

From each of these I then extracted the positions for move 15, 30, 45, and 60 and kept track of the games result for each position. A bit over 120,000 positions in total.

Then for each position I generated the evals of H4, SFDD, and K6 regardless of what engines actually played the game (this took days of wall-clock time.)

I then simply tabulated the actual game results for each eval (I think I used +/- 5 centipawn buckets) coming up with the mapping from eval -> expected value

I do not believe bias is an issue because the engines playing were not the ones that produced the evals used. The superior engine is just as likely to be white as it is to be black, so the bias thats here is normal and averaged out.

Expected performance and eval of Komodo 8 and SF 6

Re: Expected performance and eval of Komodo 8 and SF 6

Re: Expected performance and eval of Komodo 8 and SF 6

Re: Expected performance and eval of Komodo 8 and SF 6

Re: Expected performance and eval of Komodo 8 and SF 6

Re: Expected performance and eval of Komodo 8 and SF 6

Re: Expected performance and eval of Komodo 8 and SF 6

Re: Expected performance and eval of Komodo 8 and SF 6

Re: Expected performance and eval of Komodo 8 and SF 6.

Re: Expected performance and eval of Komodo 8 and SF 6

Re: Expected performance and eval of Komodo 8 and SF 6