Scaling of engines from FGRL rating list

cdani · Post by **cdani** » Sat Apr 08, 2017 2:53 am

JJJ wrote:This confirm my intuition, about Komodo scaling better than Stockfish 8 with time.

As some patches tend to have good scaling, mostly are probably neutral, and some have bad scaling, one can expect the scaling of Stockfish to be near 0, as is the case.

mjlef · Post by **mjlef** » Sat Apr 08, 2017 2:53 am

cdani wrote:
Laskos wrote:
Dann Corbit wrote:So using this measure, Andscacs scales best with longer time and Fritz the worst.
Yes.
You can bet I tried very hard to obtain this

Impressive! Is there anything you do that you think helps in time scaling? We try to test at a few time controls to get a feel for scaling, but it is difficult to get enough games at much more than a few minutes a game.

Uri Blass · Post by **Uri Blass** » Sat Apr 08, 2017 5:50 am

mjlef wrote:
cdani wrote:
Laskos wrote:
Dann Corbit wrote:So using this measure, Andscacs scales best with longer time and Fritz the worst.
Yes.
You can bet I tried very hard to obtain this
Impressive! Is there anything you do that you think helps in time scaling? We try to test at a few time controls to get a feel for scaling, but it is difficult to get enough games at much more than a few minutes a game.

I think that you need to sacrifice playing strength to achieve a better scaling.

A possible idea is to test every accepted patch with a lot of games with different time control in order to have a good idea if a patch scales well or does not scale well and I think that this knowledge can help to get better ideas.

Let say 200,000 games at STC and LTC.

Unfortunately the stockfish team is not interested in knowing a better estimate for the value of patch and after a patch pass they accept without more testing to see it's exact value.

I think that it is more interesting to know how much elo you get from every patch and if I decide how to use the machine time for stockfish I will stop testing new ideas for some months and simply start with a simple version of stockfish without a lot of code and test how much do you get from the existing code.

Stockfish is already strong enough and I do not like the target of making it stronger and it is more interesting to have more information why it is strong

Something like we start with a simple relatively weak engine A that is 2000 elo weaker than stockfish.

You get 100+-1 elo from patch A1 at STC and 90+-1 elo from patch A at LTC
After adding patch A1 you get 80+-1 from patch A2 at STC and 80+-1 elo from patch A2 at LTC

Continue in this way and you basically have something relatively simple+
something like patches A1+A2+A3+....A150 when every Ai has elo estimate
that it gives.

I think that this type of knowledge can help computer chess more than adding more elo to stockfish and I do not understand why the stockfish team care for the target of making stockfish stronger that I consider to be non important when stockfish is not a commercial program.

I wonder if all the people who give computer time to stockfish think that making stockfish stronger is more important than understanding and if there are not people who prefer to give computer time to get the type of knowledge that seems to be more interesting to me so at least part of the computer time that the stockfish team use may be devoted to this target.

Laskos · Post by **Laskos** » Sat Apr 08, 2017 8:57 am

A bit more sound, but equivalent scaling rating would be using so called "Wilos" (instead of "Elos") (http://www.talkchess.com/forum/viewtopic.php?t=57482), basically computing Wins/(Wins+Losses) ratios ("drawless Elo"). In that approach it was empirically shown that using logistic for "Wilo" ratings allows for additivity on a large spans of these ratings. It's not a problem here (we don't have large intervals), but it's good to keep consistent. I used in OP log10((W2*L1)/(W1*L2)) as a "scaling rating", but it's unclear it's additive.

So, from logistic Score=1/(1+10^(-Wilo/400)), where Score=Wins/(Wins+Losses), I get Wilo1 from short time control in the list, and Wilo2 from long time control, and compute their difference to get the scaling. The result is here:

Scaling to Long Time Control on one core:

Code: Select all

     Engine                    Scaling Wilo
  ------------------------------------------
   1 Andscacs 0.89       :          38.9 
   2 Fire 5              :          29.9
   3 Komodo 10.4         :          21.0
   4 Deep Shredder 13    :          16.7 
   5 Stockfish 8         :           6.6 
   6 Houdini 5.01        :          -9.7
   7 Chiron 4            :         -10.8
   8 Gull 3              :         -14.1 
   9 Fizbo 1.9           :         -23.8
  10 Fritz 15            :         -54.8

It is (up to a general multiplicative factor) very similar to the first result, as it should be, but now we are (almost) sure the ratings are additive (and their differences can be compared).

Laskos · Post by **Laskos** » Sat Apr 08, 2017 10:15 am

mjlef wrote:
It would be great to calculate the same kind of scaling based on number of cores/threads. Of course more cores help you search deeper, just as longer time does, so that would have to be taken into account.

Kai, as always, great stuff! Thanks.

Mark

I seem to not have data for Stockfish 8 doublings in threads, there is Andreas' very important data for doublings in threads vs doubling in time. Also, these (Komodo or older Stockfishes) are self-games, here we have rating lists with Round-Robins of engines.

cdani · Post by **cdani** » Sat Apr 08, 2017 11:09 am

Laskos wrote: It is (up to a general multiplicative factor) very similar to the first result, as it should be, but now we are (almost) sure the ratings are additive (and their differences can be compared).

Many thanks again for your very useful work! They help to shed concrete light to very interesting topics

mjlef wrote:Impressive! Is there anything you do that you think helps in time scaling? We try to test at a few time controls to get a feel for scaling, but it is difficult to get enough games at much more than a few minutes a game.

Even if I already started to try that the changes scaled well long time ago, I'm not sure which % of the apparent success is due to the continuous work or due to just a few changes that luckily scaled very well. As you told, is not easy to be sure as the resources are limited.

I can add that the scaling is measurable already with shorter time controls that what you use for Komodo, as I don't have a lot of resources and I only do tests of several minutes as a verification tests between distant versions.

Dann Corbit wrote: I guess that all the efforts to obtain this are via pruning, since it has to do with all experiments running a single thread (so it has nothing to do with SMP).

I think that this is the right direction for a giant win (next big revolution like null move and LMR were in their day).

I also think that the scaling of basic search stuff like lmr and others can be optimized to scale better and that there is something probably bigger to be won of it. And of course I have already several working stuff like this for example in the lmr of Andscacs. Its lmr is absolutely different from all the other engines I know.

cdani wrote:
JJJ wrote:This confirm my intuition, about Komodo scaling better than Stockfish 8 with time.
As some patches tend to have good scaling, mostly are probably neutral, and some have bad scaling, one can expect the scaling of Stockfish to be near 0, as is the case.

I can add that without viewing any results, and knowing that Komodo guys take care of scaling stuff unlike Stockfish tests, one can expect that Komodo scales better.

Anyway things are not necessarily simple, as maybe at TCEC like time controls maybe other things weigh more than those measured here.

Houdart told that he don't know what made Stockfish stronger to make it win TCEC. The possible answer maybe lies here as we see that Stockfish scales a little better than Houdini, or maybe is other thing, who knows.

Laskos wrote:
Scaling to Long Time Control on one core:

Code: Select all

     Engine                    Scaling Wilo
  ------------------------------------------
...
   7 Chiron 4            :         -10.8
   8 Gull 3              :         -14.1 
   9 Fizbo 1.9           :         -23.8
  10 Fritz 15            :         -54.8

I bet that these engines are tuned with very short time control games, of just a few seconds. Not easy to have enough computer power, of course. There are some scaling changes that require mandatory at least 30 seconds to be observerd. So some king safety stuff, to say the easiest, can seem good at say 5 seconds of stc and 15 seconds of ltc, but can be quite bad at 60 seconds.

Ajedrecista · Post by **Ajedrecista** » Sat Apr 08, 2017 12:01 pm

Hello Kai:

As far as I understand, your two proposed metrics are exactly the same, other than the multiplicative factors:

Proposal 1 (original post):
http://talkchess.com/forum/viewtopic.ph ... 88&t=63687

Proposal 2 (Wilos):
http://talkchess.com/forum/viewtopic.ph ... 47&t=63687

Proposal 1:

Code: Select all

Scaling = (W2/L2)/(W1/L1)
(Metric 1) = 100*log10(Scaling) = 100*log10[(W2/L2)/(W1/L1)]

------------

Proposal 2:

Code: Select all

Score1 = W1/(W1 + L1)
Score2 = W2/(W2 + L2)
1 - Score1 = 1 - W1/(W1 + L1) = L1/(W1 + L1)
1 - Score2 = 1 - W2/(W2 + L2) = L2/(W2 + L2)

Wilo1 = 400*log10[Score1/(1 - Score1)] = 400*log10(W1/L1)
Wilo2 = 400*log10[Score2/(1 - Score2)] = 400*log10(W2/L2)

Wilo2 - Wilo1 = 400*[log10(W2/L2) - log10(W1/L1)] = 400*log10[(W2/L2)/(W1/L1)]

------------

The only difference is the multiplicative factor (100 or 400). I computed your two proposed metrics and I found some little errors on your values of metric 1, while I got the same values as you in metric 2. This is the reason why you did not found them completely equivalent.

So, (metric 2) = 4*(metric 1) from my POV. Other than that, the method is quite interesting, so thank you for sharing it with us.

Regards from Spain.

Ajedrecista.

Laskos · Post by **Laskos** » Sat Apr 08, 2017 12:27 pm

Ajedrecista wrote:Hello Kai:

As far as I understand, your two proposed metrics are exactly the same, other than the multiplicative factors:

Proposal 1 (original post):
http://talkchess.com/forum/viewtopic.ph ... 88&t=63687

Proposal 2 (Wilos):
http://talkchess.com/forum/viewtopic.ph ... 47&t=63687

Proposal 1:
Code: Select all
Scaling = (W2/L2)/(W1/L1)
(Metric 1) = 100*log10(Scaling) = 100*log10[(W2/L2)/(W1/L1)]
------------

Proposal 2:
Code: Select all
Score1 = W1/(W1 + L1)
Score2 = W2/(W2 + L2)
1 - Score1 = 1 - W1/(W1 + L1) = L1/(W1 + L1)
1 - Score2 = 1 - W2/(W2 + L2) = L2/(W2 + L2)

Wilo1 = 400*log10[Score1/(1 - Score1)] = 400*log10(W1/L1)
Wilo2 = 400*log10[Score2/(1 - Score2)] = 400*log10(W2/L2)

Wilo2 - Wilo1 = 400*[log10(W2/L2) - log10(W1/L1)] = 400*log10[(W2/L2)/(W1/L1)]
------------

The only difference is the multiplicative factor (100 or 400). I computed your two proposed metrics and I found some little errors on your values of metric 1, while I got the same values as you in metric 2. This is the reason why you did not found them completely equivalent.

So, (metric 2) = 4*(metric 1) from my POV. Other than that, the method is quite interesting, so thank you for sharing it with us.

Regards from Spain.

Ajedrecista.

Thank you very much. So, it's only a multiplicative factor of 4 between the two, and the model is more sound now by using logistic Wilos. Simply, use method 1 and do 400*log10[(W2/L2)/(W1/L1)]. Method 2 is a little longer to compute. I was unsure if those logarithms in method 1 are additive (they are, in fact), but method 2 was already tested empirically.

jhellis3 · Post by **jhellis3** » Sun Apr 09, 2017 6:52 am

The problem with all of this is how one interprets the data depends on one's frame of reference.

For example, swap the order around, and I imagine SF will be near the top as the most "graceful" failure as time approaches 0.

So is it really a matter of engines scaling well with increasing time or other engines scaling better with decreasing time.

The other effect unaccounted for is natural Elo compression which will occur with greater time. Obviously, those at the top of the heap (at STC) have the most to lose...

That isn't to say engines won't scale differently with time, but the above factors certainly need to be accounted for and will likely compress the actual scaling differences considerably.

jhellis3 · Post by **jhellis3** » Sun Apr 09, 2017 7:11 am

A good example of Elo compression can be seen comparing SF7 to SF8.

Here we see the ratings differential compress from 78 Elo (10 minutes) to 61 Elo (60 minutes), so 17 Elo.

And I'm pretty sure Stockfish is not out-scaling Stockfish...

Scaling of engines from FGRL rating list

Re: Scaling of engines from FGRL rating list

Re: Scaling of engines from FGRL rating list

Re: Scaling of engines from FGRL rating list

Re: Scaling of engines from FGRL rating list

Re: Scaling of engines from FGRL rating list

Re: Scaling of engines from FGRL rating list

Re: Scaling of engines from FGRL rating list.

Re: Scaling of engines from FGRL rating list.

Re: Scaling of engines from FGRL rating list.

Re: Scaling of engines from FGRL rating list.