Scaling of engines from FGRL rating list

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Scaling of engines from FGRL rating list

Post by cdani »

JJJ wrote:This confirm my intuition, about Komodo scaling better than Stockfish 8 with time.
As some patches tend to have good scaling, mostly are probably neutral, and some have bad scaling, one can expect the scaling of Stockfish to be near 0, as is the case.
mjlef
Posts: 1494
Joined: Thu Mar 30, 2006 2:08 pm

Re: Scaling of engines from FGRL rating list

Post by mjlef »

cdani wrote:
Laskos wrote:
Dann Corbit wrote:So using this measure, Andscacs scales best with longer time and Fritz the worst.
Yes.
You can bet I tried very hard to obtain this :-)
Impressive! Is there anything you do that you think helps in time scaling? We try to test at a few time controls to get a feel for scaling, but it is difficult to get enough games at much more than a few minutes a game.
Uri Blass
Posts: 11182
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Scaling of engines from FGRL rating list

Post by Uri Blass »

mjlef wrote:
cdani wrote:
Laskos wrote:
Dann Corbit wrote:So using this measure, Andscacs scales best with longer time and Fritz the worst.
Yes.
You can bet I tried very hard to obtain this :-)
Impressive! Is there anything you do that you think helps in time scaling? We try to test at a few time controls to get a feel for scaling, but it is difficult to get enough games at much more than a few minutes a game.
I think that you need to sacrifice playing strength to achieve a better scaling.

A possible idea is to test every accepted patch with a lot of games with different time control in order to have a good idea if a patch scales well or does not scale well and I think that this knowledge can help to get better ideas.

Let say 200,000 games at STC and LTC.

Unfortunately the stockfish team is not interested in knowing a better estimate for the value of patch and after a patch pass they accept without more testing to see it's exact value.

I think that it is more interesting to know how much elo you get from every patch and if I decide how to use the machine time for stockfish I will stop testing new ideas for some months and simply start with a simple version of stockfish without a lot of code and test how much do you get from the existing code.


Stockfish is already strong enough and I do not like the target of making it stronger and it is more interesting to have more information why it is strong

Something like we start with a simple relatively weak engine A that is 2000 elo weaker than stockfish.

You get 100+-1 elo from patch A1 at STC and 90+-1 elo from patch A at LTC
After adding patch A1 you get 80+-1 from patch A2 at STC and 80+-1 elo from patch A2 at LTC

Continue in this way and you basically have something relatively simple+
something like patches A1+A2+A3+....A150 when every Ai has elo estimate
that it gives.

I think that this type of knowledge can help computer chess more than adding more elo to stockfish and I do not understand why the stockfish team care for the target of making stockfish stronger that I consider to be non important when stockfish is not a commercial program.

I wonder if all the people who give computer time to stockfish think that making stockfish stronger is more important than understanding and if there are not people who prefer to give computer time to get the type of knowledge that seems to be more interesting to me so at least part of the computer time that the stockfish team use may be devoted to this target.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Scaling of engines from FGRL rating list

Post by Laskos »

A bit more sound, but equivalent scaling rating would be using so called "Wilos" (instead of "Elos") (http://www.talkchess.com/forum/viewtopic.php?t=57482), basically computing Wins/(Wins+Losses) ratios ("drawless Elo"). In that approach it was empirically shown that using logistic for "Wilo" ratings allows for additivity on a large spans of these ratings. It's not a problem here (we don't have large intervals), but it's good to keep consistent. I used in OP log10((W2*L1)/(W1*L2)) as a "scaling rating", but it's unclear it's additive.

So, from logistic Score=1/(1+10^(-Wilo/400)), where Score=Wins/(Wins+Losses), I get Wilo1 from short time control in the list, and Wilo2 from long time control, and compute their difference to get the scaling. The result is here:

Scaling to Long Time Control on one core:

Code: Select all

     Engine                    Scaling Wilo
  ------------------------------------------
   1 Andscacs 0.89       :          38.9 
   2 Fire 5              :          29.9
   3 Komodo 10.4         :          21.0
   4 Deep Shredder 13    :          16.7 
   5 Stockfish 8         :           6.6 
   6 Houdini 5.01        :          -9.7
   7 Chiron 4            :         -10.8
   8 Gull 3              :         -14.1 
   9 Fizbo 1.9           :         -23.8
  10 Fritz 15            :         -54.8
It is (up to a general multiplicative factor) very similar to the first result, as it should be, but now we are (almost) sure the ratings are additive (and their differences can be compared).
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Scaling of engines from FGRL rating list

Post by Laskos »

mjlef wrote:
It would be great to calculate the same kind of scaling based on number of cores/threads. Of course more cores help you search deeper, just as longer time does, so that would have to be taken into account.

Kai, as always, great stuff! Thanks.

Mark
I seem to not have data for Stockfish 8 doublings in threads, there is Andreas' very important data for doublings in threads vs doubling in time. Also, these (Komodo or older Stockfishes) are self-games, here we have rating lists with Round-Robins of engines.
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Scaling of engines from FGRL rating list

Post by cdani »

Laskos wrote: It is (up to a general multiplicative factor) very similar to the first result, as it should be, but now we are (almost) sure the ratings are additive (and their differences can be compared).
Many thanks again for your very useful work! They help to shed concrete light to very interesting topics :-)
mjlef wrote:Impressive! Is there anything you do that you think helps in time scaling? We try to test at a few time controls to get a feel for scaling, but it is difficult to get enough games at much more than a few minutes a game.
Even if I already started to try that the changes scaled well long time ago, I'm not sure which % of the apparent success is due to the continuous work or due to just a few changes that luckily scaled very well. As you told, is not easy to be sure as the resources are limited.

I can add that the scaling is measurable already with shorter time controls that what you use for Komodo, as I don't have a lot of resources and I only do tests of several minutes as a verification tests between distant versions.
Dann Corbit wrote: I guess that all the efforts to obtain this are via pruning, since it has to do with all experiments running a single thread (so it has nothing to do with SMP).

I think that this is the right direction for a giant win (next big revolution like null move and LMR were in their day).
I also think that the scaling of basic search stuff like lmr and others can be optimized to scale better and that there is something probably bigger to be won of it. And of course I have already several working stuff like this for example in the lmr of Andscacs. Its lmr is absolutely different from all the other engines I know.
cdani wrote:
JJJ wrote:This confirm my intuition, about Komodo scaling better than Stockfish 8 with time.
As some patches tend to have good scaling, mostly are probably neutral, and some have bad scaling, one can expect the scaling of Stockfish to be near 0, as is the case.
I can add that without viewing any results, and knowing that Komodo guys take care of scaling stuff unlike Stockfish tests, one can expect that Komodo scales better.

Anyway things are not necessarily simple, as maybe at TCEC like time controls maybe other things weigh more than those measured here.

Houdart told that he don't know what made Stockfish stronger to make it win TCEC. The possible answer maybe lies here as we see that Stockfish scales a little better than Houdini, or maybe is other thing, who knows.
Laskos wrote:
Scaling to Long Time Control on one core:

Code: Select all

     Engine                    Scaling Wilo
  ------------------------------------------
...
   7 Chiron 4            :         -10.8
   8 Gull 3              :         -14.1 
   9 Fizbo 1.9           :         -23.8
  10 Fritz 15            :         -54.8
I bet that these engines are tuned with very short time control games, of just a few seconds. Not easy to have enough computer power, of course. There are some scaling changes that require mandatory at least 30 seconds to be observerd. So some king safety stuff, to say the easiest, can seem good at say 5 seconds of stc and 15 seconds of ltc, but can be quite bad at 60 seconds.
User avatar
Ajedrecista
Posts: 2211
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: Scaling of engines from FGRL rating list.

Post by Ajedrecista »

Hello Kai:

As far as I understand, your two proposed metrics are exactly the same, other than the multiplicative factors:

Proposal 1 (original post):
http://talkchess.com/forum/viewtopic.ph ... 88&t=63687

Proposal 2 (Wilos):
http://talkchess.com/forum/viewtopic.ph ... 47&t=63687

Proposal 1:

Code: Select all

Scaling = (W2/L2)/(W1/L1)
(Metric 1) = 100*log10(Scaling) = 100*log10[(W2/L2)/(W1/L1)]
------------

Proposal 2:

Code: Select all

Score1 = W1/(W1 + L1)
Score2 = W2/(W2 + L2)
1 - Score1 = 1 - W1/(W1 + L1) = L1/(W1 + L1)
1 - Score2 = 1 - W2/(W2 + L2) = L2/(W2 + L2)

Wilo1 = 400*log10[Score1/(1 - Score1)] = 400*log10(W1/L1)
Wilo2 = 400*log10[Score2/(1 - Score2)] = 400*log10(W2/L2)

Wilo2 - Wilo1 = 400*[log10(W2/L2) - log10(W1/L1)] = 400*log10[(W2/L2)/(W1/L1)]
------------

The only difference is the multiplicative factor (100 or 400). I computed your two proposed metrics and I found some little errors on your values of metric 1, while I got the same values as you in metric 2. This is the reason why you did not found them completely equivalent.

So, (metric 2) = 4*(metric 1) from my POV. Other than that, the method is quite interesting, so thank you for sharing it with us.

Regards from Spain.

Ajedrecista.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Scaling of engines from FGRL rating list.

Post by Laskos »

Ajedrecista wrote:Hello Kai:

As far as I understand, your two proposed metrics are exactly the same, other than the multiplicative factors:

Proposal 1 (original post):
http://talkchess.com/forum/viewtopic.ph ... 88&t=63687

Proposal 2 (Wilos):
http://talkchess.com/forum/viewtopic.ph ... 47&t=63687

Proposal 1:

Code: Select all

Scaling = (W2/L2)/(W1/L1)
(Metric 1) = 100*log10(Scaling) = 100*log10[(W2/L2)/(W1/L1)]
------------

Proposal 2:

Code: Select all

Score1 = W1/(W1 + L1)
Score2 = W2/(W2 + L2)
1 - Score1 = 1 - W1/(W1 + L1) = L1/(W1 + L1)
1 - Score2 = 1 - W2/(W2 + L2) = L2/(W2 + L2)

Wilo1 = 400*log10[Score1/(1 - Score1)] = 400*log10(W1/L1)
Wilo2 = 400*log10[Score2/(1 - Score2)] = 400*log10(W2/L2)

Wilo2 - Wilo1 = 400*[log10(W2/L2) - log10(W1/L1)] = 400*log10[(W2/L2)/(W1/L1)]
------------

The only difference is the multiplicative factor (100 or 400). I computed your two proposed metrics and I found some little errors on your values of metric 1, while I got the same values as you in metric 2. This is the reason why you did not found them completely equivalent.

So, (metric 2) = 4*(metric 1) from my POV. Other than that, the method is quite interesting, so thank you for sharing it with us.

Regards from Spain.

Ajedrecista.
Thank you very much. So, it's only a multiplicative factor of 4 between the two, and the model is more sound now by using logistic Wilos. Simply, use method 1 and do 400*log10[(W2/L2)/(W1/L1)]. Method 2 is a little longer to compute. I was unsure if those logarithms in method 1 are additive (they are, in fact), but method 2 was already tested empirically.
jhellis3
Posts: 548
Joined: Sat Aug 17, 2013 12:36 am

Re: Scaling of engines from FGRL rating list.

Post by jhellis3 »

The problem with all of this is how one interprets the data depends on one's frame of reference.

For example, swap the order around, and I imagine SF will be near the top as the most "graceful" failure as time approaches 0.

So is it really a matter of engines scaling well with increasing time or other engines scaling better with decreasing time.

The other effect unaccounted for is natural Elo compression which will occur with greater time. Obviously, those at the top of the heap (at STC) have the most to lose...

That isn't to say engines won't scale differently with time, but the above factors certainly need to be accounted for and will likely compress the actual scaling differences considerably.
jhellis3
Posts: 548
Joined: Sat Aug 17, 2013 12:36 am

Re: Scaling of engines from FGRL rating list.

Post by jhellis3 »

A good example of Elo compression can be seen comparing SF7 to SF8.

Here we see the ratings differential compress from 78 Elo (10 minutes) to 61 Elo (60 minutes), so 17 Elo.

And I'm pretty sure Stockfish is not out-scaling Stockfish...