As some patches tend to have good scaling, mostly are probably neutral, and some have bad scaling, one can expect the scaling of Stockfish to be near 0, as is the case.JJJ wrote:This confirm my intuition, about Komodo scaling better than Stockfish 8 with time.
Scaling of engines from FGRL rating list
Moderator: Ras
-
cdani
- Posts: 2204
- Joined: Sat Jan 18, 2014 10:24 am
- Location: Andorra
Re: Scaling of engines from FGRL rating list
Daniel José -
http://www.andscacs.com
-
mjlef
- Posts: 1494
- Joined: Thu Mar 30, 2006 2:08 pm
Re: Scaling of engines from FGRL rating list
Impressive! Is there anything you do that you think helps in time scaling? We try to test at a few time controls to get a feel for scaling, but it is difficult to get enough games at much more than a few minutes a game.cdani wrote:You can bet I tried very hard to obtain thisLaskos wrote:Yes.Dann Corbit wrote:So using this measure, Andscacs scales best with longer time and Fritz the worst.
-
Uri Blass
- Posts: 11182
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: Scaling of engines from FGRL rating list
I think that you need to sacrifice playing strength to achieve a better scaling.mjlef wrote:Impressive! Is there anything you do that you think helps in time scaling? We try to test at a few time controls to get a feel for scaling, but it is difficult to get enough games at much more than a few minutes a game.cdani wrote:You can bet I tried very hard to obtain thisLaskos wrote:Yes.Dann Corbit wrote:So using this measure, Andscacs scales best with longer time and Fritz the worst.
A possible idea is to test every accepted patch with a lot of games with different time control in order to have a good idea if a patch scales well or does not scale well and I think that this knowledge can help to get better ideas.
Let say 200,000 games at STC and LTC.
Unfortunately the stockfish team is not interested in knowing a better estimate for the value of patch and after a patch pass they accept without more testing to see it's exact value.
I think that it is more interesting to know how much elo you get from every patch and if I decide how to use the machine time for stockfish I will stop testing new ideas for some months and simply start with a simple version of stockfish without a lot of code and test how much do you get from the existing code.
Stockfish is already strong enough and I do not like the target of making it stronger and it is more interesting to have more information why it is strong
Something like we start with a simple relatively weak engine A that is 2000 elo weaker than stockfish.
You get 100+-1 elo from patch A1 at STC and 90+-1 elo from patch A at LTC
After adding patch A1 you get 80+-1 from patch A2 at STC and 80+-1 elo from patch A2 at LTC
Continue in this way and you basically have something relatively simple+
something like patches A1+A2+A3+....A150 when every Ai has elo estimate
that it gives.
I think that this type of knowledge can help computer chess more than adding more elo to stockfish and I do not understand why the stockfish team care for the target of making stockfish stronger that I consider to be non important when stockfish is not a commercial program.
I wonder if all the people who give computer time to stockfish think that making stockfish stronger is more important than understanding and if there are not people who prefer to give computer time to get the type of knowledge that seems to be more interesting to me so at least part of the computer time that the stockfish team use may be devoted to this target.
-
Laskos
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Scaling of engines from FGRL rating list
A bit more sound, but equivalent scaling rating would be using so called "Wilos" (instead of "Elos") (http://www.talkchess.com/forum/viewtopic.php?t=57482), basically computing Wins/(Wins+Losses) ratios ("drawless Elo"). In that approach it was empirically shown that using logistic for "Wilo" ratings allows for additivity on a large spans of these ratings. It's not a problem here (we don't have large intervals), but it's good to keep consistent. I used in OP log10((W2*L1)/(W1*L2)) as a "scaling rating", but it's unclear it's additive.
So, from logistic Score=1/(1+10^(-Wilo/400)), where Score=Wins/(Wins+Losses), I get Wilo1 from short time control in the list, and Wilo2 from long time control, and compute their difference to get the scaling. The result is here:
Scaling to Long Time Control on one core:
It is (up to a general multiplicative factor) very similar to the first result, as it should be, but now we are (almost) sure the ratings are additive (and their differences can be compared).
So, from logistic Score=1/(1+10^(-Wilo/400)), where Score=Wins/(Wins+Losses), I get Wilo1 from short time control in the list, and Wilo2 from long time control, and compute their difference to get the scaling. The result is here:
Scaling to Long Time Control on one core:
Code: Select all
Engine Scaling Wilo
------------------------------------------
1 Andscacs 0.89 : 38.9
2 Fire 5 : 29.9
3 Komodo 10.4 : 21.0
4 Deep Shredder 13 : 16.7
5 Stockfish 8 : 6.6
6 Houdini 5.01 : -9.7
7 Chiron 4 : -10.8
8 Gull 3 : -14.1
9 Fizbo 1.9 : -23.8
10 Fritz 15 : -54.8-
Laskos
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Scaling of engines from FGRL rating list
I seem to not have data for Stockfish 8 doublings in threads, there is Andreas' very important data for doublings in threads vs doubling in time. Also, these (Komodo or older Stockfishes) are self-games, here we have rating lists with Round-Robins of engines.mjlef wrote:
It would be great to calculate the same kind of scaling based on number of cores/threads. Of course more cores help you search deeper, just as longer time does, so that would have to be taken into account.
Kai, as always, great stuff! Thanks.
Mark
-
cdani
- Posts: 2204
- Joined: Sat Jan 18, 2014 10:24 am
- Location: Andorra
Re: Scaling of engines from FGRL rating list
Many thanks again for your very useful work! They help to shed concrete light to very interesting topicsLaskos wrote: It is (up to a general multiplicative factor) very similar to the first result, as it should be, but now we are (almost) sure the ratings are additive (and their differences can be compared).
Even if I already started to try that the changes scaled well long time ago, I'm not sure which % of the apparent success is due to the continuous work or due to just a few changes that luckily scaled very well. As you told, is not easy to be sure as the resources are limited.mjlef wrote:Impressive! Is there anything you do that you think helps in time scaling? We try to test at a few time controls to get a feel for scaling, but it is difficult to get enough games at much more than a few minutes a game.
I can add that the scaling is measurable already with shorter time controls that what you use for Komodo, as I don't have a lot of resources and I only do tests of several minutes as a verification tests between distant versions.
I also think that the scaling of basic search stuff like lmr and others can be optimized to scale better and that there is something probably bigger to be won of it. And of course I have already several working stuff like this for example in the lmr of Andscacs. Its lmr is absolutely different from all the other engines I know.Dann Corbit wrote: I guess that all the efforts to obtain this are via pruning, since it has to do with all experiments running a single thread (so it has nothing to do with SMP).
I think that this is the right direction for a giant win (next big revolution like null move and LMR were in their day).
I can add that without viewing any results, and knowing that Komodo guys take care of scaling stuff unlike Stockfish tests, one can expect that Komodo scales better.cdani wrote:As some patches tend to have good scaling, mostly are probably neutral, and some have bad scaling, one can expect the scaling of Stockfish to be near 0, as is the case.JJJ wrote:This confirm my intuition, about Komodo scaling better than Stockfish 8 with time.
Anyway things are not necessarily simple, as maybe at TCEC like time controls maybe other things weigh more than those measured here.
Houdart told that he don't know what made Stockfish stronger to make it win TCEC. The possible answer maybe lies here as we see that Stockfish scales a little better than Houdini, or maybe is other thing, who knows.
I bet that these engines are tuned with very short time control games, of just a few seconds. Not easy to have enough computer power, of course. There are some scaling changes that require mandatory at least 30 seconds to be observerd. So some king safety stuff, to say the easiest, can seem good at say 5 seconds of stc and 15 seconds of ltc, but can be quite bad at 60 seconds.Laskos wrote:
Scaling to Long Time Control on one core:Code: Select all
Engine Scaling Wilo ------------------------------------------ ... 7 Chiron 4 : -10.8 8 Gull 3 : -14.1 9 Fizbo 1.9 : -23.8 10 Fritz 15 : -54.8
Daniel José -
http://www.andscacs.com
-
Ajedrecista
- Posts: 2211
- Joined: Wed Jul 13, 2011 9:04 pm
- Location: Madrid, Spain.
Re: Scaling of engines from FGRL rating list.
Hello Kai:
As far as I understand, your two proposed metrics are exactly the same, other than the multiplicative factors:
Proposal 1 (original post):
http://talkchess.com/forum/viewtopic.ph ... 88&t=63687
Proposal 2 (Wilos):
http://talkchess.com/forum/viewtopic.ph ... 47&t=63687
Proposal 1:
------------
Proposal 2:
------------
The only difference is the multiplicative factor (100 or 400). I computed your two proposed metrics and I found some little errors on your values of metric 1, while I got the same values as you in metric 2. This is the reason why you did not found them completely equivalent.
So, (metric 2) = 4*(metric 1) from my POV. Other than that, the method is quite interesting, so thank you for sharing it with us.
Regards from Spain.
Ajedrecista.
As far as I understand, your two proposed metrics are exactly the same, other than the multiplicative factors:
Proposal 1 (original post):
http://talkchess.com/forum/viewtopic.ph ... 88&t=63687
Proposal 2 (Wilos):
http://talkchess.com/forum/viewtopic.ph ... 47&t=63687
Proposal 1:
Code: Select all
Scaling = (W2/L2)/(W1/L1)
(Metric 1) = 100*log10(Scaling) = 100*log10[(W2/L2)/(W1/L1)]Proposal 2:
Code: Select all
Score1 = W1/(W1 + L1)
Score2 = W2/(W2 + L2)
1 - Score1 = 1 - W1/(W1 + L1) = L1/(W1 + L1)
1 - Score2 = 1 - W2/(W2 + L2) = L2/(W2 + L2)
Wilo1 = 400*log10[Score1/(1 - Score1)] = 400*log10(W1/L1)
Wilo2 = 400*log10[Score2/(1 - Score2)] = 400*log10(W2/L2)
Wilo2 - Wilo1 = 400*[log10(W2/L2) - log10(W1/L1)] = 400*log10[(W2/L2)/(W1/L1)]The only difference is the multiplicative factor (100 or 400). I computed your two proposed metrics and I found some little errors on your values of metric 1, while I got the same values as you in metric 2. This is the reason why you did not found them completely equivalent.
So, (metric 2) = 4*(metric 1) from my POV. Other than that, the method is quite interesting, so thank you for sharing it with us.
Regards from Spain.
Ajedrecista.
-
Laskos
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Scaling of engines from FGRL rating list.
Thank you very much. So, it's only a multiplicative factor of 4 between the two, and the model is more sound now by using logistic Wilos. Simply, use method 1 and do 400*log10[(W2/L2)/(W1/L1)]. Method 2 is a little longer to compute. I was unsure if those logarithms in method 1 are additive (they are, in fact), but method 2 was already tested empirically.Ajedrecista wrote:Hello Kai:
As far as I understand, your two proposed metrics are exactly the same, other than the multiplicative factors:
Proposal 1 (original post):
http://talkchess.com/forum/viewtopic.ph ... 88&t=63687
Proposal 2 (Wilos):
http://talkchess.com/forum/viewtopic.ph ... 47&t=63687
Proposal 1:
------------Code: Select all
Scaling = (W2/L2)/(W1/L1) (Metric 1) = 100*log10(Scaling) = 100*log10[(W2/L2)/(W1/L1)]
Proposal 2:
------------Code: Select all
Score1 = W1/(W1 + L1) Score2 = W2/(W2 + L2) 1 - Score1 = 1 - W1/(W1 + L1) = L1/(W1 + L1) 1 - Score2 = 1 - W2/(W2 + L2) = L2/(W2 + L2) Wilo1 = 400*log10[Score1/(1 - Score1)] = 400*log10(W1/L1) Wilo2 = 400*log10[Score2/(1 - Score2)] = 400*log10(W2/L2) Wilo2 - Wilo1 = 400*[log10(W2/L2) - log10(W1/L1)] = 400*log10[(W2/L2)/(W1/L1)]
The only difference is the multiplicative factor (100 or 400). I computed your two proposed metrics and I found some little errors on your values of metric 1, while I got the same values as you in metric 2. This is the reason why you did not found them completely equivalent.
So, (metric 2) = 4*(metric 1) from my POV. Other than that, the method is quite interesting, so thank you for sharing it with us.
Regards from Spain.
Ajedrecista.
-
jhellis3
- Posts: 548
- Joined: Sat Aug 17, 2013 12:36 am
Re: Scaling of engines from FGRL rating list.
The problem with all of this is how one interprets the data depends on one's frame of reference.
For example, swap the order around, and I imagine SF will be near the top as the most "graceful" failure as time approaches 0.
So is it really a matter of engines scaling well with increasing time or other engines scaling better with decreasing time.
The other effect unaccounted for is natural Elo compression which will occur with greater time. Obviously, those at the top of the heap (at STC) have the most to lose...
That isn't to say engines won't scale differently with time, but the above factors certainly need to be accounted for and will likely compress the actual scaling differences considerably.
For example, swap the order around, and I imagine SF will be near the top as the most "graceful" failure as time approaches 0.
So is it really a matter of engines scaling well with increasing time or other engines scaling better with decreasing time.
The other effect unaccounted for is natural Elo compression which will occur with greater time. Obviously, those at the top of the heap (at STC) have the most to lose...
That isn't to say engines won't scale differently with time, but the above factors certainly need to be accounted for and will likely compress the actual scaling differences considerably.
-
jhellis3
- Posts: 548
- Joined: Sat Aug 17, 2013 12:36 am
Re: Scaling of engines from FGRL rating list.
A good example of Elo compression can be seen comparing SF7 to SF8.
Here we see the ratings differential compress from 78 Elo (10 minutes) to 61 Elo (60 minutes), so 17 Elo.
And I'm pretty sure Stockfish is not out-scaling Stockfish...
Here we see the ratings differential compress from 78 Elo (10 minutes) to 61 Elo (60 minutes), so 17 Elo.
And I'm pretty sure Stockfish is not out-scaling Stockfish...