lazy smp questions

Laskos · Post by **Laskos** » Fri Sep 11, 2015 6:21 am

bob wrote:
Michel wrote:
Robert Hyatt wrote:You have to look at all the data. For example, look at the average opponent rating for cheng 4cpu vs cheng 1cpu. 1cpu played against an opponent average about 50 Elo stronger than cheng 4cpu. What would you expect that to cause? Make cheng 1cpu look weaker? You can't compare Elo numbers between partially or fully disjoint sets of opponents...
Why not? As long if the graph is connected the comparison is fine.

If A plays B and B plays C and C plays D you can still compare A and D. The comparison via intermediate engines just blows up the error bars.
That was my point. If the average ratings for player A's opponents is X, and the average rating for player B's opponents is X+50, it is going to be VERY difficult to compare their ratings with any accuracy and use the resulting Elo numbers to predict outcome between the two versions. The two versions of the original program are different, the average opponents are different, WHICH is responsible for the Elo gain or loss?

So in that specific CEGT comparison, the error bars are not +/-26. They are more like +/- 75...

In this case, saying A is +130 better than B is quite inaccurate. It is most likely better, to be sure. But how much better is much harder to determine without more data points.

Bob, the error bars are those shown by rating list, which uses one of the tools like Ordo or BayesElo. I took several days ago an extreme case to check the usual logisitc ELO model, and it worked quite well even with extreme values. Komodo 9.2 versus much weaker SOS 3. Then SOS 3 versus still much weaker Bikjump, a "beginner" engine, to get the rating of top engine Komodo 9.2 against the "beginner" engine Bikjump via the "intermediate" engine SOS 3:

Score of Komodo 9.2 vs SOS 3: 9865 - 32 - 103 [0.992] 10000
ELO difference: 830

Score of SOS 3 vs Bikjump: 6863 - 636 - 514 [0.889] 8013
ELO difference: 361

The ELO model predicts that Komodo 9.2 is stronger than Bikjump by 830+361 = 1191 ELO points, predicted only via the intermediate engine SOS 3. The "real" rating is:

Score of Komodo 9.2 vs Bikjump: 19967 - 6 - 27 [0.999] 20000
ELO difference: 1204

Prediction is only 13 ELO points off on the 1200 ELO difference span. The error is both statistical and due to the ELO model, but it shows that the model is not bad even in such extreme case.

michiguel · Post by **michiguel** » Fri Sep 11, 2015 7:00 am

Michel wrote:
Martin Sedlak wrote:Well, I get +136 on CEGT 40/4, 4 cores vs 1. Error bars are very high, sure.
The error bars are not so high. Combining them I get +-26. So still 110 elo in worst case (with 95% confidence).

I downloaded the CCRL 4040 database and calculated the rating wit Ordo, fixing a given engine_x to zero. In that way, the engine_x_4CPU will have a given rating with an error that reflects the error of the difference between 1 and 4 CPU. Then, everything is taken into account. The errors are calculated from 100 simulations. The command is for Cheng 4.39

ordo -p ccrl4040.pgn -WD -o ccrl.txt -A "Cheng 4.39 64-bit" -a 0 -s100 -n4

Then, I did it for other versions and engines.

Code: Select all

Increase from 1 to 4 CPU &#40;95% confidence&#41;

Cheng  4.39     159 +/- 47
Cheng4 0.38     114 +/- 32
Cheng4 0.36c    117 +/- 35

Gaviota 1.0     111 +/- 27 &#40;has more games&#41;

Crafty 23.8     116 +/- 47
Crafty 24.1     108 +/- 39
Crafty 23.4     163 +/- 48
Crafty 23.3     105 +/- 42

That is good data to compare per se. But, if we assume that the scalability of those versions did not significantly change, we can do a weighted average. Then,

Increase from 1 to 4 CPU

Code: Select all

Cheng           124 +/- 22
Crafty          120 +/- 20
Gaviota 1.0     111 +/- 27

No matter how we look at it, we cannot statistically say that any of these are different.

Miguel

Dann Corbit · Post by **Dann Corbit** » Fri Sep 11, 2015 7:05 am

So then, being lazy is not so bad after all.

michiguel · Post by **michiguel** » Fri Sep 11, 2015 7:08 am

Dann Corbit wrote:So then, being lazy is not so bad after all.

Not at all...

Miguel

Desperado · Post by **Desperado** » Fri Sep 11, 2015 9:15 am

Good morning,

just an idea before the weekend. Although i have a ybwc like approach at hand i will definitely spent some time on lazy smp.
The main purpose and advantage i see is that it would be less error-prone with better maintenance characteristics and finally a progress compared to single cpu usage in strength.
Obviously at this point, i do not care if there are "theoretically" better solutions (for 512+ cores

for example).

The most natural way to start with it for me, would (will) be to use it in context to MTD or a MTD like approach.
I really wonder why nobody is talking about this combination. I think the ideas are obvious in the context.
But let's talk about it anyway...

I did not open another thread on this topic because i do not have the time to work on it right now, but i want to throw out the idea at least.
Maybe some others who are working on lazy smp at the moment can already start to explore this combination.

Happy weekend.

JVMerlino · Post by **JVMerlino** » Fri Sep 11, 2015 9:47 pm

michiguel wrote:
Dann Corbit wrote:So then, being lazy is not so bad after all.

Not at all...

Miguel

The wacky part of my implementation is that it takes quite a few centiseconds from the time the master process updates the slaves to the time when the master can start searching itself. Hence, finding Kb1 in Fine 70 at depth 8 with four cores.

With one core it needs to get to a more typical depth 20.

But, in either case, scores don't start to REALLY jump (to around the value of a queen) until depth 36, because that delay is no longer significant.

Code: Select all

 1    100      0            2 a1a2
 1    101      0            4 a1b2 &#40;1 KNPS&#41;
 2    100      0         1488 a1b2 a7b7 &#40;186 KNPS&#41;
 2    102      1         9792 a1b1 a7b7 &#40;699 KNPS&#41;
 3    102      2        16093 a1b1 a7b7 &#40;804 KNPS&#41;
 4    102      2        26233 a1b1 a7b7 &#40;936 KNPS&#41;
 5    100      3        34097 a1b1 a7b7 b1a2 b7c7 a2b3 &#40;1002 KNPS&#41;
 5    102      4        45160 a1b2 a7a8 &#40;1101 KNPS&#41;
 6    102      4        53188 a1b2 a7a8 &#40;1181 KNPS&#41;
 7    102      4        60931 a1b2 a7a8 &#40;1269 KNPS&#41;
 8    102      5        68999 a1b2 a7a8 &#40;1326 KNPS&#41;
 8    118      5        79182 a1b1! &#40;1389 KNPS&#41;
 8    134      6        86749 a1b1! &#40;1422 KNPS&#41;
 8    166      6        93449 a1b1! &#40;1460 KNPS&#41;
 8    214      6       101092 a1b1! &#40;1486 KNPS&#41;
 8    250      7       107746 a1b1 a7b7 &#40;1496 KNPS&#41;
 9    250      7       114858 a1b1 a7b7 &#40;1511 KNPS&#41;
10    250      8       121619 a1b1 a7b7 &#40;1501 KNPS&#41;
11    234      8       124573 a1b1? a7a8 &#40;1483 KNPS&#41;
11    266      8       126564 a1b1! &#40;1422 KNPS&#41;
11    218      9       127046 a1b1? a7b7 &#40;1380 KNPS&#41;
11    170      9       127578 a1b1? a7b7 &#40;1342 KNPS&#41;
11    106      9       128105 a1b1? a7b7 &#40;1293 KNPS&#41;
11    104     10       128803 a1b1 a7b7 b1c2 b7b8 c2b3 b8c7 b3c3 c7b7 c3c4 b7b6
c4d3 &#40;1262 KNPS&#41;
12    120     10       129225 a1b1! &#40;1185 KNPS&#41;
12    136     11       129278 a1b1! &#40;1154 KNPS&#41;
12    168     11       129331 a1b1! &#40;1124 KNPS&#41;
12    216     11       129384 a1b1! &#40;1087 KNPS&#41;
12    106     12       129514 a1b1 a7b7 b1c1 b7a8 c1b2 a8b8 b2c2 b8c7 c2b3 c7d7
b3c4 d7c7 &#40;1061 KNPS&#41;
13    122     13       129692 a1b1! &#40;997 KNPS&#41;
13    138     13       129757 a1b1! &#40;975 KNPS&#41;
13    170     13       129822 a1b1! &#40;961 KNPS&#41;
13    218     14       129887 a1b1! &#40;927 KNPS&#41;
13    104     14       130320 a1b1 a7b7 b1c1 b7a8 c1b2 a8b8 b2c2 b8c7 c2b3 c7d7
b3c4 d7c7 c4b5 &#40;911 KNPS&#41;
14    120     15       131870 a1b1! &#40;879 KNPS&#41;
14    136     15       138782 a1b1! &#40;895 KNPS&#41;
14    168     15       145713 a1b1! &#40;916 KNPS&#41;
14    216     16       153097 a1b1! &#40;945 KNPS&#41;
14    264     16       160817 a1b1 a7a8 &#40;968 KNPS&#41;
15    264     17       169998 a1b1 a7a8 &#40;994 KNPS&#41;
16    264     17       178685 a1b1 a7a8 &#40;1021 KNPS&#41;
17    264     17       188151 a1b1 a7a8 &#40;1051 KNPS&#41;
18    264     18       201033 a1b1 a7a8 &#40;1092 KNPS&#41;
19    264     18       210460 a1b1 a7a8 &#40;1119 KNPS&#41;
20    280     19       221316 a1b1! &#40;1146 KNPS&#41;
20    248     19       226881 a1b1? a7b7 &#40;1151 KNPS&#41;
20    268     20       232500 a1b1 a7b7 b1c1 b7a8 c1b2 a8b8 b2c2 &#40;1156 KNPS&#41;
21    268     21       261831 a1b1 a7b7 b1c1 b7a8 c1d1 a8b8 d1e2 b8c7 e2d3 &#40;1235
 KNPS&#41;
22    256     22       274845 a1b1 a7b7 b1c1 b7a8 c1d1 a8b8 d1e2 b8c7 e2d3 c7d7
d3c4 d7c7 c4b5 c7d7 b5b6 d7e7 b6c6 e7f7 c6d6 f7g7 d6e5 g7g6 &#40;1249 KNPS&#41;
23    272     23       296122 a1b1! &#40;1276 KNPS&#41;
23    288     23       306869 a1b1! &#40;1289 KNPS&#41;
23    281     24       310673 a1b1 a7b7 b1c1 b7a8 c1d1 a8b8 d1c2 b8c7 c2d3 c7d7
d3c4 d7c7 c4b3 c7d7 b3c3 d7e7 c3c4 e7f7 c4b5 f7e7 b5c6 &#40;1283 KNPS&#41;
24    281     25       335500 a1b1 a7b7 b1c1 b7a8 c1d1 a8b8 d1c2 b8c7 c2b3 c7d7
b3c3 d7c7 c3d3 c7d7 d3c4 d7c7 c4b5 c7d7 b5b6 &#40;1300 KNPS&#41;

jm

lkaufman · Post by **lkaufman** » Sat Sep 12, 2015 9:12 pm

Laskos wrote:
bob wrote:
Michel wrote:
Robert Hyatt wrote:You have to look at all the data. For example, look at the average opponent rating for cheng 4cpu vs cheng 1cpu. 1cpu played against an opponent average about 50 Elo stronger than cheng 4cpu. What would you expect that to cause? Make cheng 1cpu look weaker? You can't compare Elo numbers between partially or fully disjoint sets of opponents...
Why not? As long if the graph is connected the comparison is fine.

If A plays B and B plays C and C plays D you can still compare A and D. The comparison via intermediate engines just blows up the error bars.
That was my point. If the average ratings for player A's opponents is X, and the average rating for player B's opponents is X+50, it is going to be VERY difficult to compare their ratings with any accuracy and use the resulting Elo numbers to predict outcome between the two versions. The two versions of the original program are different, the average opponents are different, WHICH is responsible for the Elo gain or loss?

So in that specific CEGT comparison, the error bars are not +/-26. They are more like +/- 75...

In this case, saying A is +130 better than B is quite inaccurate. It is most likely better, to be sure. But how much better is much harder to determine without more data points.
Bob, the error bars are those shown by rating list, which uses one of the tools like Ordo or BayesElo. I took several days ago an extreme case to check the usual logisitc ELO model, and it worked quite well even with extreme values. Komodo 9.2 versus much weaker SOS 3. Then SOS 3 versus still much weaker Bikjump, a "beginner" engine, to get the rating of top engine Komodo 9.2 against the "beginner" engine Bikjump via the "intermediate" engine SOS 3:

Score of Komodo 9.2 vs SOS 3: 9865 - 32 - 103 [0.992] 10000
ELO difference: 830

Score of SOS 3 vs Bikjump: 6863 - 636 - 514 [0.889] 8013
ELO difference: 361

The ELO model predicts that Komodo 9.2 is stronger than Bikjump by 830+361 = 1191 ELO points, predicted only via the intermediate engine SOS 3. The "real" rating is:

Score of Komodo 9.2 vs Bikjump: 19967 - 6 - 27 [0.999] 20000
ELO difference: 1204

Prediction is only 13 ELO points off on the 1200 ELO difference span. The error is both statistical and due to the ELO model, but it shows that the model is not bad even in such extreme case.

This test strongly suggests that the Logistic model is more realistic than the normal (Gaussian) one for elo ratings, the opposite of your previous conclusion. Which model do you now think is more accurate, or would you call it a tossup at this point?

Laskos · Post by **Laskos** » Sat Sep 12, 2015 9:42 pm

lkaufman wrote:
Laskos wrote:
bob wrote:
Michel wrote:
Robert Hyatt wrote:You have to look at all the data. For example, look at the average opponent rating for cheng 4cpu vs cheng 1cpu. 1cpu played against an opponent average about 50 Elo stronger than cheng 4cpu. What would you expect that to cause? Make cheng 1cpu look weaker? You can't compare Elo numbers between partially or fully disjoint sets of opponents...
Why not? As long if the graph is connected the comparison is fine.

If A plays B and B plays C and C plays D you can still compare A and D. The comparison via intermediate engines just blows up the error bars.
That was my point. If the average ratings for player A's opponents is X, and the average rating for player B's opponents is X+50, it is going to be VERY difficult to compare their ratings with any accuracy and use the resulting Elo numbers to predict outcome between the two versions. The two versions of the original program are different, the average opponents are different, WHICH is responsible for the Elo gain or loss?

So in that specific CEGT comparison, the error bars are not +/-26. They are more like +/- 75...

In this case, saying A is +130 better than B is quite inaccurate. It is most likely better, to be sure. But how much better is much harder to determine without more data points.
Bob, the error bars are those shown by rating list, which uses one of the tools like Ordo or BayesElo. I took several days ago an extreme case to check the usual logisitc ELO model, and it worked quite well even with extreme values. Komodo 9.2 versus much weaker SOS 3. Then SOS 3 versus still much weaker Bikjump, a "beginner" engine, to get the rating of top engine Komodo 9.2 against the "beginner" engine Bikjump via the "intermediate" engine SOS 3:

Score of Komodo 9.2 vs SOS 3: 9865 - 32 - 103 [0.992] 10000
ELO difference: 830

Score of SOS 3 vs Bikjump: 6863 - 636 - 514 [0.889] 8013
ELO difference: 361

The ELO model predicts that Komodo 9.2 is stronger than Bikjump by 830+361 = 1191 ELO points, predicted only via the intermediate engine SOS 3. The "real" rating is:

Score of Komodo 9.2 vs Bikjump: 19967 - 6 - 27 [0.999] 20000
ELO difference: 1204

Prediction is only 13 ELO points off on the 1200 ELO difference span. The error is both statistical and due to the ELO model, but it shows that the model is not bad even in such extreme case.
This test strongly suggests that the Logistic model is more realistic than the normal (Gaussian) one for elo ratings, the opposite of your previous conclusion. Which model do you now think is more accurate, or would you call it a tossup at this point?

Just one test, I have several tending towards Gaussian, other towards logistic, some in between. Also, there seem to be differences between fixed depth, fixed time, self-play, etc. One has to take a large database of games in fairly normal conditions and perform hypothesis test.

Michel · Post by **Michel** » Sat Sep 19, 2015 11:46 am

bob wrote:
Michel wrote:
Robert Hyatt wrote:You have to look at all the data. For example, look at the average opponent rating for cheng 4cpu vs cheng 1cpu. 1cpu played against an opponent average about 50 Elo stronger than cheng 4cpu. What would you expect that to cause? Make cheng 1cpu look weaker? You can't compare Elo numbers between partially or fully disjoint sets of opponents...
Why not? As long if the graph is connected the comparison is fine.

If A plays B and B plays C and C plays D you can still compare A and D. The comparison via intermediate engines just blows up the error bars.
That was my point. If the average ratings for player A's opponents is X, and the average rating for player B's opponents is X+50, it is going to be VERY difficult to compare their ratings with any accuracy and use the resulting Elo numbers to predict outcome between the two versions. The two versions of the original program are different, the average opponents are different, WHICH is responsible for the Elo gain or loss?

So in that specific CEGT comparison, the error bars are not +/-26. They are more like +/- 75...

In this case, saying A is +130 better than B is quite inaccurate. It is most likely better, to be sure. But how much better is much harder to determine without more data points.

Ok I had a look to see if you have any point.

You object to the use of the Pythagorean formula to estimate the error bars on the elo difference between two engines.

I am going to simply the analysis a bit by assuming that we are dealing with a large database so that varying the elo of two engines has little effect on the average elo. A more precise analysis gives the same result.

Furthermore I am going to take a Bayesian viewpoint (elo's are random variables) since this is easier to explain.

The formula for the variance of the difference of two random variables is

Var(X-Y)=Var(X)+Var(Y)-2 rho(X,Y) * sqrt(Var(X) * Var(Y))

where rho(X,Y) is the correlation coefficient. So negative rho's are bad since they make us under estimate the error bars.

To see how negative the rho's can be I hacked up a BayesElo program in Python and applied it to the CCRL404 database (12,000,000 million games).

It seems the most negative correlation coefficient is -0.17 which makes us under estimate the error bars by at most 8%.

So I believe that your objection has been refuted and that we can safely use the Pythagorean formula.

Michel · Post by **Michel** » Sat Sep 19, 2015 12:29 pm

Oops, 12 million games (not 12,000,000 million).

lazy smp questions

Re: lazy smp questions

Re: lazy smp questions

Re: lazy smp questions

Re: lazy smp questions

Re: lazy smp questions

Re: lazy smp questions

Re: lazy smp questions

Re: lazy smp questions

Re: lazy smp questions

Re: lazy smp questions