HYATT: Here's Yet Another Testing Thread

bob · Post by **bob** » Sat Aug 16, 2008 7:29 pm

OK, I now have three complete runs from the cluster. Remember that the original intention was to make one of these long runs with version A, and then one with version A', to try to determine if A' was better or not. Remi (Author of BayesElo) suggested that the two tests not be rated separately, but that they should be rated together by using a different name for A and A'. OK. With three runs, there are three tests to compare, A, B and C. The names I used are crafty-run1, crafty-run2 and crafty-run3 which represents different programs, although we know they are _exactly_ the same. The hope is that when I compare run1 to run2, I get the same result as when I compare run2 to run3, and run1 to run3.

So, on to the results:

Code: Select all

Rank Name                   Elo    +    - games score oppo. draws 
   1 Glaurung 2-epsilon/5   111    5    5 15564   67%   -16   21% 
   2 Fruit 2.1               66    5    5 15564   61%   -16   24% 
   3 opponent-21.7           25    5    5 15564   56%   -16   33% 
   4 Glaurung 1.1 SMP        13    5    5 15564   54%   -16   20% 
   5 Crafty-run1            -16    4    3 38910   47%     7   23% 
   6 Crafty-run2            -17    4    4 38910   47%     7   24% 
   7 Arasan 10.0           -182    5    6 15564   28%   -16   19% 

Rank Name                   Elo    +    - games score oppo. draws 
   1 Glaurung 2-epsilon/5   111    5    5 15564   67%   -16   21% 
   2 Fruit 2.1               65    5    5 15564   61%   -16   24% 
   3 opponent-21.7           24    5    5 15564   56%   -16   33% 
   4 Glaurung 1.1 SMP        12    5    5 15563   54%   -16   20% 
   5 Crafty-run3            -15    3    4 38909   47%     6   23% 
   6 Crafty-run1            -16    4    4 38910   47%     6   23% 
   7 Arasan 10.0           -182    6    5 15564   28%   -16   19% 

Rank Name                   Elo    +    - games score oppo. draws 
   1 Glaurung 2-epsilon/5   113    5    5 15564   67%   -16   21% 
   2 Fruit 2.1               66    5    5 15564   61%   -16   23% 
   3 opponent-21.7           25    5    5 15564   56%   -16   33% 
   4 Glaurung 1.1 SMP        12    5    5 15563   54%   -16   21% 
   5 Crafty-run3            -14    3    4 38909   47%     6   23% 
   6 Crafty-run2            -17    3    4 38910   47%     6   24% 
   7 Arasan 10.0           -185    5    6 15564   28%   -16   19%

So, not bad. two -17's and one -16. If you take the three tests by themselves:

Code: Select all

Rank Name                   Elo    +    - games score oppo. draws 
   1 Glaurung 2-epsilon/5   105    7    7  7782   67%   -19   21% 
   2 Fruit 2.1               62    7    6  7782   61%   -19   24% 
   3 opponent-21.7           23    6    6  7782   56%   -19   33% 
   4 Glaurung 1.1 SMP        11    6    7  7782   54%   -19   20% 
   5 Crafty-run1            -19    4    3 38910   47%     4   23% 
   6 Arasan 10.0           -181    7    7  7782   29%   -19   19% 

Rank Name                   Elo    +    - games score oppo. draws 
   1 Glaurung 2-epsilon/5   111    7    7  7782   68%   -19   21% 
   2 Fruit 2.1               64    7    6  7782   62%   -19   24% 
   3 opponent-21.7           23    6    6  7782   56%   -19   33% 
   4 Glaurung 1.1 SMP         9    7    6  7782   54%   -19   20% 
   5 Crafty-run2            -19    4    4 38910   47%     4   24% 
   6 Arasan 10.0           -188    7    7  7782   28%   -19   20% 

Rank Name                   Elo    +    - games score oppo. draws 
   1 Glaurung 2-epsilon/5   111    7    7  7782   67%   -17   21% 
   2 Fruit 2.1               63    6    7  7782   61%   -17   23% 
   3 opponent-21.7           21    6    6  7782   56%   -17   33% 
   4 Glaurung 1.1 SMP         9    7    6  7781   54%   -17   21% 
   5 Crafty-run3            -17    3    4 38909   47%     3   23% 
   6 Arasan 10.0           -187    6    7  7782   28%   -17   18%

I did not include the first three runs because I only have the bayeselo output and not the PGN. I have one more of these runs that is in progress, I will re-run all of this when that is done...

But the numbers are now pretty repeatable, for 6 consecutive runs, with none that are producing the non-overlapping Elo ranges that were happening previously with more than the expected frequency.

hgm · Post by **hgm** » Sat Aug 16, 2008 8:35 pm

Oh yeah? What frequency was that, exactly?

Dirt · Post by **Dirt** » Sat Aug 16, 2008 10:28 pm

Could you post the likelihood of superiority (LOS) tables? It looks like 2 vs 3 should be fairly large.

bob · Post by **bob** » Sat Aug 16, 2008 11:28 pm

hgm wrote:Oh yeah? What frequency was that, exactly?

Jesus. you have the attention span of a fruit fly. Maybe the last two big runs where they were separated by a margin you called "a 6-sigma event" hmmm???

Or any of the other runs I have posted since starting this about 1.5 years or so ago, where you also pointed out "this data is cherry-picked" or "this can't possibly happen as often as you are reporting". Etc. Any of those examples will do nicely so pick one.

bob · Post by **bob** » Sat Aug 16, 2008 11:30 pm

Dirt wrote:Could you post the likelihood of superiority (LOS) tables? It looks like 2 vs 3 should be fairly large.

Not sure I followed. You mean comparing crafty-run2 to crafty-run3????

hgm · Post by **hgm** » Sat Aug 16, 2008 11:58 pm

bob wrote:
hgm wrote:Oh yeah? What frequency was that, exactly?
Jesus. you have the attention span of a fruit fly. Maybe the last two big runs where they were separated by a margin you called "a 6-sigma event" hmmm???

Or any of the other runs I have posted since starting this about 1.5 years or so ago, where you also pointed out "this data is cherry-picked" or "this can't possibly happen as often as you are reporting". Etc. Any of those examples will do nicely so pick one.

That does not nswer the question. As I stated in the other thread, I have only said that twice in the past year, and all the other data you then showed to 'corroborate' it actually subverted the claim because it was perfectly normal. So even wthin the set of results that you have posted here, the frequency was pretty low. (Although of course not as low as the 1-in-a-billion probability it should have.) But that does not mean a thing, as you actually only posted a cople of dozen results here, which is an extremely small subset of the data you have actually been taking.

How much data did you actually take that you did not post? 100 times more? 10,000 times more? How am I supposed to know the variance in that data that you did not post?

So I ask a perfectly normal question. You fail to answer it, and substitute a childish remark in stead. Why is that? Could it in fact be that you have no idea what the frequency you are talking about actually is, and hope to hide that by diversion tactics?

Well, no such luck! Let me detail the question:

1) Which fraction of the runs you did whhich should have given equal results deviated from each other beween +/- 1 standard deviation?
2) Which fraction between +/- 2 sigma?
3) In which fraction did the differnce exceed 3 sigma, in either direction?

If you want to create the impression that your statements are based on fact, rather than imagination, now is your chance...

Dirt · Post by **Dirt** » Sun Aug 17, 2008 12:32 am

bob wrote:
Dirt wrote:Could you post the likelihood of superiority (LOS) tables? It looks like 2 vs 3 should be fairly large.
Not sure I followed. You mean comparing crafty-run2 to crafty-run3????

Yes, of course. Have you used BayesElo to produce a LOS table yet?

bob · Post by **bob** » Sun Aug 17, 2008 2:30 am

Dirt wrote:
bob wrote:
Dirt wrote:Could you post the likelihood of superiority (LOS) tables? It looks like 2 vs 3 should be fairly large.
Not sure I followed. You mean comparing crafty-run2 to crafty-run3????
Yes, of course. Have you used BayesElo to produce a LOS table yet?

Nott on any results I have published here, no. To date, the issue has been Elo +/- error, where the ugly cases had two different Elo values that even when you took max(elo(i)) - error, it was still greater than min(elo(i)) + error. Those were odd enough to make me want to first get a stable elo(i) set of data, and then start trying to determine how many games would be necessary to give me some sense that A' is better than A (or not).

swami · Post by **swami** » Sun Aug 17, 2008 8:01 am

Code: Select all

HYATT: Here's Yet Another Testing Thread

That's creative

How did you come up with that?!

Like, just thinking what possible letters/words (HY) you could substitute to already existing one YATT, to match your last name?

Sorry for OT, but thumbs up for that!

hgm · Post by **hgm** » Sun Aug 17, 2008 8:15 am

bob wrote: To date, the issue has been Elo +/- error, where the ugly cases had two different Elo values that even when you took max(elo(i)) - error, it was still greater than min(elo(i)) + error.

Perhaps it is good to point out here that it is quite normal for this to happen, occasionally:

The error bars given by BayesElo hare not hard limits, but only 95% confidence intervals. That means that in 5% of the cases, the 'true' value of the performance rating will ly outside the quoted error range.

When you compare two values that each have an independent statistical error, the error bar in their difference is not the simple sum of their individual error bars, has to be added as root-mean-squars: sqaure them, add and take the square root. For two equal error bars this means the error bar of the rating difference is 1.41 times as large as that in an individual result. In practice, this means that the 95% intervls of the individual rarings will not overlap once every 178 times you measure identical quantities. If such 'flukes' happen any less frequently than that, you should start to become suspicious that there is something wrong with your tsting. (e.g. too many positios in your test set tht give dead-cert outcome, and do not help to differentite the quality of the engine you re testing.)

For the math-inclined: the 1:178 probability is obtained as follows: a 95% confidence interval is +/- 1.96*SD, where SD (aka sigma) is the standard deviation of the normal distribution describing the statistical sampling noise. So for non-oerlapping intervls, the central values are at least 3.92 sigma apart. But you have to compare tht to the sigma off the difference, which is larger by a factor 1.41, or you would be comparing apples with oranges. So the central values are 3.92/1.41 = 2.77 sigma apart. Looking in the table of the normal distribution, one sees the probability to exceed 2.77 is 0.56%.

HYATT: Here's Yet Another Testing Thread

HYATT: Here's Yet Another Testing Thread

Re: HYATT: Here's Yet Another Testing Thread

Re: HYATT: Here's Yet Another Testing Thread

Re: HYATT: Here's Yet Another Testing Thread

Re: HYATT: Here's Yet Another Testing Thread

Re: HYATT: Here's Yet Another Testing Thread

Re: HYATT: Here's Yet Another Testing Thread

Re: HYATT: Here's Yet Another Testing Thread

Re: HYATT: Here's Yet Another Testing Thread

Re: HYATT: Here's Yet Another Testing Thread