A word for casual testers

Don · Post by **Don** » Wed Dec 26, 2012 10:15 am

Werner wrote:Hi Don,
I have 2 examples:
Code: Select all
Nemo 1.0.1 x64 - Toga II 1.4 Beta5c 1CPU  11.0 - 9.0  55.00%   
Nemo 1.0.1 x64 - Toga II 1.4.2 JD 1CPU  12.5 - 7.5  62.50%   
Nemo 1.0.1 x64 - Toga II 1.4.3JDbeta19a  10.5 - 9.5  52.50%   
Nemo 1.0.1 x64 - Toga II 2.02 JA  13.0 - 7.0  65.00%   
Nemo 1.0.1 x64 - Toga Returns 1.0  11.0 - 9.0  55.00% 
Toga II 2.02 is not 30 Points ahead of Toga II 1.4.3JDbeta19a: wrong?

or

924 Delphil 2.9g w32 1CPU 2321
Code: Select all
Delphil 2.9g x64 1CPU    2217 - Djinn 0.969 x64          2356   15.5 - 34.5    +7/-26/=17    31.00%
Delphil 2.9g x64 1CPU    2267 - Rodin 4.0                2330   20.5 - 29.5    +13/-22/=15    41.00%
It seems Delphil 2.9g x64 1CPU cannot reach the Rating of 32bit Version: wrong ?

I'm not sure what you are asking. When you play a very short match the results cannot be trusted - the longer the match the more confidence you can have in the final result. This is not just about who is stronger, but what is the relative difference between them.

This is because who wins a game has much randomness built in. If you and I play a game and you are 50 ELO stronger, I still have a good chance of beating you. 50 ELO isn't much, all it means is that you are slightly more likely to win.

It doesn't mean the result are wrong, it only means that you should not place too much confidence in the answer.

In my high school, if you played a game against someone and beat them, it was assumed that you were the strongest player. A 1 match sample doesn't prove anything. However that doesn't mean you are not stronger, it just means you need a lot more games to prove it.

I like tennis. Same thing. Sometimes a match goes 5 sets. The player who wins the match did not win every game or even every set.

lucasart · Post by **lucasart** » Wed Dec 26, 2012 10:31 am

Don wrote: You may notice that in some cases
someone will make a minor modification to some open source program and
based on some 100 game match declare a breakthrough.

This is SO true !

The world is swamped with derivatives of IvanHoe and Stockfish, that claim to be some kind of breakthrough, because they solved this position faster than the original, or whatever futile criterion they may find.

But proper testing always confirms that the derivatives are weaker than the original. It takes a lot more work and humility to improve significantly on something like IvanHoe or Stockfish...

Laskos · Post by **Laskos** » Wed Dec 26, 2012 10:34 am

Adam Hair wrote:
Don wrote:

Don
That looks quite close to what it should look like, which is the quantile function of the logistic distribution.

That also looks a lot like the cumulative of the normal distribution with standard deviation 14 Elo points, i.e erf[z/14], or even better normalized (1+erf(z/14))/2. It shows that the results in Don's test are pretty normally distributed.

Kai

mar · Post by **mar** » Wed Dec 26, 2012 11:48 am

lucasart wrote: This is SO true !

The world is swamped with derivatives of IvanHoe and Stockfish, that claim to be some kind of breakthrough, because they solved this position faster than the original, or whatever futile criterion they may find.

Yes, unfortunately.
Some creative guys spit 3000+ engines like a volcano, those who never had to debug their move generator of course; no problem as long as they comply with the license, sure.
What I don't understand however is that these get room in some tournaments (yes I'm referring to CCRL - PS I wonder why there's still the grey area - quite pointless nowadays IMHO

So basically they test x versions of the same engine, just renamed and "improved". All that remains is to test hexedited masterpieces.

Laskos · Post by **Laskos** » Wed Dec 26, 2012 12:09 pm

Don wrote:

Don

There is no much need to test this, as there is a theoretical background on these normal distributions. If here A=3003 is the average Elo, N=120 the number bins, S=14 points = standard deviation for each bin (300 games), then one needs only to plot the function N*(erf[(x - A)/S] + 1)/2.

Adam Hair · Post by **Adam Hair** » Wed Dec 26, 2012 1:55 pm

Yet, it was another way to demonstrate the result of a match is itself a random variable, and that short matches have more variability than longer matches.

Adam Hair · Post by **Adam Hair** » Wed Dec 26, 2012 2:00 pm

Laskos wrote:
Adam Hair wrote:
Don wrote:

Don
That looks quite close to what it should look like, which is the quantile function of the logistic distribution.
That also looks a lot like the cumulative of the normal distribution with standard deviation 14 Elo points, i.e erf[z/14], or even better normalized (1+erf(z/14))/2. It shows that the results in Don's test are pretty normally distributed.

Kai

Yeah, I agree. I had the Elo model on my brain, and so I wrote logistic distribution. Though, the logit and probit functions are quite much alike.

Don · Post by **Don** » Wed Dec 26, 2012 2:04 pm

Laskos wrote:
Adam Hair wrote:
Don wrote:

Don
That looks quite close to what it should look like, which is the quantile function of the logistic distribution.
That also looks a lot like the cumulative of the normal distribution with standard deviation 14 Elo points, i.e erf[z/14], or even better normalized (1+erf(z/14))/2. It shows that the results in Don's test are pretty normally distributed.

I do want to point out that I usually plot using gnuplot with curve smoothing. "smooth bezier."

The lines look the same, they are just a little more jagged.

Kai

Don · Post by **Don** » Wed Dec 26, 2012 2:08 pm

Laskos wrote:
Don wrote:

Don
There is no much need to test this, as there is a theoretical background on these normal distributions. If here A=3003 is the average Elo, N=120 the number bins, S=14 points = standard deviation for each bin (300 games), then one needs only to plot the function N*(erf[(x - A)/S] + 1)/2.

The reason I actually ran the numbers is because of the audience I was trying to appeal to - they won't necessarily respect theory but need to see data from real games.

Laskos · Post by **Laskos** » Wed Dec 26, 2012 2:16 pm

Don wrote:
Laskos wrote:
Don wrote:

Don
There is no much need to test this, as there is a theoretical background on these normal distributions. If here A=3003 is the average Elo, N=120 the number bins, S=14 points = standard deviation for each bin (300 games), then one needs only to plot the function N*(erf[(x - A)/S] + 1)/2.

The reason I actually ran the numbers is because of the audience I was trying to appeal to - they won't necessarily respect theory but need to see data from real games.

Ok, I understand, though you are fighting a loosing battle with the audience, they won't listen

Kai

A word for casual testers

Re: A word for casual testers

Re: A word for casual testers

Re: A word for casual testers

Re: A word for casual testers

Re: A word for casual testers

Re: A word for casual testers

Re: A word for casual testers

Re: A word for casual testers

Re: A word for casual testers

Re: A word for casual testers