A word for casual testers

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: A word for casual testers

Post by Don »

Werner wrote:Hi Don,
I have 2 examples:

Code: Select all

Nemo 1.0.1 x64 - Toga II 1.4 Beta5c 1CPU  11.0 - 9.0  55.00%   
Nemo 1.0.1 x64 - Toga II 1.4.2 JD 1CPU  12.5 - 7.5  62.50%   
Nemo 1.0.1 x64 - Toga II 1.4.3JDbeta19a  10.5 - 9.5  52.50%   
Nemo 1.0.1 x64 - Toga II 2.02 JA  13.0 - 7.0  65.00%   
Nemo 1.0.1 x64 - Toga Returns 1.0  11.0 - 9.0  55.00% 
Toga II 2.02 is not 30 Points ahead of Toga II 1.4.3JDbeta19a: wrong?

or

924 Delphil 2.9g w32 1CPU 2321

Code: Select all

Delphil 2.9g x64 1CPU    2217 - Djinn 0.969 x64          2356   15.5 - 34.5    +7/-26/=17    31.00%
Delphil 2.9g x64 1CPU    2267 - Rodin 4.0                2330   20.5 - 29.5    +13/-22/=15    41.00%
It seems Delphil 2.9g x64 1CPU cannot reach the Rating of 32bit Version: wrong ?
I'm not sure what you are asking. When you play a very short match the results cannot be trusted - the longer the match the more confidence you can have in the final result. This is not just about who is stronger, but what is the relative difference between them.

This is because who wins a game has much randomness built in. If you and I play a game and you are 50 ELO stronger, I still have a good chance of beating you. 50 ELO isn't much, all it means is that you are slightly more likely to win.

It doesn't mean the result are wrong, it only means that you should not place too much confidence in the answer.

In my high school, if you played a game against someone and beat them, it was assumed that you were the strongest player. A 1 match sample doesn't prove anything. However that doesn't mean you are not stronger, it just means you need a lot more games to prove it.

I like tennis. Same thing. Sometimes a match goes 5 sets. The player who wins the match did not win every game or even every set.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: A word for casual testers

Post by lucasart »

Don wrote: You may notice that in some cases
someone will make a minor modification to some open source program and
based on some 100 game match declare a breakthrough.
This is SO true !

The world is swamped with derivatives of IvanHoe and Stockfish, that claim to be some kind of breakthrough, because they solved this position faster than the original, or whatever futile criterion they may find.

But proper testing always confirms that the derivatives are weaker than the original. It takes a lot more work and humility to improve significantly on something like IvanHoe or Stockfish...
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: A word for casual testers

Post by Laskos »

Adam Hair wrote:
Don wrote:Image


Don
That looks quite close to what it should look like, which is the quantile function of the logistic distribution.
That also looks a lot like the cumulative of the normal distribution with standard deviation 14 Elo points, i.e erf[z/14], or even better normalized (1+erf(z/14))/2. It shows that the results in Don's test are pretty normally distributed.

Kai
mar
Posts: 2559
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: A word for casual testers

Post by mar »

lucasart wrote: This is SO true !

The world is swamped with derivatives of IvanHoe and Stockfish, that claim to be some kind of breakthrough, because they solved this position faster than the original, or whatever futile criterion they may find.
Yes, unfortunately.
Some creative guys spit 3000+ engines like a volcano, those who never had to debug their move generator of course; no problem as long as they comply with the license, sure.
What I don't understand however is that these get room in some tournaments (yes I'm referring to CCRL - PS I wonder why there's still the grey area - quite pointless nowadays IMHO :)
So basically they test x versions of the same engine, just renamed and "improved". All that remains is to test hexedited masterpieces.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: A word for casual testers

Post by Laskos »

Don wrote:Image


Don
There is no much need to test this, as there is a theoretical background on these normal distributions. If here A=3003 is the average Elo, N=120 the number bins, S=14 points = standard deviation for each bin (300 games), then one needs only to plot the function N*(erf[(x - A)/S] + 1)/2.

Image
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: A word for casual testers

Post by Adam Hair »

Yet, it was another way to demonstrate the result of a match is itself a random variable, and that short matches have more variability than longer matches.
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: A word for casual testers

Post by Adam Hair »

Laskos wrote:
Adam Hair wrote:
Don wrote:Image


Don
That looks quite close to what it should look like, which is the quantile function of the logistic distribution.
That also looks a lot like the cumulative of the normal distribution with standard deviation 14 Elo points, i.e erf[z/14], or even better normalized (1+erf(z/14))/2. It shows that the results in Don's test are pretty normally distributed.

Kai
Yeah, I agree. I had the Elo model on my brain, and so I wrote logistic distribution. Though, the logit and probit functions are quite much alike.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: A word for casual testers

Post by Don »

Laskos wrote:
Adam Hair wrote:
Don wrote:Image


Don
That looks quite close to what it should look like, which is the quantile function of the logistic distribution.
That also looks a lot like the cumulative of the normal distribution with standard deviation 14 Elo points, i.e erf[z/14], or even better normalized (1+erf(z/14))/2. It shows that the results in Don's test are pretty normally distributed.
I do want to point out that I usually plot using gnuplot with curve smoothing. "smooth bezier."

The lines look the same, they are just a little more jagged.

Kai
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: A word for casual testers

Post by Don »

Laskos wrote:
Don wrote:Image


Don
There is no much need to test this, as there is a theoretical background on these normal distributions. If here A=3003 is the average Elo, N=120 the number bins, S=14 points = standard deviation for each bin (300 games), then one needs only to plot the function N*(erf[(x - A)/S] + 1)/2.

Image
The reason I actually ran the numbers is because of the audience I was trying to appeal to - they won't necessarily respect theory but need to see data from real games.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: A word for casual testers

Post by Laskos »

Don wrote:
Laskos wrote:
Don wrote:Image


Don
There is no much need to test this, as there is a theoretical background on these normal distributions. If here A=3003 is the average Elo, N=120 the number bins, S=14 points = standard deviation for each bin (300 games), then one needs only to plot the function N*(erf[(x - A)/S] + 1)/2.

Image
The reason I actually ran the numbers is because of the audience I was trying to appeal to - they won't necessarily respect theory but need to see data from real games.
Ok, I understand, though you are fighting a loosing battle with the audience, they won't listen :wink:

Kai