An objective test process for the rest of us?

Discussion of chess software programming and technical issues.

Moderator: Ras

nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

My current working setup:

Code: Select all

Rank Engine        Score	  Bl	Po	Ro	Ce	Z2	Ex	Pi	Va	Mu	At	Ne	Ed	Sh	Ed	S-B
01	Blikskottel   18,0/26	· ·	01	11	=0	1=	1=	1=	11	10	=1	=1	10	11	01	 225,75 
02	PolarEngine   17,5/26	10	· ·	00	00	11	=1	11	11	11	10	11	11	0=	=1	 213,50 
03	Roque 	     15,5/26	00	11	· ·	11	00	1=	11	==	10	11	00	01	11	10	 195,50 
04	Cefap 	     15,0/26	=1	11	00	· ·	11	0=	10	00	11	10	==	11	00	11	 194,00 
05	Zotron        14,5/26	0=	00	11	00	· ·	=0	=1	01	0=	10	11	1=	11	11	 169,50 
06	Exacto 	    14,0/26	0=	=0	0=	1=	=1	· ·	01	11	00	10	10	11	11	0=	 172,25 
07	Piranha 	   13,0/26	0=	00	00	01	=0	10	· ·	10	11	11	1=	01	10	=1	 153,00 
08	Vanillachess  12,5/26	00	00	==	11	10	00	01	· ·	10	01	10	01	1=	11	 148,50 
09	Murderhole 	11,5/26	01	00	01	00	1=	11	00	01	· ·	10	1=	=0	01	01	 147,25 
10	Atlanchess 	11,5/26	=0	01	00	01	01	01	00	10	01	· ·	01	11	10	01	 142,50 
11	Needle        11,5/26	=0	00	11	==	00	01	0=	01	0=	10	· ·	1=	01	1=	 142,00 
12	Eden0.0.12 JA  9,5/26	01	00	10	00	0=	00	10	10	=1	00	0=	· ·	11	10	 116,25 
13	SharpChess2 	9,0/26	00	1=	00	11	00	00	01	0=	10	01	10	00	· ·	10	 119,00 
14	Eden_0.0.13	 9,0/26	10	=0	01	00	00	1=	=0	00	10	10	0=	01	01	· ·	 117,00
I should probably exclude Eden 0.0.12 JA from the actual lineup for making a decision, but other than that, this is what I'll use as a base.

What result would Eden 0.0.14 have to achieve before you would consider it better than Eden 0.0.13?
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

nczempin wrote:My current working setup:

Code: Select all

Rank Engine        Score	  Bl	Po	Ro	Ce	Z2	Ex	Pi	Va	Mu	At	Ne	Ed	Sh	Ed	S-B
01	Blikskottel   18,0/26	· ·	01	11	=0	1=	1=	1=	11	10	=1	=1	10	11	01	 225,75 
02	PolarEngine   17,5/26	10	· ·	00	00	11	=1	11	11	11	10	11	11	0=	=1	 213,50 
03	Roque 	     15,5/26	00	11	· ·	11	00	1=	11	==	10	11	00	01	11	10	 195,50 
04	Cefap 	     15,0/26	=1	11	00	· ·	11	0=	10	00	11	10	==	11	00	11	 194,00 
05	Zotron        14,5/26	0=	00	11	00	· ·	=0	=1	01	0=	10	11	1=	11	11	 169,50 
06	Exacto 	    14,0/26	0=	=0	0=	1=	=1	· ·	01	11	00	10	10	11	11	0=	 172,25 
07	Piranha 	   13,0/26	0=	00	00	01	=0	10	· ·	10	11	11	1=	01	10	=1	 153,00 
08	Vanillachess  12,5/26	00	00	==	11	10	00	01	· ·	10	01	10	01	1=	11	 148,50 
09	Murderhole 	11,5/26	01	00	01	00	1=	11	00	01	· ·	10	1=	=0	01	01	 147,25 
10	Atlanchess 	11,5/26	=0	01	00	01	01	01	00	10	01	· ·	01	11	10	01	 142,50 
11	Needle        11,5/26	=0	00	11	==	00	01	0=	01	0=	10	· ·	1=	01	1=	 142,00 
12	Eden0.0.12 JA  9,5/26	01	00	10	00	0=	00	10	10	=1	00	0=	· ·	11	10	 116,25 
13	SharpChess2 	9,0/26	00	1=	00	11	00	00	01	0=	10	01	10	00	· ·	10	 119,00 
14	Eden_0.0.13	 9,0/26	10	=0	01	00	00	1=	=0	00	10	10	0=	01	01	· ·	 117,00
I should probably exclude Eden 0.0.12 JA from the actual lineup for making a decision, but other than that, this is what I'll use as a base.

What result would Eden 0.0.14 have to achieve before you would consider it better than Eden 0.0.13?
Or, perhaps easier given that 0.0.14 is not even trying to make the list yet, which of the programs would you argue is "significantly likely" to be stronger than Eden 0.0.13?
User avatar
hgm
Posts: 28356
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: An objective test process for the rest of us?

Post by hgm »

Standard error on 26 games is 2 pts, so for a difference its is 2.8 pts. For 95% confidence this is about twice, or 5.5 pts. (Or was that 97.5%, because this is a one-sided test? I would have to calculate tha to be sure.) So an engine equally stong as Eden 0.0.12 would make 15 points in this gauntlet only 1 once in 20 times. That means Cefap and those above it are significantly stronger than Eden 0.0.12, and you could add Zotron to that for Eden 0.0.13.

If you want do be 95% (97.5%) sure that Eden 0.0.14 is better than 0.0.13, it would have to make at least 14 points out of 26, on the first try. For 84% confidence you would have to be only 1 sigma better, i.e. 3 points. I guess I would be happy with that, if it was achieved first try.

Main trap is that you going to keep trying and trying with marginal improvements until you find one that passes. That is cheating. Out of 7 tries to pass the 84% test, you would epect one that is equal to pass. So after a failed test you really should increase your standards.
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

hgm wrote:Standard error on 26 games is 2 pts, so for a difference its is 2.8 pts. For 95% confidence this is about twice, or 5.5 pts. So an engine equally stong as Eden 0.0.12 would make 15 points in this gauntlet only 1 once in 20 times. That means Cefap and those above it are significantly stronger.
Okay, so if Cefap from this table were Eden 0.0.14, I could be 95 % confident that 14 is stronger than 13.

95 % could even be considered too high a requirement, as before trying the test, I have some reason to believe the new version is stronger.

Intuitively, I would have chosen Exacto, because of that full-point advantage. But then again I would not have chosen Needle despite the 2-point advantage over 12 JA.



So it is actually possible to be confident about such changes even with a mere 200 games, as long as the difference is high enough. That's what I've been trying to say all the time.

I'm pretty sure that Bob does not get such high differences, and thus he needs more games. Requiring smaller-granularity changes will increase the number of required games, as will using a common book.


So, how can I use this more formally; where can I look up the methodology; under which heading should I open the Statistics book?


Another thing: We were discussing the opening book issue. I think I would like to repeat this tournament with a common book, and see how much the significance changes. My guess would be that the points spread will be much smaller, and thus the significant level may well lie outside of the actual results (so that even if 14 were to win the tournament by 0.5 points, it would still not indicate significance that it is stronger).
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

hgm wrote:That means Cefap and those above it are significantly stronger than Eden 0.0.12, and you could add Zotron to that for Eden 0.0.13.
Please note that the Eden 0.0.12 mentioned here is the JA version, which is significantly stronger (by my intuition :-)) than the regular version.

Just to make sure that no-one says "HA! Caught you, Eden 13 is not stronger than 12!" :-)
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

hgm wrote:Main trap is that you going to keep trying and trying with marginal improvements until you find one that passes. That is cheating. Out of 7 tries to pass the 84% test, you would epect one that is equal to pass. So after a failed test you really should increase your standards.
That is a very, very good point.

Reminds me of those technical traders, incidentally.

I also think there's a name for this pitfall; isn't this "curve fitting" or something like that? I have heard about it in the context of my casual reading on machine learning/data mining, and it was always one of the subjects that intrigued me most.


Alas, I really want to improve my Stats background. If you have some suggestions as to Literature, please name them. I am not a complete beginner and I do not shy away from Mathematical treatments (actually, I would shy away from "Statistics for Dummies" even if that is perhaps the level I would need right now :-))
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

Uri Blass wrote:
bob wrote:
hgm wrote:
bob wrote: Complete iterations is also bad. How to balance the two programs so that their searches are equal? N plies for one can be equivalent to M plies for another where M != N.
Equivalent in what respect? Time use? Playing strength?

I would only be doing this for programs that are very similar, two versions with a minor change between them. If you would be testing, say, addition of null-move pruning, this would indeed make little sense. But you would also not need to do it, because the Elo difference will be huge.
I am missing something. You are going to play A vs A'? I don't think that kind of testing is worth anything. You are going to play A vs a set of opponents and A' vs the same set? That is the case I was asking about. How can you pick some search depth D for A or A', vs some search E for opponent E1, F for opponent F1, etc, to make things reasonable? Do you adjust the depth during the game as material comes off or do you end up playing moves instantaneously in endgames?

I still believe the fairest and most accurate way to play games is to use a time limit, just like we do in real games. How you allocate time is a specific part of an engine. How it behaves when it fails low, when it fails high, etc. are all part of the thing. I would not want, for example, to take my evaluation function, and stick it into a vanilla program and test it that way.

The point is that I want to eliminate any randomness that can be eliminated. To this end I will also exploit the tendency of two closely related versions to play the same move in the same position to the maximum. Everything inducing variability for reasons not related to the change under test should therefore be eliminated. Scientific measurements should be done under controlled conditions.

But one important principle is to use a "high-impedence probe". You don't want the test methodology itself to bias the results or change them in any way. I think it best to eliminate all issues but time, and then deal with that one by playing enough games. Playing to fixed depth, for example, would influence how you write a parallel search. Yet in real games, it would be worthless.
Another way of saying this is that comparing results of different games, where the two versions are thinking about different positions, is a very indirect comparison, with a lot of noise due to selections of the positions. I want to compare the two engines by the moves they pick in the same position, so I should present them with the same position, rather than having them select their own positions independently. And then disturb them as little as possible when they are at it.
Then why play games? Why not just feed in thousands of positions, and compare the moves chosen by each version. Eliminate the duplicates and look at the positions where the two versions play a different move...
Games are needed to avoid bugs.
It is possible that some bug does not happen from fen position but happen in a situation of a game so it is better to play games and see the difference.

Uri
By far the most common theme for "bugs" is with time usage/allocation. Which was my point. If you bypass that code by testing in some oddball way, the test is not very thorough...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

hgm wrote:
bob wrote:I am missing something. You are going to play A vs A'? I don't think that kind of testing is worth anything. You are going to play A vs a set of opponents and A' vs the same set? That is the case I was asking about. How can you pick some search depth D for A or A', vs some search E for opponent E1, F for opponent F1, etc, to make things reasonable? Do you adjust the depth during the game as material comes off or do you end up playing moves instantaneously in endgames?
I would play A and A' against the same set of opponents. To ensure a realistic time usage, A would play under a time management scheme where iterations are completely finished, but not started after a certain number of seconds. Just as you would usually do in an engine that always finishes an iteration.

Engine A' (and A", etc.) would be told how many iterations they should do (namely as many as A dis) as long as they are searching on positions that A has played before. Only when they get into positions that no member of the A family as even seen, they will be allowed to determine the number of iterations by reading the clock. That number would be remembered, together with the position, for future use, when comparing A' with other, future versions of A.
But one important principle is to use a "high-impedence probe". You don't want the test methodology itself to bias the results or change them in any way. I think it best to eliminate all issues but time, and then deal with that one by playing enough games. Playing to fixed depth, for example, would influence how you write a parallel search. Yet in real games, it would be worthless.
I don't consider what I propose as a very invasive measurement technique. Finishing iterations is a very natural and acceptable way of time management (only ~5 Elo inferior to more elaborate schemes). And it doesn't really matter to the engine if it reads the clock or the iteration counter to decide if it should continue.
That is _exactly_ the kind of comment that makes me cringe. ~5 elo? Based on what? Time usage can have a _far_ greater impact than that. As far as "finishing iterations is very natural" it really isn't. Not since the original chess 4.x which is where the iterated search was first seen so far as I know. Very few actually try to figure out whether they can finish an iteration or not, because by its very nature that is difficult to guess. Quite often you might think you can, but the branching factor goes wild and the last iteration takes 10x longer than expected...



If your engine would be strongly dependent n special time management, running an extra iteration (or some other emergency procedure) on a hefty score drop in the last iteration, I would simply count that as part of the previous iteration. I don't think there are real problems there. The main point is to resolve 'race conditions' between the clock and the deepening process only once, and then force all other engines that are in the same position to copy the outcome (stop / continue deepening), rather than having them resolve the same race condition themselves, in a possibly different way.
So somehow modify all the other engines to use something provided by the first? Are we trying to test/debug or introduce bugs to find? I'm not going to modify other programs and then debug that...

:)


Then why play games? Why not just feed in thousands of positions, and compare the moves chosen by each version. Eliminate the duplicates and look at the positions where the two versions play a different move...
Two reasons: Playing games is a convenient way to sample positions in proportion of the likelihood that the engine will work itself into such a position.

The second reason is that, when they play a different move, you would like to know wich move was actually better. Of course I could use Rybka to decide that, but then I would be just reverse engeneering Rybka, (which is beneath me), and my engine would certainly never become any better than Rybka, (which is my aim). So continuing the game is the way to determine which move was better.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

hgm wrote:Standard error on 26 games is 2 pts, so for a difference its is 2.8 pts. For 95% confidence this is about twice, or 5.5 pts. (Or was that 97.5%, because this is a one-sided test? I would have to calculate tha to be sure.) So an engine equally stong as Eden 0.0.12 would make 15 points in this gauntlet only 1 once in 20 times. That means Cefap and those above it are significantly stronger than Eden 0.0.12, and you could add Zotron to that for Eden 0.0.13.

If you want do be 95% (97.5%) sure that Eden 0.0.14 is better than 0.0.13, it would have to make at least 14 points out of 26, on the first try. For 84% confidence you would have to be only 1 sigma better, i.e. 3 points. I guess I would be happy with that, if it was achieved first try.

Main trap is that you going to keep trying and trying with marginal improvements until you find one that passes. That is cheating. Out of 7 tries to pass the 84% test, you would epect one that is equal to pass. So after a failed test you really should increase your standards.
the "standard error" might be 2 points. The variance in such a match is _far_ higher. Just play 26 game matches several times. You will _not_ get just a 2 game variance. I've already posted results from several programs including fruit, glaurung, arasan, gnuchess and crafty. So talking about standard error is really meaningless here, as is the +/- elostat output. It just is not applicable based on a _huge_ number of small matches...
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

bob wrote:
hgm wrote:Standard error on 26 games is 2 pts, so for a difference its is 2.8 pts. For 95% confidence this is about twice, or 5.5 pts. (Or was that 97.5%, because this is a one-sided test? I would have to calculate tha to be sure.) So an engine equally stong as Eden 0.0.12 would make 15 points in this gauntlet only 1 once in 20 times. That means Cefap and those above it are significantly stronger than Eden 0.0.12, and you could add Zotron to that for Eden 0.0.13.

If you want do be 95% (97.5%) sure that Eden 0.0.14 is better than 0.0.13, it would have to make at least 14 points out of 26, on the first try. For 84% confidence you would have to be only 1 sigma better, i.e. 3 points. I guess I would be happy with that, if it was achieved first try.

Main trap is that you going to keep trying and trying with marginal improvements until you find one that passes. That is cheating. Out of 7 tries to pass the 84% test, you would epect one that is equal to pass. So after a failed test you really should increase your standards.
the "standard error" might be 2 points. The variance in such a match is _far_ higher. Just play 26 game matches several times. You will _not_ get just a 2 game variance. I've already posted results from several programs including fruit, glaurung, arasan, gnuchess and crafty. So talking about standard error is really meaningless here, as is the +/- elostat output. It just is not applicable based on a _huge_ number of small matches...
Could you give me the correct way to analyse my example statistically?