An objective test process for the rest of us?

Carey · Post by **Carey** » Sat Sep 15, 2007 1:10 am

bob wrote: Not since the original chess 4.x which is where the iterated search was first seen so far as I know.

Actually it was Jim Gillogly, with TECH. At that time, Chess was still at v3.5

I asked him about that once and here is what he said on Oct 18,2006.

I invented and named iterative deepening, in fact. I remember the occasion: I came up with the concept and started by calling it "progressive deepening" from the cognitive concept used by masters described by A. D. DeGroot in "Thought and Choice in Chess." I described it to my thesis advisor, Allen Newell, and he said I needed to come up with another name for it, because it was significantly different from DeGroot's concept. After some thought I came up with iterative deepening, and it stuck. It was implemented in time for the computer chess championship, and I described it in the panel discussion there. The following year other teams (including, I think, Northwestern) were also using it.

Unforunately, the original TECH, written in BLISS appears to be lost. He thought he had a copy but couldn't find it.

So, for the record, Mir. Gillogly is the one who deserves the credit.

nczempin · Post by **nczempin** » Sat Sep 15, 2007 1:24 am

bob wrote:Just play 26 game matches several times. You will _not_ get just a 2 game variance. I've already posted results from several programs including fruit, glaurung, arasan, gnuchess and crafty.

Okay, so I need to play more 26-game matches with this setup, so that we find the variance of this particular setup?

My hypothesis is that at this level and with my settings (own book in particular), the variance is significantly lower than what you got for your tests.

What would I need to do to confirm or reject this hypothesis?

I'll go and look up the values you posted unless someone beats me to it.

For now, I am changing the 2-each tournament to an n-game Gauntlet (because to get get just the points scored from A to A' I don't need the other matches as hgm correctly pointed out.

The tournament is a good way for me to find out which set of opponents to choose, however (I usually weed out those engines totall overmatched, as well as those that are totally superior.

There is still the controversy on whether to use engines close in strength vs. engines that are much stronger. I am not sure whether Bob's reasoning still applies in the context of using an own book.

Which engines would you suggest instead of the ones I'm using (which have been selected to be slightly stronger than Eden 0.0.13)? Surely not Crafty, Fruit or Rybka? Because the current version of Eden will not even draw a single game under any conditions against any of these programs, and neither would the next one. So there must be some optimum level, probably higher but not too high.

bob · Post by **bob** » Sat Sep 15, 2007 3:26 am

Carey wrote:
bob wrote: Not since the original chess 4.x which is where the iterated search was first seen so far as I know.
Actually it was Jim Gillogly, with TECH. At that time, Chess was still at v3.5

I asked him about that once and here is what he said on Oct 18,2006.

I invented and named iterative deepening, in fact. I remember the occasion: I came up with the concept and started by calling it "progressive deepening" from the cognitive concept used by masters described by A. D. DeGroot in "Thought and Choice in Chess." I described it to my thesis advisor, Allen Newell, and he said I needed to come up with another name for it, because it was significantly different from DeGroot's concept. After some thought I came up with iterative deepening, and it stuck. It was implemented in time for the computer chess championship, and I described it in the panel discussion there. The following year other teams (including, I think, Northwestern) were also using it.
Unforunately, the original TECH, written in BLISS appears to be lost. He thought he had a copy but couldn't find it.

So, for the record, Mir. Gillogly is the one who deserves the credit.

Good. While I haven't talked to Jim in ages, he was a good guy in computer chess. Certainly deserves the credit based on what you quoted...

Bob

bob · Post by **bob** » Sat Sep 15, 2007 3:29 am

nczempin wrote:
bob wrote:
hgm wrote:Standard error on 26 games is 2 pts, so for a difference its is 2.8 pts. For 95% confidence this is about twice, or 5.5 pts. (Or was that 97.5%, because this is a one-sided test? I would have to calculate tha to be sure.) So an engine equally stong as Eden 0.0.12 would make 15 points in this gauntlet only 1 once in 20 times. That means Cefap and those above it are significantly stronger than Eden 0.0.12, and you could add Zotron to that for Eden 0.0.13.

If you want do be 95% (97.5%) sure that Eden 0.0.14 is better than 0.0.13, it would have to make at least 14 points out of 26, on the first try. For 84% confidence you would have to be only 1 sigma better, i.e. 3 points. I guess I would be happy with that, if it was achieved first try.

Main trap is that you going to keep trying and trying with marginal improvements until you find one that passes. That is cheating. Out of 7 tries to pass the 84% test, you would epect one that is equal to pass. So after a failed test you really should increase your standards.
the "standard error" might be 2 points. The variance in such a match is _far_ higher. Just play 26 game matches several times. You will _not_ get just a 2 game variance. I've already posted results from several programs including fruit, glaurung, arasan, gnuchess and crafty. So talking about standard error is really meaningless here, as is the +/- elostat output. It just is not applicable based on a _huge_ number of small matches...
Could you give me the correct way to analyse my example statistically?

I don't think we have such a methodology yet. At least I have not found one after all the games I have played using our cluster. This is really a difficult problem to address. I'm simply trying to point out to everyone that a few dozen games are a poor indicator. Again the easiest way to see why is to run the same "thing" more than once and look at how unstable the results are.

bob · Post by **bob** » Sat Sep 15, 2007 3:35 am

nczempin wrote:
bob wrote:Just play 26 game matches several times. You will _not_ get just a 2 game variance. I've already posted results from several programs including fruit, glaurung, arasan, gnuchess and crafty.
Okay, so I need to play more 26-game matches with this setup, so that we find the variance of this particular setup?

My hypothesis is that at this level and with my settings (own book in particular), the variance is significantly lower than what you got for your tests.

What would I need to do to confirm or reject this hypothesis?

I'll go and look up the values you posted unless someone beats me to it.

For now, I am changing the 2-each tournament to an n-game Gauntlet (because to get get just the points scored from A to A' I don't need the other matches as hgm correctly pointed out.

The tournament is a good way for me to find out which set of opponents to choose, however (I usually weed out those engines totall overmatched, as well as those that are totally superior.

There is still the controversy on whether to use engines close in strength vs. engines that are much stronger. I am not sure whether Bob's reasoning still applies in the context of using an own book.

Which engines would you suggest instead of the ones I'm using (which have been selected to be slightly stronger than Eden 0.0.13)? Surely not Crafty, Fruit or Rybka? Because the current version of Eden will not even draw a single game under any conditions against any of these programs, and neither would the next one. So there must be some optimum level, probably higher but not too high.

If you don't find them, we can start a new thread and I can post some data. The only problem is that my results are in chunks of 80 games against the same opponent, 40 starting positions, playing each position with alternating colors. Probably the best would be to post 4 80 game matches between several programs in my test gauntlet...

I agree that if you choose an opponent too strong, such that your results are 0-80 no matter what you do, you get no useful information from the score, and would have to look at each game carefully to see if the new version plays better. This is a hard assignment.

I believe you need at least one opponent that is worse than you, one that is significantly (but not overwhelmingly) better, and a couple that are pretty close to you. We have seen cases where we make a change and do better against a stronger opponent, and worse against a far weaker opponent. So evaluating this stuff is still difficult...

What you ought to hope is that over time, the equal programs become much weaker than yours, and the better program becomes more equal, so that you need to find a good "tough opponent" to replace the old weak opponent that is now way too weak...

hgm · Post by **hgm** » Sat Sep 15, 2007 9:01 am

bob wrote:That is _exactly_ the kind of comment that makes me cringe. ~5 elo? Based on what?

Based on theoretical considerations and general Elo vs search time, as in the derivation I gave in the other thread. In absense of actual data I see no reason to disbelieve plausible theoretical predictions. I do not have the equipment to reliably measure such small Elo differences. You yourself told me that yu never measured this. So the theoretical prediction stands until it is convincingly falsified by experimental data.

So somehow modify all the other engines to use something provided by the first? Are we trying to test/debug or introduce bugs to find? I'm not going to modify other programs and then debug that...

Of course not!

The opponents don't have to be modified at all, as they play only once in every position, as I described above. Their moves go into the data-base, and if another engine of my A family brings them in a position they have already seen, the move comes from the database. That guarantees 100% reproducible play by the opponents, without having to modify them in any way.

hgm · Post by **hgm** » Sat Sep 15, 2007 9:42 am

bob wrote:the "standard error" might be 2 points. The variance in such a match is _far_ higher. Just play 26 game matches several times. You will _not_ get just a 2 game variance. I've already posted results from several programs including fruit, glaurung, arasan, gnuchess and crafty. So talking about standard error is really meaningless here, as is the +/- elostat output. It just is not applicable based on a _huge_ number of small matches...

Wait a minute! Are we talking here about 'variance' in the usual statistical meaning, as the square of the standard deviation?

Then it is safe to say that what you claim is just nonsense. The addition law for variances of combined events (and the resulting central limit theorem) can be mathematically _proven_ under very weak conditions. The only thing that is required is that the variances be finite (which in Chess they necessarily are, as all individual scores are finite), and that the events are independent.

So what you claim could only be true if the results of games within one match were dependent on each other, in a way that winning one, would give you a larger probability of winning the next. As we typically restart our engines that doesn't seem possible without violating causality...

If you play a large number of 26-game matches, the results will be distributed with a SD of at most 0.5*sqrt(26) = 2.5, and you can only reach that if the engines cannot play draws, and score near 50% on average. With normal draw percentages it will be 2 points (i.e. the variance will be 4 square points).

No way the variance is ever going to be any higher.

Rob · Post by **Rob** » Sat Sep 15, 2007 12:44 pm

hgm wrote:
bob wrote:the "standard error" might be 2 points. The variance in such a match is _far_ higher. Just play 26 game matches several times. You will _not_ get just a 2 game variance. I've already posted results from several programs including fruit, glaurung, arasan, gnuchess and crafty. So talking about standard error is really meaningless here, as is the +/- elostat output. It just is not applicable based on a _huge_ number of small matches...
Wait a minute! Are we talking here about 'variance' in the usual statistical meaning, as the square of the standard deviation?

Then it is safe to say that what you claim is just nonsense. The addition law for variances of combined events (and the resulting central limit theorem) can be mathematically _proven_ under very weak conditions. The only thing that is required is that the variances be finite (which in Chess they necessarily are, as all individual scores are finite), and that the events are independent.

So what you claim could only be true if the results of games within one match were dependent on each other, in a way that winning one, would give you a larger probability of winning the next. As we typically restart our engines that doesn't seem possible without violating causality...

If you play a large number of 26-game matches, the results will be distributed with a SD of at most 0.5*sqrt(26) = 2.5, and you can only reach that if the engines cannot play draws, and score near 50% on average. With normal draw percentages it will be 2 points (i.e. the variance will be 4 square points).

No way the variance is ever going to be any higher.

Could it be that chess doesn't have a variance? I've been reading a bit on Wikipedia and there are such distributions.

hgm · Post by **hgm** » Sat Sep 15, 2007 1:42 pm

Yes, variance does not always have to be finite. But to have infinite variance, a quantity should be able to attain abitrarily large value (implying that it should also be able to attain infinitely many different values).

For chess scores a game can result ion only 3 scores, 0, 1/2 or 1. That means that the variance can be at most 1/2 squared = 1/4. Standard deviation is always limited to half the range of the maximum and the minimum outcome, and pathological cases can only occur when his range is infinite. It can then even happen that the expectation value does not exist. But no such thing in chess scores.

nczempin · Post by **nczempin** » Sat Sep 15, 2007 2:01 pm

Rob wrote:[
Could it be that chess doesn't have a variance? I've been reading a bit on Wikipedia and there are such distributions.

No.

Results of matches between engines do have variance, due to factors that have been discussed here and in other threads: Random numbers for Zobrist keys (if nondeterministic seeds are used), time management are just some examples, and probably the biggest factor is an opening book that has choices.

An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?