different kinds of testing

Jan Brouwer · Post by **Jan Brouwer** » Tue Nov 10, 2009 10:05 pm

Aser Huerga wrote:
Jan Brouwer wrote: It's called Rotor (http://home.kpn.nl/f2hjbrouwer120/index.html).

Hello Jan,

please, could you check this link? Clicking on Rotor 0.5 sends you to a page where there is no clear download link.

Thanks.

Hi Aser,

You are right. My provider recently changed its domain name and I had not noticed that the links to the zip files should be updated.
You can use this link http://home.kpn.nl/f2hjbrouwer120/rotor_05.zip to download the zip file. I will update my page.

Thanks for mentioning it.

Jan

Dirt · Post by **Dirt** » Tue Nov 10, 2009 10:26 pm

bob wrote:1. time control. Is it possible to choose a "bad one" that will provide misleading results? Absolutely. Is it probable? Based on my testing results, no. This means that very fast time controls are pretty reliable in predicting performance at longer time controls, when you are trying to determine if A' is better than A (A' is just a modified A). Have I seen cases where this is false? Yes. For example, it is not uncommon to see a ramped-up king safety score better at fast time controls where it gets the opponent into trouble. While at longer time controls, the opponent can see thru the superficiality of some moves and avoid the problems. But that only requires a bit of common sense when testing.

Maybe you could get the testing to supply the common sense, too.

How do you think this would work? Instead of running 40,000 games at 10s, run 20,000 at 8s and 20,000 at 12s. Report the combined results just the way you do now, except also show how much variance (in Elo?) there was between the two time controls. If the variance is ever much larger than is statistically expected then tests at longer time controls could be run.

hgm · Post by **hgm** » Tue Nov 10, 2009 10:49 pm

What you propose is basically a special case of orthogonal multi-testing: In stead of just testing a single change, vary other parameters at the same time. In this case the TC. Get the extra info for free. If you do N games, in priciple you can similtaneously test the effect of 2_log N changes that way, all with the same accuracy that goes with N games.

Will Singleton · Post by **Will Singleton** » Wed Nov 11, 2009 3:01 am

Don wrote:
Will Singleton wrote:I'm wondering whether it's better to test from a set of opening positions, flipping the colors, or rather from game start using books. I guess you'd get more volatility using books, so you'd need more games. But if you eliminate the books, then you're not testing from the positions your program would typically get into.

And if you use a set of opening positions, I guess you'd have to have a large set if you wanted to play 1000 games without duplication. Anyone have a set like that?

Will
I created a set of openings from several hundred thousand master games. My book goes to depth 10 ply. This is not very deep but I am more interested in the testing the program than the book. A custom book can be designed later.

I sanitized the huge PGN file I extracted these from so that no games repeated. I required that any given move be played at least N times so that I am not looking at ridiculous positions. I don't remember what N was, but it generated more than 3785 different starting positions, 5 moves (or 10 ply) deep.

I should also mention that I checked them for transpositions. The resulting end positions are are all unique.

I also checked for "likely" transpositions by leting the program play through all the openings and I removed an opening if it transposed to another opening within 4 ply of leaving book. Or something like that, I don't remember the exact rule. Nothing you can easily do would guarantee no eventual transpositions (you could eventually transpose into a rook vs king ending if you go crazy.)

So I ended up with 3785 carefully selecting starting positions for testing. So for any 2 players I can generate 7570 games. Sometimes I want bigger samples than this, but my tester does round robins between up to 256 players, so even 3 players will give me 15000 games per player.

My autotester also takes care that openings are not played in the same sequence. (Actually, it's deterministic by a hash of the 2 players names.) Even between players you will not play the same first white opening as your opponent played with you. I think this is fairly important to do so that your shorter tests are not always hammering the same openings.

I can give anyone who wants my openings. it's in a C file that looks like this:

#define BOOK_SIZE 3785

char *book[BOOK_SIZE] = {

"d2d4 g8f6 c2c4 e7e6 b1c3 f8b4 c1d2 e8g8 g1f3 d7d5",
"e2e4 d7d6 d2d4 g8f6 b1c3 g7g6 f1e2 f8g7 h2h4 c7c5",
"d2d4 d7d6 c2c4 e7e5 g1f3 e5e4 f3d2 f7f5 e2e3 g8f6",
"e2e4 e7e5 g1f3 b8c6 f1b5 g8f6 e1g1 d7d6 d2d4 e5d4",
"c2c4 g8f6 b1c3 e7e6 g1f3 d7d5 d2d4 f8b4 e2e3 e8g8",
...

Great, please send to smocfi@aol.com. Thanks much.

Will

bob · Post by **bob** » Wed Nov 11, 2009 11:25 pm

Dirt wrote:
bob wrote:1. time control. Is it possible to choose a "bad one" that will provide misleading results? Absolutely. Is it probable? Based on my testing results, no. This means that very fast time controls are pretty reliable in predicting performance at longer time controls, when you are trying to determine if A' is better than A (A' is just a modified A). Have I seen cases where this is false? Yes. For example, it is not uncommon to see a ramped-up king safety score better at fast time controls where it gets the opponent into trouble. While at longer time controls, the opponent can see thru the superficiality of some moves and avoid the problems. But that only requires a bit of common sense when testing.
Maybe you could get the testing to supply the common sense, too.

How do you think this would work? Instead of running 40,000 games at 10s, run 20,000 at 8s and 20,000 at 12s. Report the combined results just the way you do now, except also show how much variance (in Elo?) there was between the two time controls. If the variance is ever much larger than is statistically expected then tests at longer time controls could be run.

I'm not sure that works. The two 20K runs are effectively two different experiments, with different versions, since the time control changes. I can keep the groups separate by using a different program name for each time control. But there are lots of things that change, and variance between different time-control runs is much larger than one would suspect, particularly when you use very fast and then much slower, or very small increment vs very large one, etc...

bob · Post by **bob** » Wed Nov 11, 2009 11:28 pm

hgm wrote:What you propose is basically a special case of orthogonal multi-testing: In stead of just testing a single change, vary other parameters at the same time. In this case the TC. Get the extra info for free. If you do N games, in priciple you can similtaneously test the effect of 2_log N changes that way, all with the same accuracy that goes with N games.

So you believe it reasonable to ignore any differences in how a program behaves at different time controls, as if that has no effect on the final result? That running 10K games at very fast, 10K games at very slow, and then combining them to get 20K games, is just as good as running 20K games fast and then 20K games slow? I see too much change as time controls are varied. It is quite easy to find a program that varies by well over +/- 100 Elo against Crafty depending on the time control used. Changing the time control changes too many things (all opponents + crafty will behave differently).

Don · Post by **Don** » Thu Nov 12, 2009 12:05 am

bob wrote:
hgm wrote:What you propose is basically a special case of orthogonal multi-testing: In stead of just testing a single change, vary other parameters at the same time. In this case the TC. Get the extra info for free. If you do N games, in priciple you can similtaneously test the effect of 2_log N changes that way, all with the same accuracy that goes with N games.
So you believe it reasonable to ignore any differences in how a program behaves at different time controls, as if that has no effect on the final result? That running 10K games at very fast, 10K games at very slow, and then combining them to get 20K games, is just as good as running 20K games fast and then 20K games slow? I see too much change as time controls are varied. It is quite easy to find a program that varies by well over +/- 100 Elo against Crafty depending on the time control used. Changing the time control changes too many things (all opponents + crafty will behave differently).

Bob,

Orthogonal testing makes the assumption that the individual things being tested have little interaction, that they are independant. As you observe, that is not always the case.

However, so does the kind of testing you do. If you test 20 different things over the course of the month, did you test each change in combination with every other change you ever made? Of course you didn't. You assume that if you can get an improvement over the PREVIOUS version it is good in general. If it tests badly do you back out previous changes to see if there is a bad interaction? Probably not very often.

This kind of testing that we all do is pretty ugly but we have no choice but to make huge simplifying assumptions and just try to apply a little common sense to each of them.

I have noticed that you have the most incredible imagination. You can completely ignore something with a hand wave or it can become a major issue with you, depending on the point you want to make.

This all started with a simple observation that is a special case of orthogonal multi-testing, a very keen observation that I would have never thought of but seems self-evident when pointed out. That you even challenged it seemed incredible to me, like you are looking for some excuse to prove yourself. I think everyone here already has your respect so what gives?

bob · Post by **bob** » Thu Nov 12, 2009 2:26 am

Don wrote:
bob wrote:
hgm wrote:What you propose is basically a special case of orthogonal multi-testing: In stead of just testing a single change, vary other parameters at the same time. In this case the TC. Get the extra info for free. If you do N games, in priciple you can similtaneously test the effect of 2_log N changes that way, all with the same accuracy that goes with N games.
So you believe it reasonable to ignore any differences in how a program behaves at different time controls, as if that has no effect on the final result? That running 10K games at very fast, 10K games at very slow, and then combining them to get 20K games, is just as good as running 20K games fast and then 20K games slow? I see too much change as time controls are varied. It is quite easy to find a program that varies by well over +/- 100 Elo against Crafty depending on the time control used. Changing the time control changes too many things (all opponents + crafty will behave differently).
Bob,

Orthogonal testing makes the assumption that the individual things being tested have little interaction, that they are independant. As you observe, that is not always the case.

However, so does the kind of testing you do. If you test 20 different things over the course of the month, did you test each change in combination with every other change you ever made? Of course you didn't. You assume that if you can get an improvement over the PREVIOUS version it is good in general. If it tests badly do you back out previous changes to see if there is a bad interaction? Probably not very often.

This kind of testing that we all do is pretty ugly but we have no choice but to make huge simplifying assumptions and just try to apply a little common sense to each of them.

I have noticed that you have the most incredible imagination. You can completely ignore something with a hand wave or it can become a major issue with you, depending on the point you want to make.

This all started with a simple observation that is a special case of orthogonal multi-testing, a very keen observation that I would have never thought of but seems self-evident when pointed out. That you even challenged it seemed incredible to me, like you are looking for some excuse to prove yourself. I think everyone here already has your respect so what gives?

If my "challenge" of the idea seems incredible, perhaps you might think about it a bit more. I play 40K matches for each different test. To suggest that I get the same accuracy by playing 20K fast games and 20K slow games is not exactly a reasonable assumption. If you believe it is, more power to you.

I believe it is _far_ easier to simply do the tests I want to do, in the most accurate way possible. Everyone wants to "cheat the sample sizes" and find ways to get equivalent results with fewer games. I agree that orthogonal testing works _if_ the individual tests are really independent. In my case, which you cited, your comments make _zero_ sense. Yes, it is possible that there are interactions between changes. But I only test one change at a time. I am not testing two at once and _hoping_ that they are orthogonal. I am testing the full 40K games _knowing" there is no interactions because I am only testing one thing at a time. It really is that simple, and it really does work. Doing the orthogonal stuff is a neat idea, but with some assumptions built in that are difficult, if not impossible, to verify. My way has no assumptions at all.

As far as "what gives?" I simply want to see some factual information provided when dealing with testing. Ideas that are suggested (fixed nodes) have already been tried, for millions of games, and I found a problem that I pointed out. If you want to ignore that problem, that's OK. I simply intended to point out that there _is_ a problem. Just because some don't seem to grasp/understand the problem does not mean it doesn't exist. And there is no doubt it exists because I spent quite a bit of time trying to figure out why I was producing results that seemed to be counter-intuitive when we were playing with the fixed node test after discovering that timed tests had so incredibly much variation in them.

I almost think it would be better to simply remain silent and let everyone test the way they want, and not try to share things that I discover. Because many don't like "truth". I'll keep that in mind. We've clearly identified a way to improve Crafty. And the improvement has been verified over time by independent testers and rating lists. I'll just leave it at that and let everyone figure out how to do this stuff on their own, and re-invent the wheel a few hundred times, as that will certainly lead to a lot less aggravation, IMHO. So. Do your mixed testing. Different time controls. Orthogonal (supposedly) changes. Etc. And base your decisions on the results. After all, it really isn't going to hurt +me+. I have nothing to gain by sharing information. And in a way, have more to lose since it would be better to let everyone test in a broken way (how long did the commercial guys use 40 starting positions before my test results showed how bad that was... Ask Theron about it, for more information.)

So, for me, "mum is the word" with respect to testing. Let the "I believe," "I think," "it seems," "it must be," "that can't possibly be right," and such run their course. After all, it won't hurt +me+. And it might even help by improving our rate of progress compared to those that use sub-optimal (or flawed) testing approaches...

caio.

jesper_nielsen · Post by **jesper_nielsen** » Thu Nov 12, 2009 11:56 am

Speaking as a very resource limited (un-)happy amateur, the question of testing for me becomes "how do I get the most bang for my bucks?"

I have recently started using the "cutechess-cli" tool for running very fast games. Great tool by the way!

I am currently running 40 noomen positions tests against 8 different opponents at 5+0.4 timecontrol, giving 40*2*8 = 640 games. This is clearly not enough games to make any kind of conclusion except for very big elo jumps.

There are (at least!

) three ways to increase the number of games played.

1. Add more repetitions of the test runs.
2. Add more opponents.
3. Add more starting positions.

Which one of these options is "better"?
Or are they equal, meaning that the real value is only in the higher number of games?

That to me is an interesting question.

Kind regards,
Jesper

P.S.
I would hate to see anyone withdraw there contribution to this forum.
The diverse inputs are, i believe, one of the strengths of this place.
Even if the going gets a bit rough from time to time.

bob · Post by **bob** » Thu Nov 12, 2009 9:02 pm

jesper_nielsen wrote:Speaking as a very resource limited (un-)happy amateur, the question of testing for me becomes "how do I get the most bang for my bucks?"

I have recently started using the "cutechess-cli" tool for running very fast games. Great tool by the way!

I am currently running 40 noomen positions tests against 8 different opponents at 5+0.4 timecontrol, giving 40*2*8 = 640 games. This is clearly not enough games to make any kind of conclusion except for very big elo jumps.

There are (at least! ) three ways to increase the number of games played.

1. Add more repetitions of the test runs.

Do _NOT_ do this. It is fraught with problems. Use more positions or more opponents, but do not play more games with the same positions,

2. Add more opponents.
3. Add more starting positions.

Which one of these options is "better"?
Or are they equal, meaning that the real value is only in the higher number of games?

More positions is easier. Finding more opponents can be problematic, in that some programs do _very_ poorly at fast time controls, which even I use on the cluster to get results faster. So fast games limits the potential candidates for testing opponents. But clearly more is better. And I am continually working on this issue. One important point is that you really want to test against stronger opponents for the most part, not weaker ones. That way you can recognize gains quicker than if you are way ahead of your opponents. That is another limiting factor in choosing opponents.

That to me is an interesting question.

Kind regards,
Jesper

P.S.
I would hate to see anyone withdraw there contribution to this forum.
The diverse inputs are, i believe, one of the strengths of this place.
Even if the going gets a bit rough from time to time.

different kinds of testing

Re: different kinds of testing

Re: different kinds of testing

Re: different kinds of testing

Re: different kinds of testing

Re: different kinds of testing

Re: different kinds of testing

Re: different kinds of testing

Re: different kinds of testing

Re: different kinds of testing

Re: different kinds of testing