Passed Pawns (endgame)

mcostalba · Post by **mcostalba** » Sun May 23, 2010 8:43 pm

Ralph Stoesser wrote: Have you tried to compile your code?

No, I have written on the fly in the forum text editor. It is just a sketchy idea, never meant to be a working patch.

BTW you are of course free to trash/ignore it if you don't like.

Ralph Stoesser · Post by **Ralph Stoesser** » Sun May 23, 2010 9:39 pm

mcostalba wrote:
Ralph Stoesser wrote: Have you tried to compile your code?
No, I have written on the fly in the forum text editor. It is just a sketchy idea, never meant to be a working patch.

BTW you are of course free to trash/ignore it if you don't like.

Sorry, no offense meant.

I have not changed the whole movegen code. So far I have only removed the color paramater for queen, rook, bishop and knight moves (generate_piece_moves()).
This unfinished change was for testing purpose only. I didn't wanted to change a lot, because it's always possible that it is not worth the effort. Furthermore this PGO compile error was annoying and time consuming today. I have tried other gcc versions but with no success. Again, it's not that I need "./" before the executable. I work with unix/linux since many years, so I'm able to fix something like this.

After this first test I think it is worth to optimize the movegen code the way you suggest. It's a natural optimazion and if it's done right, I would say it don't messes up the code style.

Ralph Stoesser · Post by **Ralph Stoesser** » Sun May 23, 2010 10:17 pm

bob wrote:
Ralph Stoesser wrote:
bob wrote:
mcostalba wrote:
Ralph Stoesser wrote: I think this is a good example where a test with much more games at unreasonable fast time control is more reliable than test with few games at reasonable time control.
Ralph, please no offence, but for me definition of reliable is not: reliable = What I would like to see out of the results

I consider the comment of Joona much more up to the point: probably that single piece of code has no effect at all, in any direction.

P.S: Regarding testing at unreasonable fast time control you have not proved they work. To prove if they work is necessary to verify them at longer time controls and check results are the same. So IMHO test at unreasonable time control are not validated and can be useful to quickly filter out some bad patches, but not to validate candidate good ones.
This +has+ been verified to work. I have played millions of very fast games, dealing primarily with evaluation changes or changes that make the program faster, and then verified that with longer and longer time controls, the results remain consistent.

Search changes are a bit more difficult, but at least 80% of those changes have been verified with both fast and slow games...

All you have to avoid is time controls where a program loses too many games due to flag falling, rather than by getting beat.
If this is true, and I'm a firm believer, it should be rather pointless to outvote a test with 20000 games against a test with 1000 games, especially in this case where opposite side castling must be possible or must have happened to find eval differences at all. I still assume 1000 games are probably not enough to reveal an effect, regardless of used time control.

Also it would be easy for Marco to verify my result. For 20000 games @1 sec test we don't exactly need a cluster.
The only exceptions I have found to this process revolve around two themes:

(1) time allocation. Changes where you use more time, or alter the way time is allocated to each move, extending the time on fail-lows or other cosiderations, etc. You have to check representative time controls to be sure that things work the same when you have more time vs less time.

(2) search changes where you might see an exponential problem pop up. For example, in 1994 we had some odd failures in Cray Blitz due to singular extensions. Our testing was on slow hardware, but we played (in 1994) on hardware that would peak at about 7M nps. The significant extra depth led to many more (than expected) singular extensions. So things that might well cause tree explosion at deeper depths (extensions, reductions, etc) need to be verified at longer time controls. But in our testing, which is now beyond the 100M game level, most of these changes remain consistent.

Note that by consistent, I mean that A and A' (original and modified) show about the same "gap" (in terms of Elo) across all time controls. I have lots of examples of where program A does worse against B (two different programs) at different time controls. But with the same program, just two versions, this has not been a problem. A and A' might do worse against B at fast times, vs slow times, but if A' is better than A, it will be consistently better so that measuring the Elo gap between them produces a near-constant number.

1000 games has such a high error bar, unless the change is dramatic in nature, such a match will produce noise but no usable results. If you think a change is in the 2-3-4 Elo range, you need to produce between 40,000 and 100,000 games.

By the way, the 1,000 slow games vs 20,000 fast games myth is worthy of "The Myth Busters" TV program. It sounds plausible, but is far from it in reality. Yet we will continue to see this over and over.

To me it sounds plausible that fast games are good enough for testing tiny eval changes like flipping a few inoccent bonuses. But I have no experience, it's only that it's plausible to me. What you have found is of course far more than this. Thank you for the detailed remarks.

It think Marco will not contradict. because also in the other thread his comment was only
"Thanks, of course you have much more experience then me and everybody else here !".

I'm not sure what it means, but probably it does not mean "I completely agree".

zamar · Post by **zamar** » Sun May 23, 2010 10:40 pm

Ralph Stoesser wrote: To me it sounds plausible that fast games are good enough for testing tiny eval changes like flipping a few inoccent bonuses. But I have no experience, it's only that it's plausible to me. What you have found is of course far more than this. Thank you for the detailed remarks.

It think Marco will not contradict. because also in the other thread his comment was only
"Thanks, of course you have much more experience then me and everybody else here !".

I'm not sure what it means, but probably it does not mean "I completely agree".

I can't speak for Marco, but the fact is that in last 1.5 years we have been able to increase Stockfish's strength around 200 elo points with our current: 1000 games 1+0 system. Now when you have a testing methodology which perhaps is not in full agreement with statistical theories, but which in practice seems to works very well, you definetily don't want to change it easily.

bob · Post by **bob** » Sun May 23, 2010 11:08 pm

zamar wrote:
Ralph Stoesser wrote: To me it sounds plausible that fast games are good enough for testing tiny eval changes like flipping a few inoccent bonuses. But I have no experience, it's only that it's plausible to me. What you have found is of course far more than this. Thank you for the detailed remarks.

It think Marco will not contradict. because also in the other thread his comment was only
"Thanks, of course you have much more experience then me and everybody else here !".

I'm not sure what it means, but probably it does not mean "I completely agree".
I can't speak for Marco, but the fact is that in last 1.5 years we have been able to increase Stockfish's strength around 200 elo points with our current: 1000 games 1+0 system. Now when you have a testing methodology which perhaps is not in full agreement with statistical theories, but which in practice seems to works very well, you definetily don't want to change it easily.

The problem with 1000 games is the significant error margin (roughly +/-20 Elo for 1,000 games) makes this kind of testing _extremely_ risky unless your improvements are massive and lie well outside that +/- 20 window.

bob · Post by **bob** » Sun May 23, 2010 11:11 pm

Ralph Stoesser wrote:
mcostalba wrote:
Ralph Stoesser wrote: Have you tried to compile your code?
No, I have written on the fly in the forum text editor. It is just a sketchy idea, never meant to be a working patch.

BTW you are of course free to trash/ignore it if you don't like.
Sorry, no offense meant.

I have not changed the whole movegen code. So far I have only removed the color paramater for queen, rook, bishop and knight moves (generate_piece_moves()).
This unfinished change was for testing purpose only. I didn't wanted to change a lot, because it's always possible that it is not worth the effort. Furthermore this PGO compile error was annoying and time consuming today. I have tried other gcc versions but with no success. Again, it's not that I need "./" before the executable. I work with unix/linux since many years, so I'm able to fix something like this.

After this first test I think it is worth to optimize the movegen code the way you suggest. It's a natural optimazion and if it's done right, I would say it don't messes up the code style.

I rarely have success with gcc PGO if I try to profile the parallel code. With just one thread it seems to work well, but the instant I try a second thread, I get that same "corrupted" file error...

Intel's free compiler works flawlessly, however, and since it produces faster code I've been using that for years.

zamar · Post by **zamar** » Sun May 23, 2010 11:39 pm

bob wrote: The problem with 1000 games is the significant error margin (roughly +/-20 Elo for 1,000 games) makes this kind of testing _extremely_ risky unless your improvements are massive and lie well outside that +/- 20 window.

We don't need to use significant error margin. I believe one sided one sigma confidence (roughly +9 elo for 1000 games) is more than enough. For six step forward, you are prepared to take one step backforward and live with that fact. In practice we've had good results even including all >+5elo changes and still we are going more forward than backward.

bob · Post by **bob** » Mon May 24, 2010 12:15 am

zamar wrote:
bob wrote: The problem with 1000 games is the significant error margin (roughly +/-20 Elo for 1,000 games) makes this kind of testing _extremely_ risky unless your improvements are massive and lie well outside that +/- 20 window.
We don't need to use significant error margin. I believe one sided one sigma confidence (roughly +9 elo for 1000 games) is more than enough. For six step forward, you are prepared to take one step backforward and live with that fact. In practice we've had good results even including all >+5elo changes and still we are going more forward than backward.

I'm not willing to do that any longer. I'd rather play 60,000 games in 1 sec/game as opposed to 1000 games in 1 min/game. That way there are practically _no_ "steps backward"...

rvida · Post by **rvida** » Mon May 24, 2010 12:43 am

bob wrote: I'm not willing to do that any longer. I'd rather play 60,000 games in 1 sec/game as opposed to 1000 games in 1 min/game. That way there are practically _no_ "steps backward"...

With all respect to your testing methods (which may be scientifically accurate) - in the real-world, without some university sponsored cluster, we want to make some progress without waiting 2 weeks worth of self-playing to accept a simple change in codebase. We must take some shortcuts, and the SF team's progress showed that these shortcuts do indeed work. While I have more relaxed rules than Marco, Critter's progress is pretty evident too. (I only wish I had that SF team's "secret" autotuner

)

P.S.: sorry for my horrible english

Zach Wegner · Post by **Zach Wegner** » Mon May 24, 2010 1:07 am

rvida wrote:
bob wrote: I'm not willing to do that any longer. I'd rather play 60,000 games in 1 sec/game as opposed to 1000 games in 1 min/game. That way there are practically _no_ "steps backward"...
With all respect to your testing methods (which may be scientifically accurate) - in the real-world, without some university sponsored cluster, we want to make some progress without waiting 2 weeks worth of self-playing to accept a simple change in codebase. We must take some shortcuts, and the SF team's progress showed that these shortcuts do indeed work. While I have more relaxed rules than Marco, Critter's progress is pretty evident too. (I only wish I had that SF team's "secret" autotuner )

P.S.: sorry for my horrible english

But the two tests Bob mention take (theoretically) the same amount of time/CPU power.

Passed Pawns (endgame)

Re: Passed Pawns (endgame)

Re: Passed Pawns (endgame)

Re: Passed Pawns (endgame)

Re: Passed Pawns (endgame)

Re: Passed Pawns (endgame)

Re: Passed Pawns (endgame)

Re: Passed Pawns (endgame)

Re: Passed Pawns (endgame)

Re: Passed Pawns (endgame)

Re: Passed Pawns (endgame)