No, I have written on the fly in the forum text editor. It is just a sketchy idea, never meant to be a working patch.Ralph Stoesser wrote: Have you tried to compile your code?
BTW you are of course free to trash/ignore it if you don't like.
Moderator: Ras
No, I have written on the fly in the forum text editor. It is just a sketchy idea, never meant to be a working patch.Ralph Stoesser wrote: Have you tried to compile your code?
Sorry, no offense meant.mcostalba wrote:No, I have written on the fly in the forum text editor. It is just a sketchy idea, never meant to be a working patch.Ralph Stoesser wrote: Have you tried to compile your code?
BTW you are of course free to trash/ignore it if you don't like.
To me it sounds plausible that fast games are good enough for testing tiny eval changes like flipping a few inoccent bonuses. But I have no experience, it's only that it's plausible to me. What you have found is of course far more than this. Thank you for the detailed remarks.bob wrote:The only exceptions I have found to this process revolve around two themes:Ralph Stoesser wrote:If this is true, and I'm a firm believer, it should be rather pointless to outvote a test with 20000 games against a test with 1000 games, especially in this case where opposite side castling must be possible or must have happened to find eval differences at all. I still assume 1000 games are probably not enough to reveal an effect, regardless of used time control.bob wrote:This +has+ been verified to work. I have played millions of very fast games, dealing primarily with evaluation changes or changes that make the program faster, and then verified that with longer and longer time controls, the results remain consistent.mcostalba wrote:Ralph, please no offence, but for me definition of reliable is not: reliable = What I would like to see out of the resultsRalph Stoesser wrote: I think this is a good example where a test with much more games at unreasonable fast time control is more reliable than test with few games at reasonable time control.
I consider the comment of Joona much more up to the point: probably that single piece of code has no effect at all, in any direction.
P.S: Regarding testing at unreasonable fast time control you have not proved they work. To prove if they work is necessary to verify them at longer time controls and check results are the same. So IMHO test at unreasonable time control are not validated and can be useful to quickly filter out some bad patches, but not to validate candidate good ones.
Search changes are a bit more difficult, but at least 80% of those changes have been verified with both fast and slow games...
All you have to avoid is time controls where a program loses too many games due to flag falling, rather than by getting beat.
Also it would be easy for Marco to verify my result. For 20000 games @1 sec test we don't exactly need a cluster.
(1) time allocation. Changes where you use more time, or alter the way time is allocated to each move, extending the time on fail-lows or other cosiderations, etc. You have to check representative time controls to be sure that things work the same when you have more time vs less time.
(2) search changes where you might see an exponential problem pop up. For example, in 1994 we had some odd failures in Cray Blitz due to singular extensions. Our testing was on slow hardware, but we played (in 1994) on hardware that would peak at about 7M nps. The significant extra depth led to many more (than expected) singular extensions. So things that might well cause tree explosion at deeper depths (extensions, reductions, etc) need to be verified at longer time controls. But in our testing, which is now beyond the 100M game level, most of these changes remain consistent.
Note that by consistent, I mean that A and A' (original and modified) show about the same "gap" (in terms of Elo) across all time controls. I have lots of examples of where program A does worse against B (two different programs) at different time controls. But with the same program, just two versions, this has not been a problem. A and A' might do worse against B at fast times, vs slow times, but if A' is better than A, it will be consistently better so that measuring the Elo gap between them produces a near-constant number.
1000 games has such a high error bar, unless the change is dramatic in nature, such a match will produce noise but no usable results. If you think a change is in the 2-3-4 Elo range, you need to produce between 40,000 and 100,000 games.
By the way, the 1,000 slow games vs 20,000 fast games myth is worthy of "The Myth Busters" TV program. It sounds plausible, but is far from it in reality. Yet we will continue to see this over and over.
I can't speak for Marco, but the fact is that in last 1.5 years we have been able to increase Stockfish's strength around 200 elo points with our current: 1000 games 1+0 system. Now when you have a testing methodology which perhaps is not in full agreement with statistical theories, but which in practice seems to works very well, you definetily don't want to change it easily.Ralph Stoesser wrote: To me it sounds plausible that fast games are good enough for testing tiny eval changes like flipping a few inoccent bonuses. But I have no experience, it's only that it's plausible to me. What you have found is of course far more than this. Thank you for the detailed remarks.
It think Marco will not contradict. because also in the other thread his comment was only
"Thanks, of course you have much more experience then me and everybody else here !".
I'm not sure what it means, but probably it does not mean "I completely agree".
The problem with 1000 games is the significant error margin (roughly +/-20 Elo for 1,000 games) makes this kind of testing _extremely_ risky unless your improvements are massive and lie well outside that +/- 20 window.zamar wrote:I can't speak for Marco, but the fact is that in last 1.5 years we have been able to increase Stockfish's strength around 200 elo points with our current: 1000 games 1+0 system. Now when you have a testing methodology which perhaps is not in full agreement with statistical theories, but which in practice seems to works very well, you definetily don't want to change it easily.Ralph Stoesser wrote: To me it sounds plausible that fast games are good enough for testing tiny eval changes like flipping a few inoccent bonuses. But I have no experience, it's only that it's plausible to me. What you have found is of course far more than this. Thank you for the detailed remarks.
It think Marco will not contradict. because also in the other thread his comment was only
"Thanks, of course you have much more experience then me and everybody else here !".
I'm not sure what it means, but probably it does not mean "I completely agree".
I rarely have success with gcc PGO if I try to profile the parallel code. With just one thread it seems to work well, but the instant I try a second thread, I get that same "corrupted" file error...Ralph Stoesser wrote:Sorry, no offense meant.mcostalba wrote:No, I have written on the fly in the forum text editor. It is just a sketchy idea, never meant to be a working patch.Ralph Stoesser wrote: Have you tried to compile your code?
BTW you are of course free to trash/ignore it if you don't like.
I have not changed the whole movegen code. So far I have only removed the color paramater for queen, rook, bishop and knight moves (generate_piece_moves()).
This unfinished change was for testing purpose only. I didn't wanted to change a lot, because it's always possible that it is not worth the effort. Furthermore this PGO compile error was annoying and time consuming today. I have tried other gcc versions but with no success. Again, it's not that I need "./" before the executable. I work with unix/linux since many years, so I'm able to fix something like this.
After this first test I think it is worth to optimize the movegen code the way you suggest. It's a natural optimazion and if it's done right, I would say it don't messes up the code style.
We don't need to use significant error margin. I believe one sided one sigma confidence (roughly +9 elo for 1000 games) is more than enough. For six step forward, you are prepared to take one step backforward and live with that fact. In practice we've had good results even including all >+5elo changes and still we are going more forward than backward.bob wrote: The problem with 1000 games is the significant error margin (roughly +/-20 Elo for 1,000 games) makes this kind of testing _extremely_ risky unless your improvements are massive and lie well outside that +/- 20 window.
I'm not willing to do that any longer. I'd rather play 60,000 games in 1 sec/game as opposed to 1000 games in 1 min/game. That way there are practically _no_ "steps backward"...zamar wrote:We don't need to use significant error margin. I believe one sided one sigma confidence (roughly +9 elo for 1000 games) is more than enough. For six step forward, you are prepared to take one step backforward and live with that fact. In practice we've had good results even including all >+5elo changes and still we are going more forward than backward.bob wrote: The problem with 1000 games is the significant error margin (roughly +/-20 Elo for 1,000 games) makes this kind of testing _extremely_ risky unless your improvements are massive and lie well outside that +/- 20 window.
With all respect to your testing methods (which may be scientifically accurate) - in the real-world, without some university sponsored cluster, we want to make some progress without waiting 2 weeks worth of self-playing to accept a simple change in codebase. We must take some shortcuts, and the SF team's progress showed that these shortcuts do indeed work. While I have more relaxed rules than Marco, Critter's progress is pretty evident too. (I only wish I had that SF team's "secret" autotunerbob wrote: I'm not willing to do that any longer. I'd rather play 60,000 games in 1 sec/game as opposed to 1000 games in 1 min/game. That way there are practically _no_ "steps backward"...
But the two tests Bob mention take (theoretically) the same amount of time/CPU power.rvida wrote:With all respect to your testing methods (which may be scientifically accurate) - in the real-world, without some university sponsored cluster, we want to make some progress without waiting 2 weeks worth of self-playing to accept a simple change in codebase. We must take some shortcuts, and the SF team's progress showed that these shortcuts do indeed work. While I have more relaxed rules than Marco, Critter's progress is pretty evident too. (I only wish I had that SF team's "secret" autotunerbob wrote: I'm not willing to do that any longer. I'd rather play 60,000 games in 1 sec/game as opposed to 1000 games in 1 min/game. That way there are practically _no_ "steps backward"...)
P.S.: sorry for my horrible english