A poor man's testing environment

hgm · Post by **hgm** » Fri Jan 04, 2013 6:28 pm

lucasart wrote:Cutechess-cli is perfectly capable of handling concurrent writing into the resulting PGN if that's your point.

Indeed, WinBoard would also have no problem with that. (It uses file locking to synchronize the different playing agents.)

But I think his problem is not with the GUI, but that his engine keeps a log file with a fixed name in their installation directory. So if you run it 8 times concurrently, they would all write to the same log file, which would become an absolute mess (as writes to log files typically are not buffered, to not lose data in case of a crash).

Don · Post by **Don** » Fri Jan 04, 2013 7:40 pm

hgm wrote:
lucasart wrote:Cutechess-cli is perfectly capable of handling concurrent writing into the resulting PGN if that's your point.
Indeed, WinBoard would also have no problem with that. (It uses file locking to synchronize the different playing agents.)

But I think his problem is not with the GUI, but that his engine keeps a log file with a fixed name in their installation directory. So if you run it 8 times concurrently, they would all write to the same log file, which would become an absolute mess (as writes to log files typically are not buffered, to not lose data in case of a crash).

Komodo has the same issue. It would be good if testing tools provided some sort of symbolic shortcut or macro that replaces %i (for example) with testing instance number. So then you could say, ./myProgram -log %i.log and the tester would make the substitution before invoking the program. That would not be rocket science.

Don · Post by **Don** » Fri Jan 04, 2013 8:11 pm

Rebel wrote:I like to present a page for starters how to test a chess engine with limited hardware. I am interested in some feedback for further improvement.

http://www.top-5000.nl/tuning.htm

Hi Ed,

All your pages are valuable resources to the computer chess community. Another good page.

I have a comment on one thing you said:

Code: Select all

Understanding LOS
 
LOS means&#58; Likelihood of superiority.
 
A LOS of 95% means that the match result statiscally ensures a 95% certainty the version is superior, it does not say anything about the elo gain, just that you have a certainty of 95% the version is better than 0.00001 elo.
 
Among the programmers &#40;I think by now&#41; there is a kind of consensus that 95% is the norm to keep provided enough games are played. "Enough" is still undefined but 5000 games seems to be lower limit currently and I tend to agree.
 
The latter automatically moves us to the final and important chapter, when do you terminate a runing match? A couple of advices&#58;
 
Don't give up too early on a version. A 40-60 score after 100 games statistically means nothing. After 500 games if the result is still 40% it's high time to terminate.

Don't accept changes even with a LOS of 95% before 1000 games.

Larry and I have come to appreciate that the biggest factor in making steady progress is more that just the error bars. If you test 20 changes and honor the error margins you can still easily produce a regression when comparing the initial version to the version after the 20 changes. We assume you kept some of those changes and rejected others. I don't know if this is addressed anywhere in scientific testing methodology but chaining "improvements" together like this is not a very reliable way to test. I know this goes way beyond the scope of your article.

The reason is that given some number of candidate versions to test, your success will have a lot more to do with the percentage of those candidates that happen to represent good changes. And it's not simply because you make faster progress when you have more good changes although that of course is always good.

Suppose you have 50 versions and 10 of them are improvements. Somebody else starts with the same program and makes 20 versions that also have 10 improvements. We assume that the 10 improvements in both cases are of equal magnitude. You test the 20, one version at a time and are very careful to honor the error margins and use good testing methodologies. The "other guy" tests his 50 versions one version at a time and he is very careful to obey the error margins and use good testing methods. In theory both of you rejected all the bad versions and kept the 10 good versions and in the end come up with the same total improvement. Right?

In PRACTICE it doesn't work that way. In PRACTICE you will both conclude that some of the regressions are actually improvements because that is just the nature of the beast - we are operating within a certain noise threshold and it become impractical to run hundreds of thousands of games to resolve tiny differences. They guy that had to test 50 versions will have kept more regressions and will have made less total progress.

What is the solution? One solution is to make sure that at least half of your ideas are good

Other than that, I have no solution but you can minimize the problem and as a result of these findings Larry and I have modified our testing methodology. We now often run over 100,000 games and rarely less than 50,000. We don't keep anything that is a close call unless we have strong reason to believe that it cannot be bad. For example if we find a way to gain a slight speedup without risk we know it is a good thing even if the test is really close. Then test in this case is more of a sanity check against bugs.

A good percentage of our ideas ARE in fact good ideas and that has enabled us to make remarkable progress - but you cannot assume it will always be that way.

hgm · Post by **hgm** » Fri Jan 04, 2013 8:32 pm

Don wrote:It would be good if testing tools provided some sort of symbolic shortcut or macro that replaces %i (for example) with testing instance number. So then you could say, ./myProgram -log %i.log and the tester would make the substitution before invoking the program. That would not be rocket science.

That is a cool idea. WinBoard already does something like that for its own log file (by default winboard.debug). If the specified -debugfile contains a %d, that will be replaced (in a tourney) by the game number. No way to do that in the engine command yet, however. My latest engine projects all log through the GUI, by sending whatever they and logged as an 'engine comment' (any line starting with '#') to the GUI. So saving a separate GUI log file for each game was sufficient. (Disadvantage is that you also log the opponent.)

I will give it some thought.

Don · Post by **Don** » Fri Jan 04, 2013 8:37 pm

hgm wrote:
Don wrote:It would be good if testing tools provided some sort of symbolic shortcut or macro that replaces %i (for example) with testing instance number. So then you could say, ./myProgram -log %i.log and the tester would make the substitution before invoking the program. That would not be rocket science.
That is a cool idea. WinBoard already does something like that for its own log file (by default winboard.debug). If the specified -debugfile contains a %d, that will be replaced (in a tourney) by the game number. No way to do that in the engine command yet, however. My latest engine projects all log through the GUI, by sending whatever they and logged as an 'engine comment' (any line starting with '#') to the GUI. So saving a separate GUI log file for each game was sufficient. (Disadvantage is that you also log the opponent.)

I will give it some thought.

I have not even done this in my own tester but I plan to. There are times that I want to log output via komodo's -log option but this causes all output to go to the same file. Larry has a 16 core machine and often sets it up to run 32 instances so imagine him having to sort through all of that!

Rebel · Post by **Rebel** » Fri Jan 04, 2013 9:33 pm

mcostalba wrote:
Rebel wrote:I am interested in some feedback for further improvement.
Thanks!

The interesting part for me is the test from not out-of the book starting positions.

In your start-positions pool you have different kind of late-end game positions. Normally I always test from opening books, eventually truncated at say 10 moves.

But your approach is interesting; one one side you probably enhance the test sensitivity toward the patch you are interested to verify, for instance if it is an evaluation's endgame tweak I guess starting just from late-midgame positions enhances the signal to noise ratio and so would require less games to have reliable result (although current ELO estimators still do not take in account noise level, but this is a different topic).

On the other hand it is easier to introduce artifacts and unwanted bias if you apply it blindly, without understanding what you are doing.

As a general approach I'd still prefer to start out-of-the-book, seems safer to me and, most important, better approximates the "real world" conditions you'll find in public list tests.

Indeed, I exclusively use those early | normal | light | rook | late PGN sets for endgame changes as a first smell. When things look good the normal approach should confirm the changes.

Besides, I can create those sets from human or comp collections.

Rebel · Post by **Rebel** » Fri Jan 04, 2013 10:10 pm

Hi Don,

Don wrote:Larry and I have come to appreciate that the biggest factor in making steady progress is more that just the error bars. If you test 20 changes and honor the error margins you can still easily produce a regression when comparing the initial version to the version after the 20 changes. We assume you kept some of those changes and rejected others. I don't know if this is addressed anywhere in scientific testing methodology but chaining "improvements" together like this is not a very reliable way to test. I know this goes way beyond the scope of your article.

Indeed

The reason is that given some number of candidate versions to test, your success will have a lot more to do with the percentage of those candidates that happen to represent good changes. And it's not simply because you make faster progress when you have more good changes although that of course is always good.

Yes I know and that issue requires a loooong second page. Separate improvements that might bite each other when combining them, don't get me started

Usually when I have collected 5-6-7 of such I combine them and use common sense to find the optimal settings confirmed by (a couple of) new matches, that is, if I am lucky. I never got CLOP to work and I am not surprised by that. I am actually very interested how others tackle this nightmare.

Don · Post by **Don** » Fri Jan 04, 2013 10:19 pm

Rebel wrote:Hi Don,

Don wrote:Larry and I have come to appreciate that the biggest factor in making steady progress is more that just the error bars. If you test 20 changes and honor the error margins you can still easily produce a regression when comparing the initial version to the version after the 20 changes. We assume you kept some of those changes and rejected others. I don't know if this is addressed anywhere in scientific testing methodology but chaining "improvements" together like this is not a very reliable way to test. I know this goes way beyond the scope of your article.
Indeed

The reason is that given some number of candidate versions to test, your success will have a lot more to do with the percentage of those candidates that happen to represent good changes. And it's not simply because you make faster progress when you have more good changes although that of course is always good.
Yes I know and that issue requires a loooong second page. Separate improvements that might bite each other when combining them, don't get me started Usually when I have collected 5-6-7 of such I combine them and use common sense to find the optimal settings confirmed by (a couple of) new matches, that is, if I am lucky. I never got CLOP to work and I am not surprised by that. I am actually very interested how others tackle this nightmare.

This takes nothing away from your article of course - that would be a separate issue to address. The first thing to tackle is how to get naive testers educated to the basics and that is like pulling teeth.

For this issue we are very liberal about rejecting and very stubborn to keep a change. We actually have to believe a change is good (with lots of pre-testing) before we start testing in earnest. It does seem to be paying off though.

Rebel · Post by **Rebel** » Sat Jan 05, 2013 9:52 am

Houdini wrote:
Rebel wrote:Are the 2 questions related?
Possibly, that depends on you...

The concurrency argument will tell cutechess-cli to play multiple simultaneous games. If you want to use 8 cores, set "-concurrency 8" and cutechess-cli will play 8 simultaneous games.

If I understand correctly, you create 8 engine files in 8 folders and run 8 separate testing processes (clicking 8 times on the batch file) creating 8 PGN output files that need to be combined.

I have 1 engine file and run a single testing process using "-concurrency 8" creating a single PGN output file. KISS!

Tried "-concurrency 4"

Code: Select all

 c&#58;\cc\cutechess-cli -engine name=MAIN1 cmd=yourengine.exe dir=C&#58;\a\main1 proto=uci -engine name=WORK1 cmd=yourengine.exe dir=c&#58;\a\work1 proto=uci -each tc=inf -draw 160 100 -resign 5 500 -rounds 1000 -repeat -pgnout c&#58;\cc\all.pgn -pgnin c&#58;\cc\1.pgn -pgndepth 20 -concurrency 4

What happens is that indeed 4 threads are started but... processor activity is only 25% playing only 1 game. Aborting the match kills only one thread leaving the other 6 executables idle in the task manager and I must remove them manually. Totally unusable for me.

Perhaps a WB2UCI problem?

Anyway, it's not very encouraging to modify my starter-page with a "-concurrency" option advice.

lucasart · Post by **lucasart** » Sat Jan 05, 2013 10:15 am

Don wrote:
hgm wrote:
lucasart wrote:Cutechess-cli is perfectly capable of handling concurrent writing into the resulting PGN if that's your point.
Indeed, WinBoard would also have no problem with that. (It uses file locking to synchronize the different playing agents.)

But I think his problem is not with the GUI, but that his engine keeps a log file with a fixed name in their installation directory. So if you run it 8 times concurrently, they would all write to the same log file, which would become an absolute mess (as writes to log files typically are not buffered, to not lose data in case of a crash).
Komodo has the same issue. It would be good if testing tools provided some sort of symbolic shortcut or macro that replaces %i (for example) with testing instance number. So then you could say, ./myProgram -log %i.log and the tester would make the substitution before invoking the program. That would not be rocket science.

How about calling your log file log_%pid.log ? That solves the problem, no ?

A poor man's testing environment

Re: A poor man's testing environment

Re: A poor man's testing environment

Re: A poor man's testing environment

Re: A poor man's testing environment

Re: A poor man's testing environment

Re: A poor man's testing environment

Re: A poor man's testing environment

Re: A poor man's testing environment

Re: A poor man's testing environment

Re: A poor man's testing environment