cutechess-cli question

lech · Post by **lech** » Thu Nov 08, 2012 4:54 pm

I try to explain why it is worth to use only one process to test some changes of engine.
If you starts a test between two the same (identical) engines in conditions:
1 CPU, time per move (not too short !), clear hash before each game and e.g. 10 games (5 random position x 2) you should receive the result 5:5. If not; it means that the conditions for both engines were not same (hardware, software (OS) or cutechess-cli). Rather Opereting System ? If you use the same engine in one process (the same process plays as WHITE and BLACK), it is a chance that you get the result closer to (or) 5:5.

ilari · Post by **ilari** » Thu Nov 08, 2012 5:22 pm

lech wrote:An engine is not a dice (rather Rubik's cube).
The second question is (if I may):
I try to test with option: st=x (time per move),
Why cutechess-cli returns <x loses on time>? I tested it at two original Stockfish-23-32-ja as opponents.

The engines lose on time because they use more than <x> time per move. It happens a lot with "time per move" time controls. You should use the "timemargin=N" option to allow engines to go N msec over the limit.

The third question is (if I may):
To get less random results it would be good to clear hash before each game. Can I do it?

Unfortunately cutechess-cli doesn't yet support setting options between games, so you'll have to clear the hash by yourself when a new game starts.

I try to explain why it is worth to use only one process to test some changes of engine.
If you starts a test between two the same (identical) engines in conditions:
1 CPU, time per move (not too short !), clear hash before each game and e.g. 10 games (5 random position x 2) you should receive the result 5:5.

It doesn't work quite like that. Even if you set a fixed amount of time per move, the amount of CPU cycles used per move will vary, even if you used only one engine process. If you want the results to be as reproducible as possible you should set a node and/or depth limit instead of a time limit.

lech · Post by **lech** » Thu Nov 08, 2012 6:01 pm

Thanks for you explain.

ilari wrote: It doesn't work quite like that. Even if you set a fixed amount of time per move, the amount of CPU cycles used per move will vary, even if you used only one engine process.

Maybe somenoe has an experience in one process testing?

hgm · Post by **hgm** » Thu Nov 08, 2012 6:44 pm

When I developed micro-Max I was testing it in self-play as a single process. The results were just as random as normal, of course. The chances that self-play would end in 5-5 are quite small with any engine I know. Even highly deterministic engines like micro-Max and Eden usually don't play the same game twice. Let alone engines that have hash tables and more advanced time management.

Tom Likens · Post by **Tom Likens** » Fri Nov 09, 2012 5:36 am

lech wrote:If you starts a test between two the same (identical) engines in conditions:
1 CPU, time per move (not too short !), clear hash before each game and e.g. 10 games (5 random position x 2) you should receive the result 5:5. If not; it means that the conditions for both engines were not same (hardware, software (OS) or cutechess-cli).

I know it seems like this should be the case, but in reality it won't be. The timer resolution
simply isn't accurate enough and you will see time jitter between the game runs. Even if the
difference is one or two extra nodes in the search it will eventually result in one of the
engines playing a different move--and BANG! you're done.

To get the 50% result you're looking for, you will need to play *thousands* of games and
the result will asymptotically approach 50%, jittering slightly above and below it. In the
old days we used to play a 100 games and declare victory. Remi's analysis and Bob's
cluster disabused us of that naive notion.

Unfortunately, it's the required sacrifice we all make to the Goddess of Statistics and Chaos.

regards,
--tom

lech · Post by **lech** » Fri Nov 09, 2012 10:21 pm

I think it is a problrm of calibration only.
If some progranners need more than 100 games to know whcih version is better, it means that they use engines and computers as dice, not Rubik's cube.

cutechess-cli question

Re: cutechess-cli question

Re: cutechess-cli question

Re: cutechess-cli question

Re: cutechess-cli question

Re: cutechess-cli question

Re: cutechess-cli question