a couple of questions regarding engine testing ?

MahmoudUthman · Post by **MahmoudUthman** » Fri Oct 19, 2018 1:00 pm

1-Should hyper-threading and Intel turbo-boost be disabled or is it okay to leave them on?
2-Do I always need to retest using LTCs, or does it depend on the Elo difference between the old & new version of the engine ?
3-Can I run fish-test locally, and use it to manage tests for my engine,or is it hard-coded to work only with stock-fish and it's repositories ?
4-What is a good SPRT window to test different changes to the engine ,do I need different windows depending on the modification (eval,search,..etc), and what is the shortest time controls that I can use to to test different types of modifications (search,eval,...etc) to the engine ? (can I use several seconds <10s per a game, or is it too short) ?

Robert Pope · Post by **Robert Pope** » Fri Oct 19, 2018 2:57 pm

1-Should hyper-threading and Intel turbo-boost be disabled or is it okay to leave them on?

I think the answer is no, but I don't have a CPU with either, so I haven't had to worry about it. The issue is, an engine on a core by itself or getting turbo-boosted will have an advantage over the other engine, and you can't assume that the assignment is going to be random and average out. And even if it does average out over time, you are still adding noise, which means your tests will have less power.

2-Do I always need to retest using LTCs, or does it depend on the Elo difference between the old & new version of the engine ?
Unless your fix is intended to impact only longer time controls, I wouldn't generally retest at LTC.

3-Can I run fish-test locally, and use it to manage tests for my engine,or is it hard-coded to work only with stock-fish and it's repositories ?

4-What is a good SPRT window to test different changes to the engine ,do I need different windows depending on the modification (eval,search,..etc), and what is the shortest time controls that I can use to to test different types of modifications (search,eval,...etc) to the engine ? (can I use several seconds <10s per a game, or is it too short) ?

I use [0,10] at something like 4+0.2, which I think is in the same ballpark as Stockfish. I also cut off after 2000 games for practical reasons - that's about what I can run in an overnight test. The narrower the window, the more games SPRT requires to finish. The wider the window, the more likely a good patch will be rejected.

jdart · Post by **jdart** » Sat Oct 20, 2018 5:54 pm

Re 1. I have hyper-threading disabled on my test machines. Turbo boost might not matter if you are going to use all cores anyway.

Robert Pope wrote: ↑Fri Oct 19, 2018 2:57 pm I use [0,10] at something like 4+0.2, which I think is in the same ballpark as Stockfish. I also cut off after 2000 games for practical reasons - that's about what I can run in an overnight test. The narrower the window, the more games SPRT requires to finish. The wider the window, the more likely a good patch will be rejected.

2000 games is not even close to enough. The error bars are too high, unless you have an exceptionally good/bad patch.

4+0.2 is a very slow time control for automated testing. It is probably good to run that TC once a while for regression testing. But for routine testing I use 0:08+0.1. I have 150+ cores and can run 40,000 games in about 2 hours.

--Jon

Robert Pope · Post by **Robert Pope** » Mon Oct 22, 2018 4:43 am

jdart wrote: ↑Sat Oct 20, 2018 5:54 pm Re 1. I have hyper-threading disabled on my test machines. Turbo boost might not matter if you are going to use all cores anyway.

Robert Pope wrote: ↑Fri Oct 19, 2018 2:57 pm I use [0,10] at something like 4+0.2, which I think is in the same ballpark as Stockfish. I also cut off after 2000 games for practical reasons - that's about what I can run in an overnight test. The narrower the window, the more games SPRT requires to finish. The wider the window, the more likely a good patch will be rejected.
2000 games is not even close to enough. The error bars are too high, unless you have an exceptionally good/bad patch.

4+0.2 is a very slow time control for automated testing. It is probably good to run that TC once a while for regression testing. But for routine testing I use 0:08+0.1. I have 150+ cores and can run 40,000 games in about 2 hours.

--Jon

I meant 0:04+0.2, which isn't much different than what you use. And since I don't have 150 cores, I do the best I can with what I have.

a couple of questions regarding engine testing ?

a couple of questions regarding engine testing ?

Re: a couple of questions regarding engine testing ?

Re: a couple of questions regarding engine testing ?

Re: a couple of questions regarding engine testing ?