I do mostly self-play, and only periodically run a gauntlet against a half-dozen opponent (and an older benchmark of itself). I used to strictly use gauntlets against 8 players, but I was using PSWBTM, and that didn't allow me to play fast enough games, so I always had too much noise in the testing.MahmoudUthman wrote:1-how many opponents to use ?
I think a range of (-100,200) is good for lower tier engines. You still score at least 25% against everyone, and it allows you to keep using many of the same engines over time. It's important to know that the engines you are using are solid.2-how much should the rating difference between the weakest and the strongest opponents be ?
As many as you are playing games. You don't want your testing to repeat the same game, as that generates false precision. I use 8moves_GM_LB.pgn which has at least a couple thousand openings3-how many different starting positions should I use ?
Most of the time, I just test at as fast as I can get away with. You almost have to, in order to get enough games in. I use 6+0.25, which is about as fast as my engine can handle without side effects.4-smallest sufficient time control ? and should I test at different time control or should a single fixed time control be enough ?
I use SPRT, with a 2000 game maximum (the number my computer can play overnight).5-How many games to play ?
As long as you have the CPUs (not threads) to support it. I use 3 on my 4-core to leave a core free for Windows processing and occasional browsing.7-Is it okay to use concurrent games ?
Stockfish has really highlighted the importance of a rigorous testing process. It's way too easy to code 5 things at once, pat yourself on the back and move on. But then you don't know if those changes were actually harmful because you didn't take the time to play enough games, or which of the changes were actually helpful. It might be a 10 elo improvement, but was it +2+2+2+2+2, or +30+0+2-15-7? Things that I thought were going to be obvious leaps forward ended up being -50 failures after adequate testing.6-anything else to consider ?
The worst feeling in the world is spending two weeks on 3 unrelated items, having a huge step back, and having no idea which piece ruined your program.