lkaufman wrote:Laskos wrote:lkaufman wrote:Laskos wrote:lkaufman wrote:What is your opinion about how best to test in single-thread mode on these machines that have hyperthreading; test with HP off matching the physical core count (or minus one), or doubling the physical core count with hyperthreading on (or minus one or two)? We used to test with HT on and doubling the physical core count, but shortly before Don died we switched to testing with HT off and using the physical core count (minus 1) as almost everyone on this forum seemed convinced that HT should be off for single-thread testing. Clearly we can play more games per minute of equal quality with hyperthreading, but the suspicion is that they are less equal to each other in terms of available resources and hence more random. Of course this has nothing to do with lazy MP, as I'm talking about SP tests, but I could also ask the same question for four thread testing (on machines with 8 or more physical cores), and perhaps your answer would be different.
Since I had my first i7 4 physical core desktop and Windows 7, that's some 5 years, I always used HT ON. The match tests I divide roughly in 2: as many games as possible or as longer time control (with as many games too). For first case I use 8-1=7 or 8-2=6 logical cores. For second I use 4-0=4 physical cores. I rarely use affinity. Without it, on 4 cores the speed is indeed lower by 1-2% than setting affinity on physical cores or disabling HT in BIOS, but this is spreaded equally, so doesn't bother me. Last time I performed a statistical analysis of the match outcomes for the usage 8-1=7 was some 4 years ago with Win 7. Windows has gradually improved thread scheduling from Vista to Win 8 I have now. I did have problems with Vista, they disappeared with Win 7, and it seems even better with Win 8. I don't know how Linux handles the scheduling of the threads.
With HT ON and concurrency 8 on 4 physical cores the speed of the single-threaded engine is about 30% lower than with concurrency 4 on 4 physical cores. With concurrency 7 maybe 25% slower. So, to compare different results at the same time control, one has to be consistent, as "effective time control" is different for 4-0, 8-1 or 8-2. Sometime I have to use 8-2=6 because a YouTube or an antimalware eats one full thread, and there are other things happening with the OS or the browser.
Thanks. A couple follow-up questions if you don't mind:
1. What is "longer time control" where you only use actual number of physical cores?
2. Why use fewer threads for longer time controls? Since using 7 or 8 is way more efficient than using 4 (based on your numbers, which are similar to my own), you must really distrust using 7 or 8 in longer games for some reason.
I wanted to mean that I do care in these games about scaling with "effective time", TC to be longer, or NPS higher. The test can often last for days, and I still use my PC for browsing and with text editors. I feel comfortable with 8-1 only for hours, and I usually don't disturb excessively this PC during such a run. In these 4-0 longer games (say above 30s/game) overhead, granularity of Windows timer, move selection noise, all are of smaller importance, and I want to preserve the "quality" of the test. 8-1 at 10s/game or less are anyway noisy.
3. How do you know in advance whether to leave 1 or 2 threads free for things that are unpredictable?
4. Why leave 1 (or 2) threads free when aiming for 8 but none free when aiming for 4?
With 8-0 the impact of 2-3% task on fast games is bad, it can even disrupt occasionally a game, as it could stick for say 30ms to a thread used by an engine. At least 8-1 is mandatory. With 4-0 HT ON (8 logical cores), a 2-3% task usually interferes minimally with the active threads, there are plenty of free ones available, and even if it intervenes, on longer TC Windows scheduler is sufficiently fast to not keep it stuck. But I am not a specialist on these issues, and some things might be subjective.
I see. So you think that for "real" games (30" or longer) the improved quality of the test justifies the use of only four threads. So even though you could run on 7 threads at about 20" instead of 30" and get similar search depth with more games per minute, the test might be less "fair" and so you prefer to use only four threads. If that is the case, then we are probably doing the right thing to turn off hyperthreading and use (for example) 15 threads on a 16 core machine.
But your response is interesting in another way. I always thought that the only point of testing at say 30" vs 5" is that you get more depth,since many things behave differently at different depths. But I think you are suggesting that there is just much more randomness at 5" than at 30", so even if search depth had no effect on a given idea to be tested, there is still an argument for the longer time limit. The question is this: are the random factors at 5" level ones that will become insignificant with say 10,000 games, or are we talking about factors that might bias a test even if you played a million games? That is actually very important to know.
You mean 7 threads at longer TC than 4 threads for the same depth, with more games per minute? Yes, but the issue is mostly practical. When I am back home, and the test is running, I often interfere with my desktop PC, even with the test running, my desktop is usually more responsive than my weak laptop or tablet. While I consider this small interference as negligible with 4-0, it might be a bit disturbing with 8-1. But I forgot to mention one aspect of my testing: I often test at fixed depth or nodes, and in this case one can safely go to 8-0, or, with extremely fast games, when the time lag between moves is comparable to movetime, to concurrency higher than 8 on 8 logical cores (4 physical). Sometimes Cutechess-Cli output flows like 5-10 games per second in this case, and there is no any noise problem with concurrency. Also, do not consider 8-1 games at fixed time as always sub-standard. First thing when Komodo is released, I go to LittleBlitzer, set 50ms/move (with LittleBlitzer there are no Komodo time forfeits with that form of TC), get rid of possible time control modifications this way, take care of modifications to overhead, set concurrency to 8-1, leave it for 15-20 minutes or so in games against a previous Komodo, and I have 1000 games, with usually very informative results. Strength, NPS, depth, time used (aside granularity) are all there in the output. Usually strength difference is amplified a bit, but I am no rating list.
Games at 30'' compared to 2'' are different with systematic bias, not a white noise decaying as 1/(N games)^0.5. Even if there are no scaling issues (although at 2''/game there are very many scaling issues), the noise at short TC blurs the outcome to such a degree that strength difference might be systematically distorted. Besides the non-random noise, say to timer routines used, to overhead used (which I often modify to smaller). I only use these TC for very close engines, say SF related devs. 8-1 issue is the least of my concerns. And again, 8-1 testing can be as good and faster than 4-0 if you dedicate the whole PC to testing. You can leave 8-1 for days at long TC (say 30''/game), and if your OS tasks or antivirus don't pop-up actively at some times, you should be perfectly fine. Thread scheduler in Windows is good for say 500ms per move playing. I don't know about Linux thread scheduler.