Hyperthreading and Computer Chess: Intel i5-3210M

bob · Post by **bob** » Thu Apr 25, 2013 8:57 pm

syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:And maybe ONE day a little science will work its way into these discussions.

(hint: "ONE PROGRAM"?

And based on 1K games, with ONE program, HT is a good idea for all?
Let's see:
bob wrote:Regardless of urban legend, I have NEVER seen one example where using hyper threading improves the performance of a chess engine. Not a single one.

bob wrote:There's something badly wrong with your testing. I can post a ton of data relative to Crafty and SMT (hyper-threading). And it has ALWAYS been worse on than off. Including the recent test on my macbook dual i7 with SMT enabled (I can't turn it off).
If it doesn't work for Crafty, it can't work for any other engine? (Well, except for "poorly implemented" engines like Houdini, I guess.)

Btw, I do agree that Mike's results are not statistically significant because of the small sample size. But is it much different for the old papers on which most "conventional knowledge" is based? Just an example: 24 test positions?
You DO realize that I run multi-threaded programs regularly? I have posted results of cluster testing with threads using things like stockfish, to measure Elo improvement.
I'm looking at what you wrote, and what you wrote is that it doesn't work for Crafty with a strong suggestion that therefore it can't work for other engines.

bob wrote:I chose the "24 test positions" because that was what everyone else was using at the time, and it made our results directly comparable, even if not as accurate as might be desired. In my DTS paper I didn't use those positions, I changed the idea to following a real game, move by move, one that Cray Blitz had actually played...
That everyone else was doing the same doesn't change the fact that results from 24 positions are statistically insignificant. Let today be the day that some science gets into this discussion...

bob wrote:There was no significant locking overhead in Cray Blitz. NPS scaled perfectly up to the 32-cpu T90 we used. 12% increase in NPS will NOT offset the 30% increase in nodes searched, however...
Let's do the math. Directly from your DTS paper: 4 processors give a speedup of 3.7, 8 processors give a speedup of 6.6.

This means that on a 4-core machine with HT off, the speedup is 3.7. This is easy.

With 8 processors the speedup is 6.6, but that is including the 2x nps speedup compared to 4 processors. So on a 4-core machine with HT on and assuming 12% higher nps, the speedup is (6.6 / 2) x 1.12 = 3.7. This is still elementary.

So with more than 12% higher nps due to HT, HT is a win.

Again, this is based directly on your DTS paper. The math is not difficult.

#1. I have not seen ANYONE get that kind of speedup today. Back then the trees were pretty much a fixed depth with no forward pruning, or reductions, or any of the other things that make todays trees much more irregular and unstable. So until you find ANY current program that can produce those kinds of speedups, we have to use todays numbers. Crafty is nowhere NEAR the DTS numbers. And I have not seen any program yet that has beat Crafty's numbers, since everyone is basically using the SAME parallel search approach that is used in Crafty. Ergo, 12% is meaningless.

#2. My measurements, over hundreds of positions, repeated dozens of times (you can find discussions of that here in past threads where others took my raw data (log files) and computed the speedups for themselves to verify my methodology) show that each additional thread increases the tree by about 30% for that thread, or, more clearly, for each thread added, that thread searches about 70% useful stuff, and 30% completely wasted stuff. Hence the 1.7x speedup I typically see for 2 threads. Or 3.1x speedup I typically see for 4 threads. The HT speedup has to offset that 30% for each thread (except for the first thread) just to break even. I've not seen numbers like that. I have repeatedly said that most tests for me (with HT on) are break-even at best, but doubling threads really ramps up the variability of the results (time required to reach a depth). When I first got my PIV box, I thought it was worthwhile, barely. Our first PIV cluster test, however, showed it was actually playing a little worse with SMT on, and at that point, I stopped using it. If, by some magic bullet, SMT is suddenly more efficient, that might change. But my macbook is an I7, and the tests I ran and posted here a few weeks back suggests it still is not a "winner". Granted, it might not always be a loser either. But I have not seen any evidence that it produces a measurable improvement.

If you want to measure overhead, it is pretty simple to do. Just run a position to fixed depth with one cpu, then two, and so on. Here is a one and two cpu run for a couple of positions, on my macbook:

log.001: time=32.29 mat=0 n=140444828 fh=93% nps=4.3M
log.002: time=24.72 mat=0 n=179333906 fh=93% nps=7.3M

log.001: time=1:27 mat=0 n=350299195 fh=93% nps=4.0M
log.002: time=54.90 mat=0 n=374227506 fh=93% nps=6.8M

log.001: time=6.14 mat=0 n=36634945 fh=94% nps=6.0M
log.002: time=4.03 mat=0 n=41061761 fh=94% nps=10.2M

log.001: time=27.45 mat=0 n=121794283 fh=92% nps=4.4M
log.002: time=32.32 mat=0 n=248567560 fh=92% nps=7.7M

As you can see, in each case, the second run searches more nodes (using 2 threads) as compared to the first run (one thread).

This is the killer for parallel search performance, imperfect move ordering. Crafty counts "aborts" which is directly related to overhead (abort this search, the results are not needed, we are failing high on another move, at what we thought was an ALL node).

#3. I generally quote Crafty results because I know what it does, and how it does it. I let others report their parallel search results (few do, however) since it is their code. Crafty's behavior in a parallel search is not going to be that different from what anyone else gets, for obvious reasons (everyone is using YBW as the basic idea, and many are using a near-identical approach to that used in Crafty since the source has been available for almost 20 years now (parallel search code).

hgm · Post by **hgm** » Thu Apr 25, 2013 9:18 pm

bob wrote:You want randomness in the GAMES. We provide that by choosing a large number of different opening positions to start playing from. You do NOT want randomness introduced by varying the hardware speed, or by doing silly things like EGTB probes to a remote file-system where network latency introduces randomness, etc. There are plenty of examples where program A plays worse at a faster (or slower time control) than program B. If you slow down the hardware randomly, you essentially make one or both play at a faster effective speed, which skews the results in random ways. Not wanted.

If you have hardware randomness, you have to play enough games to be sure that the randomness affects both sides equally, but with that kind of randomness, more games doesn't help.

Have you tried that? It would be easily simulated in software. E.g. randomly assigning one of the two sides a time odds of (say) 10% for the entire game. I bet you didn't. Because if you would have, you would have seen that it doesn't make the slightest difference for the outcome of the test matches.

syzygy · Post by **syzygy** » Thu Apr 25, 2013 10:00 pm

bob wrote:#1. I have not seen ANYONE get that kind of speedup today. Back then the trees were pretty much a fixed depth with no forward pruning, or reductions, or any of the other things that make todays trees much more irregular and unstable. So until you find ANY current program that can produce those kinds of speedups, we have to use todays numbers. Crafty is nowhere NEAR the DTS numbers. And I have not seen any program yet that has beat Crafty's numbers, since everyone is basically using the SAME parallel search approach that is used in Crafty. Ergo, 12% is meaningless.

Are you saying DTS would be less efficient on x86? I was referring to a hypothetical Cray Blitz that would run on a 4-core cpu with HT.

It's interesting that just a few posts ago you were referring me to a 1988 paper, while now the DTS results are suddenly meaningless for today's programs. Cherry picking is not exactly scientific.

#3. I generally quote Crafty results because I know what it does, and how it does it. I let others report their parallel search results (few do, however) since it is their code. Crafty's behavior in a parallel search is not going to be that different from what anyone else gets, for obvious reasons (everyone is using YBW as the basic idea, and many are using a near-identical approach to that used in Crafty since the source has been available for almost 20 years now (parallel search code).

So when someone reports results that don't suit you, you dismiss it with the very scientific hint: "ONE PROGRAM" ?

bob · Post by **bob** » Thu Apr 25, 2013 10:53 pm

hgm wrote:
bob wrote:You want randomness in the GAMES. We provide that by choosing a large number of different opening positions to start playing from. You do NOT want randomness introduced by varying the hardware speed, or by doing silly things like EGTB probes to a remote file-system where network latency introduces randomness, etc. There are plenty of examples where program A plays worse at a faster (or slower time control) than program B. If you slow down the hardware randomly, you essentially make one or both play at a faster effective speed, which skews the results in random ways. Not wanted.

If you have hardware randomness, you have to play enough games to be sure that the randomness affects both sides equally, but with that kind of randomness, more games doesn't help.
Have you tried that? It would be easily simulated in software. E.g. randomly assigning one of the two sides a time odds of (say) 10% for the entire game. I bet you didn't. Because if you would have, you would have seen that it doesn't make the slightest difference for the outcome of the test matches.

Yes I have. And it changes the results. I had a version of Crafty that played maybe 100 Elo better than Fruit at very fast time controls, but pretty equal at double that time per move. In my testing, I used a 2-cpu per node cluster, and played a normal 6K match with Fruit, 3000 positions playing two games per position reversing colors. The Elo came out about as expected. Somewhere between "equal" and "+100" (never really getting to either extreme as I did not run the test enough). To screw up the timing, what I did was run one other process on the node that was 100% compute bound. It therefore competed equally with the engines for CPU cycles (two games at a time on a node, no pondering, + the cpu burner). It wrecked my comparisons. How common that "A is better than B at one T/C and B is better than A at another," I have NOT tried to measure. Part of it is related to time usage, part of it is related to eval and search tuning. If one was unaware of that problem, it would be easy to either keep a bad change or throw out a good change, simply because the CPU usage caused one to be better than the other without any changes needed...

So yes, I ran EXACTLY what you are describing, albeit in a different way. Ass I could depend on was that at any instant in time, I had 4 engines running where two were actually executing and two were blocked waiting on the opponent to move, plus one compute-bound task. So an instantaneous "peek" would show one of the two running engines getting 100% of a CPU, while the other was only getting 50%. A 4-1 time variance (anywhere between .5-1 to 1-.5). That really highlighted the particular fruit/crafty difference...

I ran the test, not because I was using hyper-threading (we never enable it on our clusters although today it seems safe enough if you limit processes to physical cores), but because the IBRIX filesystem we were using was introducing some ugly (and unpredictable) lag in writing log-files while the games were in progress. When I discovered the lag issue, I wondered "is this affecting my results?" and the answer was "definitely" due to the way the two programs play at different time controls...

Can't say it is a problem today, but I also can't say it is not. I see no reason to intentionally introduce randomness that CAN effect things. You logically want to think that if one program gets a 10% bonus this game, and another program gets a 10% bonus the next game, it is a "wash". But it isn't, if the two programs are sensitive to the time control as in the Crafty/Fruit case.

If you are interested, I could try a few time controls today between recent Crafty and Fruit, just to see if the difference is still there to some extent. I'd suspect so since I never tried to intentionally tune it away.

geots · Post by **geots** » Fri Apr 26, 2013 6:03 am

Arguing with you about technical things such as this reminds me of people either treading water or pissing against the wind. While one gets you nowhere- they both get you all wet.

george

IQ · Post by IQ » Fri Apr 26, 2013 10:01 am

For confimations and discussion of this Bug see here:

http://talkchess.com/forum/viewtopic.ph ... 714#493714
http://talkchess.com/forum/viewtopic.ph ... 728#493728
http://talkchess.com/forum/viewtopic.ph ... 742#493742
http://talkchess.com/forum/viewtopic.ph ... 819#493819

This is the behavior under windows 7 x64 WITH affinities set (without affinities a comparison would be usesless anyhow). Also I believe this behaviour also shows when using all threads (including hyperthreading), only much more seldomly and houdini "recovers" much faster. Thats why normally nobody (maybe Robert included) noticed it. Also it seems to happen more often the more cores you have (e.g. it is more likely on a 6/12 machine than on a 4/8). But also i must say it is still open whether this is a scheduling problem of windows or a houdini race condition bug. Calling it the latter was premature.

hgm wrote: I don't follow the logic of your point III. What you say there sounds very non-sensical. Software can never emulate extra computational power delivered by the hardware (in this case through HT).

It can if it is not the "extra" computational power which leads to better performance, but the shape of the tree searched. Think about it like this.

1) Assume that the 12%-20% gain through hyperthreading is not enough to offset the additonal parallel search overhead
2) If then the HT version with 2 x the number of threads still outperforms the non hyperthreaded version, it is most likely due to the different shape of the (and somewhat larger) tree searched
3) Nowadays with all sorts of reductions/extensions/pruning going and the additonal splitting in the HT ON version, it could very well be that some of these reductions which would kick in normally do not anymore in the additional searched treespace. This would also explain the faster solution of tactical tests as "key" moves are usually NOT at the top of the previous move ordering.
4) But if the hypothetical gain is due to the different shape of the tree, through mechanisms such as 3 -> then this behaviour could be replicated in software by adjusting the reductions/extension/pruning or even alpha/beta pruning or introducting a probabilistic element regime.

This can all easily put to rest if some of the kind testers do the same test with hyperthreading turned of in bios (and all other variables like turbo-boost) playing against an identical machine using hyperthreading.

bob · Post by **bob** » Fri Apr 26, 2013 9:04 pm

Here's a sample I found in my archives. I ran Crafty version 20.6 in a normal test, but for this data, I extracted just the games vs Fruit and Toga2 (which are probably reasonably similar although Toga2 seems to be stronger as you can see):

Code: Select all

Rank Name                  Elo    +    - games score oppo. draws
   1 Crafty-20.6-22  2656    5    5 12000   69%  2522   27%
   2 Crafty-20.6-33  2645    5    5 12000   67%  2522   25%
   3 Crafty-20.6-11  2639    5    5 12000   66%  2522   22%
   4 Crafty-20.6-44  2615    5    5 12000   64%  2522   34%
   5 Toga2           2566    3    3 24000   40%  2639   28%
   6 Fruit 2.1       2477    3    3 24000   28%  2639   26%

The four versions (-11, -22, etc) were identical except for the time controls used.

version -11 used a time control of 3 seconds + 0.01 second increment
version -22 used a time control of 10 seconds + 0.1 second increment
version -33 used a time control of 5 seconds + 0.05 second increment
version -44 used a time control of 60 seconds + 1 second increment

Notice that the difference between 11 and 33 is 6 elo but possibly not significant due to error bar for 12,000 games being +/- 5

22 to 33 is 11 Elo difference and that is significant even though one had just 2x the time of the other.

I could run more definitive tests, but this gives the general idea that against a specific opponent, one can do better at faster (or slower) games, and worse at slower (or faster) games. There's a 40 Elo gap between the top and bottom, even though the opponents are exactly the same. If you run on a hardware platform that randomly gives time odds, by slowing things down (reducing the speed of the CPU effectively by sharing a physical core between two logical cores, or sharing a physical core between two different processes, that hurts one more than the other. Or helps one more than the other. And it makes measuring small elo changes problematic, which is the heart of testing...

syzygy · Post by **syzygy** » Fri Apr 26, 2013 9:19 pm

IQ wrote:But also i must say it is still open whether this is a scheduling problem of windows or a houdini race condition bug. Calling it the latter was premature.

It has been indicated several times that not respecting affinities is not an engine problem but an OS problem.

It can if it is not the "extra" computational power which leads to better performance, but the shape of the tree searched.

The only reason HT could be of benefit is because of the extra computational power: more nodes searched per second. The reason HT is not necessarily of benefit is that the higher nps is achieved by more threads and increasing the number of threads lower the efficiency of parallel searches: a lot of the extra nodes searched are "wasted" and in particular don't translate to higher depth. I don't think there is any dispute on this.

So the question is if the decreased efficiency is offset by the increased nps. Now the shape of the tree came into the discussion because it is conceivable that at least some of the nodes that seem "wasted" because they do not translate to higher depth might still be of benefit to the quality of the search.

Whether or not the quality of a same-depth searched is increased by having more threads, there is no doubt that without the nps increase HT can only be a loss.

bob · Post by **bob** » Fri Apr 26, 2013 9:20 pm

geots wrote:Arguing with you about technical things such as this reminds me of people either treading water or pissing against the wind. While one gets you nowhere- they both get you all wet.

george

I look at it a bit differently. If one is going to test reliably, it is absolutely essential to understand both the nature of the experimental set-up, and how that can affect the test. I quickly discovered that changing the time controls changed Elo in unexpected ways, because I have always tested at more than one time control when making changes to the search. It is too easy to find a search extension/reduction that helps (or hurts) depending on the time control. I want to make sure that I don't make my program a great blitz chess player at the expense of playing poorly at long time controls where search extensions have a better opportunity to explode the tree...

I have literally played billions of games over the past 10 years or so.

One unfinished experiment is dealing with the previously discussed easy move and such. To date, I have discovered the following:

(1) setting a target time and stopping when that time expires produces an Elo of X.

(2) setting a target time, but allowing the search to continue up to 6x that target to try to find a better move when the score drops produces an Elo of X+80.

(3) trying the conservative set a target time, and once that time is reached, continue searching until the iteration is completed produces an Elo of X-15

(4) removing my old "easy move" code produces an Elo of X (zero change, a surprise).

(5) removing the code to always finish searching any "in progress" root moves after time runs out produces an Elo of X (also zero change).

I have investigated both of these (now) and have concluded the following:

(a) the easy move idea works most of the time and saves some time, but it is not triggered that frequently, and often an easy move fails low a small amount as it goes deeper, which disables the early time-out. It doesn't save time very often, and it also will occasionally move too quickly when a deeper search would have helped. These seem to offset each other in terms of frequency, and therefore my previous idea didn't work.

(b) same thing for the "always complete in-progress root moves after time has expired." It burns CPU time when the in-progress moves are not going to replace the best move, wasting the compute cycles, and on occasion it does find a better move. But the key parts of this are handled by the "extend time on fail-low" anyway, so this also seems to be a good idea that actually does nothing.

Both of the above have been used in my program since the late 80's, and have also been used in many others. A "good idea" that doesn't do a thing but make the code slightly more complicated, and possibly introducing a timing bug here and there.

I doubt anyone has tested those ideas separately, but I tested each many times, slightly altering the fixed time allocation to see if it could be better to set a larger (or smaller) target time for each idea, independently. By the time I had finished, each of my 5 test opponents had played 180,000 games. A total of 900,000 games. All done in a week of sporadic testing.

IQ · Post by IQ » Sat Apr 27, 2013 1:00 am

syzygy wrote:
IQ wrote:But also i must say it is still open whether this is a scheduling problem of windows or a houdini race condition bug. Calling it the latter was premature.
It has been indicated several times that not respecting affinities is not an engine problem but an OS problem.

Please reread the links given - the affinities are respected in all cases. The "behaviour" might be due to an OS problem, but it has nothing to do with affinities not being obeyed.

syzygy wrote:
It can if it is not the "extra" computational power which leads to better performance, but the shape of the tree searched.
The only reason HT could be of benefit is because of the extra computational power: more nodes searched per second. The reason HT is not necessarily of benefit is that the higher nps is achieved by more threads and increasing the number of threads lower the efficiency of parallel searches: a lot of the extra nodes searched are "wasted" and in particular don't translate to higher depth. I don't think there is any dispute on this.

So the question is if the decreased efficiency is offset by the increased nps. Now the shape of the tree came into the discussion because it is conceivable that at least some of the nodes that seem "wasted" because they do not translate to higher depth might still be of benefit to the quality of the search.

Whether or not the quality of a same-depth searched is increased by having more threads, there is no doubt that without the nps increase HT can only be a loss.

Again please reread - especially the assumption 1. My argument is hypothetical in that it only holds if 1) the HT ON improvement some claim holds up with proper testing (HT in BIOS off vs. HT ON). And if 2) the slight 12% (or even 20%) increase does not offset the parallel search overhead (which in my case going from 6 to 12 threads is most likely true). Then and only then the only possible explanation for an increase in playing strength left would be the shape of he tree. And only then the same could be achieved in software. I hope this clarifies the assumptions of this argument. I personally only believe in assumption 2 but am very sceptical about assumption 1 - that the increase in strength would hold in proper (bios HT OFF) testing. But maybe some kind tester will clear this up.

Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M