Hyperthreading and Computer Chess: Intel i5-3210M

syzygy · Post by **syzygy** » Sun Apr 14, 2013 8:12 pm

bob wrote:
syzygy wrote:
bob wrote:This plays right into the hands of a parallel search that by its very nature tends to do better when move ordering is sub-optimal.
Isn't it interesting that YBW is "known" to have no overhead compared to a sequential search with optimal move ordering?

I have a suspicion that with a good implementation of parallel search most search overhead is due to missed transpositions.
You would be wrong. We typically see 90% of fail highs on the first move searched, which means 10% of the tree is NOT optimally ordered. In Crafty, I measure the "stops" that occur, which are caused by failing high on an ALL node, something that should not happen. Yet it does.

Optimal move ordering does not, and never will exist, otherwise we would not need search in the first place. If you find a copy of the paper I wrote for the Journal of Parallel Computing, circa 1988, it very precisely analyzes this problem and predicts the tree size growth that it causes.

So which of the two is it:
- parallel search overhead is due to non-optimal search ordering, or
- parallel search by its very nature tends to do better when move ordering is sub-optimal.

Are you saying that both are true?

I'm sure some transpositions are missed, but this error analysis was done back in the days of 5 ply searches on the kopec positions, where transpositions were anything but a major problem. And the overhead was the same then as today...

I'm not sure why people put so much faith in prehistoric research into parallel search algorithms from the days that 5 ply searches were the norm.

bob · Post by **bob** » Sun Apr 14, 2013 8:59 pm

syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:I think you are hoping for "good luck". In reality, what is happening is that the parallel search is simply searching nodes that are completely unnecessary to produce the right score.
Well, you don't have to read what I write, but I'm not sure why you still bother to answer?
I responded precisely to what you wrote...

quote:

It is also not strange to believe that the tree being larger, all else being equal, contributes positively to playing strength. With pure alpha-beta it would contribute nothing, but the top engines all use a highly selective search.

A more selective search is usually good because the added depth outweighs the errors introduced by more selectivity. But if you can search a somewhat larger tree in the same time and reaching the same depth, the impact of these errors is reduced and play may be expected to be stronger.
It is most definitely "strange" to believe that a larger tree is better. Turn off alpha/beta and it will get a LOT larger. Zero better.
So you simply don't read what I wrote. I gave an explanation and you ignore it and instead come up with a comparison with minimax.

Second. If you believe that using extra CPUS to search broader rather than deeper is better, why not search broader rather than deeper with a single CPU?
Obviously if you can reach the same depth in the same time with a broader search (i.e. less pruning), your search is better. I'm not sure why you take issue with this.

As I said, this is a pretty ridiculous way of looking at a tree search issue, parallel or not.
The problem is your refusal to read.

Why would I take issue with it?

1. It is anything but true. Looking at irrelevant parts of the tree does not improve accuracy whatsoever, just wastes time.

2. The parallel search does NOT intentionally try to widen the search and make it more accurate. In fact, quite the opposite. It tries to do EXACTLY no more than what the sequential search does, it just tries to do it faster.

This is about speed, NOT additional accuracy. There is absolutely ZERO empirical data suggesting that a parallel search is more accurate than a sequential search, if you disregard the additional depth the parallel search gains. That is the only measurable benefit.

This nonsense about more accuracy is just that. Please provide some actual data, or some theory, or anything that would support this. Parallel search is about more depth, nothing else. Just because you OCCASIONALLY might find something at a depth where the sequential search misses it does NOT mean more accuracy. Just "more accuracy in that particular position" while in other similar positions you get LESS accuracy due to extra nodes searched.

Let's move on from the ridiculous and get back to technically accurate discussions, eh?

bob · Post by **bob** » Sun Apr 14, 2013 9:01 pm

geots wrote:
bob wrote:
geots wrote:
bob wrote:
syzygy wrote:
bnemias wrote:This subject comes up every so often, and it's hard to believe people still think searching a larger tree with a small increase in NPS is beneficial. I'm not completely convinced there's a relationship between time to depth and playing strength either. But lacking any data of my own, I tend to believe Bob.
I don't know why you doubt that there is a relationship between time to depth and playing strength. For sure these are strongly correlated. The same depth in less time, all else being equal, definitely results in stronger play.

It is also not strange to believe that the tree being larger, all else being equal, contributes positively to playing strength. With pure alpha-beta it would contribute nothing, but the top engines all use a highly selective search.

A more selective search is usually good because the added depth outweighs the errors introduced by more selectivity. But if you can search a somewhat larger tree in the same time and reaching the same depth, the impact of these errors is reduced and play may be expected to be stronger.

So even if time-to-depth increases with HT on, it just might be the case that overall this is outweighed by the positive impact of the larger tree.

Bob's HT tests were probably limited to crafty. I also believe they were done on a system that was not overclocked. The higher the overclock, the better the performance of HT. This is because memory latency has a bigger impact at higher clock frequencies and HT shines at hiding these latencies.

From the tests I have done myself I could not conclude that my engine benefits from HT, but I cannot say anything for engines that I have not tested.
I think you are hoping for "good luck". In reality, what is happening is that the parallel search is simply searching nodes that are completely unnecessary to produce the right score.

Sometimes a parallel search will find the answer much more quickly than expected, but this is generally a result of poor move ordering where the parallel search looks at the supposedly bad move sooner than the sequential search does, and quickly establishes a better bound that makes things go faster. But that is uncommon, not expected to happen regularly.

I'm likely the only person on the planet to have actually run 30K game matches with 1 cpu, then with 2, then with 4, and finally with 8. And I have found NO circumstance where time to depth suggests one thing, and actual games suggest another. That is, the speed of the parallel search is the thing that gains Elo, not some bizarre tree shape that happens regularly. Such is just "statistical noise" if the test is large enough...

The only exceptions I have seen are those where so few games are played, the statistical variance makes the results statistically insignificant.

The only exception to the "hyper threading is not good for chess" would be a poorly implemented program which gets an unusual boost from HT, that a well-designed implementation would not get. Such tricks (HT) tend to help poorly written code more than code that has been optimized to efficiently access memory and to reduce as much as possible unnecessary data dependencies or unnecessary computation that stalls/clogs pipelines...

Dr. Bob, I have a question. I am running 2- Intel i5 4-core systems. Neither has hyperthreading. I have on the way a new Intel i7 6-core system, in which the HP can be disabled- which I plan to do. But in theory- if I did not disable HT on the i7, in testing would that create a variable I don't want- 2 systems with no HT, and 1 system with HT- and the results all being thrown into the same batch?

Best,

george
My advice is as before. If you have an 86 with 6 cores, and you never use more than 6 cores at once (whether you play 6 games with ponder=off, 3 games with ponder=on, or one game with an engine using 6 cores ponder=off, the results should be reliable.

If, you go farther and use more than 6 cores, some of those physical cores get split into what is effectively 2 pieces, each 1/2 as fast as the original. Some programs will be running on a CPU 1/2 as fast as normal, some won't, and that can certainly add noise.

Further, since all of these things use turbo-boost, it gets a little trickier yet, since just using one core will use overclocking, but when you use several overclocking turns off completely or almost completely. Again introducing cpu speed variability. I turn this off on whatever boxes I can, apple being the exception where there is no bios to fiddle with.

Thank you Dr. I missed one thing. Did you say you turned the turbo-boost or the overclocking OFF when you could?

Thanks again,

george

PS: One last thing. If I was running 6 single core matches- my real cores are all used up. There are 6 interfaces that all total will require a bit of a core- tho maybe not much. So if a virtual core comes into play here- how do I know that the power the 6 guis need is not coming from a real core and one engine is getting what was taken away from it to give to the guis from a virtual core? (Assuming what I have said makes any sense at all)

You almost have to when you do the kinds of things I do here. I want to measure parallel speedup, which means comparing the 8 cpu time to the 1 cpu time. With turbo-boost on, that is impossible, since the 1 cpu run uses a faster CPU (thinks to TB) compared to what you see when you use all 8...

Unfortunately, on my mac, I can't do that. And for max performance, it is likely a bad idea since theoretically one can turbo-boost even with both cores running. But it does screw up performance measurements, for sure.

bob · Post by **bob** » Sun Apr 14, 2013 9:07 pm

syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:This plays right into the hands of a parallel search that by its very nature tends to do better when move ordering is sub-optimal.
Isn't it interesting that YBW is "known" to have no overhead compared to a sequential search with optimal move ordering?

I have a suspicion that with a good implementation of parallel search most search overhead is due to missed transpositions.
You would be wrong. We typically see 90% of fail highs on the first move searched, which means 10% of the tree is NOT optimally ordered. In Crafty, I measure the "stops" that occur, which are caused by failing high on an ALL node, something that should not happen. Yet it does.

Optimal move ordering does not, and never will exist, otherwise we would not need search in the first place. If you find a copy of the paper I wrote for the Journal of Parallel Computing, circa 1988, it very precisely analyzes this problem and predicts the tree size growth that it causes.
So which of the two is it:
- parallel search overhead is due to non-optimal search ordering, or
- parallel search by its very nature tends to do better when move ordering is sub-optimal.

Are you saying that both are true?

I'm sure some transpositions are missed, but this error analysis was done back in the days of 5 ply searches on the kopec positions, where transpositions were anything but a major problem. And the overhead was the same then as today...
I'm not sure why people put so much faith in prehistoric research into parallel search algorithms from the days that 5 ply searches were the norm.

Let's define "better". My Ph.D. dissertation directly addressed this issue. I showed that for a worst-first move ordering, the parallel speedup is nearly perfectly linear. Of course, the overall search is MUCH slower since this breaks alpha/beta.

It is not a "which is it?" issue. It is a "which is rational". I want the best move ordering possible. That makes the most efficient search. Worse ordering can occasionally let a parallel search find something significantly faster than it should, but more often, it just wrecks the tree and slows things down.

I don't care about that one in a hundred occurrence. I care about the typical performance, which is exactly as I stated. Give me the best possible move ordering, and that produces the most efficient parallel search one can do. Breaking ordering does NOT help anything, except for the infrequent case where one encounters the "super-linear speedup". But that is rare. If it were common, I'd simply write a sequential algorithm that multiplexes between branches to search in a similar fashion as the parallel search. Nobody would do that for obvious reasons, it would play weaker because MOST of the time alpha/beta works as it should and ordering is critical...

You look for argument when none is present. This is not that hard to grasp.

As far as your last statement, I am not sure why people put so much emphasis on pure guesswork rather than actually MEASURING what is going on...

syzygy · Post by **syzygy** » Sun Apr 14, 2013 9:13 pm

bob wrote:1. It is anything but true. Looking at irrelevant parts of the tree does not improve accuracy whatsoever, just wastes time.

Many reductions are triggered depending on the value of alpha. If alpha is sub-optimal some reductions will not be triggered when otherwise they would be.

With pure alpha/beta every additional node is wasted. But no engine implements pure alpha/beta.

In multi-pv mode engines will often find a better "best move" at the same depth. Why would that be? By your reasoning the extra node spent on the other variations are spent on irrelevant parts of the tree (as far as single-pv is concerned) and therefore cannot improve accuracy whatsoever.

syzygy · Post by **syzygy** » Sun Apr 14, 2013 9:21 pm

bob wrote:
syzygy wrote:So which of the two is it:
- parallel search overhead is due to non-optimal search ordering, or
- parallel search by its very nature tends to do better when move ordering is sub-optimal.

Are you saying that both are true?
Let's define "better". My Ph.D. dissertation directly addressed this issue. I showed that for a worst-first move ordering, the parallel speedup is nearly perfectly linear. Of course, the overall search is MUCH slower since this breaks alpha/beta.

With perfect move ordering the speedup is just as "perfectly linear". (In both cases you have to ignore overhead due to missed transpositions, though.)

It is not a "which is it?" issue. It is a "which is rational". I want the best move ordering possible. That makes the most efficient search. Worse ordering can occasionally let a parallel search find something significantly faster than it should, but more often, it just wrecks the tree and slows things down.

Why are you not answering the question?

Everybody wants best move ordering possible, but we are not discussing move ordering techniques now. Move ordering and its imperfections are a given. The question is how it impacts parallel search efficiency. We have two statements, the first is generally accepted and the second is yours:
- parallel search overhead is due to non-optimal search ordering, or
- parallel search by its very nature tends to do better when move ordering is sub-optimal.

Which of the two is true?

bob · Post by **bob** » Sun Apr 14, 2013 11:58 pm

syzygy wrote:
bob wrote:1. It is anything but true. Looking at irrelevant parts of the tree does not improve accuracy whatsoever, just wastes time.
Many reductions are triggered depending on the value of alpha. If alpha is sub-optimal some reductions will not be triggered when otherwise they would be.

You do realize that most use PVS? Which means that all nodes but a very few are searched with alpha,alpha+1 as the window.

With pure alpha/beta every additional node is wasted. But no engine implements pure alpha/beta.

In multi-pv mode engines will often find a better "best move" at the same depth. Why would that be? By your reasoning the extra node spent on the other variations are spent on irrelevant parts of the tree (as far as single-pv is concerned) and therefore cannot improve accuracy whatsoever.

If they find a better best move than the sequential search finds, then there is a problem. There is ALWAYS the well-known TT grafting problem, which explains how one can solve fine 70 with a less than 26 ply search. But you can't depend on that to happen. And intentionally forcing it to happen by using sub-optimal move ordering does not work.

Why don't you try what I have tried? Namely, modify a search to write out the hash signature for EVERY position it searches, the depth remaining when it was searched, a timestamp, and process/thread id. Now you can directly investigate your claim about transpositions being the main cause, because you can see when the same position was searched on two different threads, when it happened, and whether a sequential search would have had that same missed transposition or not. It is not THAT big a problem. I pointed this out the last time this discussion came up. The problem is, always has been, and always will be, search overhead caused by searching parts of the tree that the sequential search skips, due to poor move ordering which produces bad split points...

Anyone that has worked methodically on developing their own parallel search has beaten this subject to death. If they don't understand the problem, they can never work toward improvements. One doesn't have to guess as to what a problem might be, one can investigate to find out what it really is.

I'll try to find my old data from when I did this and post a sample...

In Cray Blitz, I actually displayed this "overhead" by simply counting the nodes a thread searched after a fail-high caused some threads to abort what they were doing. It takes some effort since the only overhead is that for nodes searched that follow the fail-high move in the normal move ordering. You will discover that this is the major part of the overhead.

bob · Post by **bob** » Mon Apr 15, 2013 12:08 am

syzygy wrote:
bob wrote:
syzygy wrote:So which of the two is it:
- parallel search overhead is due to non-optimal search ordering, or
- parallel search by its very nature tends to do better when move ordering is sub-optimal.

Are you saying that both are true?
Let's define "better". My Ph.D. dissertation directly addressed this issue. I showed that for a worst-first move ordering, the parallel speedup is nearly perfectly linear. Of course, the overall search is MUCH slower since this breaks alpha/beta.
With perfect move ordering the speedup is just as "perfectly linear". (In both cases you have to ignore overhead due to missed transpositions, though.)

I would certainly ignore that which I did not investigate carefully. But I have spent literally years looking at parallel search issues. Distributed search issues. shared/non-shared tt issues. Etc. In the case of parallel search, missed transpositions are a small part of the overhead. If I could figure out a way to eliminate the "stopped threads" (fail high at a split point) I would be within reach of a linear parallel speedup. But without perfect move ordering, perfect split point selection is impossible.

And the overhead piles up...

It is not a "which is it?" issue. It is a "which is rational". I want the best move ordering possible. That makes the most efficient search. Worse ordering can occasionally let a parallel search find something significantly faster than it should, but more often, it just wrecks the tree and slows things down.
Why are you not answering the question?

Everybody wants best move ordering possible, but we are not discussing move ordering techniques now. Move ordering and its imperfections are a given. The question is how it impacts parallel search efficiency. We have two statements, the first is generally accepted and the second is yours:
- parallel search overhead is due to non-optimal search ordering, or
- parallel search by its very nature tends to do better when move ordering is sub-optimal.

Which of the two is true?

The latter is NOT true. Perfect ordering produces perfect speedup. See my dissertation. Worst-first ordering produces perfect speedup. Also see my dissertation. Anything in between produces sub-linear speedup. And is therefore worse.

You want to re-cast what I wrote to turn it into your "typically" where I did not say that at all. I said that on rare occasions, a parallel search will do BETTER on a search than expected, because you get to the good move quicker. But that does NOT mean "this happens typically". It is rare. Yet that seems to be what you want to hang the argument on. MOST of the time, the ONLY gain from parallel search is a deeper tree which would provide a more accurate result. On RARE occasions, the odd move ordering issues might expose something faster (super-linear speedup).

The TT grafting problem is well-known and independent of parallel vs sequential search, it is an issue in both.

Now how about addressing the technical issues rather than this "might" or "could" nonsense. Offer some data.

geots · Post by **geots** » Mon Apr 15, 2013 2:56 am

bob wrote:
geots wrote:
bob wrote:
geots wrote:
bob wrote:
syzygy wrote:
bnemias wrote:This subject comes up every so often, and it's hard to believe people still think searching a larger tree with a small increase in NPS is beneficial. I'm not completely convinced there's a relationship between time to depth and playing strength either. But lacking any data of my own, I tend to believe Bob.
I don't know why you doubt that there is a relationship between time to depth and playing strength. For sure these are strongly correlated. The same depth in less time, all else being equal, definitely results in stronger play.

It is also not strange to believe that the tree being larger, all else being equal, contributes positively to playing strength. With pure alpha-beta it would contribute nothing, but the top engines all use a highly selective search.

A more selective search is usually good because the added depth outweighs the errors introduced by more selectivity. But if you can search a somewhat larger tree in the same time and reaching the same depth, the impact of these errors is reduced and play may be expected to be stronger.

So even if time-to-depth increases with HT on, it just might be the case that overall this is outweighed by the positive impact of the larger tree.

Bob's HT tests were probably limited to crafty. I also believe they were done on a system that was not overclocked. The higher the overclock, the better the performance of HT. This is because memory latency has a bigger impact at higher clock frequencies and HT shines at hiding these latencies.

From the tests I have done myself I could not conclude that my engine benefits from HT, but I cannot say anything for engines that I have not tested.
I think you are hoping for "good luck". In reality, what is happening is that the parallel search is simply searching nodes that are completely unnecessary to produce the right score.

Sometimes a parallel search will find the answer much more quickly than expected, but this is generally a result of poor move ordering where the parallel search looks at the supposedly bad move sooner than the sequential search does, and quickly establishes a better bound that makes things go faster. But that is uncommon, not expected to happen regularly.

I'm likely the only person on the planet to have actually run 30K game matches with 1 cpu, then with 2, then with 4, and finally with 8. And I have found NO circumstance where time to depth suggests one thing, and actual games suggest another. That is, the speed of the parallel search is the thing that gains Elo, not some bizarre tree shape that happens regularly. Such is just "statistical noise" if the test is large enough...

The only exceptions I have seen are those where so few games are played, the statistical variance makes the results statistically insignificant.

The only exception to the "hyper threading is not good for chess" would be a poorly implemented program which gets an unusual boost from HT, that a well-designed implementation would not get. Such tricks (HT) tend to help poorly written code more than code that has been optimized to efficiently access memory and to reduce as much as possible unnecessary data dependencies or unnecessary computation that stalls/clogs pipelines...

Dr. Bob, I have a question. I am running 2- Intel i5 4-core systems. Neither has hyperthreading. I have on the way a new Intel i7 6-core system, in which the HP can be disabled- which I plan to do. But in theory- if I did not disable HT on the i7, in testing would that create a variable I don't want- 2 systems with no HT, and 1 system with HT- and the results all being thrown into the same batch?

Best,

george
My advice is as before. If you have an 86 with 6 cores, and you never use more than 6 cores at once (whether you play 6 games with ponder=off, 3 games with ponder=on, or one game with an engine using 6 cores ponder=off, the results should be reliable.

If, you go farther and use more than 6 cores, some of those physical cores get split into what is effectively 2 pieces, each 1/2 as fast as the original. Some programs will be running on a CPU 1/2 as fast as normal, some won't, and that can certainly add noise.

Further, since all of these things use turbo-boost, it gets a little trickier yet, since just using one core will use overclocking, but when you use several overclocking turns off completely or almost completely. Again introducing cpu speed variability. I turn this off on whatever boxes I can, apple being the exception where there is no bios to fiddle with.

Thank you Dr. I missed one thing. Did you say you turned the turbo-boost or the overclocking OFF when you could?

Thanks again,

george

PS: One last thing. If I was running 6 single core matches- my real cores are all used up. There are 6 interfaces that all total will require a bit of a core- tho maybe not much. So if a virtual core comes into play here- how do I know that the power the 6 guis need is not coming from a real core and one engine is getting what was taken away from it to give to the guis from a virtual core? (Assuming what I have said makes any sense at all)
You almost have to when you do the kinds of things I do here. I want to measure parallel speedup, which means comparing the 8 cpu time to the 1 cpu time. With turbo-boost on, that is impossible, since the 1 cpu run uses a faster CPU (thinks to TB) compared to what you see when you use all 8...

Unfortunately, on my mac, I can't do that. And for max performance, it is likely a bad idea since theoretically one can turbo-boost even with both cores running. But it does screw up performance measurements, for sure.

Thanks for taking the time to help me here. We have a number of extremely intelligent people around here, but for me the last word lies with you.

All the best,

george

bob · Post by **bob** » Tue Apr 16, 2013 11:52 pm

geots wrote:
bob wrote:
Modern Times wrote:
bob wrote: Bottom line is to NOT use hyper threading when playing chess. You don't have to disable it as new operating system process schedulers understand the issues and will make sure each thread runs on a physical core, unless you run more threads than physical cores. At that point, you start to hurt performance.
Which new operating systems are you referring to ? Do you include Windows 7 in that ?
Yes. Windows 7/8, recent linux kernels (no more than 3 years old), Mac OS x. Etc.

I thank everyone who posted a thread here with their thoughts. Here is what I come away with: First, I always end up wishing I knew just 1/1,000,000 of what Bob does. Second, I take away from this that HT could possibly-maybe in some situations help a little bit- but you have to be careful- because if mishandled- it could hurt also.

But what no one has thought about- or at least no one has mentioned- is another important issue, I would think. When, for example, you are testing for CCRL, we all benchmarked our computers using the "crafty benchmark" to try and get the hardware as close to the same as possible. There is no perfection there- you just have to do all you can. I have no idea if this is applicable, but right now I am running 2- intel i5 4-core systems and neither have HT. When you are always trying to get rid of as many hardware variables as possible in testing- it looks like it could be a bad idea to run 2 machines with no HT and then turn around and use another system WITH HT. And then all the results go in the same batch. I would think that ideally, you should run HP on all 3 (which I cannot), or run all 3 with NO HT.

Maybe I am wrong, but this is something I considered. I would be very interested to read Dr. Hyatt's response to this thought.

And best to all of you,

george

Equal hardware is not easy. HT changes things. As does turbo-boost. As does different memory sizes/speeds. As does different cache sizes. About the only way I get sanity here is that each of our clusters is made of identical nodes, so as long as I test on a single cluster, I can eliminate that single point of worry.

Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M

Re: Hyperthreading and Computer Chess: Intel i5-3210M