New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".
I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
There are compilers and there are compilers
Moderators: hgm, Rebel, chrisw
-
- Posts: 64
- Joined: Fri Oct 18, 2013 11:40 pm
- Location: New York
-
- Posts: 793
- Joined: Sun Aug 03, 2014 4:48 am
- Location: London, UK
Re: There are compilers and there are compilers
TSX supports a totally different parallel programming paradigm (from the lock-based and lock-free stuff we already have), so I don't think the compiler would be smart enough to use it.ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".
I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
With transactional memory you can do things like this -
Let's say you have 2 variables that always have to be incremented together.
Code: Select all
int a;
int b;
Code: Select all
mutex m;
int a;
int b;
void threadMain()
{
// in each thread
lock(m);
a++;
b++;
unlock(m);
}
Code: Select all
int a;
int b;
void threadMain()
{
// in each thread
__transaction_atomic
{
a++;
b++;
}
}
If, while you are in the block, a memory write (from another thread) happens to any of the memory locations read in the transactional block, all your writes up to that point are discarded, and the transactional block is retried. When a transactional block is aborted, the memory state is as if it never ran.
This saves manual management of locks, and prevents deadlocks. However, it does not save you from starvation, and a very unlucky transactional block can be made to keep on retrying. That's why most programs using transactional blocks have fallback lock-based synchronization. They would switch back to lock-based sync if the transactional block fails for 3 times for example.
The nice thing about TSX is that there is almost no cost at all if blocks are not retried.
In cases like the transposition table where collisions are very rare, TSX gives you close to the performance and code simplicity of using no synchronization at all, while still being safe. Of course, for the TT there are other lockless implementations. But transactional memory is more general, and makes the code simpler (and potentially faster, if the lockless implementation require extra instructions).
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
-
- Posts: 1563
- Joined: Thu Jul 16, 2009 10:47 am
- Location: Almere, The Netherlands
Re: There are compilers and there are compilers
TSX is disabled on all Haswell processors because there is an error in the implementation.ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".
I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: There are compilers and there are compilers
I didn't search to fixed depth, I searched to a fixed time limit. That would make the node counts different. If I search to fixed depth they match exactly as they should....Aleks Peshkov wrote:Nodes count is different. I can say that GCC build is 20% smarter in 20 nodes.
Seriously, nodes and nps does not matter. Time to solution matters.
In the given case, time to solution is significantly faster on Intel compiler because it searches faster.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: There are compilers and there are compilers
BTW this is a v3 chip:Joost Buijs wrote:TSX is disabled on all Haswell processors because there is an error in the implementation.ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".
I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.
model name : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
-
- Posts: 859
- Joined: Mon Aug 10, 2009 10:05 pm
- Location: Italy
- Full name: Stefano Gemma
Re: There are compilers and there are compilers
Theoretically if you want to compare two compilers you must compare at fixed depth, not at a fixed time limit. Nodes far from the root have statically less pieces than those near the root and they would be completed in less time. The faster compiled executable can reach those nodes more than the slower and this would increase its nps (IMHO).bob wrote:I didn't search to fixed depth, I searched to a fixed time limit. That would make the node counts different. If I search to fixed depth they match exactly as they should....Aleks Peshkov wrote:Nodes count is different. I can say that GCC build is 20% smarter in 20 nodes.
Seriously, nodes and nps does not matter. Time to solution matters.
In the given case, time to solution is significantly faster on Intel compiler because it searches faster.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: There are compilers and there are compilers
For compiled speed, NPS is ALL that matters. The parallel search works the same no matter what compiles the code. In this case, Intel compiler produces an executable that runs over 10% faster. There is no other measurement that matters when comparing two compiler executables. Given the same program, if the NPS is 10% slower the parallel search is going to also be 10% slower. They are directly proportional.stegemma wrote:Theoretically if you want to compare two compilers you must compare at fixed depth, not at a fixed time limit. Nodes far from the root have statically less pieces than those near the root and they would be completed in less time. The faster compiled executable can reach those nodes more than the slower and this would increase its nps (IMHO).bob wrote:I didn't search to fixed depth, I searched to a fixed time limit. That would make the node counts different. If I search to fixed depth they match exactly as they should....Aleks Peshkov wrote:Nodes count is different. I can say that GCC build is 20% smarter in 20 nodes.
Seriously, nodes and nps does not matter. Time to solution matters.
In the given case, time to solution is significantly faster on Intel compiler because it searches faster.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: There are compilers and there are compilers
For compiled speed, NPS is ALL that matters. The parallel search works the same no matter what compiles the code. In this case, Intel compiler produces an executable that runs over 10% faster. There is no other measurement that matters when comparing two compiler executables. Given the same program, if the NPS is 10% slower the parallel search is going to also be 10% slower. They are directly proportional.stegemma wrote:Theoretically if you want to compare two compilers you must compare at fixed depth, not at a fixed time limit. Nodes far from the root have statically less pieces than those near the root and they would be completed in less time. The faster compiled executable can reach those nodes more than the slower and this would increase its nps (IMHO).bob wrote:I didn't search to fixed depth, I searched to a fixed time limit. That would make the node counts different. If I search to fixed depth they match exactly as they should....Aleks Peshkov wrote:Nodes count is different. I can say that GCC build is 20% smarter in 20 nodes.
Seriously, nodes and nps does not matter. Time to solution matters.
In the given case, time to solution is significantly faster on Intel compiler because it searches faster.
To show this, same position for both versions, searched to depth=35, 20 cores, 64 gb RAM (8gb hash entries):
time=1:04(94%) nodes=5413354307(5.4B) fh1=90% nps=83.9M
time=47.83(94%) nodes=4410508225(4.4B) fh1=91% nps=92.2M
Fixed depth shows the same thing as fixed time here. Intel compiler produces an executable over 10% faster. Fixed depth or fixed time doesn't matter a bit here when I am just measuring the effectiveness of the compiler.
-
- Posts: 1563
- Joined: Thu Jul 16, 2009 10:47 am
- Location: Almere, The Netherlands
Re: There are compilers and there are compilers
The v3 doesn't have TSX either.bob wrote:BTW this is a v3 chipJoost Buijs wrote:TSX is disabled on all Haswell processors because there is an error in the implementation.ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".
I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.
model name : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
http://ark.intel.com/nl/products/81706/ ... e-2_60-GHz
-
- Posts: 793
- Joined: Sun Aug 03, 2014 4:48 am
- Location: London, UK
Re: There are compilers and there are compilers
V3 is Haswell I believe. The one I have access to is V2, which is IvyBridge.Joost Buijs wrote:The v3 doesn't have TSX either.bob wrote:BTW this is a v3 chipJoost Buijs wrote:TSX is disabled on all Haswell processors because there is an error in the implementation.ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".
I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.
model name : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
http://ark.intel.com/nl/products/81706/ ... e-2_60-GHz
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.