There are compilers and there are compilers

ymatioun · Post by **ymatioun** » Wed Jun 24, 2015 1:46 pm

New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".

I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.

matthewlai · Post by **matthewlai** » Wed Jun 24, 2015 2:18 pm

ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".

I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.

TSX supports a totally different parallel programming paradigm (from the lock-based and lock-free stuff we already have), so I don't think the compiler would be smart enough to use it.

With transactional memory you can do things like this -

Let's say you have 2 variables that always have to be incremented together.

Code: Select all

int a;
int b;

In normal locking code you would write something like this -

Code: Select all

mutex m;
int a;
int b;

void threadMain&#40;)
&#123;
     // in each thread
     lock&#40;m&#41;;
     a++;
     b++;
     unlock&#40;m&#41;;
&#125;

In transactional code you can write it like this:

Code: Select all

int a;
int b;

void threadMain&#40;)
&#123;
     // in each thread
     __transaction_atomic
     &#123;
          a++;
          b++;
     &#125;
&#125;

What it does is that when you enter a transactional block, the states of all memory locations to be written to in the block are checkpointed.

If, while you are in the block, a memory write (from another thread) happens to any of the memory locations read in the transactional block, all your writes up to that point are discarded, and the transactional block is retried. When a transactional block is aborted, the memory state is as if it never ran.

This saves manual management of locks, and prevents deadlocks. However, it does not save you from starvation, and a very unlucky transactional block can be made to keep on retrying. That's why most programs using transactional blocks have fallback lock-based synchronization. They would switch back to lock-based sync if the transactional block fails for 3 times for example.

The nice thing about TSX is that there is almost no cost at all if blocks are not retried.

In cases like the transposition table where collisions are very rare, TSX gives you close to the performance and code simplicity of using no synchronization at all, while still being safe. Of course, for the TT there are other lockless implementations. But transactional memory is more general, and makes the code simpler (and potentially faster, if the lockless implementation require extra instructions).

Joost Buijs · Post by **Joost Buijs** » Wed Jun 24, 2015 3:01 pm

ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".

I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.

TSX is disabled on all Haswell processors because there is an error in the implementation.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.

bob · Post by **bob** » Wed Jun 24, 2015 4:36 pm

Aleks Peshkov wrote:Nodes count is different. I can say that GCC build is 20% smarter in 20 nodes.
Seriously, nodes and nps does not matter. Time to solution matters.

I didn't search to fixed depth, I searched to a fixed time limit. That would make the node counts different. If I search to fixed depth they match exactly as they should....

In the given case, time to solution is significantly faster on Intel compiler because it searches faster.

bob · Post by **bob** » Wed Jun 24, 2015 4:38 pm

Joost Buijs wrote:
ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".

I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
TSX is disabled on all Haswell processors because there is an error in the implementation.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.

BTW this is a v3 chip:

model name : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz

stegemma · Post by **stegemma** » Wed Jun 24, 2015 4:50 pm

bob wrote:
Aleks Peshkov wrote:Nodes count is different. I can say that GCC build is 20% smarter in 20 nodes.
Seriously, nodes and nps does not matter. Time to solution matters.
I didn't search to fixed depth, I searched to a fixed time limit. That would make the node counts different. If I search to fixed depth they match exactly as they should....

In the given case, time to solution is significantly faster on Intel compiler because it searches faster.

Theoretically if you want to compare two compilers you must compare at fixed depth, not at a fixed time limit. Nodes far from the root have statically less pieces than those near the root and they would be completed in less time. The faster compiled executable can reach those nodes more than the slower and this would increase its nps (IMHO).

bob · Post by **bob** » Wed Jun 24, 2015 5:48 pm

stegemma wrote:
bob wrote:
Aleks Peshkov wrote:Nodes count is different. I can say that GCC build is 20% smarter in 20 nodes.
Seriously, nodes and nps does not matter. Time to solution matters.
I didn't search to fixed depth, I searched to a fixed time limit. That would make the node counts different. If I search to fixed depth they match exactly as they should....

In the given case, time to solution is significantly faster on Intel compiler because it searches faster.
Theoretically if you want to compare two compilers you must compare at fixed depth, not at a fixed time limit. Nodes far from the root have statically less pieces than those near the root and they would be completed in less time. The faster compiled executable can reach those nodes more than the slower and this would increase its nps (IMHO).

For compiled speed, NPS is ALL that matters. The parallel search works the same no matter what compiles the code. In this case, Intel compiler produces an executable that runs over 10% faster. There is no other measurement that matters when comparing two compiler executables. Given the same program, if the NPS is 10% slower the parallel search is going to also be 10% slower. They are directly proportional.

bob · Post by **bob** » Wed Jun 24, 2015 5:54 pm

stegemma wrote:
bob wrote:
Aleks Peshkov wrote:Nodes count is different. I can say that GCC build is 20% smarter in 20 nodes.
Seriously, nodes and nps does not matter. Time to solution matters.
I didn't search to fixed depth, I searched to a fixed time limit. That would make the node counts different. If I search to fixed depth they match exactly as they should....

In the given case, time to solution is significantly faster on Intel compiler because it searches faster.
Theoretically if you want to compare two compilers you must compare at fixed depth, not at a fixed time limit. Nodes far from the root have statically less pieces than those near the root and they would be completed in less time. The faster compiled executable can reach those nodes more than the slower and this would increase its nps (IMHO).

For compiled speed, NPS is ALL that matters. The parallel search works the same no matter what compiles the code. In this case, Intel compiler produces an executable that runs over 10% faster. There is no other measurement that matters when comparing two compiler executables. Given the same program, if the NPS is 10% slower the parallel search is going to also be 10% slower. They are directly proportional.

To show this, same position for both versions, searched to depth=35, 20 cores, 64 gb RAM (8gb hash entries):

time=1:04(94%) nodes=5413354307(5.4B) fh1=90% nps=83.9M
time=47.83(94%) nodes=4410508225(4.4B) fh1=91% nps=92.2M

Fixed depth shows the same thing as fixed time here. Intel compiler produces an executable over 10% faster. Fixed depth or fixed time doesn't matter a bit here when I am just measuring the effectiveness of the compiler.

Joost Buijs · Post by **Joost Buijs** » Wed Jun 24, 2015 6:02 pm

bob wrote:
Joost Buijs wrote:
ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".

I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
TSX is disabled on all Haswell processors because there is an error in the implementation.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.
BTW this is a v3 chip

model name : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz

The v3 doesn't have TSX either.

http://ark.intel.com/nl/products/81706/ ... e-2_60-GHz

matthewlai · Post by **matthewlai** » Wed Jun 24, 2015 6:09 pm

Joost Buijs wrote:
bob wrote:
Joost Buijs wrote:
ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".

I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
TSX is disabled on all Haswell processors because there is an error in the implementation.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.
BTW this is a v3 chip

model name : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
The v3 doesn't have TSX either.

http://ark.intel.com/nl/products/81706/ ... e-2_60-GHz

V3 is Haswell I believe. The one I have access to is V2, which is IvyBridge.

There are compilers and there are compilers

Re: There are compilers and there are compilers

Re: There are compilers and there are compilers

Re: There are compilers and there are compilers

Re: There are compilers and there are compilers

Re: There are compilers and there are compilers

Re: There are compilers and there are compilers

Re: There are compilers and there are compilers

Re: There are compilers and there are compilers

Re: There are compilers and there are compilers

Re: There are compilers and there are compilers