There are compilers and there are compilers

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

ymatioun
Posts: 64
Joined: Fri Oct 18, 2013 11:40 pm
Location: New York

Re: There are compilers and there are compilers

Post by ymatioun »

New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".

I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: There are compilers and there are compilers

Post by matthewlai »

ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".

I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
TSX supports a totally different parallel programming paradigm (from the lock-based and lock-free stuff we already have), so I don't think the compiler would be smart enough to use it.

With transactional memory you can do things like this -

Let's say you have 2 variables that always have to be incremented together.

Code: Select all

int a;
int b;
In normal locking code you would write something like this -

Code: Select all

mutex m;
int a;
int b;

void threadMain()
{
     // in each thread
     lock(m);
     a++;
     b++;
     unlock(m);
}
In transactional code you can write it like this:

Code: Select all

int a;
int b;

void threadMain()
{
     // in each thread
     __transaction_atomic
     {
          a++;
          b++;
     }
}
What it does is that when you enter a transactional block, the states of all memory locations to be written to in the block are checkpointed.

If, while you are in the block, a memory write (from another thread) happens to any of the memory locations read in the transactional block, all your writes up to that point are discarded, and the transactional block is retried. When a transactional block is aborted, the memory state is as if it never ran.

This saves manual management of locks, and prevents deadlocks. However, it does not save you from starvation, and a very unlucky transactional block can be made to keep on retrying. That's why most programs using transactional blocks have fallback lock-based synchronization. They would switch back to lock-based sync if the transactional block fails for 3 times for example.

The nice thing about TSX is that there is almost no cost at all if blocks are not retried.

In cases like the transposition table where collisions are very rare, TSX gives you close to the performance and code simplicity of using no synchronization at all, while still being safe. Of course, for the TT there are other lockless implementations. But transactional memory is more general, and makes the code simpler (and potentially faster, if the lockless implementation require extra instructions).
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: There are compilers and there are compilers

Post by Joost Buijs »

ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".

I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
TSX is disabled on all Haswell processors because there is an error in the implementation.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: There are compilers and there are compilers

Post by bob »

Aleks Peshkov wrote:Nodes count is different. I can say that GCC build is 20% smarter in 20 nodes. :)
Seriously, nodes and nps does not matter. Time to solution matters.
I didn't search to fixed depth, I searched to a fixed time limit. That would make the node counts different. If I search to fixed depth they match exactly as they should....

In the given case, time to solution is significantly faster on Intel compiler because it searches faster.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: There are compilers and there are compilers

Post by bob »

Joost Buijs wrote:
ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".

I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
TSX is disabled on all Haswell processors because there is an error in the implementation.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.
BTW this is a v3 chip:

model name : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Re: There are compilers and there are compilers

Post by stegemma »

bob wrote:
Aleks Peshkov wrote:Nodes count is different. I can say that GCC build is 20% smarter in 20 nodes. :)
Seriously, nodes and nps does not matter. Time to solution matters.
I didn't search to fixed depth, I searched to a fixed time limit. That would make the node counts different. If I search to fixed depth they match exactly as they should....

In the given case, time to solution is significantly faster on Intel compiler because it searches faster.
Theoretically if you want to compare two compilers you must compare at fixed depth, not at a fixed time limit. Nodes far from the root have statically less pieces than those near the root and they would be completed in less time. The faster compiled executable can reach those nodes more than the slower and this would increase its nps (IMHO).
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: There are compilers and there are compilers

Post by bob »

stegemma wrote:
bob wrote:
Aleks Peshkov wrote:Nodes count is different. I can say that GCC build is 20% smarter in 20 nodes. :)
Seriously, nodes and nps does not matter. Time to solution matters.
I didn't search to fixed depth, I searched to a fixed time limit. That would make the node counts different. If I search to fixed depth they match exactly as they should....

In the given case, time to solution is significantly faster on Intel compiler because it searches faster.
Theoretically if you want to compare two compilers you must compare at fixed depth, not at a fixed time limit. Nodes far from the root have statically less pieces than those near the root and they would be completed in less time. The faster compiled executable can reach those nodes more than the slower and this would increase its nps (IMHO).
For compiled speed, NPS is ALL that matters. The parallel search works the same no matter what compiles the code. In this case, Intel compiler produces an executable that runs over 10% faster. There is no other measurement that matters when comparing two compiler executables. Given the same program, if the NPS is 10% slower the parallel search is going to also be 10% slower. They are directly proportional.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: There are compilers and there are compilers

Post by bob »

stegemma wrote:
bob wrote:
Aleks Peshkov wrote:Nodes count is different. I can say that GCC build is 20% smarter in 20 nodes. :)
Seriously, nodes and nps does not matter. Time to solution matters.
I didn't search to fixed depth, I searched to a fixed time limit. That would make the node counts different. If I search to fixed depth they match exactly as they should....

In the given case, time to solution is significantly faster on Intel compiler because it searches faster.
Theoretically if you want to compare two compilers you must compare at fixed depth, not at a fixed time limit. Nodes far from the root have statically less pieces than those near the root and they would be completed in less time. The faster compiled executable can reach those nodes more than the slower and this would increase its nps (IMHO).
For compiled speed, NPS is ALL that matters. The parallel search works the same no matter what compiles the code. In this case, Intel compiler produces an executable that runs over 10% faster. There is no other measurement that matters when comparing two compiler executables. Given the same program, if the NPS is 10% slower the parallel search is going to also be 10% slower. They are directly proportional.

To show this, same position for both versions, searched to depth=35, 20 cores, 64 gb RAM (8gb hash entries):

time=1:04(94%) nodes=5413354307(5.4B) fh1=90% nps=83.9M
time=47.83(94%) nodes=4410508225(4.4B) fh1=91% nps=92.2M

Fixed depth shows the same thing as fixed time here. Intel compiler produces an executable over 10% faster. Fixed depth or fixed time doesn't matter a bit here when I am just measuring the effectiveness of the compiler.
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: There are compilers and there are compilers

Post by Joost Buijs »

bob wrote:
Joost Buijs wrote:
ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".

I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
TSX is disabled on all Haswell processors because there is an error in the implementation.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.
BTW this is a v3 chip

model name : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
The v3 doesn't have TSX either.

http://ark.intel.com/nl/products/81706/ ... e-2_60-GHz
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: There are compilers and there are compilers

Post by matthewlai »

Joost Buijs wrote:
bob wrote:
Joost Buijs wrote:
ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".

I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
TSX is disabled on all Haswell processors because there is an error in the implementation.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.
BTW this is a v3 chip

model name : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
The v3 doesn't have TSX either.

http://ark.intel.com/nl/products/81706/ ... e-2_60-GHz
V3 is Haswell I believe. The one I have access to is V2, which is IvyBridge.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.