multithreading questions

Pradu · Post by **Pradu** » Thu Aug 09, 2007 5:53 pm

hgm wrote:
Aleks Peshkov wrote:Integer flag decrement is not thread safe. If you make it safe it will be very slow.
I also never have done any multi-core programming, but what would be wrong in simply using the inc/dec instruction with a lock prefix (or using xchg, which is always locked)? Being only a single instruction, that cannot be disastrously slow, can it?

When you add the lock prefix to an instruction it becomes disastorously slow.

You can test this out pretty quick in a sample program.

Gerd Isenberg · Post by **Gerd Isenberg** » Thu Aug 09, 2007 5:55 pm

hgm wrote:
Aleks Peshkov wrote:Integer flag decrement is not thread safe. If you make it safe it will be very slow.
I also never have done any multi-core programming, but what would be wrong in simply using the inc/dec instruction with a lock prefix (or using xchg, which is always locked)? Being only a single instruction, that cannot be disastrously slow, can it?

The single instruction with lock prefix is not the problem - but you have to enter a critical section before you may modify shared data mutual exclusive. Entering a critical section is something like this...

Code: Select all

EnterCriticalSection&#58;
    mov   eax, 1
    xchg  &#91;sema&#93;, eax
    test  eax, eax
    jnz   EnterCritilaSection

    ; mutual exclusive access
    inc  &#91;someSharedAddress&#93;

LeaveCriticalSection&#58;
    xor   eax, eax
    xchg  &#91;sema&#93;, eax

... polling a semaphore, until it is false - it might become slow if other threads request the shared ressource at the same time.

Pradu · Post by **Pradu** » Thu Aug 09, 2007 5:59 pm

Gerd Isenberg wrote:
hgm wrote:
Aleks Peshkov wrote:Integer flag decrement is not thread safe. If you make it safe it will be very slow.
I also never have done any multi-core programming, but what would be wrong in simply using the inc/dec instruction with a lock prefix (or using xchg, which is always locked)? Being only a single instruction, that cannot be disastrously slow, can it?
The single instruction with lock prefix is not the problem - but you have to enter a critical section before you may modify shared data mutual exclusive. Entering a critical section is something like this...
Code: Select all
EnterCriticalSection&#58;
    mov   eax, 1
    xchg  &#91;sema&#93;, eax
    test  eax, eax
    jnz   EnterCritilaSection

    ; mutual exclusive access
    inc  &#91;someSharedAddress&#93;

LeaveCriticalSection&#58;
    xor   eax, eax
    xchg  &#91;sema&#93;, eax
... polling a semaphore, until it is false - it might become slow if other threads request the shared ressource at the same time.

With high contention, the above could would be faster with a test before the xchg.

Gerd Isenberg · Post by **Gerd Isenberg** » Thu Aug 09, 2007 6:12 pm

Pradu wrote:

Code: Select all

EnterCriticalSection&#58;
    mov   eax, 1
    xchg  &#91;sema&#93;, eax
    test  eax, eax
    jnz   EnterCritilaSection

    ; mutual exclusive access
    inc  &#91;someSharedAddress&#93;

LeaveCriticalSection&#58;
    xor   eax, eax
    xchg  &#91;sema&#93;, eax

With high contention, the above could would be faster with a test before the xchg.

sorry, I don't get that. Faster but wrong imho...

Pradu · Post by **Pradu** » Thu Aug 09, 2007 6:25 pm

Gerd Isenberg wrote:
Pradu wrote:
Code: Select all
EnterCriticalSection&#58;
    mov   eax, 1
    xchg  &#91;sema&#93;, eax
    test  eax, eax
    jnz   EnterCritilaSection

    ; mutual exclusive access
    inc  &#91;someSharedAddress&#93;

LeaveCriticalSection&#58;
    xor   eax, eax
    xchg  &#91;sema&#93;, eax
With high contention, the above could would be faster with a test before the xchg.
sorry, I don't get that. Faster but wrong imho...

You do a test on the variable that you are using for a lock before the exchange, then you test the returned result after the exchange. An example of this posted here: http://www.intel.com/cd/ids/developer/a ... /20464.htm

Code: Select all

if &#40;GETLOCK&#40;lock&#41; != 0&#41;
&#123;
  While &#40;VARIOUS_CRITERIA&#41;
 &#123;
   _asm pause;  // pause instruction in spin-loop
   if (&#40;volatile int*&#41;lock&#41; == 0&#41; // spin on dirty read, not on lock
   &#123;
     if &#40;GETLOCK&#40;lock&#41;) == 0&#41;
     &#123;
       goto PROTECTED_CODE;
     &#125;
   &#125;
&#125;
 BACK_OFF_LOCK&#40;lock&#41;; // back off lock or do other work
&#125;
PROTECTED_CODE&#58;

Here the variable is always tested before the exchange to see if it is unlocked before an exchange is attempted.

One common mistake made by developers developing their own spin-wait loops is attempting to spin on an atomic instruction instead of spinning on a volatile read. Spinning on a dirty read instead of attempting to acquire a lock consumes less time and resources. This allows an application to only attempt to acquire a lock only when it is free.

Gerd Isenberg · Post by **Gerd Isenberg** » Thu Aug 09, 2007 6:32 pm

Pradu wrote:
Gerd Isenberg wrote:
Pradu wrote:
Code: Select all
EnterCriticalSection&#58;
    mov   eax, 1
    xchg  &#91;sema&#93;, eax
    test  eax, eax
    jnz   EnterCritilaSection

    ; mutual exclusive access
    inc  &#91;someSharedAddress&#93;

LeaveCriticalSection&#58;
    xor   eax, eax
    xchg  &#91;sema&#93;, eax
With high contention, the above could would be faster with a test before the xchg.
sorry, I don't get that. Faster but wrong imho...
You do a test on the variable that you are using for a lock before the exchange, then you test the returned result after the exchange. An example of this posted here: http://www.intel.com/cd/ids/developer/a ... /20464.htm
Code: Select all
if &#40;GETLOCK&#40;lock&#41; != 0&#41;
&#123;
  While &#40;VARIOUS_CRITERIA&#41;
 &#123;
   _asm pause;  // pause instruction in spin-loop
   if (&#40;volatile int*&#41;lock&#41; == 0&#41; // spin on dirty read, not on lock
   &#123;
     if &#40;GETLOCK&#40;lock&#41;) == 0&#41;
     &#123;
       goto PROTECTED_CODE;
     &#125;
   &#125;
&#125;
 BACK_OFF_LOCK&#40;lock&#41;; // back off lock or do other work
&#125;
PROTECTED_CODE&#58;
Here the variable is always tested before the exchange to see if it is unlocked before an exchange is attempted.

One common mistake made by developers developing their own spin-wait loops is attempting to spin on an atomic instruction instead of spinning on a volatile read. Spinning on a dirty read instead of attempting to acquire a lock consumes less time and resources. This allows an application to only attempt to acquire a lock only when it is free.

Seems to me that is related to hyperthreading - if no more than two threads really ask at the same time...

hgm · Post by **hgm** » Thu Aug 09, 2007 6:37 pm

Gerd Isenberg wrote:The single instruction with lock prefix is not the problem - but you have to enter a critical section before you may modify shared data mutual exclusive. Entering a critical section is something like this...

OK, of course. The purpose of the samaphore is to make the thread requesting access wait until the other one is done. There is no such thing as 'faster waiting'.

So I guess that the speed of semphore implementations is measured by how much time it takes to enter the critical section if it is not in use.

If 'busy waiting' is too expensive, the CPU should look for something useful to do in the mean time. But I have no doubt that task switches are made increadibly expensive by the OS.

Are there SMP engines that implement their own multitasking in a single thread? For non-recursive search implementations (implementing their own stack, indexed by a depth variable) this would not be so hard to do. Just give each virtual thread its own set of 'globals' containing their stack, accessed through a pointer, and all you would have to do is remember where the current task should resume, load the pointer to the globals for the new task, and jump to where it left off. With cooperative multitasking this could work very fast, just put a switch point directly after the critical section (so that a thread restarting there for its next time slice always has something to do), and an optional switch point just before the critical section, where only threads that are denied access pass the CPU to the next virtual thread that is not stalled there. As all virtual threads would run the same code, no flushing of the I-cache would ever be needed, and if you are lucky the working sets used by the various threads would not collide in the D-cache either. So it could be very fast. Of course, dividing the power of a single CPU over multiple virtual threads would make you suffer a scaling penalty.

Pradu · Post by **Pradu** » Thu Aug 09, 2007 7:06 pm

Gerd Isenberg wrote:Seems to me that is related to hyperthreading - if no more than two threads really ask at the same time...

Ok you might be correct. I have not tested it actually and believed that website I linked to. I just hacked up a quick test but don't have a multi-processor computer to test it on:
http://www.prism.gatech.edu/~gtg365v/Bu ... cktest.zip

Basically this is what it is:

Code: Select all

SpinLock lock = &#123;0&#125;;
unsigned int count=0;
Thread threads&#91;NUM_THREADS&#93;;

THREAD_PROC_RET inc&#40;void* args&#41;
&#123;
	int i,j;
	for&#40;i=0;i<INC_I_END;i++)
	    for&#40;j=0;j<INC_J_END;j++)
	    &#123;
			Lock&#40;lock&#41;;
			count++;
			Release&#40;lock&#41;;
		&#125;
	return 0;
&#125;

int main&#40;)
&#123;
	int i;
	unsigned int t0 = getms&#40;);
	for&#40;i=0;i<NUM_THREADS;i++)
        threads&#91;i&#93;=BeginThread&#40;inc,NULL&#41;;
	for&#40;i=0;i<NUM_THREADS;i++)
        Join&#40;threads&#91;i&#93;);
	printf&#40;"Time = %u\n",getms&#40;)-t0&#41;;
&#125;

We can test whether spinning on the "dirty read" or on the exchange is faster by changing these lines in thread.h:

Code: Select all

static INLINE int volatile TryLock&#40;SpinLock_P s&#41; &#123;return !&#40;IsLocked&#40;s&#41; || LockedExchange&#40;s,1&#41;);&#125; 
static INLINE void volatile Lock&#40;SpinLock_P s&#41; &#123;while&#40;IsLocked&#40;s&#41; || LockedExchange&#40;s,1&#41;);&#125;

To change to spinning on Mutex&#58;
static INLINE int volatile TryLock&#40;SpinLock_P s&#41; &#123;return !&#40;LockedExchange&#40;s,1&#41;);&#125; 
static INLINE void volatile Lock&#40;SpinLock_P s&#41; &#123;while&#40;LockedExchange&#40;s,1&#41;);&#125;

Gerd Isenberg · Post by **Gerd Isenberg** » Thu Aug 09, 2007 8:14 pm

Pradu wrote:
Gerd Isenberg wrote:Seems to me that is related to hyperthreading - if no more than two threads really ask at the same time...
Ok you might be correct. I have not tested it actually and believed that website I linked to. I just hacked up a quick test but don't have a multi-processor computer to test it on:
http://www.prism.gatech.edu/~gtg365v/Bu ... cktest.zip

Basically this is what it is:
Code: Select all
SpinLock lock = &#123;0&#125;;
unsigned int count=0;
Thread threads&#91;NUM_THREADS&#93;;

THREAD_PROC_RET inc&#40;void* args&#41;
&#123;
	int i,j;
	for&#40;i=0;i<INC_I_END;i++)
	    for&#40;j=0;j<INC_J_END;j++)
	    &#123;
			Lock&#40;lock&#41;;
			count++;
			Release&#40;lock&#41;;
		&#125;
	return 0;
&#125;

int main&#40;)
&#123;
	int i;
	unsigned int t0 = getms&#40;);
	for&#40;i=0;i<NUM_THREADS;i++)
        threads&#91;i&#93;=BeginThread&#40;inc,NULL&#41;;
	for&#40;i=0;i<NUM_THREADS;i++)
        Join&#40;threads&#91;i&#93;);
	printf&#40;"Time = %u\n",getms&#40;)-t0&#41;;
&#125;
We can test whether spinning on the "dirty read" or on the exchange is faster by changing these lines in thread.h:
Code: Select all
static INLINE int volatile TryLock&#40;SpinLock_P s&#41; &#123;return !&#40;IsLocked&#40;s&#41; || LockedExchange&#40;s,1&#41;);&#125; 
static INLINE void volatile Lock&#40;SpinLock_P s&#41; &#123;while&#40;IsLocked&#40;s&#41; || LockedExchange&#40;s,1&#41;);&#125;

To change to spinning on Mutex&#58;
static INLINE int volatile TryLock&#40;SpinLock_P s&#41; &#123;return !&#40;LockedExchange&#40;s,1&#41;);&#125; 
static INLINE void volatile Lock&#40;SpinLock_P s&#41; &#123;while&#40;LockedExchange&#40;s,1&#41;);&#125;

I think i get the idea of the dirty read now. If the semapore is true (locked) after the dirty read, one safes the xchg - otherwise it only enables to try the lock xchg - likely but still not sure that the thread becomes owner of the ressource. Makes sense and safes some memory writes in high traffic situations.

Code: Select all

EnterCriticalSection&#58;
    mov   eax, &#91;sema&#93; ; dirty read as precondition
    test  eax, eax
    jnz   EnterCritilaSection
    mov   eax, 1
    xchg  &#91;sema&#93;, eax
    test  eax, eax
    jnz   EnterCritilaSection

    ; mutual exclusive access
    inc  &#91;someSharedAddress&#93;

LeaveCriticalSection&#58;
    xor   eax, eax
    xchg  &#91;sema&#93;, eax

But isn't it possible that one thread never gets the ressource because some ather spinning thread always takes the chance to write a "one" between dirty read a "zero" and lock xchg?

I had a look in the K10 optimization manual and 32-bit [LOCK] xchg is direct-path 15 cycles - of course relatively slow but it ensurses no other thread/process may modify the content between the read- and write cycle of the instruction due to the imlicit lock. I didn't inspect the assembly of the lock-instrinsics yet.

While "usual" applications may better sleep while ressources are blocked, we like our threads to be immediatly ready to search - and to keep the associated processor with 100% even if idle.

Pradu · Post by **Pradu** » Thu Aug 09, 2007 8:43 pm

Gerd Isenberg wrote:I had a look in the K10 optimization manual and 32-bit [LOCK] xchg is direct-path 15 cycles - of course relatively slow but it ensurses no other thread/process may modify the content between the read- and write cycle of the instruction due to the imlicit lock. I didn't inspect the assembly of the lock-instrinsics yet.

While "usual" applications may better sleep while ressources are blocked, we like our threads to be immediatly ready to search - and to keep the associated processor with 100% even if idle.

Yeah, I'm not really sure what that "pause" instruction does .. whether makes the CPU not use up 100% or not ... but from what I've read it's highly reccomended to use it in a spin-wait loop:

Intel wrote:The PAUSE instruction is first introduced for the Pentium 4 processor. Technically, it is a hint to
the hardware that says that the code being executed is a spin-wait loop. What the hardware does
with this hint is specific to a particular processor. On the Pentium 4 processor, the instruction
behaves as described above, resulting in a net performance improvement. However for all
known existing IA-32 processors the instruction is interpreted as a NOP (no operation) and does
not affect the program in any way. It has been verified that PAUSE is a NOP for all known Intel ®
architectures prior to the Pentium 4 processor. It is even known to behave as a NOP on the non-Intel
x86 family processors that were available at the time of testing.

For this reason, it is highly recommended that you insert the PAUSE instruction into all spin-wait
code immediately. Using the PAUSE instruction does not affect the correctness of programs on
existing platforms, and it improves performance on Pentium 4 processor platforms.

My guess is that "pause" is possibly related to only hyper-threading and, like you said, possibly doesn't concern us.

multithreading questions

Re: multithreading questions

Re: multithreading questions

Re: multithreading questions

Re: multithreading questions

Re: multithreading questions

Re: multithreading questions

Re: multithreading questions

Re: multithreading questions

Re: multithreading questions

Re: multithreading questions