When you add the lock prefix to an instruction it becomes disastorously slow. You can test this out pretty quick in a sample program.hgm wrote:I also never have done any multi-core programming, but what would be wrong in simply using the inc/dec instruction with a lock prefix (or using xchg, which is always locked)? Being only a single instruction, that cannot be disastrously slow, can it?Aleks Peshkov wrote:Integer flag decrement is not thread safe. If you make it safe it will be very slow.
multithreading questions
Moderators: hgm, Rebel, chrisw
-
- Posts: 287
- Joined: Sat Mar 11, 2006 3:19 am
- Location: Atlanta, GA
Re: multithreading questions
-
- Posts: 2250
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: multithreading questions
The single instruction with lock prefix is not the problem - but you have to enter a critical section before you may modify shared data mutual exclusive. Entering a critical section is something like this...hgm wrote:I also never have done any multi-core programming, but what would be wrong in simply using the inc/dec instruction with a lock prefix (or using xchg, which is always locked)? Being only a single instruction, that cannot be disastrously slow, can it?Aleks Peshkov wrote:Integer flag decrement is not thread safe. If you make it safe it will be very slow.
Code: Select all
EnterCriticalSection:
mov eax, 1
xchg [sema], eax
test eax, eax
jnz EnterCritilaSection
; mutual exclusive access
inc [someSharedAddress]
LeaveCriticalSection:
xor eax, eax
xchg [sema], eax
-
- Posts: 287
- Joined: Sat Mar 11, 2006 3:19 am
- Location: Atlanta, GA
Re: multithreading questions
With high contention, the above could would be faster with a test before the xchg.Gerd Isenberg wrote:The single instruction with lock prefix is not the problem - but you have to enter a critical section before you may modify shared data mutual exclusive. Entering a critical section is something like this...hgm wrote:I also never have done any multi-core programming, but what would be wrong in simply using the inc/dec instruction with a lock prefix (or using xchg, which is always locked)? Being only a single instruction, that cannot be disastrously slow, can it?Aleks Peshkov wrote:Integer flag decrement is not thread safe. If you make it safe it will be very slow.... polling a semaphore, until it is false - it might become slow if other threads request the shared ressource at the same time.Code: Select all
EnterCriticalSection: mov eax, 1 xchg [sema], eax test eax, eax jnz EnterCritilaSection ; mutual exclusive access inc [someSharedAddress] LeaveCriticalSection: xor eax, eax xchg [sema], eax
-
- Posts: 2250
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: multithreading questions
sorry, I don't get that. Faster but wrong imho...Pradu wrote:With high contention, the above could would be faster with a test before the xchg.Code: Select all
EnterCriticalSection: mov eax, 1 xchg [sema], eax test eax, eax jnz EnterCritilaSection ; mutual exclusive access inc [someSharedAddress] LeaveCriticalSection: xor eax, eax xchg [sema], eax
-
- Posts: 287
- Joined: Sat Mar 11, 2006 3:19 am
- Location: Atlanta, GA
Re: multithreading questions
You do a test on the variable that you are using for a lock before the exchange, then you test the returned result after the exchange. An example of this posted here: http://www.intel.com/cd/ids/developer/a ... /20464.htmGerd Isenberg wrote:sorry, I don't get that. Faster but wrong imho...Pradu wrote:With high contention, the above could would be faster with a test before the xchg.Code: Select all
EnterCriticalSection: mov eax, 1 xchg [sema], eax test eax, eax jnz EnterCritilaSection ; mutual exclusive access inc [someSharedAddress] LeaveCriticalSection: xor eax, eax xchg [sema], eax
Code: Select all
if (GETLOCK(lock) != 0)
{
While (VARIOUS_CRITERIA)
{
_asm pause; // pause instruction in spin-loop
if ((volatile int*)lock) == 0) // spin on dirty read, not on lock
{
if (GETLOCK(lock)) == 0)
{
goto PROTECTED_CODE;
}
}
}
BACK_OFF_LOCK(lock); // back off lock or do other work
}
PROTECTED_CODE:
One common mistake made by developers developing their own spin-wait loops is attempting to spin on an atomic instruction instead of spinning on a volatile read. Spinning on a dirty read instead of attempting to acquire a lock consumes less time and resources. This allows an application to only attempt to acquire a lock only when it is free.
-
- Posts: 2250
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: multithreading questions
Seems to me that is related to hyperthreading - if no more than two threads really ask at the same time...Pradu wrote:You do a test on the variable that you are using for a lock before the exchange, then you test the returned result after the exchange. An example of this posted here: http://www.intel.com/cd/ids/developer/a ... /20464.htmGerd Isenberg wrote:sorry, I don't get that. Faster but wrong imho...Pradu wrote:With high contention, the above could would be faster with a test before the xchg.Code: Select all
EnterCriticalSection: mov eax, 1 xchg [sema], eax test eax, eax jnz EnterCritilaSection ; mutual exclusive access inc [someSharedAddress] LeaveCriticalSection: xor eax, eax xchg [sema], eax
Here the variable is always tested before the exchange to see if it is unlocked before an exchange is attempted.Code: Select all
if (GETLOCK(lock) != 0) { While (VARIOUS_CRITERIA) { _asm pause; // pause instruction in spin-loop if ((volatile int*)lock) == 0) // spin on dirty read, not on lock { if (GETLOCK(lock)) == 0) { goto PROTECTED_CODE; } } } BACK_OFF_LOCK(lock); // back off lock or do other work } PROTECTED_CODE:
One common mistake made by developers developing their own spin-wait loops is attempting to spin on an atomic instruction instead of spinning on a volatile read. Spinning on a dirty read instead of attempting to acquire a lock consumes less time and resources. This allows an application to only attempt to acquire a lock only when it is free.
-
- Posts: 27811
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: multithreading questions
OK, of course. The purpose of the samaphore is to make the thread requesting access wait until the other one is done. There is no such thing as 'faster waiting'. So I guess that the speed of semphore implementations is measured by how much time it takes to enter the critical section if it is not in use.Gerd Isenberg wrote:The single instruction with lock prefix is not the problem - but you have to enter a critical section before you may modify shared data mutual exclusive. Entering a critical section is something like this...
If 'busy waiting' is too expensive, the CPU should look for something useful to do in the mean time. But I have no doubt that task switches are made increadibly expensive by the OS.
Are there SMP engines that implement their own multitasking in a single thread? For non-recursive search implementations (implementing their own stack, indexed by a depth variable) this would not be so hard to do. Just give each virtual thread its own set of 'globals' containing their stack, accessed through a pointer, and all you would have to do is remember where the current task should resume, load the pointer to the globals for the new task, and jump to where it left off. With cooperative multitasking this could work very fast, just put a switch point directly after the critical section (so that a thread restarting there for its next time slice always has something to do), and an optional switch point just before the critical section, where only threads that are denied access pass the CPU to the next virtual thread that is not stalled there. As all virtual threads would run the same code, no flushing of the I-cache would ever be needed, and if you are lucky the working sets used by the various threads would not collide in the D-cache either. So it could be very fast. Of course, dividing the power of a single CPU over multiple virtual threads would make you suffer a scaling penalty.
-
- Posts: 287
- Joined: Sat Mar 11, 2006 3:19 am
- Location: Atlanta, GA
Re: multithreading questions
Ok you might be correct. I have not tested it actually and believed that website I linked to. I just hacked up a quick test but don't have a multi-processor computer to test it on:Gerd Isenberg wrote:Seems to me that is related to hyperthreading - if no more than two threads really ask at the same time...
http://www.prism.gatech.edu/~gtg365v/Bu ... cktest.zip
Basically this is what it is:
Code: Select all
SpinLock lock = {0};
unsigned int count=0;
Thread threads[NUM_THREADS];
THREAD_PROC_RET inc(void* args)
{
int i,j;
for(i=0;i<INC_I_END;i++)
for(j=0;j<INC_J_END;j++)
{
Lock(lock);
count++;
Release(lock);
}
return 0;
}
int main()
{
int i;
unsigned int t0 = getms();
for(i=0;i<NUM_THREADS;i++)
threads[i]=BeginThread(inc,NULL);
for(i=0;i<NUM_THREADS;i++)
Join(threads[i]);
printf("Time = %u\n",getms()-t0);
}
Code: Select all
static INLINE int volatile TryLock(SpinLock_P s) {return !(IsLocked(s) || LockedExchange(s,1));}
static INLINE void volatile Lock(SpinLock_P s) {while(IsLocked(s) || LockedExchange(s,1));}
To change to spinning on Mutex:
static INLINE int volatile TryLock(SpinLock_P s) {return !(LockedExchange(s,1));}
static INLINE void volatile Lock(SpinLock_P s) {while(LockedExchange(s,1));}
-
- Posts: 2250
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: multithreading questions
I think i get the idea of the dirty read now. If the semapore is true (locked) after the dirty read, one safes the xchg - otherwise it only enables to try the lock xchg - likely but still not sure that the thread becomes owner of the ressource. Makes sense and safes some memory writes in high traffic situations.Pradu wrote:Ok you might be correct. I have not tested it actually and believed that website I linked to. I just hacked up a quick test but don't have a multi-processor computer to test it on:Gerd Isenberg wrote:Seems to me that is related to hyperthreading - if no more than two threads really ask at the same time...
http://www.prism.gatech.edu/~gtg365v/Bu ... cktest.zip
Basically this is what it is:We can test whether spinning on the "dirty read" or on the exchange is faster by changing these lines in thread.h:Code: Select all
SpinLock lock = {0}; unsigned int count=0; Thread threads[NUM_THREADS]; THREAD_PROC_RET inc(void* args) { int i,j; for(i=0;i<INC_I_END;i++) for(j=0;j<INC_J_END;j++) { Lock(lock); count++; Release(lock); } return 0; } int main() { int i; unsigned int t0 = getms(); for(i=0;i<NUM_THREADS;i++) threads[i]=BeginThread(inc,NULL); for(i=0;i<NUM_THREADS;i++) Join(threads[i]); printf("Time = %u\n",getms()-t0); }
Code: Select all
static INLINE int volatile TryLock(SpinLock_P s) {return !(IsLocked(s) || LockedExchange(s,1));} static INLINE void volatile Lock(SpinLock_P s) {while(IsLocked(s) || LockedExchange(s,1));} To change to spinning on Mutex: static INLINE int volatile TryLock(SpinLock_P s) {return !(LockedExchange(s,1));} static INLINE void volatile Lock(SpinLock_P s) {while(LockedExchange(s,1));}
Code: Select all
EnterCriticalSection:
mov eax, [sema] ; dirty read as precondition
test eax, eax
jnz EnterCritilaSection
mov eax, 1
xchg [sema], eax
test eax, eax
jnz EnterCritilaSection
; mutual exclusive access
inc [someSharedAddress]
LeaveCriticalSection:
xor eax, eax
xchg [sema], eax
I had a look in the K10 optimization manual and 32-bit [LOCK] xchg is direct-path 15 cycles - of course relatively slow but it ensurses no other thread/process may modify the content between the read- and write cycle of the instruction due to the imlicit lock. I didn't inspect the assembly of the lock-instrinsics yet.
While "usual" applications may better sleep while ressources are blocked, we like our threads to be immediatly ready to search - and to keep the associated processor with 100% even if idle.
-
- Posts: 287
- Joined: Sat Mar 11, 2006 3:19 am
- Location: Atlanta, GA
Re: multithreading questions
Yeah, I'm not really sure what that "pause" instruction does .. whether makes the CPU not use up 100% or not ... but from what I've read it's highly reccomended to use it in a spin-wait loop:Gerd Isenberg wrote:I had a look in the K10 optimization manual and 32-bit [LOCK] xchg is direct-path 15 cycles - of course relatively slow but it ensurses no other thread/process may modify the content between the read- and write cycle of the instruction due to the imlicit lock. I didn't inspect the assembly of the lock-instrinsics yet.
While "usual" applications may better sleep while ressources are blocked, we like our threads to be immediatly ready to search - and to keep the associated processor with 100% even if idle.
My guess is that "pause" is possibly related to only hyper-threading and, like you said, possibly doesn't concern us.Intel wrote:The PAUSE instruction is first introduced for the Pentium 4 processor. Technically, it is a hint to
the hardware that says that the code being executed is a spin-wait loop. What the hardware does
with this hint is specific to a particular processor. On the Pentium 4 processor, the instruction
behaves as described above, resulting in a net performance improvement. However for all
known existing IA-32 processors the instruction is interpreted as a NOP (no operation) and does
not affect the program in any way. It has been verified that PAUSE is a NOP for all known Intel ®
architectures prior to the Pentium 4 processor. It is even known to behave as a NOP on the non-Intel
x86 family processors that were available at the time of testing.
For this reason, it is highly recommended that you insert the PAUSE instruction into all spin-wait
code immediately. Using the PAUSE instruction does not affect the correctness of programs on
existing platforms, and it improves performance on Pentium 4 processor platforms.