Gerd Isenberg wrote:The single instruction with lock prefix is not the problem - but you have to enter a critical section before you may modify shared data mutual exclusive. Entering a critical section is something like this...
OK, of course. The purpose of the samaphore is to make the thread requesting access wait until the other one is done. There is no such thing as 'faster waiting'.

So I guess that the speed of semphore implementations is measured by how much time it takes to enter the critical section if it is
not in use.
If 'busy waiting' is too expensive, the CPU should look for something useful to do in the mean time. But I have no doubt that task switches are made increadibly expensive by the OS.
Are there SMP engines that implement their own multitasking in a single thread? For non-recursive search implementations (implementing their own stack, indexed by a depth variable) this would not be so hard to do. Just give each virtual thread its own set of 'globals' containing their stack, accessed through a pointer, and all you would have to do is remember where the current task should resume, load the pointer to the globals for the new task, and jump to where it left off. With cooperative multitasking this could work very fast, just put a switch point directly after the critical section (so that a thread restarting there for its next time slice always has something to do), and an optional switch point just before the critical section, where only threads that are denied access pass the CPU to the next virtual thread that is not stalled there. As all virtual threads would run the same code, no flushing of the I-cache would ever be needed, and if you are lucky the working sets used by the various threads would not collide in the D-cache either. So it could be very fast. Of course, dividing the power of a single CPU over multiple virtual threads would make you suffer a scaling penalty.