Lock-free multi-thread solutions for multi-core and more - Parallel Scalable Solutions Main menu for Parallel Scalable Solutions
TransparentArrow hider
Blank background bar
The digital road leads to massively multi-core computers needing non-blocking programming

   -  NON-BLOCKING 1(4)
   -  NON-BLOCKING 2(4)
   -  NON-BLOCKING 3(4)
   -  NON-BLOCKING 4(4)




Acrobat Reader Portable Document Format (PDF)

Multithread & Multiprocess: Problems and Solutions

DOWNLOAD & READ Arrow right

Services – Technologies – Non-blocking

1 (4)

In order to make benefits of the parallelism with multi-core and multi-processors and thus achieve higher experienced performance, the programs must obviously be designed in such a way that it can be divided into pieces that can be run more or less individually on each logical processor. This can be done in several ways. One way is to use a distributed model similar to that used for computer clusters, where the program is divided into separate program parts that regularly send messages between each others. However, this can be inefficient, especially if the messages must be sent often. More efficient, and maybe also more intuitive, is that the program parts communicate through common shared variables. This model fits especially well on the new processors which have several logical processors that is directly connected with a common shared work memory. It is within this model that multi-thread programming fits. Multi-thread programming has previously been used mostly for designing programs in a manageable way when one has several tasks that should be performed regularly. Now, it becomes more or less necessary to design program multi-threaded in order to get increased performance. 

Parallel programming with common shared data

We are now describing more concrete what type of problems that can show up with this type of parallel programming. Assume we have a number of thread, i.e., different program pieces that can be run in parallel, and that these threads want to increment a common shared counter. This counter we have defined as a global variable:  

int volatile counter=0;

In order to increment the counter, we try to do as usual; all threads contain a program line as follows: 


This can seem to be perfectly ok, but what happens really when we are executing our multi-threaded program and two or more "counter++" are executed simultaneously? Unfortunately, there is a great risk that the counter gets wrong value, as can be seen in the figure below.

concurrently incrementing a variable can result in incorrect result

The reason for this is that "counter++" normally is not compromised of only one machine instruction, and that operations on the work memory normally is done step by step with either reading or writing - not both. In the case of incrementing the counter this is done in 3 steps; first the variable's current value is read, then the value is incremented by one within the processor's arithmetic unit, and finally the new value is written back to the variable in memory. With several simultaneous of "counter++" can thus the internal steps towards the memory be interleaved in such an unfortunate way that the computation becomes wrong. There has of course been solutions on this problems since long. What we want to achieve is that "counter++" should be run in one single pass like a transaction, without being interrupted by anything else when the execution of "counter++" has started. 

The standard solutions for handling critical regions, that "counter++" is an example of in this case, is to use some kind of locking mechanism. The system are mostly often providing some kind of locking mechanism of the "mutex" or "semaphore" types. The greatest difference between "mutex" and semaphore" are normally that "semaphore" can handling longer waiting periods on a lock in more efficient way (it simply lets the operating system run another thread during the time the lock is busy, after having waited for a certain while on it to become available). With a lock surrounding "counter++" the increment will work painlessly, as is seen in the figure below.

using locks guarantee consistency but can impose waiting

Now, another problem shows up - what happened to the parallelism? As the purpose with the lock was to stop other threads to change the variable while we are working on it, the result will be that only one thread is doing something. This wasn't any problem with a processor that can only execute one thread at a time, but with a processor with several logical processors this becomes poor utilization. Moreover are operations with "mutex" and "semaphore" relatively expensive in time in themselves (especially if compare with a simple increment of a variable). As locking mechanisms by definition blocks other threads from being run, this can also cause several other problems which are of special importance for time-critical systems. These problems can be difficulties with safely guaranteeing deadlines (as it is hard to guarantee exactly when a certain lock gets released and becomes available) and avoiding risk for dead-locks (if the lock is never released).

There are solutions without locks

If we are recalling our example with "counter++", what we really wanted to was to be able to run all the steps in the increment as one single non-interrupted and atomic unit. Fortunately there is support for doing shorter atomic steps in the hardware. In all modern processors, there are special machine-code instructions that can perform a read directly followed by a write to the memory as on single unit. This type of operations are called atomic and is run like a transaction, either completely with all steps or not at all. The most powerful of these atomic instructions are called "compare-and-swap" (CAS), and is available directly or indirectly in most processors and systems. See the code below for a semantic description of this operation (observe that this is not the actual code, the real implementation is in hardware and/or system libraries).

int CAS(int *p, int old, int new) {
  atomic {
    if(*p == old) {
      return 1;
    else return 0;

The operation is mostly used in combination with an earlier read of the value in a variable and makes sure that no one has changed the value before one has calculated the new value and written this to the variable.

Also larger operations can be done atomically

The fundamental atomic operations that are available in the hardware is relatively limited, e.g., they can only operate on one single memory word at a time and has a limited semantic. By combining several atomic operations, one can write programs that actually can replace critical regions of larger sizes and complexity. However, this not trivial, it can be seen like an advanced mathematical puzzle where every piece is simple and limited. The thing that makes it so complex, is that every single line of code in the program can be executed simultaneously as any other line or be interrupted for an indefinitely long time period. One has thus no clue what is happening between every line of code in the program, and must be able to take care of this and adjust for eventual simultaneous changes in concerned variables. In order to re-attach to our problem with "counter++", we will now design a program that can increase a variable with an arbitrary value atomically.   

int FAA(int volatile *p, int value) {
  int old;
  do {
  } while(!CAS(p,old,old+value));
  return old;

This function first reads the old value of the variable, then calculates the new value, and lastly makes sure using CAS that the variable has not been changed before the variable gets updated with the new value. If the value has been changed after that this function read the value of the variable, the CAS will fail and the function will repeat. Thus, we can now make use of "FAA(&counter,1)" and in this way make "counter++" atomic - without any locking mechanism. Now, one might ask if this function will always succeed for all thread, or is there any possibility that some thread are spinning forever in the program loop? In theory, the answer is yes to this possibility, although in practice this is clearly very unlikely under normal loads.

Arrow left PREVIOUS

NEXT Arrow right

Click here to view the swedish language version of the siteCreated by Kobolt.se, www.kobolt.se. ©2008 Parallel Scalable Solutions AB.
To home page