|
|
Services Technologies
Non-blocking
|
1 (4) |
In order to make benefits of the parallelism
with multi-core and multi-processors and
thus achieve higher experienced performance, the programs
must obviously be designed in such a way
that it can be divided into pieces that
can be run more or less individually on each
logical processor. This can be done in
several ways. One way is to use a
distributed model similar to that used for
computer clusters, where the program is
divided into separate program parts that
regularly send messages between each others.
However, this can be inefficient, especially
if the messages must be sent often. More
efficient, and maybe also more intuitive, is
that the program parts communicate through
common shared variables. This model fits
especially well on the new processors which
have several logical processors that is
directly connected with a common shared work
memory. It is within this model that
multi-thread programming fits. Multi-thread
programming has previously been used mostly
for designing programs in a manageable way
when one has several tasks that should be
performed regularly. Now, it becomes more or less
necessary to design program multi-threaded
in order to get increased performance.
Parallel programming with common shared data
We are now describing more
concrete what type
of problems that can show up with this type
of parallel programming. Assume we have a
number of thread, i.e., different program
pieces that can be run in parallel, and that
these threads want to increment a common
shared counter. This counter we have defined
as a global variable:
int volatile counter=0;
In order to increment the counter, we try to
do as usual; all threads contain a program
line as follows:
counter++;
This can seem to be perfectly ok, but what
happens really when we are executing our
multi-threaded program and two or more "counter++"
are executed simultaneously? Unfortunately,
there is a great risk that the counter gets
wrong value, as can be seen in the figure
below.
The reason for this is that "counter++"
normally is not compromised of only one
machine instruction, and that operations on
the work memory normally is done step by
step with either reading or writing - not
both. In the case of incrementing the
counter this is done in 3 steps; first the
variable's current value is read, then the
value is incremented by one within the
processor's arithmetic unit, and finally the
new value is written back to the variable in
memory. With several simultaneous of "counter++"
can thus the internal steps towards the
memory be interleaved in such an unfortunate
way that the computation becomes wrong.
There has of course been solutions on this
problems since long. What we want to achieve
is that "counter++" should be run in one
single pass like a transaction, without
being interrupted by anything else when the
execution of "counter++" has started.
The standard solutions for handling critical
regions, that "counter++" is an example of
in this case, is to use some kind of locking
mechanism. The system are mostly often
providing some kind of locking mechanism of
the "mutex" or "semaphore" types. The
greatest difference between "mutex" and
semaphore" are normally that "semaphore" can
handling longer waiting periods on a lock in
more efficient way (it simply lets the
operating system run another thread during
the time the lock is busy, after having
waited for a certain while on it to become
available). With a lock surrounding "counter++"
the increment will work painlessly, as is
seen in the figure below.
Now, another
problem shows up - what happened to the
parallelism? As the purpose with the lock
was to stop other threads to change the
variable while we are working on it, the
result will be that only one thread is doing
something. This wasn't any problem with a
processor that can only execute one thread
at a time, but with a processor with several
logical processors this becomes poor
utilization. Moreover are operations with "mutex"
and "semaphore" relatively expensive in time
in themselves (especially if compare with a
simple increment of a variable). As locking
mechanisms by definition blocks other
threads from being run, this can also cause
several other problems which are of special
importance for time-critical systems. These
problems can be difficulties with safely
guaranteeing deadlines (as it is hard to
guarantee exactly when a certain lock gets
released and becomes available) and avoiding
risk for dead-locks (if the lock is never
released).
There are solutions
without locks
If we are recalling our example with "counter++",
what we really wanted to was to be able to
run all the steps in the increment as one
single non-interrupted and atomic unit.
Fortunately there is support for doing
shorter atomic steps in the hardware. In all
modern processors, there are special
machine-code instructions that can perform a
read directly followed by a write to the
memory as on single unit. This type of
operations are called atomic and is run like
a transaction, either completely with all
steps or not at all. The most powerful of
these atomic instructions are called "compare-and-swap"
(CAS), and is available directly or
indirectly in most processors and systems.
See the code below for a semantic
description of this operation (observe that
this is not the actual code, the real
implementation is in hardware and/or system
libraries).
int CAS(int *p, int old, int new) {
atomic {
if(*p == old) {
*p=new;
return 1;
}
else return 0;
}
}
The operation
is mostly used in combination with an
earlier read of the value in a variable and
makes sure that no one has changed the value
before one has calculated the new value and
written this to the variable.
Also larger operations
can be done atomically
The fundamental atomic operations that are
available in the hardware is relatively
limited, e.g., they can only operate on one
single memory word at a time and has a
limited semantic. By combining several
atomic operations, one can write programs
that actually can replace critical regions
of larger sizes and complexity. However,
this not trivial, it can be seen like an
advanced mathematical puzzle where every
piece is simple and limited. The thing that
makes it so complex, is that every single
line of code in the program can be executed
simultaneously as any other line or be
interrupted for an indefinitely long time
period. One has thus no clue what is
happening between every line of code in the
program, and must be able to take care of
this and adjust for eventual simultaneous
changes in concerned variables. In order to
re-attach to our problem with "counter++",
we will now design a program that can
increase a variable with an arbitrary value
atomically.
int FAA(int volatile *p, int value) {
int old;
do {
old=*p;
} while(!CAS(p,old,old+value));
return old;
}
This function first reads the old value of
the variable, then calculates the new value,
and lastly makes sure using CAS that the
variable has not been changed before the
variable gets updated with the new value. If
the value has been changed after that this
function read the value of the variable, the
CAS will fail and the function will repeat.
Thus, we can now make use of "FAA(&counter,1)"
and in this way make "counter++" atomic -
without any locking mechanism. Now, one
might ask if this function will always
succeed for all thread, or is there any
possibility that some thread are spinning
forever in the program loop? In theory, the
answer is yes to this possibility, although
in practice this is clearly very unlikely
under normal loads.
|
|