Lock-free - Non-blocking - Parallel Scalable Solutions

Lock-free multi-thread solutions for multi-core and more - Parallel Scalable Solutions

The digital road leads to massively multi-core computers needing non-blocking programming

SERVICES

- OFFERS

- TECHNOLOGIES

- NON-BLOCKING 1(4)

- NON-BLOCKING 2(4)

- NON-BLOCKING 3(4)

- NON-BLOCKING 4(4)

Services – Technologies – Non-blocking

4 (4)

Dynamic data structures

If we want to be able to design data structures which maximal size is not fixed limited, as example a linked list, we must also be able to handle dynamic allocation of memory. As we now must share pointer references to these memory blocks to other threads, a new problem arises - when can we safely re-use the memory of blocks that are no longer used? If we for example has taken away an element in our linked list, and thus wants to return the memory for this to the system, we must be sure that no other thread has a reference to this memory and is going to read or write to it. We must in some sense thus have some kind of "garbage collection" facility, i.e., some reclamation system. If we want our non-blocking functions still to be purely non-blocking, then all sub-functions must also be so. This consequently includes also function calls to the garbage handler. Moreover, we want to be able to allocate and return the memory dynamically, and then must also this be done non-blocking.

A lock-free memory handler

In the code below is shown a very simple "lock-free" memory handler that can manage to allocate memory blocks of fixed size and also to return back the memory when the memory blocks are no longer reference to in any way. It must be pointed out that this example definitely is not the most efficient way to do it, but it is an illustrative and simple example the clearly alights the fundamental problems.

typedef struct Node {
int volatile refcount;
struct Node * volatile next;
void *data;
} Node;

Node nodes[MAX_NODES];
Node * volatile freeList;

void InitMem() {
for(int i=0;i<MAX_NODES;i++) {
    nodes[i].refcount=1;
    nodes[i].next=&nodes[i+1];
}
nodes[MAX_NODES-1].next=NULL;
freeList=nodes;
}

Node * AllocNode() {
   for(;;) {
    Node *p = freeList;
    FAA(&(p->refcount), 2);
    if(CAS(&freeList,p,p->next)) {
      FAA(&(p->refcount),-1);
      return p;
    }
    ReleaseNode(p);
}
}

void ReleaseNode(struct Node * node) {
FAA(&(node->refcount),-2);
if(CAS(&node->refcount,0,1)) {
    if(node->next) ReleaseNode(node->next);
    Node *q; do {
      q = freeList;
      node->next = q;
    } while(!CAS(&freeList,q,node));
}
}

Node *DeRefLink(Node * volatile *link) {
for(;;) {
    Node *q = *link;
    if(q == NULL) return q;
    FAA(&(q->refcount), 2);
    if(q == *link) return q;
    ReleaseNode(q);
}
}

The actual memory blocks for allocation is kept in a linked list, called "freeList", and reference counting is used for keeping watch over references. For this purpose, each memory block has a variable that indicates the number of references that exist to this block. This means concrete that every time we are following a shared pointer (which is normally done with the *-operator) we must also increment the reference counter to the referred memory block. The problem is that all steps for this can not be done at once with help from the basic atomic operations. When we are increasing the counter (with FAA for example) it might be the case that the pointer no longer refers to that memory block, and that memory block might also have been re-used and allocated by another thread. The solution in this example is to always keep the variable for reference counting at the same memory address, even when other parts of the memory block has been used for other purposes than was done by this thread.

Similar problems appears at the allocation of memory blocks. What guarantees that "p->next" really has the value we read before, when the CAS operation finally succeeded? If the memory block "p" has been allocated and thereafter returned after that we read "p->next" and before "CAS", it might actually be that "p->next" has changed. By increasing the reference counter for the memory block "p" on before hand, we can make sure that any reclamation can not have occurred.

The reason why we are increasing the reference counter with 2 every time, is that we are using the least significant bit (with value 1) in the counter to something else. The problem is that when we are decrementing the counter, because a reference is no longer used, we must be able to decide when the memory should be returned (probably when the counter is 0) and more importantly to decide which thread that should do it. It may actually be the case that several threads have read the counter and seen it to be 0, but only the one that manages to atomically change it to 1 gets the task to return the memory block.

A lock-free dynamic queue

We are now using our non-blocking memory handler in order to design a non-blocking dynamic data structure. In the code below is shown a simple "lock-free" queue data structure.

Node * volatile head, * volatile tail;

void InitQueue() {
head = tail = AllocNode();
FAA(&(head->refcount),2);
head->next = NULL;
}

void *Dequeue() {
Node * node1, *next;
for(;;) {
    node1 = DeRefLink(&head);
    next = DeRefLink(&node1->next);
    if(next == NULL) {
      ReleaseNode(node1);
      return NULL;
    }
    if(CAS(&head,node1,next))
      break;
    ReleaseNode(node1);
    ReleaseNode(next);
}
void *value = next->data;
ReleaseNode(node1);
ReleaseNode(node1);
return value;
}

void Enqueue(void *value) {
Node * node = AllocNode();
FAA(&(node->refcount),2);
node->next = NULL;
node->data = value;
Node * old = DeRefLink(&tail);
Node * prev = old;
do {
    while(prev->next != NULL) {
      Node * prev2 = DeRefLink(&prev->next);
      ReleaseNode(prev);
      prev = prev2;
    }
} while(!CAS(&prev->next,NULL,node));
ReleaseNode(prev);
if(CAS(&tail,old,node))
    ReleaseNode(old);
else ReleaseNode(node);
}

Observe how the calls to the memory handler are used in the code; "AllocNode" for allocating new memory blocks, "DeRefLink" for following (de-referencing) pointers, and "ReleaseNode" for releasing references that are no longer used. The code is based on a linked list, where the first link is referenced by the "head" pointer and the last link always is a "dummy" element and contains no data. In order to remove an element, "head" is changed by the use of CAS to point to the next element and to add a new item a new element is added after the last element with the use of CAS, as illustrated in the figure below.

lock-free insert in a linked list implementation of a queue using compare-and-swap

As the "tail" pointer can not be updated at the same time as the new element is added, it will be behind in time, which means that every thread must step along the linked list until it reaches the very last element. This also means that we must be sure that it is safe to follow the "->next" pointers in every element, which is done by making sure that the reference count in the referred element includes this pointer. Moreover, we must consequently also decrease the reference counter when it is time to return the memory block containing this pointer, which is done in the function "ReleaseNode" by recursively calling "ReleaseNode" for all pointers contained in the memory block.

More advanced

Theoretically, one has been able to prove that it is possible to design parallel and "wait-free" operation of any size and complexity by the use of CAS. In practice it is although quite harder. Research in the area has been done internationally since over 30 years ago. As an example, there is a research group at Chalmers University of Technology which have worked with finding practical and specific applications on the technique with results coming relatively far towards its goal. There are also general methods that easily can be applied on almost whatever, but has on the other hand high overhead and low performance relatively to specific constructions. Something that also is in advantage for specific solutions is that there are now a large number of constructions, primarily within "lock-free" that is known and published. Parallel Scalable Solutions provide commercial solutions for developers, that span the greater part of what is known within the academic world on non-blocking techniques.

Also large companies like Intel, Microsoft, Sun and so on have initiated their own projects, even though yet in smaller scale. Intel merchandize their "Threading Building Blocks", Microsoft provide their "Interlocked" class in ".NET" and Win32 API, and Sun has implemented several non-blocking classes in Java 1.6. In the Linux community promising work has been done on combining simple non-blocking and lock-based mechanisms, especially in the kernel. This is mainly in focus on performance, as the whole construction generally achieves the same properties as a lock-based solution. The future will tell, although it is completely clear that the current non-blocking techniques can match or even significantly surpass conventional lock-based techniques, something that will definitely be more and more interesting as the number of logical processors continuously increase in the future.