Monday, June 8, 2009

Performance Improvement in Amino C++ Library

In version 0.5.1, major improvement of Amino C++ library is about performance. Now performance of many data structures are relatively improved. The biggest improvement is the LockFreeList and LockFreeOrderedList. Each of them has made nearly 100% improvements on PPC.

The steps for performance tuning are as follows:
  1. Profile all the original data structure under X86 and PPC. ( Use oprofile under X86, tprof under PPC)
  2. Analyze the oprofile results and tprof results of *.etm files by VPA (Virtual Performance Analyzer) tool. Find the bottle neck and infer the possible reason.
  3. Trace back to source code to fix the bottle necks get from step 2.
  4. Retest the data structure with different scale of operation numbers, meanwhile fix bugs if any. Compare the performance with the original one.

All the performance tests have been done on X86 and PowerPC servers after bug fixing and code refine, and the performance is evaluated by the throughput per second. The approximate average improvements of all the C++ data structures are as follows.

Data Structure

Performance Improved

X86

PPC

LockFreeStack

50%

31%

LockFreeEBStack

30%

N/A

LockFreeQueue

N/A

N/A

LockFreeShavit-Queue

N/A

N/A

LockFreeDeque

33%

298%

LockFreeList

20%

100%

LockFreeOrderedList

N/A

132%

LockFreeSet

40%

90%


Diagram 1 Performance average improvements ratio

Some explanation about the diagram:
  1. The figures are approximate ones which only show the improvement degree.
  2. LockFreeDeque’s huge improvement on PPC because of its implementation originally is not lock free. This time it was re-implemented in lock free way.
  3. Because of the hardware difference, the performance improvement figures are not the same between X86 and PPC.
The performance chart of LockFreeOrderedList:
Diagram 2 Original performance of LockFreeOrderedList on PPC.
Diagram 3 Tuned performance of LockFreeOrderedList on PPC

The main points for improving the performance are as follows:

1.Change the improper memory order type.
As we know, the overhead time of lock free components are mainly on the synchronization of atomic operations, such as CAS, LL/SC, whereas these operations are mainly controlled by the memory ordering types. The memory order has five different types (memory_order_relaxed, memory_order_acquire, memory_order_release, memory_order_acq_rel, memory_order_seq_cs), and the sequential requirements of them becomes stronger as the given sequence. So if misuse the memory order type could waste a lot of time for synchronization.


2.Fixing bugs.
There are some tough bugs existed before I did the performance tuning which could lead to get wrong performance data.
For example, there are peekTop() functions in LockFreeStack, the original implementation didn’t protect the top pointer and get the value of top->data. It’s possible the top pointer had been deleted before the operation of top->data. Because of no protection by SMR, there is no synchronization, so the performance result is wrong.


3.Refine redundant code.
Some codes can be refined to be more succinct while the logic is not changed, especially for the time-consuming loop sections.

~LockFreeQueue() {

- Node* first = head.load(memory_order_relaxed)->next.load(

- memory_order_relaxed);

- while (NULL != first) {

- if (first == tail.load(memory_order_relaxed)) { /*last node in the queue*/

- delete first;

- break;

- }

- Node* next = first->next.load(memory_order_relaxed);

+ Node* first = head.load(memory_order_relaxed);

+ Node* next = NULL;

+ while (NULL != first) {

+ next = (first->next).load(memory_order_relaxed);

delete first;

first = next;

}

-

- delete head.load(memory_order_relaxed);

}




Diagram 4 Code snippets for refine
( “-” means the original code lines, “+” means refined code lines.)
As you can see, after refining, the destructor has reduced 5 lines of code, but the logic is the same.


4.Reduce unnecessary function invocation.
Some times, we can cache the function invocation results by local variables instead of calling the function every time.
For example, each time if we want to use SMR to protect a pointer, we have to call getHPRec(), this function is time consuming and the function return value will be used several times in a big function. If we can have a local variable to cache the value, we can save a lot of time.


5.Utilize good tools.
As we all know, C++ program is memory error prone, so the usage of memory in our library is very critical for the correctness. I can still remember segment fault is very usual.
We can use valgrind to check the memory usage. It’s useful, since I fixed a bug in ebstack destructor with this tool.
We also can use oprofile and tprofile to analyze the bottle neck of our program. It’s easy to use VPA (developed by IBM) to get the time consuming code sections in ASM form.


6.Choose the suitable optimization options.
Before running the performance testing case, it’s necessary to choose the best optimization option of the compiler. Different compilers have different options. I want to mention the XLC on AIX, according to the compile manual, the highest optimization option is -O5, –qinline and -qipa, but when I compile the data structure which contains String, if added the –qipa option, there was segment fault, so I removed this option when compiling data types of String.

All in all, through nearly two month work, the C++ components’ performance has got big improvement. Not only for the performance, but also fixed some tough bugs. At the same time, the SMR has been extended to support aligned memory management.