Monday, August 31, 2009

Lock-free Deque in Amino project

At fist, Amino Deque implements the algorithm invented by Maged M. Michael in the great paper "CAS-Based Lock-Free Algorithm for Shared Deques".

The deque is a doublely ended queue. Insertion and deletion can happen at both ends. Internally, this deque is represented as a doubly linked list. For each deque, there is an anchor object which comprise of 2 pointers and 1 flag. The pseudo-code is as shown below:
Anchor{
Node left, right;
enum {STABLE, LEFT, RIGHT} flag;
} anchor;
The key idea of this deque algorithm is to automatically updating this anchor field when deque is being changed. An anchor object occupies more than a machine word. Until now, there is no portable way to automatically updating such an "wide" object in Java. Actually, we don't need a wider "CAS" for implementing lock free deque. The idea is simple:
  1. We make Anchor immutable and use an AtomicReference to wrap it
  2. Whenever we want to update an anchor, we create a new Anchor object and use AtomicReference.compareAndSwap() to update the anchor automatically.
Above approach behaves differently as a wider "CAS" for some special cases. For example, it will judge equalism by comparing identity. It means even two objects has exactly same internal states, "CAS" will fail in comparison phase.
Besides, there is an object allocated for each updating. Will this be a big overhead? Since memory allocation in Java is relatively cheaper than other languages, this approach works well when frequency is moderate. Performance charts are shown below:



This deque is not completely compatible with standard JDK API. For example, it doesn't support element deletion by iterator.remove(). This is normal to Amino lock-free components. To implement such an unusual method for deque, performance of other frequently used methods such as addFirst()/pollFirst()/addLast()/pollLast() will become much slower than before. That's the reason why Amino library choose not to support such methods. I personally don't think it's a big problem. After all, if a data structure supports deletion in the middle, can it still be called a deque/queue/stack?

Tuesday, August 11, 2009

Performance of Amino Queue

Now we got the performance result of Amino Queue in an 8-core X86 machine which is running Linux and IBM JDK v6. We compared the performance of Amino's LockFreeQueue and java.util.concurrent.ConcurrentLinkedQueue. Both of the two queues used lock-free algorithm. Amino queue's algorithm comes from the brilliant paper "An Optimistic Approach to Lock-Free FIFO Queues" of Edya Ladan-Mozes and Nir Shavit. Result of the micro-benchmark shows Amino queue has some advantage:


The key idea behind the new algorithm is a novel way of replacing the singly-linked list of Michael and Scott, whose pointers are inserted using a costly compare-and-swap (CAS) operation, by an optimistic doubly-linked list, whose pointers are updated using a simple store, yet can be fixed if a bad ordering of events causes them to be inconsistent.

Tuesday, August 4, 2009

Performance of Amino Stack

In Amino library, stack is the simplest and yet the most useful component. According to our performance test, Amino's stack is far more faster than a lock-protected stack. Today I rerun the performance test on a 8-core X86-64 machine and result is as shown in below diagrams:










In the performance test, the 1st step is warm-up with 8 threads. That's the reason that first column of every bar charts has a label of "8". After that, we increase the thread number from "1" to "128". The test complete in 1,039.557 second.

The code (Apache license) of LockFreeStack can be downloaded at Amino's SF.net SVN

Monday, June 8, 2009

Performance Improvement in Amino C++ Library

In version 0.5.1, major improvement of Amino C++ library is about performance. Now performance of many data structures are relatively improved. The biggest improvement is the LockFreeList and LockFreeOrderedList. Each of them has made nearly 100% improvements on PPC.

The steps for performance tuning are as follows:
  1. Profile all the original data structure under X86 and PPC. ( Use oprofile under X86, tprof under PPC)
  2. Analyze the oprofile results and tprof results of *.etm files by VPA (Virtual Performance Analyzer) tool. Find the bottle neck and infer the possible reason.
  3. Trace back to source code to fix the bottle necks get from step 2.
  4. Retest the data structure with different scale of operation numbers, meanwhile fix bugs if any. Compare the performance with the original one.

All the performance tests have been done on X86 and PowerPC servers after bug fixing and code refine, and the performance is evaluated by the throughput per second. The approximate average improvements of all the C++ data structures are as follows.

Data Structure

Performance Improved

X86

PPC

LockFreeStack

50%

31%

LockFreeEBStack

30%

N/A

LockFreeQueue

N/A

N/A

LockFreeShavit-Queue

N/A

N/A

LockFreeDeque

33%

298%

LockFreeList

20%

100%

LockFreeOrderedList

N/A

132%

LockFreeSet

40%

90%


Diagram 1 Performance average improvements ratio

Some explanation about the diagram:
  1. The figures are approximate ones which only show the improvement degree.
  2. LockFreeDeque’s huge improvement on PPC because of its implementation originally is not lock free. This time it was re-implemented in lock free way.
  3. Because of the hardware difference, the performance improvement figures are not the same between X86 and PPC.
The performance chart of LockFreeOrderedList:
Diagram 2 Original performance of LockFreeOrderedList on PPC.
Diagram 3 Tuned performance of LockFreeOrderedList on PPC

The main points for improving the performance are as follows:

1.Change the improper memory order type.
As we know, the overhead time of lock free components are mainly on the synchronization of atomic operations, such as CAS, LL/SC, whereas these operations are mainly controlled by the memory ordering types. The memory order has five different types (memory_order_relaxed, memory_order_acquire, memory_order_release, memory_order_acq_rel, memory_order_seq_cs), and the sequential requirements of them becomes stronger as the given sequence. So if misuse the memory order type could waste a lot of time for synchronization.


2.Fixing bugs.
There are some tough bugs existed before I did the performance tuning which could lead to get wrong performance data.
For example, there are peekTop() functions in LockFreeStack, the original implementation didn’t protect the top pointer and get the value of top->data. It’s possible the top pointer had been deleted before the operation of top->data. Because of no protection by SMR, there is no synchronization, so the performance result is wrong.


3.Refine redundant code.
Some codes can be refined to be more succinct while the logic is not changed, especially for the time-consuming loop sections.

~LockFreeQueue() {

- Node* first = head.load(memory_order_relaxed)->next.load(

- memory_order_relaxed);

- while (NULL != first) {

- if (first == tail.load(memory_order_relaxed)) { /*last node in the queue*/

- delete first;

- break;

- }

- Node* next = first->next.load(memory_order_relaxed);

+ Node* first = head.load(memory_order_relaxed);

+ Node* next = NULL;

+ while (NULL != first) {

+ next = (first->next).load(memory_order_relaxed);

delete first;

first = next;

}

-

- delete head.load(memory_order_relaxed);

}




Diagram 4 Code snippets for refine
( “-” means the original code lines, “+” means refined code lines.)
As you can see, after refining, the destructor has reduced 5 lines of code, but the logic is the same.


4.Reduce unnecessary function invocation.
Some times, we can cache the function invocation results by local variables instead of calling the function every time.
For example, each time if we want to use SMR to protect a pointer, we have to call getHPRec(), this function is time consuming and the function return value will be used several times in a big function. If we can have a local variable to cache the value, we can save a lot of time.


5.Utilize good tools.
As we all know, C++ program is memory error prone, so the usage of memory in our library is very critical for the correctness. I can still remember segment fault is very usual.
We can use valgrind to check the memory usage. It’s useful, since I fixed a bug in ebstack destructor with this tool.
We also can use oprofile and tprofile to analyze the bottle neck of our program. It’s easy to use VPA (developed by IBM) to get the time consuming code sections in ASM form.


6.Choose the suitable optimization options.
Before running the performance testing case, it’s necessary to choose the best optimization option of the compiler. Different compilers have different options. I want to mention the XLC on AIX, according to the compile manual, the highest optimization option is -O5, –qinline and -qipa, but when I compile the data structure which contains String, if added the –qipa option, there was segment fault, so I removed this option when compiling data types of String.

All in all, through nearly two month work, the C++ components’ performance has got big improvement. Not only for the performance, but also fixed some tough bugs. At the same time, the SMR has been extended to support aligned memory management.

Wednesday, May 6, 2009

ANN: Concurrent Building Blocks: 0.5.3 released!

The Amino team is pleased to announce a new release of Concurrent
Building Blocks (Apache License v2).
(1) News in this release:
  • Added PPC version of C++ components. Now C++ components support three platforms(X86_32/Linux/GCC, X86_64/Linux/GCC, PPC32/AIX/XLC)
  • Fixed several tough bugs
  • Dramatically improved performance of C++ components
  • Extended SMR for managing aligned memory

(2)The primary goal is to develop concurrent Java/C++ libraries that
can be used by programmers. Components are grouped into four
categories:
  • Data Structures
  • Parallel Patterns
  • Parallel functions
  • Atomics and STM

(3) Links

And always, your feedbacks are welcome!

Amino Team

MTRAT helps opensource projects find concurrent errors

Developing, testing, and debugging multithreaded programs are still very difficult; it is all too easy to create concurrent programs that appear to work, but fail when it matters most: in production, under heavy load. The most common parallel errors are data race and deadlock. These errors are typically harmful and hard to locate. Multi-Thread Run-time Analysis Tool(MTRAT) is a tool that detects and analyzes potential data race and deadlock conditions that might occur in multithreaded Java programs.

Apache FtpServer is a popular open source Java FTP server. In order to handle multiple requests at the same time, threading and concurrency are introduced in FtpServer core and its dependent library MINA, a network application framework. Let's exploit MTRAT on FtpServer to see what happens and apply MTRAT in FtpServer is easy(my operating system is Linux, so this showcase is done under Linux).

  1. Download MTRAT from its webiste http://www.alphaworks.ibm.com/tech/mtrat, and configure it following the user manual. Suppose mtrat is extracted under ~/MTRAT.
  2. The FtpServer startup script file ftpd.sh locates under bin folder, open ftpd.sh and find "$JAVACMD" -classpath "$FTPD_CLASSPATH" $MAIN_CLASS $@ in last few lines, this command is for invoking FtpServer application. Comment this line with heading "#" and add one statement to replace "$JAVACMD" with "mtrat" like this,
    ~/MTRAT/mtrat/mtrat -x java.*:sun.*:javax.*:com.*:org.eclipse.*
    :org.apache.xerces.*:org.apache.lucene.* -Dcom.ibm.mtrat.dbg.cl=false -Dcom.ibm.mtrat.novolatile=true -Dcom.ibm.mtrat.threadcache=true -Dcom.ibm.mtrat.osm=true
    -classpath "$FTPD_CLASSPATH" $MAIN_CLASS $@
    #"$JAVACMD" -classpath "$FTPD_CLASSPATH" $MAIN_CLASS $@
  3. Save ftpd.sh and run it.


After Ftpserver launches with MTRAT, we open two new consoles and login as admin and anonymous respectively, then admin uploads a file to ftpserver while anonymous download a file from ftpserver. You will see data races are reported in Ftpserver console.

We submit two data race bugs found by MTRAT, one is in FtpServer([#FTPSERVER-122]Data races are found in FtpStatisticsImpl) while the other is in MINA([#DIRMINA-651]Data Race in org.apache.mina.core.session.AbstractIoSession), and attach the data race report generated by MTRAT.

Positive feedback is received from the Apache community, these two data race bugs are confirmed by the community leader and [#DIRMINA-651] is committed to be fixed in MINA 2.0.0-RC1 while [#FTPSERVER-122] will be fixed after component refactor.

  1. "Thanks a lot for the analysis! The reason why I did not fix the race conditions already is that I'm planning to bring FtpServers statistics handling more in line with MINA.
    Anyways, I would much appreciate further analysis on the source code, especially after the statistics implementation is updated."

    -- Niklas Gustavsson
  2. "Some of these races are in the idle time checking, which is a core functionality in MINA of course. Those we have to fix."
    -- Niklas Gustavsson


In addition to data race bugs found in FtpServer, Bug 45608 in Tomcat is confirmed by community too. MTRAT is proven to be practical for real-world application and useful for parallel program developers.

Multi-threaded Runtime Analysis Tool Link

http://www.alphaworks.ibm.com/tech/mtrat

Tuesday, March 3, 2009

ANN: Multi-Thread Run-time Analysis Tool for Java: V2.2 released!

The Amino team is pleased to announce a new release of Multi-Thread Run-time Analysis Tool for Java (MTRAT)

(1) News in this release:

- More stable by fixing existing bugs
- Supports Linux/X86_64 and AIX/PPC64 platforms, now MTRAT support platforms as below

Operating SystemHardware PlatformJDK Version
Windows XPi386JDK 6 32-bit version
GNU/Linuxi386JDK 6 32-bit version
GNU/Linuxx86_64JDK 6 x64 version
AIXPPC64JDK 6 32-bit version


(2) Multi-Thread Run-time Analysis Tool is a tool that detects and analyzes potential data race and deadlock conditions that might occur in multi-threaded Java programs.

(3) Links
- Home: http://www.alphaworks.ibm.com/tech/mtrat
- Forum:http://www.alphaworks.ibm.com/tech/mtrat/forum
- Project Blog:http://aminoprj.blogspot.com

Your feedbacks are welcome!
Amino Team

Tuesday, February 17, 2009

Say goodbye to awful concurrency bugs -- Showcase of MTRAT on Tomcat

Multicore hardware trend drives more and more program archtects and progreammers take the double-edged sward -- parallel program, even it has been very notorious for its non-determined bugs, such race condition and deadlock, however mostly their injuries were beaten black and blue by this double-edged sward. What should and could we do? MTRAT might be one of the live-saving straw.

Today, I will show how do we use MTRAT to find concurrency defects in Tomcat, and hope this tool and article could comfort ones heart if he/she has been hurt by concurrency bugs a lot.

In one word, MTRAT instrument java classes when they are loaded by JVM, and runtime analysis engine is running during the program's execution. Once it finds some potential problems, it will report them with sufficient information, such as thread backtrace, class name, object id, etc.

Now, let us start the journey of MTRAT on Tomcat (my operating system is Linux, so this showcase is done under Linux),

  1. Download MTRAT from its webiste http://www.alphaworks.ibm.com/tech/mtrat, and configure it following the user manual.
  2. Start tomcat.
    [qiyao@mtrat-test:~/SourceCode/apache-tomcat-6.0.16]$ ./bin/startup.sh
  3. check java process of tomcat, record the process information of tomcat java process.
    [qiyao@mtrat-test:~/SourceCode/apache-tomcat-6.0.16]$ ps -elf | grep tomcat
    0 S qiyao 7415 1 4 81 0 - 143807 stext 11:13 pts/1 00:00:10 /opt/ibm-java-i386-60/bin/java -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.util.logging.config.file=/home/qiyao/SourceCode/apache-tomcat-6.0.16/conf/logging.properties -Djava.endorsed.dirs=/home/qiyao/SourceCode/apache-tomcat-6.0.16/endorsed -classpath :/home/qiyao/SourceCode/apache-tomcat-6.0.16/bin/bootstrap.jar -Dcatalina.base=/home/qiyao/SourceCode/apache-tomcat-6.0.16 -Dcatalina.home=/home/qiyao/SourceCode/apache-tomcat-6.0.16 -Djava.io.tmpdir=/home/qiyao/SourceCode/apache-tomcat-6.0.16/temp org.apache.catalina.startup.Bootstrap start


  4. stop tomcat,
    [qiyao@mtrat-test:~/SourceCode/apache-tomcat-6.0.16]$ ./bin/shutdown.sh

  5. Replace java with mtrat in process command line, and run it,
    ~/MTRAT/mtrat/mtrat -x java.*:com.*:sun.*:org.apache.xerces.*:javax.managment.* -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.util.logging.config.file=/home/qiyao/SourceCode/apache-tomcat-6.0.16/conf/logging.properties -Djava.endorsed.dirs=/home/qiyao/SourceCode/apache-tomcat-6.0.16/endorsed -classpath :/home/qiyao/SourceCode/apache-tomcat-6.0.16/bin/bootstrap.jar -Dcatalina.base=/home/qiyao/SourceCode/apache-tomcat-6.0.16 -Dcatalina.home=/home/qiyao/SourceCode/apache-tomcat-6.0.16 -Djava.io.tmpdir=/home/qiyao/SourceCode/apache-tomcat-6.0.16/temp org.apache.catalina.startup.Bootstrap start
  6. Access tomcat via browser and click examples shipped in tomcat, then you will see some data races are report in console,
Actually, user could deploy some applications to tomcat, and drive http requests to tomcat. Then, MTRAT could find more potential race conditions. In our experiments, we find some potential race conditions in tomcat 6.0.16, and one of them has been confirmed by tomcat commnity. See bug 45608

ANN: Concurrent Building Blocks: 0.3.2 released!

The Amino team is pleased to announce a new release of Concurrent
Building Blocks (Apache License v2).

(1) News in this release:
- Fix existing bugs
- Improved performance of C++ components
- A new bounded queue in Java
- Better makefile scripts

(2)The primary goal is to develop concurrent Java/C++ libraries that
can be used by programmers. Components are grouped into four
categories:
- Data Structures
- Parallel Patterns
- Parallel functions
- Atomics and STM

(3) Links
- Download and learn more from http://amino-cbbs.sourceforge.net/
- Discussion: https://groups.google.com/group/aminoprj
- Project Blog: http://aminoprj.blogspot.com/
- SourceForge: https://sourceforge.net/projects/amino-cbbs/
- Issue Tracker: https://sourceforge.net/tracker2/?group_id=209259
Your feedbacks are welcome!

Amino Team

Tuesday, February 10, 2009

Find parallel errors using MTRAT

Multicore architectures have become very popular. To take full advantage of such processor architectures, developers have to introduce parallelism in their software. Developing parallel software is hard and error prone. The most common parallel errors are data race and deadlock. When these errors occur, they are typically hard to locate. The Amino: Multi-threaded Runtime Analysis Tool (MTRAT) for Java can be used by developers to find these hard-to-find bugs.

Here I give two examples on how MTRAT finds data race and deadlock errors in parallel programs.

Data race occurs when two threads access the same memory location with no ordering constraints enforced between the accesses, where at least one of these accesses is a "write”. The Java sample below has potential data race. Class Value declares an integer field “x”, a synchronization method “add” to get x value and modify its value, a non-synchronization method “get” to return x value. We construct class Task by two Value instance and start two Task thread in main function. Run sample.DataRace with MTRAT, we can check potential data races in program runtime.

MTRAT detection result shows below. One data race is found by MTRAT, and it is reported with class fields, threads which access this field, and stack back trace information. This Read/Write data race happens on field x class sample/Value because two Task threads will visit instance v1 and v2, but add and get method in class Value are not protected by a same lock.

Run MTRAT in eclipse:



A deadlock is a situation in which two or more competing actions are waiting for the other to finish, so that neither ever does. Both situations are destructive and are common when building multi-threaded programs. The Java sample below has potential deadlock. In method harness2 of class Deadlock, two instances of StringBuffer L1 and L2 are created as parameters of class T3 constructor. In method run of class T3, thread will acquire two object locks sequentially, and then release them in opposite order. Since L1 and L2 are passed to t1 and t2 in opposite order, two threads will acquire two locks in opposite order, deadlock occurs. Run sample.Deadlock with MTRAT, we can check potential deadlocks in program runtime.


Run MTRAT in command line:

Run MTRAT in eclipse:


Multi-threaded Runtime Analysis Tool Link

http://www.alphaworks.ibm.com/tech/mtrat

Parallel Patterns for C++ Programmer

In Amino project, we've created several experimental parallel patterns in C++. There are already advanced parallel patterns for Java programmers, such as Fork/Join from Doug Lea. But things are so ready for C++ world.

Under the current release of Amino library project, we have created three parallelized version of existing patterns from STL. The method signature is very close to the original one:


  1. Foreach

  2. Pattern Usage Computing Kernel

    vector<int> dataV;

    ThreadPoolExecutor exec;

    for_each(exec, 2, dataV.begin(), dataV.end(), sum);
    exec.shutdown();
    exec.waitTermination();


    void
    sum (int n)
    {
    result += n;
    }


  3. Transform

  4. Pattern Usage Computing Kernel
    UnaryFunc<int> uf;
    vector<int> dataV;

    int i = 0;

    for ( ; i1);
    }

    ThreadPoolExecutor exec;

    // change each elemet to its twice
    transform(exec, 2, dataV.begin(), dataV.end(), dataV.begin(), uf);
    exec.shutdown();
    exec.waitTermination();


    template<typename ParaType>
    class UnaryFunc {
    public:
    ParaType operator()(ParaType element) {
    return 2 * element;
    }
    };


  5. Accumulate

  6. Pattern Usage Computing Kernel
    vector<int> dataV;

    // Test the function 1
    int result = accumulate<int>::iterator, int, ThreadPoolExecutor>(exec,
    2, dataV.begin(), dataV.end());
    exec.shutdown();
    exec.waitTermination();


    template<typename ParaType>
    class UnaryFunc {
    public:
    ParaType operator()(ParaType element) {
    return 2 * element;
    }
    };




Please note these patterns are in pretty early stage. The performance is still ridiculous now. Please let us know your opinion about the API design. And contributions are always welcome!