Wednesday, 22 June 2011

Dissecting the Disruptor: What's so special about a ring buffer?

Recently we open sourced the LMAX Disruptor, the key to what makes our exchange so fast.  Why did we open source it?  Well, we've realised that conventional wisdom around high performance programming is... a bit wrong. We've come up with a better, faster way to share data between threads, and it would be selfish not to share it with the world.  Plus it makes us look dead clever.

On the site you can download a technical article explaining what the Disruptor is and why it's so clever and fast.  I even get a writing credit on it, which is gratifying when all I really did is insert commas and re-phrase sentences I didn't understand.

However I find the whole thing a bit much to digest all at once, so I'm going to explain it in smaller pieces, as suits my NADD audience.

First up - the ring buffer.  Initially I was under the impression the Disruptor was just the ring buffer.  But I've come to realise that while this data structure is at the heart of the pattern, the clever bit about the Disruptor is controlling access to it.

What on earth is a ring buffer?
Well, it does what it says on the tin - it's a ring (it's circular and wraps), and you use it as a buffer to pass stuff from one context (one thread) to another:


(OK, I drew it in Paint.  I'm experimenting with sketch styles and hoping my OCD doesn't kick in and demand perfect circles and straight lines at precise angles).

So basically it's an array with a pointer to the next available slot.


As you keep filling up the buffer (and presumable reading from it too), the sequence keeps incrementing, wrapping around the ring:

To find the slot in the array that the current sequence points to you use a mod operation:
sequence mod array length = array index
So for the above ring buffer (using Java mod syntax): 12 % 10 = 2. Easy.

Actually it was a total accident that the picture had ten slots.  Powers of two work better because computers think in binary.

So what?
If you look at Wikipedia's entry on Circular Buffers, you'll see one major difference to the way we've implemented ours - we don't have a pointer to the end.  We only have the next available sequence number.  This is deliberate - the original reason we chose a ring buffer was so we could support reliable messaging.  We needed a store of the messages the service had sent, so when another service sent a nak to say they hadn't received some messages, it would be able to resend them.

The ring buffer seems ideal for this.  It stores the sequence to show where the end of the buffer is, and if it gets a nak it can replay everything from that point to the current sequence:


The difference between the ring buffer as we've implemented it, and the queues we had traditionally been using, is that we don't consume the items in the buffer - they stay there until they get over-written.  Which is why we don't need the "end" pointer you see in the Wikipedia version.  Deciding whether it's OK to wrap or not is managed outside of the data structure itself (this is part of the producer and consumer behaviour - if you can't wait for me to get round to blogging about it, check out the Disruptor site).

And it's so great because...?
So we use this data structure because it gives us some nice behaviour for reliable messaging.  It turns out though that it has some other nice characteristics.

Firstly, it's faster than something like a linked list because it's an array, and has a predictable pattern of access.  This is nice and CPU-cache-friendly - at the hardware level the entries can be pre-loaded, so the machine is not constantly going back to main memory to load the next item in the ring.

Secondly, it's an array and you can pre-allocate it up front, making the objects effectively immortal.  This means the garbage collector has pretty much nothing to do here.  Again, unlike a linked list which creates objects for every item added to the list - these then all need to be cleaned up when the item is no longer in the list.

The missing pieces
I haven't talked about how to prevent the ring wrapping, or specifics around how to write stuff to and read things from the ring buffer.  You'll also notice I've been comparing it to a data structure like a linked list, which I don't think anyone believes is the answer to the world's problems.

The interesting part comes when you compare the Disruptor with an implementation like a queue.  Queues usually take care of all the stuff like the start and end of the queue, adding and consuming items, and so forth.  All the stuff I haven't really touched on with the ring buffer.  That's because the ring buffer itself isn't responsible for these things, we've moved these concerns outside of the data structure.

For more details you're just going to have to read the paper or check out the code.  Or watch Mike and Martin at QCon San Francisco last year.  Or wait for me to have a spare five minutes to get my head around the rest of it.

20 comments:

  1. If you don't consume elements from your ring buffer then you're keeping them reachable and preventing them from being deallocated. This can obviously have an adverse effect on the throughput and latency of the garbage collector.

    Writing references into different locations in your ring buffer incurs the write barrier, which can also adversely affect throughput and latency.

    I wonder what the trade-offs are concerning these disadvantages and when they come into play.

    ReplyDelete
  2. With regards to use of memory, no real trade offs are made by the Disruptor. Unlike a queue, you have a choice about how to make use of memory. If solution is a soft real-time system, reducing GC pauses is paramount. Therefore you can re-use the entries in the ring buffer, e.g. copying byte arrays to and from network I/O buffers in and out of the ring buffer (our most common usage pattern). As the amount of memory used by the system remains static is reduces the frequency of garbage collection.

    It is also possible to implement an Entry that contains a reference to an immutable object. However in that situation it may be necessary for the consumer to null out the message object to reduce the amount of memory that needs to be promoted from Eden. So a little more effort is required from the programmer to build the most appropriate solution. We believe that the flexibility provided justifies this small bit of extra effort.

    Considering the write barrier, the primary goal of the Disruptor is to pass messages between threads. We make no trade offs regarding ordering or consistency, therefore it is necessary to use memory barriers in the appropriate places. We've done our utmost to keep this to a minimum. However, we are many times faster than the popular alternatives as most of them use locks provide consistency.

    Mike.

    ReplyDelete
  3. How does this approach compare to the Pool approach and other approaches used here: http://cacm.acm.org/magazines/2011/3/105308-data-structures-in-the-multicore-age/fulltext

    Why not use a Pool instead of a queue? Is the LIFO requirement essential?

    ReplyDelete
  4. Unfortunately I can't read that article because I don't have an account at that site.

    FIFO (not LIFO) is absolutely essential - our exchange depends upon predictable ordering, and if you play the same events into it you will always get the same outcome. The Disruptor ensures this ordering without taking the performance penalties usually associated with FIFO structures.

    ReplyDelete
  5. @Flying Frog Consultancy said "If you don't consume elements from your ring buffer then you're keeping them reachable and preventing them from being deallocated. This can obviously have an adverse effect on the throughput and latency of the garbage collector."

    The whole point is to *not* to invoke the garbage collector. The Disruptor pattern allows data to be passed between CPU's at pretty much the theoretical maximum of the hardware - its been well thought out

    ReplyDelete
  6. I'm new to Disruptor pattern. I have a very basic question. How do I add messages to the Ring Buffer from a Multi Threaded Producer. Should the add calls to the Ring Buffer be synchronized?

    Thank you.

    ReplyDelete
  7. Generally the aim is not to run anything multi-threaded. Producers and consumers should be single-threaded. But you can have more than one producer: http://mechanitis.blogspot.com/2011/07/dissecting-disruptor-writing-to-ring.html - this is a slightly out of date post, the naming conventions have changed and the producer barrier is now managed by the ring buffer, but I think this might be a good place to start to think about how to solve your problem.

    ReplyDelete
  8. Hi Trisha,

    Thanks for interesting the article.
    I am not sure if I understand, but the concept of holding on to memory and reusing already allocated objects to avoid GC pauses does not seem to be new. How is the ring buffer different from an object pool?

    Thanks!

    ReplyDelete
  9. Hi Sourabh,

    Avoiding GC is not the main aim of the RingBuffer, although it does help towards the speed of the Disruptor. The interesting characteristics of the RingBuffer are that it's FIFO, and it enables some really nice batching when you read from it. The RingBuffer is not the secret sauce in the Disruptor's performance, in fact in the current version of the Disruptor you don't need it at all.

    It's worth noting that there's nothing new in the Disruptor at all, in fact many of the ideas have been around for years. But I don't think there are any other frameworks in Java that pull together these concepts in this way to give the kind of performance we see when using the Disruptor.

    ReplyDelete
  10. Hi Trisha,
    From a few days I discovered the LMAX architecture and disruptor also, It's not so clearly to my how exactly the consumers extract the messages from RingBuffer and how exactly a consumer, for example C1, know which messages are for it and not for other consumer, C2 ?
    Thanks
    Sorin.

    ReplyDelete
    Replies
    1. Hi Sorin,

      Actually the messages are for both consumers. The default behaviour is that all consumers (or EventHandlers as they are now) read all messages in the RingBuffer. If you have different types of events that are handled by different consumers, then it's up to the consumer to decide whether to ignore the event or not. So, if C1 handles all blue messages and C2 handles all red ones (over simplification of course) then C1 needs to check it's a blue message before proceeding.

      In terms of extracting the messages - you don't. Messages live on the ring buffer to be read by (and processed by) all consumers, until every consumer has done what it needs to do with it (i.e. every consumer has incremented its sequence number to at least that message's number) then it will get over-written when the ring wraps. If you want to do something with that message, then you simply read it and do whatever you want with it, even if that's passing it on to another Disruptor or another part of the system.

      Delete
  11. Hi,

    Great Work.

    Could you tell me, How RingBuffer handles overflow?

    Thanks

    ReplyDelete
    Replies
    1. By overflow, you mean what happens when the RingBuffer fills up? Well, the Producer (EventPublisher), the thing that writes into the RingBuffer, has to block until there is an available slot in the RingBuffer. See my (now very old) post on writing to the buffer: http://mechanitis.blogspot.co.uk/2011/07/dissecting-disruptor-writing-to-ring.html.

      This is done very deliberately as you want to create back pressure in your system rather than running out of memory: http://mechanical-sympathy.blogspot.co.uk/2012/05/apply-back-pressure-when-overloaded.html

      Delete
  12. This comment has been removed by the author.

    ReplyDelete
  13. Hi Trisha,
    Thanks for this and other presentations.

    I have a question regarding the disruptor which is rather basic.

    The consumers (event processors) are not implementing any of the Callable or Runnable interfaces they implement EventHandler, Then how can they run in parallel, so for example I have a disruptor implementation where there is a diamond pattern like this



    P1 - [c1,c2,c3] - c4 - c5



    Where c1 to c3 can work in parallel after p1, and C4 and C5 work after them.

    So conventionally I'd have something like this (with P1 and C1-C5 being runnables/callables)

    p1.start();
    p1.join();

    c1.start();
    c2.start();
    c3.start();
    c1.join();
    c2.join();
    c3.join();

    c4.start();
    c4.join();
    c5.start();
    c5.join();

    But in case of the Disruptor none of my event handlers implement Runnable or Callable, so how'd the disruptor framework end up running them in parallel?

    Take following sceanrio:

    My consumer C2 requires to make a webservice call for some annotation to the Event, In SEDA I can startup 10 threads for such 10 C2 requests [for pulling the message out of queue + make Webservice Call and update the next SEDA Queue] and that will make sure that I don't sequentially wait for a web service response for each of the 10 requests
    where as in this case my eventprocessor C2 (if) being the single instance would wait sequentially for 10 C2 requests.

    http://stackoverflow.com/questions/17019203/disruptor-are-the-consumers-multithreaded

    -Regards
    Nikhil

    ReplyDelete
  14. Hi Trisha,

    In Java, creating an array of Java objects does not allocate memory for the objects. It only allocates memory for references to the objects. How does an array of object references help improve CPU caching efficiency because the actual objects are still scattered in the heap?

    Thanks,
    Richard

    ReplyDelete
    Replies
    1. You're absolutely right, which is why in the LMAX case we have an array of byte arrays, not an array of objects - at least for the high performance instances of the disruptor. An array of object references is still valuable in a lot of cases, but as you say it doesn't necessarily give you the cache line affinity.

      This has come up several times in the Google Group discussions (https://groups.google.com/forum/#!forum/lmax-disruptor), I think you'll find more detailed discussion there.

      Delete
  15. Hi Trisha;

    I've just spend a good part of my day going through your disruptor and I can see from your example the hundreds of millions of ops per second and also from the presentation 6 million trades per second. I've just written an example with the producer retrieving the operation request from a webservice with 2 consumers one for marshaling and the other for business logic and my throughput is just over 1000 ops per second source (https://github.com/ejosiah/activemq-vs-distruptor)

    My question does any of your metrics include I/O operations from other consumers such as those for (Journalling, replication, serialization, etc)

    ReplyDelete
    Replies
    1. No, the metrics quoted are for the business logic only, not for I/O etc. I think if you check the Google Group history you'll find more specific information about what was measured and how, this question has definitely come up before:

      https://groups.google.com/forum/#!forum/lmax-disruptor

      Delete