When Nanoseconds Matter: Ultrafast Trading Systems in C++

This is my personal notes of the talk “When Nanoseconds Matter: Ultrafast Trading Systems in C++” from Cppcon.

Order Book

Properties

Two ordered sequences
- Bids: from highest to lowest price
- Asks: from lowest to highest price
Price Level
- Price
- Size: sum of all orders’ size at this price level
Each order has an ID(uint64_t) which is unqiue throughout the trading session
A typical tock order book has ~1000 price level “per side”

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


std::map<Price, Volume, std::greater<Price>> mBidLevels;
std::map<Price, Volume, std::less<Price>> mAskLevels;

template <class T>
typename T::iterator AddOrder(T& levels, Price price, Volume volume)
{
    auto [it, inserted] = levels.try_emplace(price, volume);
    if (inserted == false)
        it->second += volume;
    return it;
}

template <class T>
void DeleteOrder(typename T::iterator it, T& levels, Price price, Volume volume)
{
    it->second -= volume;
    if (it->second <= 0)
        levels.erase(it);
}

Principle #1: “Most of the time, you don’t want node containers”

Using `std::vector`

Using two vectors (bids and asks) and use std::lower_bound (binary search).

1
2


std::vector<std::pair<Price, Volume>> mBidLevels;
std::vector<std::pair<Price, Volume>> mAskLevels;

AddOrder
- log(N) if the price level exists
- log(N) + N if a new price level is inserted
ModifyOrder: log(N)
- We can’t store iterators/pointers as they are being invalidated on a call to std::vector::insert
DeleteOrder
- log(N) if the price level exists
- log(N) + N if the price level does not exist

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


void AddOrder(Side side, Price price, Volume volume)
{
    if (side == Side::Bid)
        return AddOrder(bidLevels, price, volume, std::greater<Price>());
    else
        return AddOrder(askLevels, price, volume, std::less<Price>());
}

template <class T, class Compare>
void AddOrder(T& levels, Price price, Volume volume, Compare comp)
{
    auto it = std::lower_bound(levels.begin(), levels.end(), price,
        [comp](const auto& p, Price price) { return comp(p.first, price); });
    
    if (it != levels.end() && it->first == price)
        it->second += volume;
    else
        levels.insert(it, {price, volume});
}

“reverse” vector

1
2
3
4
5


// Reversing the vector to minimize the number of elements being moved
auto GetBestPrices() const
{
    return {mBidLevels.rbegin().first, mAskLevels.rbegin().first};
}

Principle #2: “Understanding your problem (by looking at data!)”

Principle #3: “Hand tailored (specialized) algorithms are key to achieve performance”

Running `perf` on a benchmark

Start perf once the initialization is completed so that the exact benchmark is assessed without the cost of initialization.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


void RunPerf()
{
    pid_t pid = fork();
    if (pid == 0)
    {
        const auto parentPid = std::to_string(getppid());
        std::cout << "Running perf on parent process " << parentPid << std::endl;
        // Changing the current process into perf
        execlp("perf", "perf", ..., parentPid.c_str(), (char*)nullptr);
        throw std::runtime_error("execlp failed");
    }
}

void InitAndRunBenchmark()
{
    InitBenchmark();  // might take a long time!
    RunPerf();
    RunBenchmark();
}

First perf measurements should never be too specific.

Branchless binary search

Because of branch prediction, if a misprediction happens, the pipeline will be flushed, and the CPU needs to read commands from the correct address again, which is a waste of 10-20 clock cycles.

By using branchless logic, the waste of clock cycles is avoided.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


template <class ForwardIt, class T, class Compare>
ForwardIt branchless_lower_bound(ForwardIt first, ForwardIt last, 
                                  const T& value, Compare comp)
{
    auto length = last - first;
    
    while (length > 0)
    {
        auto half = length / 2;
        
        // multiplication (by 1) is needed for GCC to generate CMOV
        first += comp(first[half], value) * (length - half);
        
        length = half;
    }
    
    return first;
}

Linear search is fast because of better cache locality.

Principle #4: “Simplicity is the ultimate sopistication”

Principle #5: “Mechanical sympathy”

I-Cache misses - likely/unlikely attributes
I-Cache misses - Immediately Invoked Function Expressions(IIFE)
Lambda, Functor vs std::function

Transport: networking & concurrency

Userspace Networking

Solarflare
OpenOnload
TCPDirect
EF_VI

Principle #6: “True efficiency is found not in the layers of complexity we add, but in the unnecessary layers we remove”

Shared Memory
Concurrent Queues

Principle #7: “Choose the right tool for the right task”

FastQueue - Design

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51


struct FastQueue
{
    // Both counters are written by the producer, read by the consumer(s)
    // Before/after a Write operation, both counters contain the same value

    // [WriteCounter, ReadCounter] defines the area of reading
    alignas(CACHE_LINE_SIZE) std::atomic<uint64_t> mReadCounter{0};
    
    // [ReadCounter, WriteCounter] defines the area of writing
    alignas(CACHE_LINE_SIZE) std::atomic<uint64_t> mWriteCounter{0};
    
    alignas(CACHE_LINE_SIZE) uint8_t mBuffer[0];
};

// simplified code!
void QProducer::Write(std::span<std::byte> buffer)
{
    const int32_t payloadSize = sizeof(int32_t) + buffer.size();
    mLocalCounter += payloadSize;
    
    mQ->mWriteCounter.store(mLocalCounter, std::memory_order_release);
    
    std::memcpy(mNextElement, &size, sizeof(int32_t));
    std::memcpy(mNextElement + sizeof(int32_t), buffer.data(), buffer.size());
    
    mQ->mReadCounter.store(mLocalCounter, std::memory_order_release);
    
    mNextElement += payloadSize;
}

int32_t QConsumer::TryRead(std::span<std::byte> buffer)
{
    if (mLocalCounter == mQ->mWriteCounter.load(std::memory_order_acquire))
        return 0;

    int32_t size;
    std::memcpy(&size, mNextElement, sizeof(int32_t));

    int32_t writeCounter = mQ->mWriteCounter.load(std::memory_order_acquire);
    EXPECT(writeCounter - mLocalWriter <= QUEUE_SIZE, "queue overflow");
    EXPECT(size <= buffer.size(), "Buffer space isn't large enough");

    std::memcpy(buffer.data(), mNextElement + sizeof(size), size);
    
    const int32_t payloadSize = sizeof(size) + size;
    mLocalCounter += payloadSize;
    mNextElement += payloadSize;

    writeCounter = mQ->mWriteCounter.load(std::memory_order_acquire);
    EXPECT(writeCounter - mLocalCounter <= QUEUE_SIZE, "queue overflow");
}

Optimization #1: Caching the Write Counter

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


void QProducer::Write(std::span<std::byte> buffer)
{
    const int32_t payloadSize = sizeof(int32_t) + buffer.size();
    mLocalCounter += payloadSize;
    // we "reserve" more space (X% of the total queue)
    // to avoid touching this cache line on every message written
    if (mCachedWriteCounter < mLocalCounter) {
        mCachedWriteCounter = Align<Q_WRITE_COUNTER_BLOCK_BYTES>(mLocalCounter);
        mQ->mWriteCounter.store(mCachedWriteCounter, std::memory_order_release);
    }
    
    std::memcpy(mNextElement, &payloadSize, sizeof(int32_t));
    std::memcpy(mNextElement + sizeof(int32_t), buffer.data(), buffer.size());
    
    mQ->mReadCounter.store(mLocalCounter, std::memory_order_release);
    
    mNextElement += payloadSize;
}

Optimization #2: Data Alignment

Optimization #3: Caching the Read Counter

Going further - Zero copy

Measurements in low-latency trading systems

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


void Executor::PollMarketData()
{
    while (poll[&](const auto& msg))
    {
        if (IsInteresting(msg))
        {
            SendOrder();
        }
    }
}

Clang Xray Instrumentation

Principle #8: “Being fast is good - staying in fast is better”

Principle #9: “Thinking abut the system as a whole”

Principle #10: “The performance of your code depends on your colleagues’ code as much as yours”