This is my personal notes of the talk “When Nanoseconds Matter: Ultrafast Trading Systems in C++” from Cppcon.
Order Book
Properties
- Two ordered sequences
- Bids: from highest to lowest price
- Asks: from lowest to highest price
- Price Level
- Price
- Size: sum of all orders’ size at this price level
- Each order has an ID(uint64_t) which is unqiue throughout the trading session
- A typical tock order book has ~1000 price level “per side”
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
std::map<Price, Volume, std::greater<Price>> mBidLevels;
std::map<Price, Volume, std::less<Price>> mAskLevels;
template <class T>
typename T::iterator AddOrder(T& levels, Price price, Volume volume)
{
auto [it, inserted] = levels.try_emplace(price, volume);
if (inserted == false)
it->second += volume;
return it;
}
template <class T>
void DeleteOrder(typename T::iterator it, T& levels, Price price, Volume volume)
{
it->second -= volume;
if (it->second <= 0)
levels.erase(it);
}
|
Principle #1: “Most of the time, you don’t want node containers”
Using std::vector
Using two vectors (bids and asks) and use std::lower_bound (binary search).
1
2
|
std::vector<std::pair<Price, Volume>> mBidLevels;
std::vector<std::pair<Price, Volume>> mAskLevels;
|
- AddOrder
log(N) if the price level exists
log(N) + N if a new price level is inserted
- ModifyOrder:
log(N)
- We can’t store iterators/pointers as they are being invalidated on a call to
std::vector::insert
- DeleteOrder
log(N) if the price level exists
log(N) + N if the price level does not exist
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
void AddOrder(Side side, Price price, Volume volume)
{
if (side == Side::Bid)
return AddOrder(bidLevels, price, volume, std::greater<Price>());
else
return AddOrder(askLevels, price, volume, std::less<Price>());
}
template <class T, class Compare>
void AddOrder(T& levels, Price price, Volume volume, Compare comp)
{
auto it = std::lower_bound(levels.begin(), levels.end(), price,
[comp](const auto& p, Price price) { return comp(p.first, price); });
if (it != levels.end() && it->first == price)
it->second += volume;
else
levels.insert(it, {price, volume});
}
|
“reverse” vector
1
2
3
4
5
|
// Reversing the vector to minimize the number of elements being moved
auto GetBestPrices() const
{
return {mBidLevels.rbegin().first, mAskLevels.rbegin().first};
}
|
Principle #2: “Understanding your problem (by looking at data!)”
Principle #3: “Hand tailored (specialized) algorithms are key to achieve performance”
Running perf on a benchmark
Start perf once the initialization is completed so that the exact benchmark is assessed without the cost of initialization.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
void RunPerf()
{
pid_t pid = fork();
if (pid == 0)
{
const auto parentPid = std::to_string(getppid());
std::cout << "Running perf on parent process " << parentPid << std::endl;
// Changing the current process into perf
execlp("perf", "perf", ..., parentPid.c_str(), (char*)nullptr);
throw std::runtime_error("execlp failed");
}
}
void InitAndRunBenchmark()
{
InitBenchmark(); // might take a long time!
RunPerf();
RunBenchmark();
}
|
First perf measurements should never be too specific.
Branchless binary search
Because of branch prediction, if a misprediction happens, the pipeline will be flushed, and the CPU needs to read commands from the correct address again, which is a waste of 10-20 clock cycles.
By using branchless logic, the waste of clock cycles is avoided.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
template <class ForwardIt, class T, class Compare>
ForwardIt branchless_lower_bound(ForwardIt first, ForwardIt last,
const T& value, Compare comp)
{
auto length = last - first;
while (length > 0)
{
auto half = length / 2;
// multiplication (by 1) is needed for GCC to generate CMOV
first += comp(first[half], value) * (length - half);
length = half;
}
return first;
}
|
Linear search is fast because of better cache locality.
Principle #4: “Simplicity is the ultimate sopistication”
Principle #5: “Mechanical sympathy”
- I-Cache misses -
likely/unlikely attributes
- I-Cache misses - Immediately Invoked Function Expressions(IIFE)
- Lambda, Functor vs
std::function
Transport: networking & concurrency
Userspace Networking
- Solarflare
- OpenOnload
- TCPDirect
- EF_VI
Principle #6: “True efficiency is found not in the layers of complexity we add, but in the unnecessary layers we remove”
- Shared Memory
- Concurrent Queues
Principle #7: “Choose the right tool for the right task”
FastQueue - Design
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
|
struct FastQueue
{
// Both counters are written by the producer, read by the consumer(s)
// Before/after a Write operation, both counters contain the same value
// [WriteCounter, ReadCounter] defines the area of reading
alignas(CACHE_LINE_SIZE) std::atomic<uint64_t> mReadCounter{0};
// [ReadCounter, WriteCounter] defines the area of writing
alignas(CACHE_LINE_SIZE) std::atomic<uint64_t> mWriteCounter{0};
alignas(CACHE_LINE_SIZE) uint8_t mBuffer[0];
};
// simplified code!
void QProducer::Write(std::span<std::byte> buffer)
{
const int32_t payloadSize = sizeof(int32_t) + buffer.size();
mLocalCounter += payloadSize;
mQ->mWriteCounter.store(mLocalCounter, std::memory_order_release);
std::memcpy(mNextElement, &size, sizeof(int32_t));
std::memcpy(mNextElement + sizeof(int32_t), buffer.data(), buffer.size());
mQ->mReadCounter.store(mLocalCounter, std::memory_order_release);
mNextElement += payloadSize;
}
int32_t QConsumer::TryRead(std::span<std::byte> buffer)
{
if (mLocalCounter == mQ->mWriteCounter.load(std::memory_order_acquire))
return 0;
int32_t size;
std::memcpy(&size, mNextElement, sizeof(int32_t));
int32_t writeCounter = mQ->mWriteCounter.load(std::memory_order_acquire);
EXPECT(writeCounter - mLocalWriter <= QUEUE_SIZE, "queue overflow");
EXPECT(size <= buffer.size(), "Buffer space isn't large enough");
std::memcpy(buffer.data(), mNextElement + sizeof(size), size);
const int32_t payloadSize = sizeof(size) + size;
mLocalCounter += payloadSize;
mNextElement += payloadSize;
writeCounter = mQ->mWriteCounter.load(std::memory_order_acquire);
EXPECT(writeCounter - mLocalCounter <= QUEUE_SIZE, "queue overflow");
}
|
Optimization #1: Caching the Write Counter
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
void QProducer::Write(std::span<std::byte> buffer)
{
const int32_t payloadSize = sizeof(int32_t) + buffer.size();
mLocalCounter += payloadSize;
// we "reserve" more space (X% of the total queue)
// to avoid touching this cache line on every message written
if (mCachedWriteCounter < mLocalCounter) {
mCachedWriteCounter = Align<Q_WRITE_COUNTER_BLOCK_BYTES>(mLocalCounter);
mQ->mWriteCounter.store(mCachedWriteCounter, std::memory_order_release);
}
std::memcpy(mNextElement, &payloadSize, sizeof(int32_t));
std::memcpy(mNextElement + sizeof(int32_t), buffer.data(), buffer.size());
mQ->mReadCounter.store(mLocalCounter, std::memory_order_release);
mNextElement += payloadSize;
}
|
Optimization #2: Data Alignment
Optimization #3: Caching the Read Counter
Going further - Zero copy
Measurements in low-latency trading systems
1
2
3
4
5
6
7
8
9
10
|
void Executor::PollMarketData()
{
while (poll[&](const auto& msg))
{
if (IsInteresting(msg))
{
SendOrder();
}
}
}
|
Clang Xray Instrumentation
Principle #8: “Being fast is good - staying in fast is better”
Principle #9: “Thinking abut the system as a whole”
Principle #10: “The performance of your code depends on your colleagues’ code as much as yours”