Back to Blog
Building a Sub-Millisecond Order Matching Engine in Rust
rustlow-latencysystems-designtrading-enginedata-structures

Building a Sub-Millisecond Order Matching Engine in Rust

A deep dive into the data structures, algorithms, and systems design behind a low-latency order matching engine I built for a prediction market, covering BTreeMap vs HashMap tradeoffs, market normalization, balance reservation, and benchmarking methodology.

A prediction market lets users trade on the outcome of real-world events. Behind every trade is an order matching engine, the component that decides which buy meets which sell, at what price, and in what order. Latency here directly affects user experience and market fairness.

I recently built one of these engines in Rust as part of a prediction market platform. In this article, I will walk through the actual implementation, covering the data structure choices, the matching algorithm, and the systems design that keeps it fast. No theory, just the code and the reasoning behind each decision.

The Orderbook Data Structure

The core of any matching engine is the orderbook. Here's the struct I ended up with:

pub struct OrderbookData {
    pub market_id: u64,
    pub asks: BTreeMap<u64, u64>,
    pub bids: BTreeMap<u64, u64>,
    pub ask_queue: HashMap<u64, Vec<u64>>,
    pub bid_queue: HashMap<u64, Vec<u64>>,
    pub orders: HashMap<u64, Order>,
    pub last_price: Option<u64>,
}

I used three different map types because they serve three different access patterns:

BTreeMap<u64, u64> for price levels. asks and bids map a price to its aggregate quantity. BTreeMap keeps keys sorted, which means getting the best ask (lowest price) is book.asks.first_key_value() and the best bid (highest price) is book.bids.last_key_value(), both O(log n). A HashMap would require scanning every key to find the min or max. Insertion and deletion are also O(log n), which is acceptable since the number of distinct price levels in a prediction market (prices range 0 to 100) is bounded.

HashMap<u64, Vec<u64>> for order queues. ask_queue and bid_queue map a price to the list of order IDs resting at that level, in FIFO order. New orders append via Vec::push. During matching, I dequeue from the front via order_ids.first(). This gives O(1) price-level lookup and preserves time priority within each level.

HashMap<u64, Order> for order storage. Every live order is stored by its ID. During a fill, I need to look up the maker order, decrement its remaining quantity, and possibly remove it. O(1) lookup by ID makes this fast.

The separation matters. The BTreeMap handles price discovery (what's the best price?). The HashMap queues handle time priority (who was first at that price?). The order HashMap handles mutation (update or remove a specific order). Merging these concerns into a single structure would compromise at least one of these access patterns.

The Matching Algorithm

I implemented price-time priority matching, the standard for most exchanges. When a new bid comes in, it matches against the lowest-priced asks first. Within the same price level, the oldest order fills first.

Here's the core of match_bid_against_asks:

async fn match_bid_against_asks(
    order: &mut Order,
    book: &mut OrderbookData,
    users: &mut HashMap<u64, User>,
) -> Result<(), String> {
    while order.remaining_qty > 0 {
        let Some((&ask_price, _)) = book.asks.first_key_value() else {
            break;
        };

        if matches!(order.order_type, OrderType::Limit)
            && order.price < ask_price
        {
            break;
        }

        let Some(order_ids) = book.ask_queue.get_mut(&ask_price) else {
            book.asks.remove(&ask_price);
            continue;
        };

        while let Some(&maker_order_id) = order_ids.first() {
            let Some(maker_order) = book.orders.get_mut(&maker_order_id) else {
                order_ids.remove(0);
                continue;
            };

            let fill_qty = order.remaining_qty.min(maker_order.remaining_qty);
            let fill_price = ask_price;

            order.remaining_qty -= fill_qty;
            maker_order.remaining_qty -= fill_qty;
            *book.asks.get_mut(&ask_price).unwrap() -= fill_qty;

            // ... balance updates, event publishing ...

            if maker_order.remaining_qty == 0 {
                book.orders.remove(&maker_order_id);
                order_ids.remove(0);
                if order_ids.is_empty() {
                    book.ask_queue.remove(&ask_price);
                    book.asks.remove(&ask_price);
                    break;
                }
            } else {
                break;
            }
            if order.remaining_qty == 0 {
                break;
            }
        }
    }
    Ok(())
}

The outer while loop walks the price levels from best to worst. The inner while loop drains the FIFO queue at each price level. Here is the flow:

  1. Grab the best ask price via first_key_value() on the BTreeMap
  2. If this is a limit order and the best ask is above the bid price, stop because no match is possible
  3. Walk the FIFO queue at that price level, filling against each maker order
  4. fill_qty = min(taker.remaining_qty, maker.remaining_qty) handles partial fills
  5. Fully filled maker orders are removed from orders, the queue, and if the queue empties, from the BTreeMap price level itself

The ask-side mirror (match_ask_against_bids) is identical but uses book.bids.last_key_value() to start from the highest bid.

Price improvement is built in. If a bid at 55 matches an ask at 50, the fill executes at 50. The taker gets a refund of (55 - 50) * fill_qty, the difference between what they were willing to pay and what they actually paid.

Market Normalization: Collapsing Yes/No into One Book

Prediction markets have a twist that stock exchanges don't. Every outcome has both a Yes and a No market. "Buy Yes at 60" is economically identical to "Sell No at 40" (since prices are on a 0 to 100 scale representing probability). If I maintained two separate orderbooks per outcome naively, I would miss the cross-matching opportunity.

The solution I went with is normalization. All No-market orders are converted to their Yes-market equivalent before entering the engine:

pub fn normalize_order(order: &mut Order, market_store: &MarketStore) -> Result<u64, String> {
    let Some(market) = market_store.get_market(order.market_id) else {
        return Err("Market not found".into());
    };

    if let Some(side) = &market.side {
        match side {
            MarketSide::No => {
                if let Some(paired_id) = market.paired_market_id {
                    order.market_id = paired_id;
                    order.price = 100 - order.price;
                    order.side = match order.side {
                        OrderSide::Bid => OrderSide::Ask,
                        OrderSide::Ask => OrderSide::Bid,
                    };
                }
            }
            MarketSide::Yes => {}
        }
    }

    Ok(order.market_id)
}

Three mutations happen atomically: the market ID is redirected to the paired Yes market, the price is flipped (100 - price), and the side is inverted (Bid becomes Ask, Ask becomes Bid). I also maintain an alias_map: HashMap<u64, u64> that maps every No market ID to its canonical Yes market ID, so lookups always resolve to the right orderbook.

This gives two concrete benefits:

  • Half the orderbooks. One book per outcome instead of two. Less memory, simpler state management.
  • Concentrated liquidity. A "Buy No at 40" order now sits in the same book as a "Sell Yes at 60" order. They can match directly instead of existing in isolated books.

Responses are denormalized before being sent back to the user, so the original market ID, price, and side are restored and the API remains clean.

Balance Reservation: Preventing Overselling Without Locks

A common problem in trading engines is what happens when two orders simultaneously try to spend the same balance. Traditional systems use locks. I avoided them entirely by using a reservation pattern.

Before any matching happens, I reserve the maximum possible cost:

pub async fn reserve_balance(order: &Order, users: &mut HashMap<u64, User>) -> Result<(), String> {
    match order.side {
        OrderSide::Bid => {
            let total_cost = (order.original_qty as i64) * (order.price as i64);
            update_balance(users, order.user_id, -total_cost)
        }
        OrderSide::Ask => {
            if !check_position_sufficient(users, order.user_id, order.market_id, order.original_qty) {
                return Err("Insufficient positions quantity".into());
            }
            update_position(users, order.user_id, order.market_id, -(order.original_qty as i64))
        }
    }
}

The three-phase flow:

  1. Reserve. For bids, deduct qty * price from the user's balance upfront. For asks, deduct the position quantity. If insufficient, reject the order immediately with no partial state to clean up.
  2. Match. Run the matching algorithm. Fills may happen at better prices than the reservation assumed.
  3. Return unused. If the order filled at a better price (price improvement), the difference is refunded. If the order is fully filled and won't rest in the book, any excess reservation is returned.

On cancellation, return_reserved_balance gives back remaining_qty * price for bids or restores positions for asks.

Why does this work without locks? Because I designed the engine as a single-threaded actor. All commands are processed sequentially through one mpsc channel. There is no concurrent access to user balances, ever. The reservation pattern is an application-level invariant, not a concurrency primitive.

The Actor That Ties It Together

All the pieces above (orderbooks, users, balances, positions) live inside a single Tokio task:

pub fn spawn_orderbook_actor(market_store: MarketStore) -> Orderbook {
    let (tx, mut rx) = mpsc::channel::<Command>(1000);

    tokio::spawn(async move {
        let mut orderbooks: HashMap<u64, OrderbookData> = HashMap::new();
        let mut users: HashMap<u64, User> = HashMap::new();
        let mut alias_map: HashMap<u64, u64> = HashMap::new();

        while let Some(cmd) = rx.recv().await {
            match cmd {
                Command::PlaceOrder(mut order, reply) => { /* ... */ }
                Command::CancelOrder(market_id, order_id, reply) => { /* ... */ }
                Command::GetBestBid(market_id, reply) => { /* ... */ }
                // ... more commands
            }
        }
    });

    Orderbook::new(tx)
}

The mpsc::channel with a capacity of 1000 is the only entry point. Every command (place, cancel, modify, query) goes through this channel. The actor processes them one at a time. This design has specific consequences:

Zero locks on the hot path. The orderbook state is owned exclusively by the actor task. No Mutex, no RwLock, no atomic operations on the matching path. The only Arc<RwLock> in the system is on the MarketStore, which is read-heavy metadata that rarely changes.

Backpressure is built in. The bounded channel (1000) means if the engine falls behind, senders block. This prevents unbounded memory growth and gives the system a natural flow control mechanism.

Persistence never blocks matching. After a trade executes, I publish events to a Redis Stream via tokio::spawn as fire-and-forget. A separate db_worker service consumes these events and writes to PostgreSQL asynchronously. The matching engine never waits for a database write.

Request/response via oneshot channels. Each command carries a tokio::sync::oneshot::Sender for its response. The caller awaits the oneshot receiver. This gives callers a clean async API while the actor processes commands sequentially internally.

The tradeoff is clear, this is a single-threaded bottleneck. One actor, one core. But for a prediction market with bounded price ranges (0 to 100) and moderate order volumes, the sequential throughput is more than sufficient, and the simplicity of reasoning about state is a significant engineering advantage over lock-based concurrent designs.

Benchmarking Methodology

Claims about latency mean nothing without measurement. I wrote a Criterion benchmark suite with six scenarios:

criterion_group!(
    name = benches;
    config = Criterion::default()
        .warm_up_time(Duration::from_secs(5))
        .measurement_time(Duration::from_secs(15))
        .sample_size(100);
    targets =
        bench_matching_heavy,
        bench_orderbook_building,
        bench_concurrent_users_max_capacity,
        bench_single_order_latency,
        bench_order_cancellation,
        bench_orderbook_queries
);

Matching-heavy workload. Pre-populates the book with ask orders at price 50, then floods matching bid orders at the same price. Tested at 1K, 10K, and 50K order volumes. This measures the worst case where every order triggers a fill, which means BTreeMap removals, HashMap mutations, balance updates, and event publishing all happen per order.

Orderbook building. Submits non-matching limit orders at varied prices (bids 40 to 50, asks 51 to 61). This isolates insertion performance without matching overhead. It tests the BTreeMap insertion and HashMap queue growth paths.

Concurrent users. Spawns 100/500/1000 Tokio tasks, each submitting 100 orders (50% matching, 50% non-matching). This stress-tests the mpsc channel under contention and measures how the actor handles concurrent senders funneling through a single consumer.

Single order latency. Runs with a sample size of 1000 for percentile accuracy. Measures both a matching order (place maker, then measure taker) and a non-matching order (just insertion). Criterion reports p50, p95, and p99 from these samples.

Cancellation. Place an order, then measure the cancel round-trip. Tests the remove_order_from_book path which includes queue removal via Vec::retain, BTreeMap quantity update, and balance return.

Query operations. After populating with 2000 orders (1000 bids, 1000 asks), benchmarks best_bid, best_ask, get_full_orderbook, and get_user_open_orders. These go through the actor channel, so the measurement includes channel round-trip overhead, not just the data structure access.

Each benchmark creates a fresh Tokio runtime and a fresh orderbook actor per iteration. Users are pre-funded with 100M units of balance and 50M units of positions to avoid rejection noise in the measurements.

Closing Thoughts

The engine's design can be reduced to a few principles: use sorted maps where you need sorted access, use hash maps where you need O(1) lookups, normalize away redundant state, reserve before matching, and funnel everything through a single owner to avoid locks. None of these ideas are novel individually. The value is in how they compose into a system where the matching hot path touches zero locks, zero disk I/O, and zero network calls.

© 2026. All rights reserved.