Back to Blog
Building a Sub-Millisecond Order Matching Engine in Rust
rustlow-latencysystems-designtrading-enginedata-structures

Building a Sub-Millisecond Order Matching Engine in Rust

A deep dive into the data structures, algorithms, and systems design behind a low-latency order matching engine I built for a prediction market, covering sorted tree vs hash map tradeoffs, market normalization, balance reservation, and the actor-based architecture that eliminates locks on the hot path.

A prediction market lets users trade on the outcome of real-world events. Behind every trade is an order matching engine, the component that decides which buy meets which sell, at what price, and in what order. Latency here directly affects user experience and market fairness.

I recently built one of these engines in Rust as part of a prediction market platform. In this article, I will walk through the engineering decisions: the data structure choices, the matching algorithm, and the systems design that keeps it fast.

The Orderbook Data Structure

The core of any matching engine is the orderbook. It holds all outstanding buy and sell orders and is organized for two things: finding the best available price quickly, and making sure the earliest order at that price gets filled first.

The orderbook is split into three layers, each optimized for a different job.

Layer 1: Price levels. The engine needs to constantly answer the question "what is the best price available right now?" To do this efficiently, I store all ask (sell) prices and bid (buy) prices in sorted tree maps. Because the prices are always sorted, grabbing the lowest ask or highest bid is nearly instant. If I used a regular hash map instead, I would have to scan through every single price to find the best one, which is much slower. In a prediction market where prices range from 0 to 100, the number of distinct price levels is small, so the sorted tree stays lightweight.

Layer 2: Order queues. At each price level, multiple orders can be waiting. The engine needs to know who placed their order first, because that person should get filled first. So at every price, I maintain a first-in-first-out (FIFO) queue of order IDs. New orders go to the back of the line. When a match happens, orders are taken from the front.

Layer 3: Order storage. Every live order is stored by its unique ID in a separate lookup table. When a fill happens, the engine needs to find the resting order, reduce its remaining quantity, and possibly remove it. Storing orders by ID makes this lookup instant.

This three-layer separation is intentional. The sorted tree handles price discovery. The queues handle time priority. The order table handles mutation. Combining all of this into one structure would make at least one of these operations slower.

The Matching Algorithm

I implemented price-time priority matching, the standard algorithm used by most exchanges. When a new buy order (bid) comes in, it matches against the lowest-priced sell orders (asks) first. Within the same price level, the oldest order fills first.

The algorithm has two nested loops.

Outer loop: walk price levels from best to worst.

  1. Retrieve the best ask price from the sorted tree map
  2. If this is a limit order and the best ask exceeds the bid price, stop. No match is possible at any remaining price level
  3. If the price level's queue is empty (stale entry), clean it up and continue to the next level

Inner loop: drain the FIFO queue at the current price level.

  1. Peek at the first order in the queue
  2. Compute the fill quantity as min(taker's remaining quantity, maker's remaining quantity)
  3. Execute at the ask price (the resting order's price), giving the taker price improvement if they bid higher
  4. Decrement both orders' remaining quantities and the aggregate quantity at that price level
  5. If the maker order is fully filled, remove it from the order store and dequeue it. If the queue is now empty, remove the price level from the sorted tree entirely
  6. If the taker order is fully filled, stop

Price improvement is a natural consequence of this design. If a bid at 55 matches an ask at 50, the trade executes at 50. The taker is refunded the difference of (55 - 50) × fill_qty. They pay less than their maximum willingness.

The sell-side mirror (match ask against bids) is identical, but starts from the highest bid price instead of the lowest ask.

Market Normalization: Collapsing Yes/No into One Book

Prediction markets have a structural twist that stock exchanges don't. Every outcome has both a Yes and a No market. "Buy Yes at 60" is economically identical to "Sell No at 40" since prices represent probabilities on a 0 to 100 scale. If I maintained two separate orderbooks per outcome, I would miss these cross-matching opportunities entirely.

The solution is normalization. All No-market orders are converted to their Yes-market equivalent before entering the matching engine. Three transformations happen atomically:

  1. Redirect the market ID. The order is routed to the paired Yes-market's orderbook
  2. Flip the price. new_price = 100 - original_price
  3. Invert the side. Buy becomes Sell, Sell becomes Buy

I also maintain an alias map that maps every No market ID to its canonical Yes market ID, so all lookups resolve to the correct book.

This gives two concrete benefits:

  • Half the orderbooks. One book per outcome instead of two. Less memory, simpler state management.
  • Concentrated liquidity. A "Buy No at 40" order now sits in the same book as a "Sell Yes at 60" order. They can match directly instead of existing in isolated books that never see each other.

Responses are denormalized before being sent back to the user. The original market ID, price, and side are restored so the API surface remains clean and intuitive.

Balance Reservation: Preventing Overselling Without Locks

A common problem in trading engines is the double-spend problem. What happens when two orders simultaneously try to spend the same balance? Traditional systems solve this with database locks or pessimistic concurrency control. I avoided locking entirely by using a reservation pattern with a three-phase flow.

Phase 1: Reserve. Before any matching begins, deduct the maximum possible cost upfront. For buy orders, deduct quantity × price from the user's cash balance. For sell orders, deduct the position quantity from the user's holdings. If the user has insufficient funds or positions, reject immediately. There is no partial state to clean up.

Phase 2: Match. Run the matching algorithm. Fills may occur at better prices than the reservation assumed (price improvement).

Phase 3: Return unused. After matching completes, if the order filled at a better price, refund the difference. If the order was only partially filled and rests in the book, the remaining reservation stays locked. On cancellation, return remaining_quantity × price to the user's balance or restore positions for sell orders.

Why this works without locks: the engine is designed as a single-threaded actor (more on this below). All commands are processed sequentially through a single message channel. There is no concurrent access to user balances, ever. The reservation pattern is an application-level invariant, not a concurrency primitive.

The Actor Architecture

All the pieces above (orderbooks, users, balances, positions) live inside a single long-running task, structured as an actor.

┌──────────────────────────────────────────────┐
│  Actor Task (single-threaded)                │
│                                              │
│  ┌──────────┐ ┌──────────┐ ┌─────────────┐  │
│  │Orderbooks│ │  Users   │ │ Alias Map   │  │
│  │(per mkt) │ │(balances)│ │(No→Yes IDs) │  │
│  └──────────┘ └──────────┘ └─────────────┘  │
│                                              │
│     ▲  Commands processed one at a time      │
└─────┼────────────────────────────────────────┘
      │
 ┌────┴─────┐
 │  Bounded  │ ← PlaceOrder, CancelOrder,
 │  Channel  │   GetBestBid, GetOrderbook, ...
 │ (cap:1000)│
 └────┬─────┘
      │
 ┌────┴──────────────────────┐
 │  HTTP / WebSocket Handlers │
 │  (concurrent, multi-core)  │
 └────────────────────────────┘

The bounded message channel (capacity 1,000) is the only entry point to the engine. Every command (place, cancel, modify, query) goes through this channel. The actor processes them one at a time. This design has specific consequences.

Zero locks on the hot path. The orderbook state is owned exclusively by the actor task. No mutexes, no reader-writer locks, no atomic operations on the matching path. The only shared lock in the system is on market metadata, which is read-heavy and rarely changes.

Backpressure is built in. The bounded channel means if the engine falls behind, senders block naturally. This prevents unbounded memory growth and gives the system a flow control mechanism without explicit rate limiting.

Persistence never blocks matching. After a trade executes, events are published to a Redis Stream asynchronously as fire-and-forget. A separate database worker service consumes these events and writes to PostgreSQL. The matching engine never waits for a disk write.

Clean request/response model. Each command carries a one-shot response channel. The caller sends a command and awaits the response asynchronously. This gives external callers a clean async API while the actor processes commands sequentially internally.

The deliberate tradeoff is that this is a single-threaded bottleneck. One actor, one core. But for a prediction market with bounded price ranges (0 to 100) and moderate order volumes, sequential throughput is more than sufficient. The simplicity of reasoning about state with zero concurrency bugs is a massive engineering advantage over lock-based concurrent designs.

Benchmarking Methodology

Claims about latency are meaningless without measurement. I wrote a benchmark suite with six scenarios. Each uses 5 seconds of warmup, 15 seconds of measurement, and 100 samples.

| Benchmark | What It Measures | Setup | |---|---|---| | Matching-heavy | Worst-case fill throughput | Pre-populate book, then flood with matching orders at 1K / 10K / 50K volumes | | Orderbook building | Insertion without matching | Non-crossing limit orders at varied prices | | Concurrent users | Channel contention under load | 100 / 500 / 1,000 concurrent tasks, each submitting 100 orders (50% matching) | | Single order latency | p50, p95, p99 latency | 1,000 samples; one matching order and one non-matching order | | Cancellation | Cancel round-trip cost | Place then cancel cycle, including queue removal and balance return | | Query operations | Read path performance | best_bid, best_ask, full_orderbook, user_open_orders after populating 2,000 orders |

Why these scenarios matter:

The matching-heavy benchmark is the worst case. Every order triggers a fill, which means sorted tree removals, hash map mutations, balance updates, and event publishing all happen per order.

The concurrent users benchmark stress-tests the message channel under contention and measures how the actor handles many senders funneling through a single consumer.

The single order latency benchmark provides the percentile numbers (p50/p95/p99) that matter most for real-world user experience.

The query benchmarks include the channel round-trip overhead, not just data structure access, because that is what users actually experience.

Each benchmark creates a fresh runtime and a fresh actor per iteration. Users are pre-funded with large balances to avoid rejection noise in measurements.

Closing Thoughts

The engine's design reduces to a handful of principles: use sorted maps where you need sorted access, use hash maps where you need O(1) lookups, normalize away redundant state, reserve before matching, and funnel everything through a single owner to avoid locks. None of these ideas are novel individually. The value is in how they compose into a system where the matching hot path touches zero locks, zero disk I/O, and zero network calls.

© 2026. All rights reserved.