Fixing DNS tail latency with a 5-line config and a 50-line function
If you’re using reqwest for small HTTP/2 payloads, you probably have a tail latency problem you don’t know about. Hyper’s default flow control windows are 10,000× oversized for anything under 1 KB, and its dispatch channel adds periodic 40-140ms stalls that don’t show up in median benchmarks.
I hit this building Numa’s DoH forwarding path. Median was 10ms, mean was 23ms — the tail was dragging everything.
The fix was a 5-line reqwest config and a 50-line hedging function. This post is also an advertisement for Dean & Barroso’s 2013 paper “The Tail at Scale” — a decade-old idea that still demolishes dispatch spikes. The same ideas later took my cold recursive p99 from 2.3 seconds to 538ms.
The cause: hyper’s dispatch channel
Reqwest sits on top of hyper, which interposes an mpsc dispatch
channel and a separate ClientTask between
.send() and the h2 stream. I instrumented the forwarding
path and confirmed: 100% of the spike time lives in the
send() phase, and a parallel heartbeat task showed zero
runtime lag during spikes. The tokio runtime was fine — the stall was
internal to hyper’s request scheduling.
Hickory-resolver doesn’t have this issue. It holds
h2::SendRequest<Bytes> directly and calls
ready().await; send_request() in the caller’s task — no
channel, no scheduling dependency. I used it as a reference point
throughout.
Fix #1 — HTTP/2 window sizes
Reqwest inherits hyper’s HTTP/2 defaults: 2 MB stream window, 5 MB connection window. For DNS responses (~200 bytes), that’s ~10,000× oversized — unnecessary WINDOW_UPDATE frames, bloated bookkeeping on every poll, and different server-side scheduling behavior.
Setting both windows to the h2 spec default (64 KB) dropped my median from 13.3ms to 10.1ms:
reqwest::Client::builder()
.use_rustls_tls()
.http2_initial_stream_window_size(65_535)
.http2_initial_connection_window_size(65_535)
.http2_keep_alive_interval(Duration::from_secs(15))
.http2_keep_alive_while_idle(true)
.http2_keep_alive_timeout(Duration::from_secs(10))
.pool_idle_timeout(Duration::from_secs(300))
.pool_max_idle_per_host(1)
.build()Any Rust code using reqwest for tiny-payload HTTP/2 workloads — DoH, API polling, metric scraping — is probably hitting this.
Fix #2 — Request hedging
“The Tail at Scale” (Dean & Barroso, 2013): fire a request, and if it doesn’t return within your P50 latency, fire the same request in parallel. First response wins.
The intuition: if 5% of requests spike due to independent random events, two parallel requests means only 0.25% of pairs spike on both. The tail collapses.
The surprise: hedging against the same upstream
works. HTTP/2 multiplexes streams — two
send_request() calls on one connection become independent
h2 streams. If one stalls in the dispatch channel, the other keeps
making progress.
pub async fn forward_with_hedging_raw(
wire: &[u8],
primary: &Upstream,
secondary: &Upstream,
hedge_delay: Duration,
timeout_duration: Duration,
) -> Result<Vec<u8>> {
let primary_fut = forward_query_raw(wire, primary, timeout_duration);
tokio::pin!(primary_fut);
let delay = sleep(hedge_delay);
tokio::pin!(delay);
// Phase 1: wait for primary to return OR the hedge delay.
tokio::select! {
result = &mut primary_fut => return result,
_ = &mut delay => {}
}
// Phase 2: hedge delay expired — fire secondary, keep primary alive.
let secondary_fut = forward_query_raw(wire, secondary, timeout_duration);
tokio::pin!(secondary_fut);
// First successful response wins.
tokio::select! {
r = primary_fut => r,
r = secondary_fut => r,
}
}The production
version adds error handling — if one leg fails, it waits for the
other. In production, Numa passes the same &Upstream
twice when only one is configured. I extended hedging to all protocols —
UDP (rescues packet loss on WiFi), DoT (rescues TLS handshake stalls).
Configurable via hedge_ms; set to 0 to disable.
Caveat: hedging hurts on degraded networks. When latency is consistently high (no random spikes, just slow), the hedge adds overhead with nothing to rescue. Hedging is a variance reducer, not a latency reducer — it only helps when spikes are random.
Forwarding results
5 iterations × 101 domains × 10 rounds, 5,050 samples per method. Hickory-resolver included as a reference (it uses h2 directly, no dispatch channel):
| Single | Hedged | Hickory (ref) | |
|---|---|---|---|
| mean | 17.4ms | 14.3ms | 16.8ms |
| median | 10.4ms | 10.2ms | 13.3ms |
| p95 | 52.5ms | 28.6ms | 37.7ms |
| p99 | 113.4ms | 71.3ms | 98.1ms |
| σ | 30.6ms | 13.2ms | 19.1ms |
The internal improvement: hedging cut p95 by 45%, p99 by 37%, σ by 57%. The exact margin vs hickory varies with network conditions; the σ reduction is consistent across runs.
Recursive resolution: from 2.3 seconds to 538ms
Forwarding is one job. Recursive resolution — walking from root hints through TLD nameservers to the authoritative server — is a different one. I started 15× behind Unbound on cold recursive p99 and traced it to four root causes.
1. Missing NS delegation caching. I cached glue
records (ns1’s IP) but not the delegation itself. Every
.com query walked from root. Fix: cache NS records from
referral authority sections. (10 lines)
2. Expired cache entries caused full cold resolutions. Fix: serve-stale (RFC 8767) — return expired entries with TTL=1 while revalidating in the background. (20 lines)
3. Wasting 1,900ms per unreachable server. 800ms UDP timeout + unconditional 1,500ms TCP fallback. Fix: 400ms UDP, TCP only for truncation. (5 lines)
4. Sequential NS queries on cold starts. Fix: fire to the top 2 nameservers simultaneously. First response wins, SRTT recorded for both. Same hedging principle. (50 lines)
Genuine cold benchmarks — unique subdomains, 1 query per domain, 5 iterations, 505 samples per server:
| Baseline | Final | Unbound (ref) | |
|---|---|---|---|
| p99 | 2,367ms | 538ms | 748ms |
| σ | 254ms | 114ms | 457ms |
| median | — | 77.6ms | 74.7ms |
Unbound wins median by ~4%. Where hedging shines is the tail — domains with slow or unreachable nameservers, where parallel queries turn worst-case sequential timeouts into races. Cache hits are tied at 0.1ms across Numa, Unbound, and AdGuard Home.
What I’m exploring next: persistent SRTT data across restarts (currently cold-starts lose all server timing), aggressive NSEC caching to shortcut negative lookups, and adaptive hedge delays that tune themselves to observed network conditions instead of a fixed 10ms.
Takeaways
The real hero of this post is Dean & Barroso. Hedging works because spikes are random, and two random draws rarely both lose. It’s effective for any HTTP/2 client, any language, any forwarder topology. Nobody we know of ships it by default.
If you’re building a Rust service that makes many small HTTP/2 requests to the same backend: check your flow control window sizes first, then implement hedging. Don’t rewrite the client.
Benchmarks are in benches/recursive_compare.rs
— run them yourself. If you’re using reqwest for tiny-payload workloads
and try the window size fix, I’d love to hear if you see the same
improvement.
Numa is a DNS resolver that runs on your laptop or phone. DoH, DoT, .numa local domains, ad blocking, developer overrides, a REST API, and all the optimization work in this post. github.com/razvandimescu/numa.