Data Quality¶
Understanding and handling the real-world quality issues in Binance orderbook data.
The Problem¶
Raw Binance orderbook data has significant quality issues that most libraries ignore. Our empirical analysis of live BTCUSDT data found:
| Issue | Finding | Impact |
|---|---|---|
| Dust orders | 27-49% of top 100 levels have < $66 notional | Inflates apparent depth, distorts imbalance signals |
| Book sparsity | 75-79% of levels have price gaps > 1 tick | Misleading level count; gaps of 28-113 ticks are common |
| Depth asymmetry | 3:1 bid/ask imbalance ($1.53M vs $481K) | Normal, but important to understand |
| Unbounded growth | Without pruning, book grows from 1000 to 4000+ rows in 13 hours | Memory leaks in long-running processes |
| Latency spikes | Feed delays spike to seconds during volatile events | Stale quotes misrepresent the book |
Built-in Filters¶
Dust Filter¶
Removes levels with notional value below a threshold.
# Default: removes levels < $5 notional
ob = book.ob_snapshot("BTCUSDT", max_levels=50, clean=["dust"])
# Custom threshold
from binance_book.filters.dust import filter_dust
raw = book.ob_snapshot("BTCUSDT", max_levels=50)
clean = filter_dust(raw, min_notional_usd=100) # Only keep levels > $100
Stale Filter¶
Removes levels with timestamps older than a threshold.
from binance_book.filters.stale import filter_stale
clean = filter_stale(raw, staleness_ms=5000) # Remove anything > 5 seconds old
Gap Filter¶
Removes levels with large price gaps from their neighbor.
from binance_book.filters.gap import filter_gap
clean = filter_gap(raw, max_gap_ticks=50, tick_size=0.01)
Anomaly Filter¶
Removes statistical size outliers (potential spoof walls).
from binance_book.filters.anomaly import filter_anomalies
clean = filter_anomalies(raw, sigma=3.0) # Remove sizes > 3σ from mean
Annotation Mode¶
Instead of removing rows, add quality columns for analysis:
ob = book.ob_snapshot("BTCUSDT", max_levels=20, annotate=True)
# Each row gets: IS_DUST, NOTIONAL_USD, GAP_TICKS, IS_OUTLIER, IS_STALE, STALENESS_MS
Token Budget Reality¶
A full 5000-level BTCUSDT orderbook is 332 KB / ~85,000 tokens. This table shows how many symbols you can fit in different LLM context windows:
| Detail Level | Tokens/Symbol | GPT-4o (64k avail) | Claude 3.5 (100k avail) |
|---|---|---|---|
"minimal" |
~34 | 1,280 symbols | 2,000 symbols |
"summary" |
~136 | 426 symbols | 666 symbols |
"standard" |
~500 | 128 symbols | 200 symbols |
"detailed" |
~2,500 | 25 symbols | 40 symbols |
"full" |
~100,000 | 0 symbols | 1 symbol |
Rule of thumb: Use detail="auto" and let binance-book pick the right level based on your context budget.
Streaming Bandwidth¶
| Stream | Per Symbol | 10 Symbols | 100 Symbols |
|---|---|---|---|
@depth@100ms |
~12.7 KB/s, 44.6 MB/hr | 446 MB/hr | 4.4 GB/hr |
@depth (1s) |
~1.3 KB/s, 4.5 MB/hr | 45 MB/hr | 450 MB/hr |
@bookTicker |
~5 KB/s | 50 KB/s | 500 KB/s |
@trade |
~1.5 KB/s | 15 KB/s | 150 KB/s |