Data Quality¶

Understanding and handling the real-world quality issues in Binance orderbook data.

The Problem¶

Raw Binance orderbook data has significant quality issues that most libraries ignore. Our empirical analysis of live BTCUSDT data found:

Issue	Finding	Impact
Dust orders	27-49% of top 100 levels have < $66 notional	Inflates apparent depth, distorts imbalance signals
Book sparsity	75-79% of levels have price gaps > 1 tick	Misleading level count; gaps of 28-113 ticks are common
Depth asymmetry	3:1 bid/ask imbalance ($1.53M vs $481K)	Normal, but important to understand
Unbounded growth	Without pruning, book grows from 1000 to 4000+ rows in 13 hours	Memory leaks in long-running processes
Latency spikes	Feed delays spike to seconds during volatile events	Stale quotes misrepresent the book

Built-in Filters¶

Dust Filter¶

Removes levels with notional value below a threshold.

# Default: removes levels < $5 notional
ob = book.ob_snapshot("BTCUSDT", max_levels=50, clean=["dust"])

# Custom threshold
from binance_book.filters.dust import filter_dust
raw = book.ob_snapshot("BTCUSDT", max_levels=50)
clean = filter_dust(raw, min_notional_usd=100)  # Only keep levels > $100

Stale Filter¶

Removes levels with timestamps older than a threshold.

from binance_book.filters.stale import filter_stale
clean = filter_stale(raw, staleness_ms=5000)  # Remove anything > 5 seconds old

Gap Filter¶

Removes levels with large price gaps from their neighbor.

from binance_book.filters.gap import filter_gap
clean = filter_gap(raw, max_gap_ticks=50, tick_size=0.01)

Anomaly Filter¶

Removes statistical size outliers (potential spoof walls).

from binance_book.filters.anomaly import filter_anomalies
clean = filter_anomalies(raw, sigma=3.0)  # Remove sizes > 3σ from mean

Annotation Mode¶

Instead of removing rows, add quality columns for analysis:

ob = book.ob_snapshot("BTCUSDT", max_levels=20, annotate=True)
# Each row gets: IS_DUST, NOTIONAL_USD, GAP_TICKS, IS_OUTLIER, IS_STALE, STALENESS_MS

Token Budget Reality¶

A full 5000-level BTCUSDT orderbook is 332 KB / ~85,000 tokens. This table shows how many symbols you can fit in different LLM context windows:

Detail Level	Tokens/Symbol	GPT-4o (64k avail)	Claude 3.5 (100k avail)
`"minimal"`	~34	1,280 symbols	2,000 symbols
`"summary"`	~136	426 symbols	666 symbols
`"standard"`	~500	128 symbols	200 symbols
`"detailed"`	~2,500	25 symbols	40 symbols
`"full"`	~100,000	0 symbols	1 symbol

Rule of thumb: Use detail="auto" and let binance-book pick the right level based on your context budget.

Streaming Bandwidth¶

Stream	Per Symbol	10 Symbols	100 Symbols
`@depth@100ms`	~12.7 KB/s, 44.6 MB/hr	446 MB/hr	4.4 GB/hr
`@depth` (1s)	~1.3 KB/s, 4.5 MB/hr	45 MB/hr	450 MB/hr
`@bookTicker`	~5 KB/s	50 KB/s	500 KB/s
`@trade`	~1.5 KB/s	15 KB/s	150 KB/s