4.4 KiB
Orderbook Snapshot Schema
Status: valid
This document covers the Checkpoint 5 normalized order-book sample. The raw gzip JSONL files remain the source of truth. Normalized rows are derived records for quick inspection and later quality checks.
Normalized Snapshot
Schema name: normalized_orderbook_snapshot
Schema version: 1
File format: gzip JSONL, one JSON object per line.
Sample location:
data/normalized_sample/polymarket/orderbooks/<run_id>/polymarket_orderbooks_normalized_<run_id>.jsonl.gz
Every normalized row must reference exactly one raw gzip JSONL source row:
raw_file: repository-relative path to the raw gzip JSONL file.raw_line_number: 1-based line number inside that raw gzip JSONL file.
Derived data is invalid if either lineage field is missing or points to a missing raw file.
Field Contract
| Field | Type | Meaning |
|---|---|---|
schema_name |
string | Always normalized_orderbook_snapshot. |
schema_version |
number | Schema version, currently 1. |
market_name |
string | Market source name from the raw envelope. |
market_slug |
string | Polymarket market slug from the raw envelope. |
condition_id |
string | Polymarket condition ID from the raw envelope. |
token_id |
string | Polymarket CLOB token ID from the raw envelope. |
outcome |
string | Outcome label associated with token_id. |
collected_at_utc |
string | Collector timestamp from the raw envelope. |
best_bid |
string or null | Maximum bid price, or null when no bids exist. |
best_ask |
string or null | Minimum ask price, or null when no asks exist. |
spread |
string or null | best_ask - best_bid when both sides exist. |
midpoint |
string or null | (best_bid + best_ask) / 2 when both sides exist. |
bid_depth_total |
string | Sum of all bid sizes. |
ask_depth_total |
string | Sum of all ask sizes. |
bid_depth_within_1c |
string | Sum of bid sizes priced at least best_bid - 0.01. |
ask_depth_within_1c |
string | Sum of ask sizes priced at most best_ask + 0.01. |
bid_depth_within_2c |
string | Sum of bid sizes priced at least best_bid - 0.02. |
ask_depth_within_2c |
string | Sum of ask sizes priced at most best_ask + 0.02. |
bid_depth_within_5c |
string | Sum of bid sizes priced at least best_bid - 0.05. |
ask_depth_within_5c |
string | Sum of ask sizes priced at most best_ask + 0.05. |
raw_file |
string | Repository-relative raw gzip JSONL path. |
raw_line_number |
number | 1-based source line number in raw_file. |
Numeric Encoding
Prices and sizes are parsed with Python Decimal. Derived numeric values are
emitted as exact decimal strings rather than JSON numbers. This keeps precision
visible and avoids binary floating-point rounding.
Missing price-derived values are emitted as null. Depth totals and depth bands
are emitted as decimal strings and use "0" when the relevant side is empty.
Calculation Rules
best_bid: maximum bid price.best_ask: minimum ask price.spread:best_ask - best_bidwhen both sides exist.midpoint:(best_bid + best_ask) / 2when both sides exist.bid_depth_total: sum of all bid sizes.ask_depth_total: sum of all ask sizes.bid_depth_within_1c: sum bid sizes with price greater than or equal tobest_bid - 0.01.ask_depth_within_1c: sum ask sizes with price less than or equal tobest_ask + 0.01.- The same band rule is used for
0.02and0.05.
Sanity Rules
A normalized file should pass these checks:
- Output row count equals raw input row count unless skipped rows are recorded.
- Every row has
raw_fileandraw_line_number. - Every referenced raw file exists.
spreadis non-negative whenever both sides exist.midpointis betweenbest_bidandbest_askwhenever both sides exist.- Depth totals and band depths are non-negative.
- At least one
Uprow and oneDownrow exist in the sample. - The gzip JSONL file decompresses and every line parses as JSON.
- The manifest checksum matches the normalized output file.
Current Known Gaps
- This schema covers a derived sample extract only.
- It does not define sustained daily normalized partitions.
- It does not include upload, daemon runtime, dashboards, databases, strategy code, backtests, trading behavior, or wallet behavior.
- Long-run schema stability still depends on future collection and soak-test evidence.