Orderbook Snapshot Schema

Status: valid

This document covers the Checkpoint 5 normalized order-book sample. The raw gzip JSONL files remain the source of truth. Normalized rows are derived records for quick inspection and later quality checks.

Normalized Snapshot

Schema name: normalized_orderbook_snapshot

Schema version: 1

File format: gzip JSONL, one JSON object per line.

Sample location:

data/normalized_sample/polymarket/orderbooks/<run_id>/polymarket_orderbooks_normalized_<run_id>.jsonl.gz

Every normalized row must reference exactly one raw gzip JSONL source row:

raw_file: repository-relative path to the raw gzip JSONL file.
raw_line_number: 1-based line number inside that raw gzip JSONL file.

Derived data is invalid if either lineage field is missing or points to a missing raw file.

Field Contract

Field	Type	Meaning
`schema_name`	string	Always `normalized_orderbook_snapshot`.
`schema_version`	number	Schema version, currently `1`.
`market_name`	string	Market source name from the raw envelope.
`market_slug`	string	Polymarket market slug from the raw envelope.
`condition_id`	string	Polymarket condition ID from the raw envelope.
`token_id`	string	Polymarket CLOB token ID from the raw envelope.
`outcome`	string	Outcome label associated with `token_id`.
`collected_at_utc`	string	Collector timestamp from the raw envelope.
`best_bid`	string or null	Maximum bid price, or null when no bids exist.
`best_ask`	string or null	Minimum ask price, or null when no asks exist.
`spread`	string or null	`best_ask - best_bid` when both sides exist.
`midpoint`	string or null	`(best_bid + best_ask) / 2` when both sides exist.
`bid_depth_total`	string	Sum of all bid sizes.
`ask_depth_total`	string	Sum of all ask sizes.
`bid_depth_within_1c`	string	Sum of bid sizes priced at least `best_bid - 0.01`.
`ask_depth_within_1c`	string	Sum of ask sizes priced at most `best_ask + 0.01`.
`bid_depth_within_2c`	string	Sum of bid sizes priced at least `best_bid - 0.02`.
`ask_depth_within_2c`	string	Sum of ask sizes priced at most `best_ask + 0.02`.
`bid_depth_within_5c`	string	Sum of bid sizes priced at least `best_bid - 0.05`.
`ask_depth_within_5c`	string	Sum of ask sizes priced at most `best_ask + 0.05`.
`raw_file`	string	Repository-relative raw gzip JSONL path.
`raw_line_number`	number	1-based source line number in `raw_file`.

Numeric Encoding

Prices and sizes are parsed with Python Decimal. Derived numeric values are emitted as exact decimal strings rather than JSON numbers. This keeps precision visible and avoids binary floating-point rounding.

Missing price-derived values are emitted as null. Depth totals and depth bands are emitted as decimal strings and use "0" when the relevant side is empty.

Calculation Rules

best_bid: maximum bid price.
best_ask: minimum ask price.
spread: best_ask - best_bid when both sides exist.
midpoint: (best_bid + best_ask) / 2 when both sides exist.
bid_depth_total: sum of all bid sizes.
ask_depth_total: sum of all ask sizes.
bid_depth_within_1c: sum bid sizes with price greater than or equal to best_bid - 0.01.
ask_depth_within_1c: sum ask sizes with price less than or equal to best_ask + 0.01.
The same band rule is used for 0.02 and 0.05.

Sanity Rules

A normalized file should pass these checks:

Output row count equals raw input row count unless skipped rows are recorded.
Every row has raw_file and raw_line_number.
Every referenced raw file exists.
spread is non-negative whenever both sides exist.
midpoint is between best_bid and best_ask whenever both sides exist.
Depth totals and band depths are non-negative.
At least one Up row and one Down row exist in the sample.
The gzip JSONL file decompresses and every line parses as JSON.
The manifest checksum matches the normalized output file.

Current Known Gaps

This schema covers a derived sample extract only.
It does not define sustained daily normalized partitions.
It does not include upload, daemon runtime, dashboards, databases, strategy code, backtests, trading behavior, or wallet behavior.
Long-run schema stability still depends on future collection and soak-test evidence.

4.4 KiB Raw Permalink Blame History