# Orderbook Snapshot Schema Status: valid This document covers the Checkpoint 5 normalized order-book sample. The raw gzip JSONL files remain the source of truth. Normalized rows are derived records for quick inspection and later quality checks. ## Normalized Snapshot Schema name: `normalized_orderbook_snapshot` Schema version: `1` File format: gzip JSONL, one JSON object per line. Sample location: ```text data/normalized_sample/polymarket/orderbooks//polymarket_orderbooks_normalized_.jsonl.gz ``` Every normalized row must reference exactly one raw gzip JSONL source row: - `raw_file`: repository-relative path to the raw gzip JSONL file. - `raw_line_number`: 1-based line number inside that raw gzip JSONL file. Derived data is invalid if either lineage field is missing or points to a missing raw file. ## Field Contract | Field | Type | Meaning | | --- | --- | --- | | `schema_name` | string | Always `normalized_orderbook_snapshot`. | | `schema_version` | number | Schema version, currently `1`. | | `market_name` | string | Market source name from the raw envelope. | | `market_slug` | string | Polymarket market slug from the raw envelope. | | `condition_id` | string | Polymarket condition ID from the raw envelope. | | `token_id` | string | Polymarket CLOB token ID from the raw envelope. | | `outcome` | string | Outcome label associated with `token_id`. | | `collected_at_utc` | string | Collector timestamp from the raw envelope. | | `best_bid` | string or null | Maximum bid price, or null when no bids exist. | | `best_ask` | string or null | Minimum ask price, or null when no asks exist. | | `spread` | string or null | `best_ask - best_bid` when both sides exist. | | `midpoint` | string or null | `(best_bid + best_ask) / 2` when both sides exist. | | `bid_depth_total` | string | Sum of all bid sizes. | | `ask_depth_total` | string | Sum of all ask sizes. | | `bid_depth_within_1c` | string | Sum of bid sizes priced at least `best_bid - 0.01`. | | `ask_depth_within_1c` | string | Sum of ask sizes priced at most `best_ask + 0.01`. | | `bid_depth_within_2c` | string | Sum of bid sizes priced at least `best_bid - 0.02`. | | `ask_depth_within_2c` | string | Sum of ask sizes priced at most `best_ask + 0.02`. | | `bid_depth_within_5c` | string | Sum of bid sizes priced at least `best_bid - 0.05`. | | `ask_depth_within_5c` | string | Sum of ask sizes priced at most `best_ask + 0.05`. | | `raw_file` | string | Repository-relative raw gzip JSONL path. | | `raw_line_number` | number | 1-based source line number in `raw_file`. | ## Numeric Encoding Prices and sizes are parsed with Python `Decimal`. Derived numeric values are emitted as exact decimal strings rather than JSON numbers. This keeps precision visible and avoids binary floating-point rounding. Missing price-derived values are emitted as `null`. Depth totals and depth bands are emitted as decimal strings and use `"0"` when the relevant side is empty. ## Calculation Rules - `best_bid`: maximum bid price. - `best_ask`: minimum ask price. - `spread`: `best_ask - best_bid` when both sides exist. - `midpoint`: `(best_bid + best_ask) / 2` when both sides exist. - `bid_depth_total`: sum of all bid sizes. - `ask_depth_total`: sum of all ask sizes. - `bid_depth_within_1c`: sum bid sizes with price greater than or equal to `best_bid - 0.01`. - `ask_depth_within_1c`: sum ask sizes with price less than or equal to `best_ask + 0.01`. - The same band rule is used for `0.02` and `0.05`. ## Sanity Rules A normalized file should pass these checks: - Output row count equals raw input row count unless skipped rows are recorded. - Every row has `raw_file` and `raw_line_number`. - Every referenced raw file exists. - `spread` is non-negative whenever both sides exist. - `midpoint` is between `best_bid` and `best_ask` whenever both sides exist. - Depth totals and band depths are non-negative. - At least one `Up` row and one `Down` row exist in the sample. - The gzip JSONL file decompresses and every line parses as JSON. - The manifest checksum matches the normalized output file. ## Current Known Gaps - This schema covers a derived sample extract only. - It does not define sustained daily normalized partitions. - It does not include upload, daemon runtime, dashboards, databases, strategy code, backtests, trading behavior, or wallet behavior. - Long-run schema stability still depends on future collection and soak-test evidence.