102 lines
4.4 KiB
Markdown
102 lines
4.4 KiB
Markdown
# Orderbook Snapshot Schema
|
|
|
|
Status: valid
|
|
|
|
This document covers the Checkpoint 5 normalized order-book sample. The raw
|
|
gzip JSONL files remain the source of truth. Normalized rows are derived records
|
|
for quick inspection and later quality checks.
|
|
|
|
## Normalized Snapshot
|
|
|
|
Schema name: `normalized_orderbook_snapshot`
|
|
|
|
Schema version: `1`
|
|
|
|
File format: gzip JSONL, one JSON object per line.
|
|
|
|
Sample location:
|
|
|
|
```text
|
|
data/normalized_sample/polymarket/orderbooks/<run_id>/polymarket_orderbooks_normalized_<run_id>.jsonl.gz
|
|
```
|
|
|
|
Every normalized row must reference exactly one raw gzip JSONL source row:
|
|
|
|
- `raw_file`: repository-relative path to the raw gzip JSONL file.
|
|
- `raw_line_number`: 1-based line number inside that raw gzip JSONL file.
|
|
|
|
Derived data is invalid if either lineage field is missing or points to a
|
|
missing raw file.
|
|
|
|
## Field Contract
|
|
|
|
| Field | Type | Meaning |
|
|
| --- | --- | --- |
|
|
| `schema_name` | string | Always `normalized_orderbook_snapshot`. |
|
|
| `schema_version` | number | Schema version, currently `1`. |
|
|
| `market_name` | string | Market source name from the raw envelope. |
|
|
| `market_slug` | string | Polymarket market slug from the raw envelope. |
|
|
| `condition_id` | string | Polymarket condition ID from the raw envelope. |
|
|
| `token_id` | string | Polymarket CLOB token ID from the raw envelope. |
|
|
| `outcome` | string | Outcome label associated with `token_id`. |
|
|
| `collected_at_utc` | string | Collector timestamp from the raw envelope. |
|
|
| `best_bid` | string or null | Maximum bid price, or null when no bids exist. |
|
|
| `best_ask` | string or null | Minimum ask price, or null when no asks exist. |
|
|
| `spread` | string or null | `best_ask - best_bid` when both sides exist. |
|
|
| `midpoint` | string or null | `(best_bid + best_ask) / 2` when both sides exist. |
|
|
| `bid_depth_total` | string | Sum of all bid sizes. |
|
|
| `ask_depth_total` | string | Sum of all ask sizes. |
|
|
| `bid_depth_within_1c` | string | Sum of bid sizes priced at least `best_bid - 0.01`. |
|
|
| `ask_depth_within_1c` | string | Sum of ask sizes priced at most `best_ask + 0.01`. |
|
|
| `bid_depth_within_2c` | string | Sum of bid sizes priced at least `best_bid - 0.02`. |
|
|
| `ask_depth_within_2c` | string | Sum of ask sizes priced at most `best_ask + 0.02`. |
|
|
| `bid_depth_within_5c` | string | Sum of bid sizes priced at least `best_bid - 0.05`. |
|
|
| `ask_depth_within_5c` | string | Sum of ask sizes priced at most `best_ask + 0.05`. |
|
|
| `raw_file` | string | Repository-relative raw gzip JSONL path. |
|
|
| `raw_line_number` | number | 1-based source line number in `raw_file`. |
|
|
|
|
## Numeric Encoding
|
|
|
|
Prices and sizes are parsed with Python `Decimal`. Derived numeric values are
|
|
emitted as exact decimal strings rather than JSON numbers. This keeps precision
|
|
visible and avoids binary floating-point rounding.
|
|
|
|
Missing price-derived values are emitted as `null`. Depth totals and depth bands
|
|
are emitted as decimal strings and use `"0"` when the relevant side is empty.
|
|
|
|
## Calculation Rules
|
|
|
|
- `best_bid`: maximum bid price.
|
|
- `best_ask`: minimum ask price.
|
|
- `spread`: `best_ask - best_bid` when both sides exist.
|
|
- `midpoint`: `(best_bid + best_ask) / 2` when both sides exist.
|
|
- `bid_depth_total`: sum of all bid sizes.
|
|
- `ask_depth_total`: sum of all ask sizes.
|
|
- `bid_depth_within_1c`: sum bid sizes with price greater than or equal to
|
|
`best_bid - 0.01`.
|
|
- `ask_depth_within_1c`: sum ask sizes with price less than or equal to
|
|
`best_ask + 0.01`.
|
|
- The same band rule is used for `0.02` and `0.05`.
|
|
|
|
## Sanity Rules
|
|
|
|
A normalized file should pass these checks:
|
|
|
|
- Output row count equals raw input row count unless skipped rows are recorded.
|
|
- Every row has `raw_file` and `raw_line_number`.
|
|
- Every referenced raw file exists.
|
|
- `spread` is non-negative whenever both sides exist.
|
|
- `midpoint` is between `best_bid` and `best_ask` whenever both sides exist.
|
|
- Depth totals and band depths are non-negative.
|
|
- At least one `Up` row and one `Down` row exist in the sample.
|
|
- The gzip JSONL file decompresses and every line parses as JSON.
|
|
- The manifest checksum matches the normalized output file.
|
|
|
|
## Current Known Gaps
|
|
|
|
- This schema covers a derived sample extract only.
|
|
- It does not define sustained daily normalized partitions.
|
|
- It does not include upload, daemon runtime, dashboards, databases, strategy
|
|
code, backtests, trading behavior, or wallet behavior.
|
|
- Long-run schema stability still depends on future collection and soak-test
|
|
evidence.
|