orderbooks/docs/ORDERBOOK_SCHEMA.md
philipp 284e465588
Some checks failed
deploy / deploy (push) Has been cancelled
Prepare Kubernetes orderbooks deployment
2026-04-18 11:23:28 +02:00

102 lines
4.4 KiB
Markdown

# Orderbook Snapshot Schema
Status: valid
This document covers the Checkpoint 5 normalized order-book sample. The raw
gzip JSONL files remain the source of truth. Normalized rows are derived records
for quick inspection and later quality checks.
## Normalized Snapshot
Schema name: `normalized_orderbook_snapshot`
Schema version: `1`
File format: gzip JSONL, one JSON object per line.
Sample location:
```text
data/normalized_sample/polymarket/orderbooks/<run_id>/polymarket_orderbooks_normalized_<run_id>.jsonl.gz
```
Every normalized row must reference exactly one raw gzip JSONL source row:
- `raw_file`: repository-relative path to the raw gzip JSONL file.
- `raw_line_number`: 1-based line number inside that raw gzip JSONL file.
Derived data is invalid if either lineage field is missing or points to a
missing raw file.
## Field Contract
| Field | Type | Meaning |
| --- | --- | --- |
| `schema_name` | string | Always `normalized_orderbook_snapshot`. |
| `schema_version` | number | Schema version, currently `1`. |
| `market_name` | string | Market source name from the raw envelope. |
| `market_slug` | string | Polymarket market slug from the raw envelope. |
| `condition_id` | string | Polymarket condition ID from the raw envelope. |
| `token_id` | string | Polymarket CLOB token ID from the raw envelope. |
| `outcome` | string | Outcome label associated with `token_id`. |
| `collected_at_utc` | string | Collector timestamp from the raw envelope. |
| `best_bid` | string or null | Maximum bid price, or null when no bids exist. |
| `best_ask` | string or null | Minimum ask price, or null when no asks exist. |
| `spread` | string or null | `best_ask - best_bid` when both sides exist. |
| `midpoint` | string or null | `(best_bid + best_ask) / 2` when both sides exist. |
| `bid_depth_total` | string | Sum of all bid sizes. |
| `ask_depth_total` | string | Sum of all ask sizes. |
| `bid_depth_within_1c` | string | Sum of bid sizes priced at least `best_bid - 0.01`. |
| `ask_depth_within_1c` | string | Sum of ask sizes priced at most `best_ask + 0.01`. |
| `bid_depth_within_2c` | string | Sum of bid sizes priced at least `best_bid - 0.02`. |
| `ask_depth_within_2c` | string | Sum of ask sizes priced at most `best_ask + 0.02`. |
| `bid_depth_within_5c` | string | Sum of bid sizes priced at least `best_bid - 0.05`. |
| `ask_depth_within_5c` | string | Sum of ask sizes priced at most `best_ask + 0.05`. |
| `raw_file` | string | Repository-relative raw gzip JSONL path. |
| `raw_line_number` | number | 1-based source line number in `raw_file`. |
## Numeric Encoding
Prices and sizes are parsed with Python `Decimal`. Derived numeric values are
emitted as exact decimal strings rather than JSON numbers. This keeps precision
visible and avoids binary floating-point rounding.
Missing price-derived values are emitted as `null`. Depth totals and depth bands
are emitted as decimal strings and use `"0"` when the relevant side is empty.
## Calculation Rules
- `best_bid`: maximum bid price.
- `best_ask`: minimum ask price.
- `spread`: `best_ask - best_bid` when both sides exist.
- `midpoint`: `(best_bid + best_ask) / 2` when both sides exist.
- `bid_depth_total`: sum of all bid sizes.
- `ask_depth_total`: sum of all ask sizes.
- `bid_depth_within_1c`: sum bid sizes with price greater than or equal to
`best_bid - 0.01`.
- `ask_depth_within_1c`: sum ask sizes with price less than or equal to
`best_ask + 0.01`.
- The same band rule is used for `0.02` and `0.05`.
## Sanity Rules
A normalized file should pass these checks:
- Output row count equals raw input row count unless skipped rows are recorded.
- Every row has `raw_file` and `raw_line_number`.
- Every referenced raw file exists.
- `spread` is non-negative whenever both sides exist.
- `midpoint` is between `best_bid` and `best_ask` whenever both sides exist.
- Depth totals and band depths are non-negative.
- At least one `Up` row and one `Down` row exist in the sample.
- The gzip JSONL file decompresses and every line parses as JSON.
- The manifest checksum matches the normalized output file.
## Current Known Gaps
- This schema covers a derived sample extract only.
- It does not define sustained daily normalized partitions.
- It does not include upload, daemon runtime, dashboards, databases, strategy
code, backtests, trading behavior, or wallet behavior.
- Long-run schema stability still depends on future collection and soak-test
evidence.