orderbooks/docs/ORDERBOOK_SCHEMA.md
philipp 284e465588
Some checks failed
deploy / deploy (push) Has been cancelled
Prepare Kubernetes orderbooks deployment
2026-04-18 11:23:28 +02:00

4.4 KiB

Orderbook Snapshot Schema

Status: valid

This document covers the Checkpoint 5 normalized order-book sample. The raw gzip JSONL files remain the source of truth. Normalized rows are derived records for quick inspection and later quality checks.

Normalized Snapshot

Schema name: normalized_orderbook_snapshot

Schema version: 1

File format: gzip JSONL, one JSON object per line.

Sample location:

data/normalized_sample/polymarket/orderbooks/<run_id>/polymarket_orderbooks_normalized_<run_id>.jsonl.gz

Every normalized row must reference exactly one raw gzip JSONL source row:

  • raw_file: repository-relative path to the raw gzip JSONL file.
  • raw_line_number: 1-based line number inside that raw gzip JSONL file.

Derived data is invalid if either lineage field is missing or points to a missing raw file.

Field Contract

Field Type Meaning
schema_name string Always normalized_orderbook_snapshot.
schema_version number Schema version, currently 1.
market_name string Market source name from the raw envelope.
market_slug string Polymarket market slug from the raw envelope.
condition_id string Polymarket condition ID from the raw envelope.
token_id string Polymarket CLOB token ID from the raw envelope.
outcome string Outcome label associated with token_id.
collected_at_utc string Collector timestamp from the raw envelope.
best_bid string or null Maximum bid price, or null when no bids exist.
best_ask string or null Minimum ask price, or null when no asks exist.
spread string or null best_ask - best_bid when both sides exist.
midpoint string or null (best_bid + best_ask) / 2 when both sides exist.
bid_depth_total string Sum of all bid sizes.
ask_depth_total string Sum of all ask sizes.
bid_depth_within_1c string Sum of bid sizes priced at least best_bid - 0.01.
ask_depth_within_1c string Sum of ask sizes priced at most best_ask + 0.01.
bid_depth_within_2c string Sum of bid sizes priced at least best_bid - 0.02.
ask_depth_within_2c string Sum of ask sizes priced at most best_ask + 0.02.
bid_depth_within_5c string Sum of bid sizes priced at least best_bid - 0.05.
ask_depth_within_5c string Sum of ask sizes priced at most best_ask + 0.05.
raw_file string Repository-relative raw gzip JSONL path.
raw_line_number number 1-based source line number in raw_file.

Numeric Encoding

Prices and sizes are parsed with Python Decimal. Derived numeric values are emitted as exact decimal strings rather than JSON numbers. This keeps precision visible and avoids binary floating-point rounding.

Missing price-derived values are emitted as null. Depth totals and depth bands are emitted as decimal strings and use "0" when the relevant side is empty.

Calculation Rules

  • best_bid: maximum bid price.
  • best_ask: minimum ask price.
  • spread: best_ask - best_bid when both sides exist.
  • midpoint: (best_bid + best_ask) / 2 when both sides exist.
  • bid_depth_total: sum of all bid sizes.
  • ask_depth_total: sum of all ask sizes.
  • bid_depth_within_1c: sum bid sizes with price greater than or equal to best_bid - 0.01.
  • ask_depth_within_1c: sum ask sizes with price less than or equal to best_ask + 0.01.
  • The same band rule is used for 0.02 and 0.05.

Sanity Rules

A normalized file should pass these checks:

  • Output row count equals raw input row count unless skipped rows are recorded.
  • Every row has raw_file and raw_line_number.
  • Every referenced raw file exists.
  • spread is non-negative whenever both sides exist.
  • midpoint is between best_bid and best_ask whenever both sides exist.
  • Depth totals and band depths are non-negative.
  • At least one Up row and one Down row exist in the sample.
  • The gzip JSONL file decompresses and every line parses as JSON.
  • The manifest checksum matches the normalized output file.

Current Known Gaps

  • This schema covers a derived sample extract only.
  • It does not define sustained daily normalized partitions.
  • It does not include upload, daemon runtime, dashboards, databases, strategy code, backtests, trading behavior, or wallet behavior.
  • Long-run schema stability still depends on future collection and soak-test evidence.