orderbooks/docs/DATA_CONTRACT.md
philipp 284e465588
Some checks failed
deploy / deploy (push) Has been cancelled
Prepare Kubernetes orderbooks deployment
2026-04-18 11:23:28 +02:00

3.9 KiB

Data Contract

The archive is raw-first. Raw market data must be preserved before normalization, aggregation, upload, or analysis.

Storage Principles

  • Store the raw response payload exactly as received whenever practical.
  • Add collector metadata beside the raw payload, not inside it.
  • Use UTC timestamps in ISO 8601 format with a Z suffix.
  • Use gzip JSONL for high-frequency snapshot data.
  • Rotate live collection files by hour or run.
  • Include checksums in manifests for all closed files.
  • Keep normalized files derived and traceable back to raw files.
  • Never store secrets, cookies, private keys, wallet material, or authenticated session state.

Directory Layout

Initial expected layout:

data/
  probes/
  discovery/
  live_sample/
  normalized_sample/
  manifests/
reports/
  checkpoints/

Future sustained collection layout:

data/
  raw/
    polymarket/
      orderbooks/
        YYYY/
          MM/
            DD/
              HH/
                polymarket_orderbooks_YYYYMMDDTHHMMSSZ.jsonl.gz
  normalized/
    polymarket/
      orderbooks/
        YYYY/
          MM/
            DD/
              polymarket_orderbooks_normalized_YYYYMMDD.jsonl.gz
  manifests/

Do not create a database until compressed file archives are proven painful.

Raw Orderbook Snapshot Envelope

Checkpoint 4 should store one JSON object per line using this envelope or a documented successor:

{
  "schema_name": "raw_orderbook_snapshot",
  "schema_version": 1,
  "collector": {
    "name": "polymarket_orderbook_collector",
    "version": "0.1.0"
  },
  "market": {
    "market_name": "polymarket",
    "market_slug": "example-slug",
    "condition_id": "0x...",
    "token_id": "123",
    "outcome": "Yes"
  },
  "collection": {
    "collected_at_utc": "2026-04-14T20:53:49Z",
    "sequence": 1
  },
  "request": {
    "method": "GET",
    "url": "https://example.invalid/orderbook",
    "params": {
      "token_id": "123"
    },
    "status_code": 200,
    "duration_ms": 123
  },
  "raw": {}
}

raw is the unmodified response payload. If the endpoint returns text or bytes, record encoding and store a lossless representation.

Discovery Record Fields

Checkpoint 3 normalized market records should include:

  • market_name
  • market_slug
  • title or question
  • condition_id
  • tokens
  • outcomes
  • start_time_utc, if available
  • end_time_utc, if available
  • active
  • closed
  • endpoint_source
  • fetched_at_utc
  • raw_ref

tokens should preserve the mapping between outcome labels and token IDs.

Normalized Snapshot Fields

Checkpoint 5 normalized records should include:

  • market_name
  • market_slug
  • condition_id
  • token_id
  • outcome
  • collected_at_utc
  • best_bid
  • best_ask
  • spread
  • midpoint
  • bid_depth_total
  • ask_depth_total
  • bid_depth_within_1c
  • ask_depth_within_1c
  • bid_depth_within_2c
  • ask_depth_within_2c
  • bid_depth_within_5c
  • ask_depth_within_5c
  • raw_file
  • raw_line_number, when feasible

Normalized data is invalid if it cannot reference the raw source record.

Manifest Requirements

Collection and transformation manifests should include:

  • manifest schema name and version
  • checkpoint or process name
  • start and end timestamps
  • market names and market IDs tracked
  • input files
  • output files
  • request counts
  • success and failure counts
  • status-code counts
  • row counts
  • checksums for closed files
  • command used
  • config path or config digest
  • warnings and known gaps
  • gate status

Checksums should use SHA-256 unless a later report explains why another hash is used.

Timestamp Policy

  • collected_at_utc: local collector timestamp taken as close as possible to receipt of data.
  • fetched_at_utc: timestamp for metadata or discovery fetches.
  • Endpoint-provided timestamps must be preserved under their original field names in raw.
  • If endpoint timestamp semantics are unclear, write the ambiguity into the probe report.