# Data Contract The archive is raw-first. Raw market data must be preserved before normalization, aggregation, upload, or analysis. ## Storage Principles - Store the raw response payload exactly as received whenever practical. - Add collector metadata beside the raw payload, not inside it. - Use UTC timestamps in ISO 8601 format with a `Z` suffix. - Use gzip JSONL for high-frequency snapshot data. - Rotate live collection files by hour or run. - Include checksums in manifests for all closed files. - Keep normalized files derived and traceable back to raw files. - Never store secrets, cookies, private keys, wallet material, or authenticated session state. ## Directory Layout Initial expected layout: ```text data/ probes/ discovery/ live_sample/ normalized_sample/ manifests/ reports/ checkpoints/ ``` Future sustained collection layout: ```text data/ raw/ polymarket/ orderbooks/ YYYY/ MM/ DD/ HH/ polymarket_orderbooks_YYYYMMDDTHHMMSSZ.jsonl.gz normalized/ polymarket/ orderbooks/ YYYY/ MM/ DD/ polymarket_orderbooks_normalized_YYYYMMDD.jsonl.gz manifests/ ``` Do not create a database until compressed file archives are proven painful. ## Raw Orderbook Snapshot Envelope Checkpoint 4 should store one JSON object per line using this envelope or a documented successor: ```json { "schema_name": "raw_orderbook_snapshot", "schema_version": 1, "collector": { "name": "polymarket_orderbook_collector", "version": "0.1.0" }, "market": { "market_name": "polymarket", "market_slug": "example-slug", "condition_id": "0x...", "token_id": "123", "outcome": "Yes" }, "collection": { "collected_at_utc": "2026-04-14T20:53:49Z", "sequence": 1 }, "request": { "method": "GET", "url": "https://example.invalid/orderbook", "params": { "token_id": "123" }, "status_code": 200, "duration_ms": 123 }, "raw": {} } ``` `raw` is the unmodified response payload. If the endpoint returns text or bytes, record encoding and store a lossless representation. ## Discovery Record Fields Checkpoint 3 normalized market records should include: - `market_name` - `market_slug` - `title` or `question` - `condition_id` - `tokens` - `outcomes` - `start_time_utc`, if available - `end_time_utc`, if available - `active` - `closed` - `endpoint_source` - `fetched_at_utc` - `raw_ref` `tokens` should preserve the mapping between outcome labels and token IDs. ## Normalized Snapshot Fields Checkpoint 5 normalized records should include: - `market_name` - `market_slug` - `condition_id` - `token_id` - `outcome` - `collected_at_utc` - `best_bid` - `best_ask` - `spread` - `midpoint` - `bid_depth_total` - `ask_depth_total` - `bid_depth_within_1c` - `ask_depth_within_1c` - `bid_depth_within_2c` - `ask_depth_within_2c` - `bid_depth_within_5c` - `ask_depth_within_5c` - `raw_file` - `raw_line_number`, when feasible Normalized data is invalid if it cannot reference the raw source record. ## Manifest Requirements Collection and transformation manifests should include: - manifest schema name and version - checkpoint or process name - start and end timestamps - market names and market IDs tracked - input files - output files - request counts - success and failure counts - status-code counts - row counts - checksums for closed files - command used - config path or config digest - warnings and known gaps - gate status Checksums should use SHA-256 unless a later report explains why another hash is used. ## Timestamp Policy - `collected_at_utc`: local collector timestamp taken as close as possible to receipt of data. - `fetched_at_utc`: timestamp for metadata or discovery fetches. - Endpoint-provided timestamps must be preserved under their original field names in `raw`. - If endpoint timestamp semantics are unclear, write the ambiguity into the probe report.