3.9 KiB
Data Contract
The archive is raw-first. Raw market data must be preserved before normalization, aggregation, upload, or analysis.
Storage Principles
- Store the raw response payload exactly as received whenever practical.
- Add collector metadata beside the raw payload, not inside it.
- Use UTC timestamps in ISO 8601 format with a
Zsuffix. - Use gzip JSONL for high-frequency snapshot data.
- Rotate live collection files by hour or run.
- Include checksums in manifests for all closed files.
- Keep normalized files derived and traceable back to raw files.
- Never store secrets, cookies, private keys, wallet material, or authenticated session state.
Directory Layout
Initial expected layout:
data/
probes/
discovery/
live_sample/
normalized_sample/
manifests/
reports/
checkpoints/
Future sustained collection layout:
data/
raw/
polymarket/
orderbooks/
YYYY/
MM/
DD/
HH/
polymarket_orderbooks_YYYYMMDDTHHMMSSZ.jsonl.gz
normalized/
polymarket/
orderbooks/
YYYY/
MM/
DD/
polymarket_orderbooks_normalized_YYYYMMDD.jsonl.gz
manifests/
Do not create a database until compressed file archives are proven painful.
Raw Orderbook Snapshot Envelope
Checkpoint 4 should store one JSON object per line using this envelope or a documented successor:
{
"schema_name": "raw_orderbook_snapshot",
"schema_version": 1,
"collector": {
"name": "polymarket_orderbook_collector",
"version": "0.1.0"
},
"market": {
"market_name": "polymarket",
"market_slug": "example-slug",
"condition_id": "0x...",
"token_id": "123",
"outcome": "Yes"
},
"collection": {
"collected_at_utc": "2026-04-14T20:53:49Z",
"sequence": 1
},
"request": {
"method": "GET",
"url": "https://example.invalid/orderbook",
"params": {
"token_id": "123"
},
"status_code": 200,
"duration_ms": 123
},
"raw": {}
}
raw is the unmodified response payload. If the endpoint returns text or bytes, record encoding and store a lossless representation.
Discovery Record Fields
Checkpoint 3 normalized market records should include:
market_namemarket_slugtitleorquestioncondition_idtokensoutcomesstart_time_utc, if availableend_time_utc, if availableactiveclosedendpoint_sourcefetched_at_utcraw_ref
tokens should preserve the mapping between outcome labels and token IDs.
Normalized Snapshot Fields
Checkpoint 5 normalized records should include:
market_namemarket_slugcondition_idtoken_idoutcomecollected_at_utcbest_bidbest_askspreadmidpointbid_depth_totalask_depth_totalbid_depth_within_1cask_depth_within_1cbid_depth_within_2cask_depth_within_2cbid_depth_within_5cask_depth_within_5craw_fileraw_line_number, when feasible
Normalized data is invalid if it cannot reference the raw source record.
Manifest Requirements
Collection and transformation manifests should include:
- manifest schema name and version
- checkpoint or process name
- start and end timestamps
- market names and market IDs tracked
- input files
- output files
- request counts
- success and failure counts
- status-code counts
- row counts
- checksums for closed files
- command used
- config path or config digest
- warnings and known gaps
- gate status
Checksums should use SHA-256 unless a later report explains why another hash is used.
Timestamp Policy
collected_at_utc: local collector timestamp taken as close as possible to receipt of data.fetched_at_utc: timestamp for metadata or discovery fetches.- Endpoint-provided timestamps must be preserved under their original field names in
raw. - If endpoint timestamp semantics are unclear, write the ambiguity into the probe report.