orderbooks/docs/DATA_CONTRACT.md
philipp 284e465588
Some checks failed
deploy / deploy (push) Has been cancelled
Prepare Kubernetes orderbooks deployment
2026-04-18 11:23:28 +02:00

168 lines
3.9 KiB
Markdown

# Data Contract
The archive is raw-first. Raw market data must be preserved before normalization, aggregation, upload, or analysis.
## Storage Principles
- Store the raw response payload exactly as received whenever practical.
- Add collector metadata beside the raw payload, not inside it.
- Use UTC timestamps in ISO 8601 format with a `Z` suffix.
- Use gzip JSONL for high-frequency snapshot data.
- Rotate live collection files by hour or run.
- Include checksums in manifests for all closed files.
- Keep normalized files derived and traceable back to raw files.
- Never store secrets, cookies, private keys, wallet material, or authenticated session state.
## Directory Layout
Initial expected layout:
```text
data/
probes/
discovery/
live_sample/
normalized_sample/
manifests/
reports/
checkpoints/
```
Future sustained collection layout:
```text
data/
raw/
polymarket/
orderbooks/
YYYY/
MM/
DD/
HH/
polymarket_orderbooks_YYYYMMDDTHHMMSSZ.jsonl.gz
normalized/
polymarket/
orderbooks/
YYYY/
MM/
DD/
polymarket_orderbooks_normalized_YYYYMMDD.jsonl.gz
manifests/
```
Do not create a database until compressed file archives are proven painful.
## Raw Orderbook Snapshot Envelope
Checkpoint 4 should store one JSON object per line using this envelope or a documented successor:
```json
{
"schema_name": "raw_orderbook_snapshot",
"schema_version": 1,
"collector": {
"name": "polymarket_orderbook_collector",
"version": "0.1.0"
},
"market": {
"market_name": "polymarket",
"market_slug": "example-slug",
"condition_id": "0x...",
"token_id": "123",
"outcome": "Yes"
},
"collection": {
"collected_at_utc": "2026-04-14T20:53:49Z",
"sequence": 1
},
"request": {
"method": "GET",
"url": "https://example.invalid/orderbook",
"params": {
"token_id": "123"
},
"status_code": 200,
"duration_ms": 123
},
"raw": {}
}
```
`raw` is the unmodified response payload. If the endpoint returns text or bytes, record encoding and store a lossless representation.
## Discovery Record Fields
Checkpoint 3 normalized market records should include:
- `market_name`
- `market_slug`
- `title` or `question`
- `condition_id`
- `tokens`
- `outcomes`
- `start_time_utc`, if available
- `end_time_utc`, if available
- `active`
- `closed`
- `endpoint_source`
- `fetched_at_utc`
- `raw_ref`
`tokens` should preserve the mapping between outcome labels and token IDs.
## Normalized Snapshot Fields
Checkpoint 5 normalized records should include:
- `market_name`
- `market_slug`
- `condition_id`
- `token_id`
- `outcome`
- `collected_at_utc`
- `best_bid`
- `best_ask`
- `spread`
- `midpoint`
- `bid_depth_total`
- `ask_depth_total`
- `bid_depth_within_1c`
- `ask_depth_within_1c`
- `bid_depth_within_2c`
- `ask_depth_within_2c`
- `bid_depth_within_5c`
- `ask_depth_within_5c`
- `raw_file`
- `raw_line_number`, when feasible
Normalized data is invalid if it cannot reference the raw source record.
## Manifest Requirements
Collection and transformation manifests should include:
- manifest schema name and version
- checkpoint or process name
- start and end timestamps
- market names and market IDs tracked
- input files
- output files
- request counts
- success and failure counts
- status-code counts
- row counts
- checksums for closed files
- command used
- config path or config digest
- warnings and known gaps
- gate status
Checksums should use SHA-256 unless a later report explains why another hash is used.
## Timestamp Policy
- `collected_at_utc`: local collector timestamp taken as close as possible to receipt of data.
- `fetched_at_utc`: timestamp for metadata or discovery fetches.
- Endpoint-provided timestamps must be preserved under their original field names in `raw`.
- If endpoint timestamp semantics are unclear, write the ambiguity into the probe report.