168 lines
3.9 KiB
Markdown
168 lines
3.9 KiB
Markdown
# Data Contract
|
|
|
|
The archive is raw-first. Raw market data must be preserved before normalization, aggregation, upload, or analysis.
|
|
|
|
## Storage Principles
|
|
|
|
- Store the raw response payload exactly as received whenever practical.
|
|
- Add collector metadata beside the raw payload, not inside it.
|
|
- Use UTC timestamps in ISO 8601 format with a `Z` suffix.
|
|
- Use gzip JSONL for high-frequency snapshot data.
|
|
- Rotate live collection files by hour or run.
|
|
- Include checksums in manifests for all closed files.
|
|
- Keep normalized files derived and traceable back to raw files.
|
|
- Never store secrets, cookies, private keys, wallet material, or authenticated session state.
|
|
|
|
## Directory Layout
|
|
|
|
Initial expected layout:
|
|
|
|
```text
|
|
data/
|
|
probes/
|
|
discovery/
|
|
live_sample/
|
|
normalized_sample/
|
|
manifests/
|
|
reports/
|
|
checkpoints/
|
|
```
|
|
|
|
Future sustained collection layout:
|
|
|
|
```text
|
|
data/
|
|
raw/
|
|
polymarket/
|
|
orderbooks/
|
|
YYYY/
|
|
MM/
|
|
DD/
|
|
HH/
|
|
polymarket_orderbooks_YYYYMMDDTHHMMSSZ.jsonl.gz
|
|
normalized/
|
|
polymarket/
|
|
orderbooks/
|
|
YYYY/
|
|
MM/
|
|
DD/
|
|
polymarket_orderbooks_normalized_YYYYMMDD.jsonl.gz
|
|
manifests/
|
|
```
|
|
|
|
Do not create a database until compressed file archives are proven painful.
|
|
|
|
## Raw Orderbook Snapshot Envelope
|
|
|
|
Checkpoint 4 should store one JSON object per line using this envelope or a documented successor:
|
|
|
|
```json
|
|
{
|
|
"schema_name": "raw_orderbook_snapshot",
|
|
"schema_version": 1,
|
|
"collector": {
|
|
"name": "polymarket_orderbook_collector",
|
|
"version": "0.1.0"
|
|
},
|
|
"market": {
|
|
"market_name": "polymarket",
|
|
"market_slug": "example-slug",
|
|
"condition_id": "0x...",
|
|
"token_id": "123",
|
|
"outcome": "Yes"
|
|
},
|
|
"collection": {
|
|
"collected_at_utc": "2026-04-14T20:53:49Z",
|
|
"sequence": 1
|
|
},
|
|
"request": {
|
|
"method": "GET",
|
|
"url": "https://example.invalid/orderbook",
|
|
"params": {
|
|
"token_id": "123"
|
|
},
|
|
"status_code": 200,
|
|
"duration_ms": 123
|
|
},
|
|
"raw": {}
|
|
}
|
|
```
|
|
|
|
`raw` is the unmodified response payload. If the endpoint returns text or bytes, record encoding and store a lossless representation.
|
|
|
|
## Discovery Record Fields
|
|
|
|
Checkpoint 3 normalized market records should include:
|
|
|
|
- `market_name`
|
|
- `market_slug`
|
|
- `title` or `question`
|
|
- `condition_id`
|
|
- `tokens`
|
|
- `outcomes`
|
|
- `start_time_utc`, if available
|
|
- `end_time_utc`, if available
|
|
- `active`
|
|
- `closed`
|
|
- `endpoint_source`
|
|
- `fetched_at_utc`
|
|
- `raw_ref`
|
|
|
|
`tokens` should preserve the mapping between outcome labels and token IDs.
|
|
|
|
## Normalized Snapshot Fields
|
|
|
|
Checkpoint 5 normalized records should include:
|
|
|
|
- `market_name`
|
|
- `market_slug`
|
|
- `condition_id`
|
|
- `token_id`
|
|
- `outcome`
|
|
- `collected_at_utc`
|
|
- `best_bid`
|
|
- `best_ask`
|
|
- `spread`
|
|
- `midpoint`
|
|
- `bid_depth_total`
|
|
- `ask_depth_total`
|
|
- `bid_depth_within_1c`
|
|
- `ask_depth_within_1c`
|
|
- `bid_depth_within_2c`
|
|
- `ask_depth_within_2c`
|
|
- `bid_depth_within_5c`
|
|
- `ask_depth_within_5c`
|
|
- `raw_file`
|
|
- `raw_line_number`, when feasible
|
|
|
|
Normalized data is invalid if it cannot reference the raw source record.
|
|
|
|
## Manifest Requirements
|
|
|
|
Collection and transformation manifests should include:
|
|
|
|
- manifest schema name and version
|
|
- checkpoint or process name
|
|
- start and end timestamps
|
|
- market names and market IDs tracked
|
|
- input files
|
|
- output files
|
|
- request counts
|
|
- success and failure counts
|
|
- status-code counts
|
|
- row counts
|
|
- checksums for closed files
|
|
- command used
|
|
- config path or config digest
|
|
- warnings and known gaps
|
|
- gate status
|
|
|
|
Checksums should use SHA-256 unless a later report explains why another hash is used.
|
|
|
|
## Timestamp Policy
|
|
|
|
- `collected_at_utc`: local collector timestamp taken as close as possible to receipt of data.
|
|
- `fetched_at_utc`: timestamp for metadata or discovery fetches.
|
|
- Endpoint-provided timestamps must be preserved under their original field names in `raw`.
|
|
- If endpoint timestamp semantics are unclear, write the ambiguity into the probe report.
|
|
|