orderbooks/docs/DATA_CONTRACT.md

# Data Contract

The archive is raw-first. Raw market data must be preserved before normalization, aggregation, upload, or analysis.

## Storage Principles

- Store the raw response payload exactly as received whenever practical.
- Add collector metadata beside the raw payload, not inside it.
- Use UTC timestamps in ISO 8601 format with a `Z` suffix.
- Use gzip JSONL for high-frequency snapshot data.
- Rotate live collection files by hour or run.
- Include checksums in manifests for all closed files.
- Keep normalized files derived and traceable back to raw files.
- Never store secrets, cookies, private keys, wallet material, or authenticated session state.

## Directory Layout

Initial expected layout:

```text
data/
  probes/
  discovery/
  live_sample/
  normalized_sample/
  manifests/
reports/
  checkpoints/
```

Future sustained collection layout:

```text
data/
  raw/
    polymarket/
      orderbooks/
        YYYY/
          MM/
            DD/
              HH/
                polymarket_orderbooks_YYYYMMDDTHHMMSSZ.jsonl.gz
  normalized/
    polymarket/
      orderbooks/
        YYYY/
          MM/
            DD/
              polymarket_orderbooks_normalized_YYYYMMDD.jsonl.gz
  manifests/
```

Do not create a database until compressed file archives are proven painful.

## Raw Orderbook Snapshot Envelope

Checkpoint 4 should store one JSON object per line using this envelope or a documented successor:

```json
{
  "schema_name": "raw_orderbook_snapshot",
  "schema_version": 1,
  "collector": {
    "name": "polymarket_orderbook_collector",
    "version": "0.1.0"
  },
  "market": {
    "market_name": "polymarket",
    "market_slug": "example-slug",
    "condition_id": "0x...",
    "token_id": "123",
    "outcome": "Yes"
  },
  "collection": {
    "collected_at_utc": "2026-04-14T20:53:49Z",
    "sequence": 1
  },
  "request": {
    "method": "GET",
    "url": "https://example.invalid/orderbook",
    "params": {
      "token_id": "123"
    },
    "status_code": 200,
    "duration_ms": 123
  },
  "raw": {}
}
```

`raw` is the unmodified response payload. If the endpoint returns text or bytes, record encoding and store a lossless representation.

## Discovery Record Fields

Checkpoint 3 normalized market records should include:

- `market_name`
- `market_slug`
- `title` or `question`
- `condition_id`
- `tokens`
- `outcomes`
- `start_time_utc`, if available
- `end_time_utc`, if available
- `active`
- `closed`
- `endpoint_source`
- `fetched_at_utc`
- `raw_ref`

`tokens` should preserve the mapping between outcome labels and token IDs.

## Normalized Snapshot Fields

Checkpoint 5 normalized records should include:

- `market_name`
- `market_slug`
- `condition_id`
- `token_id`
- `outcome`
- `collected_at_utc`
- `best_bid`
- `best_ask`
- `spread`
- `midpoint`
- `bid_depth_total`
- `ask_depth_total`
- `bid_depth_within_1c`
- `ask_depth_within_1c`
- `bid_depth_within_2c`
- `ask_depth_within_2c`
- `bid_depth_within_5c`
- `ask_depth_within_5c`
- `raw_file`
- `raw_line_number`, when feasible

Normalized data is invalid if it cannot reference the raw source record.

## Manifest Requirements

Collection and transformation manifests should include:

- manifest schema name and version
- checkpoint or process name
- start and end timestamps
- market names and market IDs tracked
- input files
- output files
- request counts
- success and failure counts
- status-code counts
- row counts
- checksums for closed files
- command used
- config path or config digest
- warnings and known gaps
- gate status

Checksums should use SHA-256 unless a later report explains why another hash is used.

## Timestamp Policy

- `collected_at_utc`: local collector timestamp taken as close as possible to receipt of data.
- `fetched_at_utc`: timestamp for metadata or discovery fetches.
- Endpoint-provided timestamps must be preserved under their original field names in `raw`.
- If endpoint timestamp semantics are unclear, write the ambiguity into the probe report.