orderbooks/docs/POLYMARKET_WEBSOCKET_RECORDER.md

4.7 KiB

Polymarket Websocket Sample Recorder

This document describes the bounded Checkpoint 10B sample path. It is separate from the live Kubernetes REST collector and does not replace it.

Scope

The recorder captures public Polymarket market websocket messages for active BTC up/down outcome tokens and writes REST /books checkpoints during the same run. It does not trade, sign requests, use private keys, require API keys, or handle private account data.

Discovery

Run the existing discovery first so token IDs are current:

python scripts/discover_polymarket_btc_markets.py

The recorder reads data/discovery/polymarket_btc_markets_latest.json, selects active BTC up/down markets, and preserves market_slug, condition_id, token_id, outcome, and end_time_utc in every raw websocket envelope.

Sample Run

Default bounded run:

python scripts/record_polymarket_ws_sample.py --config config/polymarket_ws_sample.example.yaml

Useful overrides:

python scripts/record_polymarket_ws_sample.py   --market-limit 2   --duration-seconds 150   --rest-checkpoint-interval-seconds 30

The default endpoint is:

wss://ws-subscriptions-clob.polymarket.com/ws/market

The subscription body is:

{"assets_ids":["<token_id>"],"type":"market","custom_feature_enabled":true}

For multiple tokens, assets_ids contains all selected Up/Down token IDs.

Raw Websocket Output

Websocket text messages are written as gzip JSONL under:

data/ws_sample/polymarket/ws_raw/<run_id>/polymarket_ws_raw_<run_id>.jsonl.gz

Each row preserves the raw text payload in raw_text, plus parsed JSON in json when parsing succeeds. Unknown message shapes are retained and counted in the manifest.

Important envelope fields include:

  • received_at_utc
  • session_id
  • connection_sequence
  • message_sequence
  • global_message_sequence
  • websocket.url
  • subscription.assets_ids
  • tokens_tracked
  • opcode
  • payload_length_bytes
  • payload_sha256
  • raw_text
  • json
  • json_error
  • classified_event_types

REST Checkpoints

REST checkpoints are written as gzip JSONL under:

data/ws_sample/polymarket/rest_checkpoints/<run_id>/polymarket_rest_checkpoints_<run_id>.jsonl.gz

Each row records one POST to:

https://clob.polymarket.com/books

The request body contains the same token IDs as the websocket subscription. The response JSON is preserved in response.raw_response_json, with safe response headers only. Secret-bearing headers are not recorded.

Manifest And Gate

The checkpoint manifest is:

data/manifests/checkpoint_010b_ws_raw_sample.json

The report is:

reports/checkpoints/checkpoint_010b_ws_raw_sample.md

WS_RAW_SAMPLE_PASS requires at least one selected BTC market with both outcome tokens, at least one parseable websocket text message, at least two successful REST checkpoints, parseable gzip JSONL outputs, and checksum summaries.

If the websocket connects but no market messages arrive, the recorder must gate as WS_RAW_SAMPLE_NEEDS_REVIEW rather than pretending the websocket path is proven.

Checkpoint 10D Runtime Direction

The long-running runtime recorder is scripts/collect_polymarket_ws_orderbooks.py. It is separate from the bounded 10B sample script. The runtime recorder is intended to run as orderbooks-ws-recorder beside the existing REST collector. It preserves raw websocket messages under raw_orderbooks/polymarket/ws_raw/, keeps REST /books checkpoints under raw_orderbooks/polymarket/rest_checkpoints/, rotates closed gzip archives hourly, writes manifests under /var/lib/orderbooks/manifests, and records reconnect, stale-feed, REST failure, parser, and divergence counters.

Current gzip files use hidden .open names until closed. The uploader skips open/temporary files and deletes local archives only when --cleanup-after-verify is used after rclone verification succeeds.

Reliability Semantics

Checkpoint 10E fixed the stale reconnect loop by making stale timers session-local. A new websocket session starts with first_message_timeout_seconds grace before stale detection can fire. After the first text message, normal stale_feed_threshold_seconds applies to that session only. Run-level last_successful_ws_message_at_utc is still preserved for observability, but it is not used to stale-fail a fresh connection.

After max_consecutive_stale_reconnects_before_discovery_refresh stale reconnects, the recorder forces discovery before the next subscription. This prevents expired or rotated BTC token IDs from causing an endless reconnect loop. Manifests distinguish RUNNING_RECEIVING, RUNNING_RECONNECTING, and RUNNING_NO_ACTIVE_TOKENS, and include recent session summaries plus current open archive paths.