# Roadmap Project: Cross-Market Live Orderbook Archive Goal: build a reliable, minimal, always-on archive of live market microstructure data so future research agents can test whether strategies were actually observable, fillable, and reproducible in real time. The roadmap is checkpoint-driven. Each checkpoint must leave durable artifacts, validation evidence, and an explicit gate result. ## Current Status - Latest completed checkpoint: Checkpoint 7, Google Drive Offload - Latest gate: PASS - Next checkpoint: Checkpoint 8, 24h Soak Test Plan - Initial market: Polymarket - Future market work: gated until Polymarket is stable ## Checkpoint 1: Project Scaffold And Methodology Goal: create the minimum repository structure and rules that keep future agents on track. Artifacts: - `AGENTS.md` - `ROADMAP.md` - `docs/METHODOLOGY.md` - `docs/DATA_CONTRACT.md` - `docs/OPERATIONS.md` - `orchestration/prompts/` Requirements: - Define project goal. - Define anti-fake-progress rules. - Define raw-first storage policy. - Define checkpoint reporting format. - Define no-trading/no-private-key policy. - Define how to label deprecated or misleading artifacts instead of deleting them. - Define how new market connectors should be added later. Pass condition: the repo contains durable project rules and the next checkpoint is specific enough to execute. ## Checkpoint 2: Polymarket Public Data Source Probe Goal: determine exactly which public Polymarket endpoints can support live collection. Questions: - How to discover active Polymarket markets? - How to filter BTC up/down markets? - How to resolve conditionId and token IDs? - How to fetch current order book for one token? - Is there a batch order-book endpoint? - Is there a market websocket for order-book updates? - Is there a trade websocket or recent trades endpoint? - What rate limits are documented or observed? - What fields are returned? - What timestamps exist? Artifacts: - `scripts/probe_polymarket_public_sources.py` - `data/probes/polymarket_public_sources_probe_v1.json` - `data/probes/polymarket_public_sources_probe_v1.md` Pass condition: we know the exact endpoint set and can fetch at least one active market metadata record and one current order book. ## Checkpoint 3: Minimal BTC Market Discovery Goal: build a small script that finds active BTC up/down Polymarket markets and resolves both outcome token IDs. Artifacts: - `scripts/discover_polymarket_btc_markets.py` - `data/discovery/polymarket_btc_markets_latest.json` - `data/discovery/polymarket_btc_markets_manifest.json` - `data/discovery/polymarket_btc_markets.md` Requirements: - Public endpoints only. - No trading. - No API keys unless strictly needed for public data. - Never store secrets in the repo. - Preserve raw metadata responses. - Write normalized market records with slug, question, conditionId, token IDs, outcomes, times, status, source, and `fetched_at_utc`. Pass condition: the script reliably outputs currently active BTC markets with token IDs. ## Checkpoint 4: Minimal Orderbook Snapshot Collector Goal: collect raw order-book snapshots for active BTC markets at a fixed interval. Artifacts: - `scripts/collect_polymarket_orderbooks.py` - `config/polymarket_collector.example.yaml` - `data/live_sample/...` - `data/manifests/orderbook_collector_sample_manifest.json` - `docs/POLYMARKET_COLLECTOR.md` Requirements: - Collect active BTC markets only. - Fetch order books for both outcome tokens. - Store raw API responses as gzip JSONL. - Add local `collected_at_utc`, collector version, endpoint URL, and request params. - Rotate files by hour or run. - Include a manifest with timing, markets, request counts, status codes, rows, output files, and checksums. - Handle graceful shutdown and rate limits. - Do not add a database. Pass condition: a 5-10 minute sample run creates valid compressed raw snapshots and a manifest. ## Checkpoint 5: Normalized Snapshot Extract Goal: create a derived normalized dataset from raw snapshots while preserving raw files as source of truth. Artifacts: - `scripts/normalize_polymarket_orderbooks.py` - `data/normalized_sample/...` - `data/manifests/orderbook_normalization_sample_manifest.json` - `docs/ORDERBOOK_SCHEMA.md` Pass condition: a sample raw file can be normalized and basic sanity checks pass. ## Checkpoint 6: VPS Runtime Package Goal: make the collector deployable on a small VPS. Artifacts: - `systemd/polymarket-orderbook-collector.service` - `config/polymarket_collector.vps.example.yaml` - `scripts/run_polymarket_collector_cycle.sh` - `docs/VPS_DEPLOYMENT.md` Uploader service and timer units are deferred to Checkpoint 7 with Google Drive offload. Creating empty uploader units in Checkpoint 6 would be fake progress. Pass condition: a user can follow docs on a VPS and run the collector. ## Checkpoint 7: Google Drive Offload Goal: add periodic upload to Google Drive using `rclone`. Artifacts: - `scripts/upload_archive_rclone.sh` - `config/rclone.example.md` - `docs/GOOGLE_DRIVE_OFFLOAD.md` - sample upload manifest format Pass condition: a dry-run and a real small test upload succeed and are documented. ## Checkpoint 8: 24h Soak Test Plan Goal: run the collector for a real 24h period and validate reliability. Artifacts: - `reports/soak_test_YYYY-MM-DD.md` - `data/manifests/...` Metrics: - uptime - markets tracked - total snapshots - missed interval estimate - API errors - rate limits - file sizes - compression ratio - Google Drive upload status - restart behavior - disk usage - data quality checks Pass condition: a 24h run completes with acceptable data quality and documented issues. ## Checkpoint 9: Add Second Market Only After Polymarket Is Stable Goal: prepare for NEAR or another market only after Polymarket collector reliability is proven. Do not start this checkpoint until: - Polymarket discovery works. - Polymarket order-book collection works. - Google Drive offload works. - The 24h soak test is complete. Architecture principles: - Use `collectors//` only when adding the second market. - Keep shared code minimal. - Avoid abstract base classes until duplication is painful. - Keep raw-first, normalized-second, manifest-always file format consistent across markets. ## Anti-Fake-Progress Gates - No dashboard before 24h data reliability. - No database before the file archive becomes painful. - No strategy or backtest code in this project. - No live trading. - No generic multi-market abstraction before the second market exists. - No claiming "production-ready" before a 24h soak test. - No deleting bad artifacts; label them deprecated or invalid and write why. ## Next Smallest Step Checkpoint 2 is next. It should inspect official Polymarket docs and perform bounded public endpoint probes to determine the exact live collection sources, schemas, timestamps, and rate-limit behavior.