6.7 KiB
Roadmap
Project: Cross-Market Live Orderbook Archive
Goal: build a reliable, minimal, always-on archive of live market microstructure data so future research agents can test whether strategies were actually observable, fillable, and reproducible in real time.
The roadmap is checkpoint-driven. Each checkpoint must leave durable artifacts, validation evidence, and an explicit gate result.
Current Status
- Latest completed checkpoint: Checkpoint 7, Google Drive Offload
- Latest gate: PASS
- Next checkpoint: Checkpoint 8, 24h Soak Test Plan
- Initial market: Polymarket
- Future market work: gated until Polymarket is stable
Checkpoint 1: Project Scaffold And Methodology
Goal: create the minimum repository structure and rules that keep future agents on track.
Artifacts:
AGENTS.mdROADMAP.mddocs/METHODOLOGY.mddocs/DATA_CONTRACT.mddocs/OPERATIONS.mdorchestration/prompts/
Requirements:
- Define project goal.
- Define anti-fake-progress rules.
- Define raw-first storage policy.
- Define checkpoint reporting format.
- Define no-trading/no-private-key policy.
- Define how to label deprecated or misleading artifacts instead of deleting them.
- Define how new market connectors should be added later.
Pass condition: the repo contains durable project rules and the next checkpoint is specific enough to execute.
Checkpoint 2: Polymarket Public Data Source Probe
Goal: determine exactly which public Polymarket endpoints can support live collection.
Questions:
- How to discover active Polymarket markets?
- How to filter BTC up/down markets?
- How to resolve conditionId and token IDs?
- How to fetch current order book for one token?
- Is there a batch order-book endpoint?
- Is there a market websocket for order-book updates?
- Is there a trade websocket or recent trades endpoint?
- What rate limits are documented or observed?
- What fields are returned?
- What timestamps exist?
Artifacts:
scripts/probe_polymarket_public_sources.pydata/probes/polymarket_public_sources_probe_v1.jsondata/probes/polymarket_public_sources_probe_v1.md
Pass condition: we know the exact endpoint set and can fetch at least one active market metadata record and one current order book.
Checkpoint 3: Minimal BTC Market Discovery
Goal: build a small script that finds active BTC up/down Polymarket markets and resolves both outcome token IDs.
Artifacts:
scripts/discover_polymarket_btc_markets.pydata/discovery/polymarket_btc_markets_latest.jsondata/discovery/polymarket_btc_markets_manifest.jsondata/discovery/polymarket_btc_markets.md
Requirements:
- Public endpoints only.
- No trading.
- No API keys unless strictly needed for public data.
- Never store secrets in the repo.
- Preserve raw metadata responses.
- Write normalized market records with slug, question, conditionId, token IDs, outcomes, times, status, source, and
fetched_at_utc.
Pass condition: the script reliably outputs currently active BTC markets with token IDs.
Checkpoint 4: Minimal Orderbook Snapshot Collector
Goal: collect raw order-book snapshots for active BTC markets at a fixed interval.
Artifacts:
scripts/collect_polymarket_orderbooks.pyconfig/polymarket_collector.example.yamldata/live_sample/...data/manifests/orderbook_collector_sample_manifest.jsondocs/POLYMARKET_COLLECTOR.md
Requirements:
- Collect active BTC markets only.
- Fetch order books for both outcome tokens.
- Store raw API responses as gzip JSONL.
- Add local
collected_at_utc, collector version, endpoint URL, and request params. - Rotate files by hour or run.
- Include a manifest with timing, markets, request counts, status codes, rows, output files, and checksums.
- Handle graceful shutdown and rate limits.
- Do not add a database.
Pass condition: a 5-10 minute sample run creates valid compressed raw snapshots and a manifest.
Checkpoint 5: Normalized Snapshot Extract
Goal: create a derived normalized dataset from raw snapshots while preserving raw files as source of truth.
Artifacts:
scripts/normalize_polymarket_orderbooks.pydata/normalized_sample/...data/manifests/orderbook_normalization_sample_manifest.jsondocs/ORDERBOOK_SCHEMA.md
Pass condition: a sample raw file can be normalized and basic sanity checks pass.
Checkpoint 6: VPS Runtime Package
Goal: make the collector deployable on a small VPS.
Artifacts:
systemd/polymarket-orderbook-collector.serviceconfig/polymarket_collector.vps.example.yamlscripts/run_polymarket_collector_cycle.shdocs/VPS_DEPLOYMENT.md
Uploader service and timer units are deferred to Checkpoint 7 with Google Drive offload. Creating empty uploader units in Checkpoint 6 would be fake progress.
Pass condition: a user can follow docs on a VPS and run the collector.
Checkpoint 7: Google Drive Offload
Goal: add periodic upload to Google Drive using rclone.
Artifacts:
scripts/upload_archive_rclone.shconfig/rclone.example.mddocs/GOOGLE_DRIVE_OFFLOAD.md- sample upload manifest format
Pass condition: a dry-run and a real small test upload succeed and are documented.
Checkpoint 8: 24h Soak Test Plan
Goal: run the collector for a real 24h period and validate reliability.
Artifacts:
reports/soak_test_YYYY-MM-DD.mddata/manifests/...
Metrics:
- uptime
- markets tracked
- total snapshots
- missed interval estimate
- API errors
- rate limits
- file sizes
- compression ratio
- Google Drive upload status
- restart behavior
- disk usage
- data quality checks
Pass condition: a 24h run completes with acceptable data quality and documented issues.
Checkpoint 9: Add Second Market Only After Polymarket Is Stable
Goal: prepare for NEAR or another market only after Polymarket collector reliability is proven.
Do not start this checkpoint until:
- Polymarket discovery works.
- Polymarket order-book collection works.
- Google Drive offload works.
- The 24h soak test is complete.
Architecture principles:
- Use
collectors/<market_name>/only when adding the second market. - Keep shared code minimal.
- Avoid abstract base classes until duplication is painful.
- Keep raw-first, normalized-second, manifest-always file format consistent across markets.
Anti-Fake-Progress Gates
- No dashboard before 24h data reliability.
- No database before the file archive becomes painful.
- No strategy or backtest code in this project.
- No live trading.
- No generic multi-market abstraction before the second market exists.
- No claiming "production-ready" before a 24h soak test.
- No deleting bad artifacts; label them deprecated or invalid and write why.
Next Smallest Step
Checkpoint 2 is next. It should inspect official Polymarket docs and perform bounded public endpoint probes to determine the exact live collection sources, schemas, timestamps, and rate-limit behavior.