orderbooks/ROADMAP.md
philipp 284e465588
Some checks failed
deploy / deploy (push) Has been cancelled
Prepare Kubernetes orderbooks deployment
2026-04-18 11:23:28 +02:00

212 lines
6.7 KiB
Markdown

# Roadmap
Project: Cross-Market Live Orderbook Archive
Goal: build a reliable, minimal, always-on archive of live market microstructure data so future research agents can test whether strategies were actually observable, fillable, and reproducible in real time.
The roadmap is checkpoint-driven. Each checkpoint must leave durable artifacts, validation evidence, and an explicit gate result.
## Current Status
- Latest completed checkpoint: Checkpoint 7, Google Drive Offload
- Latest gate: PASS
- Next checkpoint: Checkpoint 8, 24h Soak Test Plan
- Initial market: Polymarket
- Future market work: gated until Polymarket is stable
## Checkpoint 1: Project Scaffold And Methodology
Goal: create the minimum repository structure and rules that keep future agents on track.
Artifacts:
- `AGENTS.md`
- `ROADMAP.md`
- `docs/METHODOLOGY.md`
- `docs/DATA_CONTRACT.md`
- `docs/OPERATIONS.md`
- `orchestration/prompts/`
Requirements:
- Define project goal.
- Define anti-fake-progress rules.
- Define raw-first storage policy.
- Define checkpoint reporting format.
- Define no-trading/no-private-key policy.
- Define how to label deprecated or misleading artifacts instead of deleting them.
- Define how new market connectors should be added later.
Pass condition: the repo contains durable project rules and the next checkpoint is specific enough to execute.
## Checkpoint 2: Polymarket Public Data Source Probe
Goal: determine exactly which public Polymarket endpoints can support live collection.
Questions:
- How to discover active Polymarket markets?
- How to filter BTC up/down markets?
- How to resolve conditionId and token IDs?
- How to fetch current order book for one token?
- Is there a batch order-book endpoint?
- Is there a market websocket for order-book updates?
- Is there a trade websocket or recent trades endpoint?
- What rate limits are documented or observed?
- What fields are returned?
- What timestamps exist?
Artifacts:
- `scripts/probe_polymarket_public_sources.py`
- `data/probes/polymarket_public_sources_probe_v1.json`
- `data/probes/polymarket_public_sources_probe_v1.md`
Pass condition: we know the exact endpoint set and can fetch at least one active market metadata record and one current order book.
## Checkpoint 3: Minimal BTC Market Discovery
Goal: build a small script that finds active BTC up/down Polymarket markets and resolves both outcome token IDs.
Artifacts:
- `scripts/discover_polymarket_btc_markets.py`
- `data/discovery/polymarket_btc_markets_latest.json`
- `data/discovery/polymarket_btc_markets_manifest.json`
- `data/discovery/polymarket_btc_markets.md`
Requirements:
- Public endpoints only.
- No trading.
- No API keys unless strictly needed for public data.
- Never store secrets in the repo.
- Preserve raw metadata responses.
- Write normalized market records with slug, question, conditionId, token IDs, outcomes, times, status, source, and `fetched_at_utc`.
Pass condition: the script reliably outputs currently active BTC markets with token IDs.
## Checkpoint 4: Minimal Orderbook Snapshot Collector
Goal: collect raw order-book snapshots for active BTC markets at a fixed interval.
Artifacts:
- `scripts/collect_polymarket_orderbooks.py`
- `config/polymarket_collector.example.yaml`
- `data/live_sample/...`
- `data/manifests/orderbook_collector_sample_manifest.json`
- `docs/POLYMARKET_COLLECTOR.md`
Requirements:
- Collect active BTC markets only.
- Fetch order books for both outcome tokens.
- Store raw API responses as gzip JSONL.
- Add local `collected_at_utc`, collector version, endpoint URL, and request params.
- Rotate files by hour or run.
- Include a manifest with timing, markets, request counts, status codes, rows, output files, and checksums.
- Handle graceful shutdown and rate limits.
- Do not add a database.
Pass condition: a 5-10 minute sample run creates valid compressed raw snapshots and a manifest.
## Checkpoint 5: Normalized Snapshot Extract
Goal: create a derived normalized dataset from raw snapshots while preserving raw files as source of truth.
Artifacts:
- `scripts/normalize_polymarket_orderbooks.py`
- `data/normalized_sample/...`
- `data/manifests/orderbook_normalization_sample_manifest.json`
- `docs/ORDERBOOK_SCHEMA.md`
Pass condition: a sample raw file can be normalized and basic sanity checks pass.
## Checkpoint 6: VPS Runtime Package
Goal: make the collector deployable on a small VPS.
Artifacts:
- `systemd/polymarket-orderbook-collector.service`
- `config/polymarket_collector.vps.example.yaml`
- `scripts/run_polymarket_collector_cycle.sh`
- `docs/VPS_DEPLOYMENT.md`
Uploader service and timer units are deferred to Checkpoint 7 with Google Drive
offload. Creating empty uploader units in Checkpoint 6 would be fake progress.
Pass condition: a user can follow docs on a VPS and run the collector.
## Checkpoint 7: Google Drive Offload
Goal: add periodic upload to Google Drive using `rclone`.
Artifacts:
- `scripts/upload_archive_rclone.sh`
- `config/rclone.example.md`
- `docs/GOOGLE_DRIVE_OFFLOAD.md`
- sample upload manifest format
Pass condition: a dry-run and a real small test upload succeed and are documented.
## Checkpoint 8: 24h Soak Test Plan
Goal: run the collector for a real 24h period and validate reliability.
Artifacts:
- `reports/soak_test_YYYY-MM-DD.md`
- `data/manifests/...`
Metrics:
- uptime
- markets tracked
- total snapshots
- missed interval estimate
- API errors
- rate limits
- file sizes
- compression ratio
- Google Drive upload status
- restart behavior
- disk usage
- data quality checks
Pass condition: a 24h run completes with acceptable data quality and documented issues.
## Checkpoint 9: Add Second Market Only After Polymarket Is Stable
Goal: prepare for NEAR or another market only after Polymarket collector reliability is proven.
Do not start this checkpoint until:
- Polymarket discovery works.
- Polymarket order-book collection works.
- Google Drive offload works.
- The 24h soak test is complete.
Architecture principles:
- Use `collectors/<market_name>/` only when adding the second market.
- Keep shared code minimal.
- Avoid abstract base classes until duplication is painful.
- Keep raw-first, normalized-second, manifest-always file format consistent across markets.
## Anti-Fake-Progress Gates
- No dashboard before 24h data reliability.
- No database before the file archive becomes painful.
- No strategy or backtest code in this project.
- No live trading.
- No generic multi-market abstraction before the second market exists.
- No claiming "production-ready" before a 24h soak test.
- No deleting bad artifacts; label them deprecated or invalid and write why.
## Next Smallest Step
Checkpoint 2 is next. It should inspect official Polymarket docs and perform bounded public endpoint probes to determine the exact live collection sources, schemas, timestamps, and rate-limit behavior.