orderbooks/docs/OPERATIONS.md

# Operations

This document defines operational rules before the collector exists. It should be updated with exact commands as checkpoints add scripts, services, and upload jobs.

## Current Operational Status

- Collector implementation: not started.
- Supported market: none yet; Polymarket is the first planned market.
- Deployment target: small VPS.
- Offload target: Google Drive through `rclone`.
- Reliability status: not production-ready until a documented 24h soak test passes.

## Safety Rules

- No trading.
- No order placement.
- No wallet signing.
- No private keys.
- No secrets in git.
- No dashboards, databases, ML, or strategy code before the roadmap gate allows them.

## Local Runtime Principles

Future scripts should:

- accept a configurable data directory
- write logs to a predictable location
- write raw gzip JSONL snapshots
- rotate files by hour or run
- close files cleanly on shutdown
- write manifests after runs
- avoid corrupting closed files on restart
- handle public endpoint errors and rate limits conservatively

## VPS Deployment Principles

Checkpoint 6 should document:

- Python version and virtualenv setup
- package installation
- environment variables
- systemd or Docker Compose runtime
- service user and file permissions
- data directory ownership
- log locations
- restart policy
- disk usage checks
- safe upgrade and rollback steps

## Google Drive Offload Principles

Checkpoint 7 should use `rclone` and must:

- avoid hardcoded credentials
- upload only closed or rotated files
- support dry-run mode
- verify upload success
- preserve local files until upload is verified
- maintain checksums
- keep the last N days locally
- write an upload manifest

## Incident And Bad-Data Handling

If data looks wrong:

1. Preserve the raw files.
2. Stop relying on the affected derived files.
3. Label the artifact `invalid` or `deprecated`.
4. Write a short note explaining the issue and replacement, if any.
5. Keep the learning in docs or reports.

Examples of bad-data conditions:

- endpoint returned a schema different from expected
- token/outcome mapping was wrong
- timestamps were misunderstood
- rate limits caused large gaps
- gzip file was not closed cleanly
- upload succeeded but checksum did not match

## Minimum Reliability Claim

A short sample run can prove that code writes files. It cannot prove 24/7 reliability.

The project may only claim production readiness after:

- discovery works
- raw order-book collection works
- offload works
- 24h soak test completes
- data quality and gap metrics are documented