# Kubernetes Deployment Status: draft runtime package for Checkpoint 8G This document describes the Kubernetes package for the Polymarket raw order-book collector. It follows the shared Hetzner k3s cluster model from `../nuri/unrip3`: application code, Dockerfile, manifests, and Forgejo workflow live in this repository; platform services, the shared registry, and the shared Forgejo runner remain platform-owned. This package does not claim production readiness. Production readiness still requires a real Kubernetes runtime smoke run with preserved evidence. ## Cluster Decisions - Namespace: `orderbooks` - Workstation kubeconfig for validation: `../nuri/unrip3/.state/hetzner/kubeconfig.yaml` - Shared registry and shared Forgejo runner - Existing rclone Secret: `orderbooks/orderbooks-rclone-config` - Secret key mounted by the uploader: `rclone.conf` Do not commit or print rclone config contents. ## Runtime Layout The collector and uploader share one PVC: ```text PVC: orderbooks-data mount: /var/lib/orderbooks raw files: /var/lib/orderbooks/raw_orderbooks manifests: /var/lib/orderbooks/manifests discovery: /var/lib/orderbooks/discovery ``` The REST snapshot collector uses one Deployment with one replica. The container runs `/app/scripts/run_polymarket_collector_loop.sh`, which repeatedly executes the existing bounded collector cycle and records loop failure/interruption manifests instead of relying on Kubernetes crash loops for normal operation. The websocket recorder canary uses a separate Deployment named `orderbooks-ws-recorder`. It runs `/app/scripts/run_polymarket_ws_recorder_loop.sh` and does not replace or stop `orderbooks-collector`. It writes raw websocket archives under `/var/lib/orderbooks/raw_orderbooks/polymarket/ws_raw/`, REST checkpoint archives under `/var/lib/orderbooks/raw_orderbooks/polymarket/rest_checkpoints/`, and runtime manifests under `/var/lib/orderbooks/manifests/`. The uploader uses one CronJob. It runs the existing rclone uploader in execute mode, mounts the same PVC, mounts `orderbooks-rclone-config` read-only at `/etc/rclone/rclone.conf`, sets `RCLONE_CONFIG` to that file, uploads only closed/aged files, skips `.open`/temporary writer files, and uses `--cleanup-after-verify`. Local cleanup is allowed only after rclone copy and check succeed. The Kubernetes retention setting is 3 days because websocket raw capture is materially larger than REST snapshots and the current PVC is 10Gi. ## Bootstrap This App Repo Run the orderbooks-specific bootstrap from this repository: ```sh scripts/deploy/bootstrap_orderbooks_k8s.sh ``` The bootstrap loads platform defaults and resolved secrets from the local platform state without printing secret values. It ensures namespace `orderbooks`, creates or updates `orderbooks-registry-creds`, verifies the existing `orderbooks-rclone-config` secret has key `rclone.conf`, creates or updates the Forgejo repo `philipp/orderbooks`, and upserts the required Actions secret and variables. After bootstrap, push a clean source tree to Forgejo `main`. Do not push local `data/`, `artifacts/`, `reports/`, `orchestration/`, kubeconfigs, rclone config, `.env`, private keys, or other local evidence/secrets. ## Image Build And Deploy The Forgejo workflow is `.forgejo/workflows/deploy.yml`. It follows the shared runner pattern: 1. load `KUBECONFIG_B64` from Forgejo secrets; 2. clone this repo inside the runner; 3. create an in-cluster Kaniko Job; 4. build and push `REGISTRY_HOST/orderbooks:`; 5. apply `deploy/k8s/base` with the built image; 6. wait for `deployment/orderbooks-collector` and `deployment/orderbooks-ws-recorder` rollout. Required Forgejo repo secret: ```text KUBECONFIG_B64 ``` Required Forgejo repo variable: ```text REGISTRY_HOST ``` Project defaults used by the workflow: ```text PROJECT_NAME=orderbooks PROJECT_NAMESPACE=orderbooks PROJECT_DEPLOYMENTS=orderbooks-collector,orderbooks-ws-recorder PROJECT_REGISTRY_SECRET_NAME=orderbooks-registry-creds ``` The registry pull/build secret `orderbooks-registry-creds` must exist in the `orderbooks` namespace before the workflow builds and deploys. Pushes to `main` are intentionally non-deploying during the websocket canary work. `workflow_dispatch` remains the broad release path and may roll both Deployments listed in `PROJECT_DEPLOYMENTS`. Do not use that broad workflow for websocket-only canary evidence. ## Websocket Canary-Only Deploy Path Checkpoint 10D1 uses `scripts/deploy/deploy_ws_canary_kaniko.sh` for the websocket canary. The helper builds an image from the committed Forgejo `main` SHA with an in-cluster Kaniko Job, then applies only: ```text namespace.yaml configmap.yaml pvc.yaml cronjob-uploader.yaml deployment-ws-recorder.yaml ``` It does not apply `deployment-collector.yaml`, does not set the `orderbooks-collector` image, and waits only for `deployment/orderbooks-ws-recorder`. Validate the scoped apply set first: ```sh KUBECONFIG=../nuri/unrip3/.state/hetzner/kubeconfig.yaml \ scripts/deploy/deploy_ws_canary_kaniko.sh --server-dry-run ``` After a clean source-only commit has been pushed to Forgejo `main`, deploy the canary with: ```sh KUBECONFIG=../nuri/unrip3/.state/hetzner/kubeconfig.yaml \ scripts/deploy/deploy_ws_canary_kaniko.sh --git-ref "$(git rev-parse HEAD)" ``` The helper writes compact deploy evidence under `data/manifests/ws_canary_deploy_.json`. ## Websocket Recorder Canary Checkpoint 10D adds the websocket recorder as a canary, not as a replacement for the REST snapshot collector. The canary subscribes to public Polymarket market websocket messages for active BTC Up/Down token IDs, preserves every websocket text payload exactly in `raw_text`, and keeps periodic REST `/books` checkpoints for recovery and divergence evidence. The script and example config default to `market_limit: 0`, which means all discovered active BTC Up/Down markets. The Kubernetes canary config currently sets `market_limit: 2` and `manifest_write_interval_seconds: 60` as explicit smoke/safety settings. The 10D local bounded run wrote about 3.35 MB of compressed websocket data in two minutes for two markets; running all active BTC markets on the current 10Gi PVC needs a separate sizing or retention decision before removing the cap. Do not use a cap silently in production evidence. Raw/current file safety: - completed archives end in `.jsonl.gz`; - the recorder writes current gzip files with a hidden `.open` name and renames them only after close; - the uploader skips `.open`, `.tmp`, and `.partial` files; - verified cleanup deletes local files only after rclone verification succeeds. ## Pre-Deploy Validation From this repository: ```sh bash -n scripts/run_polymarket_collector_loop.sh bash -n scripts/run_polymarket_ws_recorder_loop.sh bash -n scripts/k8s_runtime_smoke_check.sh bash -n scripts/k8s_ws_runtime_smoke_check.sh python -m py_compile scripts/collect_polymarket_ws_orderbooks.py kubectl kustomize deploy/k8s/base KUBECONFIG=../nuri/unrip3/.state/hetzner/kubeconfig.yaml kubectl apply -k deploy/k8s/base --dry-run=server KUBECONFIG=../nuri/unrip3/.state/hetzner/kubeconfig.yaml kubectl -n orderbooks get secret orderbooks-rclone-config -o go-template='{{if index .data "rclone.conf"}}rclone_secret_key_present{{else}}rclone_secret_key_missing{{end}}{{"\n"}}' ``` The last command checks only whether the key exists. It must not print secret data. ## Runtime Smoke Gate After the image is built and the workload is actually deployed, run: ```sh KUBECONFIG=../nuri/unrip3/.state/hetzner/kubeconfig.yaml scripts/k8s_runtime_smoke_check.sh --namespace orderbooks --deployment orderbooks-collector --cronjob orderbooks-uploader --raw-dir /var/lib/orderbooks/raw_orderbooks --manifest-dir /var/lib/orderbooks/manifests --wait-seconds 1800 \ --upload-min-age-seconds 600 ``` The smoke gate uses `kubectl`, not systemd. It writes local JSON evidence under `data/manifests/k8s_runtime_smoke_.json` by default. It verifies: - collector pod is running; - latest collector manifest has `gate_status: PASS`, `rows_written > 0`, and `failure_count: 0`; - raw gzip JSONL parses and is under `/var/lib/orderbooks/raw_orderbooks`; - deleting the collector pod does not corrupt the old raw file checksum or row count; - a later post-restart collector cycle writes valid rows; - an uploader Job created from the CronJob completes; - the latest upload manifest records a verified rclone upload with at least one verified file. A failed smoke run still writes JSON evidence and exits nonzero. Preserve failed manifests, raw files, upload manifests, and pod logs for review. ## Not Included - No trading, signing, wallets, private keys, or API keys. - No dashboard, database, strategy, backtest, or second-market connector. - No websocket rewrite. - No rclone config contents in this repository. ## Websocket Canary Smoke Gate After the canary image is deployed and has run long enough to close at least one websocket and REST checkpoint archive, run: ```sh KUBECONFIG=../nuri/unrip3/.state/hetzner/kubeconfig.yaml scripts/k8s_ws_runtime_smoke_check.sh --namespace orderbooks --deployment orderbooks-ws-recorder --rest-deployment orderbooks-collector --cronjob orderbooks-uploader --wait-seconds 900 --upload-min-age-seconds 600 ``` The smoke gate verifies the websocket pod is running, raw websocket gzip JSONL parses, REST checkpoint gzip JSONL parses, manifests expose reconnect/stale and divergence counters, pod deletion/restart does not corrupt the prior closed raw file or produces a SIGTERM-closed archive when no prior closed file exists, a later pod writes new data, and the existing REST collector remains healthy. For upload evidence it creates a one-off uploader Job from the deployed image and same PVC/secret with `ORDERBOOKS_UPLOAD_MIN_AGE_SECONDS=0`, then verifies the upload manifest has `UPLOAD_VERIFIED`, `gate_status: PASS`, and at least one verified websocket recorder raw or REST checkpoint file. Production CronJob upload min age remains 600 seconds.