philipp 0d86f56514 Add websocket recorder canary deployment

2026-04-19 19:17:56 +02:00

9.9 KiB

Raw Blame History

Kubernetes Deployment

Status: draft runtime package for Checkpoint 8G

This document describes the Kubernetes package for the Polymarket raw order-book collector. It follows the shared Hetzner k3s cluster model from ../nuri/unrip3: application code, Dockerfile, manifests, and Forgejo workflow live in this repository; platform services, the shared registry, and the shared Forgejo runner remain platform-owned.

This package does not claim production readiness. Production readiness still requires a real Kubernetes runtime smoke run with preserved evidence.

Cluster Decisions

Namespace: orderbooks
Workstation kubeconfig for validation: ../nuri/unrip3/.state/hetzner/kubeconfig.yaml
Shared registry and shared Forgejo runner
Existing rclone Secret: orderbooks/orderbooks-rclone-config
Secret key mounted by the uploader: rclone.conf

Do not commit or print rclone config contents.

Runtime Layout

The collector and uploader share one PVC:

PVC: orderbooks-data
mount: /var/lib/orderbooks
raw files: /var/lib/orderbooks/raw_orderbooks
manifests: /var/lib/orderbooks/manifests
discovery: /var/lib/orderbooks/discovery

The REST snapshot collector uses one Deployment with one replica. The container runs /app/scripts/run_polymarket_collector_loop.sh, which repeatedly executes the existing bounded collector cycle and records loop failure/interruption manifests instead of relying on Kubernetes crash loops for normal operation.

The websocket recorder canary uses a separate Deployment named orderbooks-ws-recorder. It runs /app/scripts/run_polymarket_ws_recorder_loop.sh and does not replace or stop orderbooks-collector. It writes raw websocket archives under /var/lib/orderbooks/raw_orderbooks/polymarket/ws_raw/, REST checkpoint archives under /var/lib/orderbooks/raw_orderbooks/polymarket/rest_checkpoints/, and runtime manifests under /var/lib/orderbooks/manifests/.

The uploader uses one CronJob. It runs the existing rclone uploader in execute mode, mounts the same PVC, mounts orderbooks-rclone-config read-only at /etc/rclone/rclone.conf, sets RCLONE_CONFIG to that file, uploads only closed/aged files, skips .open/temporary writer files, and uses --cleanup-after-verify. Local cleanup is allowed only after rclone copy and check succeed. The Kubernetes retention setting is 3 days because websocket raw capture is materially larger than REST snapshots and the current PVC is 10Gi.

Bootstrap This App Repo

Run the orderbooks-specific bootstrap from this repository:

scripts/deploy/bootstrap_orderbooks_k8s.sh

The bootstrap loads platform defaults and resolved secrets from the local platform state without printing secret values. It ensures namespace orderbooks, creates or updates orderbooks-registry-creds, verifies the existing orderbooks-rclone-config secret has key rclone.conf, creates or updates the Forgejo repo philipp/orderbooks, and upserts the required Actions secret and variables.

After bootstrap, push a clean source tree to Forgejo main. Do not push local data/, artifacts/, reports/, orchestration/, kubeconfigs, rclone config, .env, private keys, or other local evidence/secrets.

Image Build And Deploy

The Forgejo workflow is .forgejo/workflows/deploy.yml. It follows the shared runner pattern:

load KUBECONFIG_B64 from Forgejo secrets;
clone this repo inside the runner;
create an in-cluster Kaniko Job;
build and push REGISTRY_HOST/orderbooks:<git-sha>;
apply deploy/k8s/base with the built image;
wait for deployment/orderbooks-collector and deployment/orderbooks-ws-recorder rollout.

Required Forgejo repo secret:

KUBECONFIG_B64

Required Forgejo repo variable:

REGISTRY_HOST

Project defaults used by the workflow:

PROJECT_NAME=orderbooks
PROJECT_NAMESPACE=orderbooks
PROJECT_DEPLOYMENTS=orderbooks-collector,orderbooks-ws-recorder
PROJECT_REGISTRY_SECRET_NAME=orderbooks-registry-creds

The registry pull/build secret orderbooks-registry-creds must exist in the orderbooks namespace before the workflow builds and deploys.

Pushes to main are intentionally non-deploying during the websocket canary work. workflow_dispatch remains the broad release path and may roll both Deployments listed in PROJECT_DEPLOYMENTS. Do not use that broad workflow for websocket-only canary evidence.

Websocket Canary-Only Deploy Path

Checkpoint 10D1 uses scripts/deploy/deploy_ws_canary_kaniko.sh for the websocket canary. The helper builds an image from the committed Forgejo main SHA with an in-cluster Kaniko Job, then applies only:

namespace.yaml
configmap.yaml
pvc.yaml
cronjob-uploader.yaml
deployment-ws-recorder.yaml

It does not apply deployment-collector.yaml, does not set the orderbooks-collector image, and waits only for deployment/orderbooks-ws-recorder. Validate the scoped apply set first:

KUBECONFIG=../nuri/unrip3/.state/hetzner/kubeconfig.yaml \
  scripts/deploy/deploy_ws_canary_kaniko.sh --server-dry-run

After a clean source-only commit has been pushed to Forgejo main, deploy the canary with:

KUBECONFIG=../nuri/unrip3/.state/hetzner/kubeconfig.yaml \
  scripts/deploy/deploy_ws_canary_kaniko.sh --git-ref "$(git rev-parse HEAD)"

The helper writes compact deploy evidence under data/manifests/ws_canary_deploy_<UTC_TIMESTAMP>.json.

Websocket Recorder Canary

Checkpoint 10D adds the websocket recorder as a canary, not as a replacement for the REST snapshot collector. The canary subscribes to public Polymarket market websocket messages for active BTC Up/Down token IDs, preserves every websocket text payload exactly in raw_text, and keeps periodic REST /books checkpoints for recovery and divergence evidence.

The script and example config default to market_limit: 0, which means all discovered active BTC Up/Down markets. The Kubernetes canary config currently sets market_limit: 2 and manifest_write_interval_seconds: 60 as explicit smoke/safety settings. The 10D local bounded run wrote about 3.35 MB of compressed websocket data in two minutes for two markets; running all active BTC markets on the current 10Gi PVC needs a separate sizing or retention decision before removing the cap. Do not use a cap silently in production evidence.

Raw/current file safety:

completed archives end in .jsonl.gz;
the recorder writes current gzip files with a hidden .open name and renames them only after close;
the uploader skips .open, .tmp, and .partial files;
verified cleanup deletes local files only after rclone verification succeeds.

Pre-Deploy Validation

From this repository:

bash -n scripts/run_polymarket_collector_loop.sh
bash -n scripts/run_polymarket_ws_recorder_loop.sh
bash -n scripts/k8s_runtime_smoke_check.sh
bash -n scripts/k8s_ws_runtime_smoke_check.sh
python -m py_compile scripts/collect_polymarket_ws_orderbooks.py
kubectl kustomize deploy/k8s/base
KUBECONFIG=../nuri/unrip3/.state/hetzner/kubeconfig.yaml   kubectl apply -k deploy/k8s/base --dry-run=server
KUBECONFIG=../nuri/unrip3/.state/hetzner/kubeconfig.yaml   kubectl -n orderbooks get secret orderbooks-rclone-config   -o go-template='{{if index .data "rclone.conf"}}rclone_secret_key_present{{else}}rclone_secret_key_missing{{end}}{{"\n"}}'

The last command checks only whether the key exists. It must not print secret data.

Runtime Smoke Gate

After the image is built and the workload is actually deployed, run:

KUBECONFIG=../nuri/unrip3/.state/hetzner/kubeconfig.yaml   scripts/k8s_runtime_smoke_check.sh   --namespace orderbooks   --deployment orderbooks-collector   --cronjob orderbooks-uploader   --raw-dir /var/lib/orderbooks/raw_orderbooks   --manifest-dir /var/lib/orderbooks/manifests   --wait-seconds 1800 \
  --upload-min-age-seconds 600

The smoke gate uses kubectl, not systemd. It writes local JSON evidence under data/manifests/k8s_runtime_smoke_<UTC_TIMESTAMP>.json by default. It verifies:

collector pod is running;
latest collector manifest has gate_status: PASS, rows_written > 0, and failure_count: 0;
raw gzip JSONL parses and is under /var/lib/orderbooks/raw_orderbooks;
deleting the collector pod does not corrupt the old raw file checksum or row count;
a later post-restart collector cycle writes valid rows;
an uploader Job created from the CronJob completes;
the latest upload manifest records a verified rclone upload with at least one verified file.

A failed smoke run still writes JSON evidence and exits nonzero. Preserve failed manifests, raw files, upload manifests, and pod logs for review.

Not Included

No trading, signing, wallets, private keys, or API keys.
No dashboard, database, strategy, backtest, or second-market connector.
No websocket rewrite.
No rclone config contents in this repository.

Websocket Canary Smoke Gate

After the canary image is deployed and has run long enough to close at least one websocket and REST checkpoint archive, run:

KUBECONFIG=../nuri/unrip3/.state/hetzner/kubeconfig.yaml   scripts/k8s_ws_runtime_smoke_check.sh   --namespace orderbooks   --deployment orderbooks-ws-recorder   --rest-deployment orderbooks-collector   --cronjob orderbooks-uploader   --wait-seconds 900   --upload-min-age-seconds 600

The smoke gate verifies the websocket pod is running, raw websocket gzip JSONL parses, REST checkpoint gzip JSONL parses, manifests expose reconnect/stale and divergence counters, pod deletion/restart does not corrupt the prior closed raw file or produces a SIGTERM-closed archive when no prior closed file exists, a later pod writes new data, and the existing REST collector remains healthy. For upload evidence it creates a one-off uploader Job from the deployed image and same PVC/secret with ORDERBOOKS_UPLOAD_MIN_AGE_SECONDS=0, then verifies the upload manifest has UPLOAD_VERIFIED, gate_status: PASS, and at least one verified websocket recorder raw or REST checkpoint file. Production CronJob upload min age remains 600 seconds.

9.9 KiB Raw Blame History