11 KiB
Kubernetes Deployment
Status: draft runtime package for Checkpoint 8G
This document describes the Kubernetes package for the Polymarket raw
order-book collector. It follows the shared Hetzner k3s cluster model from
../nuri/unrip3: application code, Dockerfile, manifests, and Forgejo workflow
live in this repository; platform services, the shared registry, and the shared
Forgejo runner remain platform-owned.
This package does not claim production readiness. Production readiness still requires a real Kubernetes runtime smoke run with preserved evidence.
Cluster Decisions
- Namespace:
orderbooks - Workstation kubeconfig for validation:
../nuri/unrip3/.state/hetzner/kubeconfig.yaml - Shared registry and shared Forgejo runner
- Existing rclone Secret:
orderbooks/orderbooks-rclone-config - Secret key mounted by the uploader:
rclone.conf
Do not commit or print rclone config contents.
Runtime Layout
The collector and uploader share one PVC:
PVC: orderbooks-data
mount: /var/lib/orderbooks
raw files: /var/lib/orderbooks/raw_orderbooks
manifests: /var/lib/orderbooks/manifests
discovery: /var/lib/orderbooks/discovery
The REST snapshot collector uses one Deployment with one replica. The container
runs /app/scripts/run_polymarket_collector_loop.sh, which repeatedly executes
the existing bounded collector cycle and records loop failure/interruption
manifests instead of relying on Kubernetes crash loops for normal operation.
The websocket recorder canary uses a separate Deployment named
orderbooks-ws-recorder. It runs /app/scripts/run_polymarket_ws_recorder_loop.sh
and does not replace or stop orderbooks-collector. It writes raw websocket
archives under /var/lib/orderbooks/raw_orderbooks/polymarket/ws_raw/, REST
checkpoint archives under /var/lib/orderbooks/raw_orderbooks/polymarket/rest_checkpoints/,
and runtime manifests under /var/lib/orderbooks/manifests/.
The uploader uses one CronJob. It runs the existing rclone uploader in execute
mode, mounts the same PVC, mounts orderbooks-rclone-config read-only at
/etc/rclone/rclone.conf, sets RCLONE_CONFIG to that file, uploads only
closed/aged files, skips .open/temporary writer files, and uses
--cleanup-after-verify. Local cleanup is allowed only after rclone copy and
check succeed. The Kubernetes retention setting is 3 days because websocket raw
capture is materially larger than REST snapshots and the current PVC is 10Gi.
Bootstrap This App Repo
Run the orderbooks-specific bootstrap from this repository:
scripts/deploy/bootstrap_orderbooks_k8s.sh
The bootstrap loads platform defaults and resolved secrets from the local
platform state without printing secret values. It ensures namespace orderbooks,
creates or updates orderbooks-registry-creds, verifies the existing
orderbooks-rclone-config secret has key rclone.conf, creates or updates the
Forgejo repo philipp/orderbooks, and upserts the required Actions secret and
variables.
After bootstrap, push a clean source tree to Forgejo main. Do not push local
data/, artifacts/, reports/, orchestration/, kubeconfigs, rclone config,
.env, private keys, or other local evidence/secrets.
Image Build And Deploy
The Forgejo workflow is .forgejo/workflows/deploy.yml. It follows the shared
runner pattern:
- load
KUBECONFIG_B64from Forgejo secrets; - clone this repo inside the runner;
- create an in-cluster Kaniko Job;
- build and push
REGISTRY_HOST/orderbooks:<git-sha>; - apply
deploy/k8s/basewith the built image; - wait for
deployment/orderbooks-collectoranddeployment/orderbooks-ws-recorderrollout.
Required Forgejo repo secret:
KUBECONFIG_B64
Required Forgejo repo variable:
REGISTRY_HOST
Project defaults used by the workflow:
PROJECT_NAME=orderbooks
PROJECT_NAMESPACE=orderbooks
PROJECT_DEPLOYMENTS=orderbooks-collector,orderbooks-ws-recorder
PROJECT_REGISTRY_SECRET_NAME=orderbooks-registry-creds
The registry pull/build secret orderbooks-registry-creds must exist in the
orderbooks namespace before the workflow builds and deploys.
Pushes to main are intentionally non-deploying during the websocket canary
work. workflow_dispatch remains the broad release path and may roll both
Deployments listed in PROJECT_DEPLOYMENTS. Do not use that broad workflow for
websocket-only canary evidence.
Websocket Canary-Only Deploy Path
Checkpoint 10D1 uses scripts/deploy/deploy_ws_canary_kaniko.sh for the
websocket canary. The helper builds an image from the committed Forgejo main
SHA with an in-cluster Kaniko Job, then applies only:
namespace.yaml
configmap.yaml
pvc.yaml
cronjob-uploader.yaml
deployment-ws-recorder.yaml
It does not apply deployment-collector.yaml, does not set the
orderbooks-collector image, and waits only for
deployment/orderbooks-ws-recorder. Validate the scoped apply set first:
KUBECONFIG=../nuri/unrip3/.state/hetzner/kubeconfig.yaml \
scripts/deploy/deploy_ws_canary_kaniko.sh --server-dry-run
After a clean source-only commit has been pushed to Forgejo main, deploy the
canary with:
KUBECONFIG=../nuri/unrip3/.state/hetzner/kubeconfig.yaml \
scripts/deploy/deploy_ws_canary_kaniko.sh --git-ref "$(git rev-parse HEAD)"
The helper writes compact deploy evidence under
data/manifests/ws_canary_deploy_<UTC_TIMESTAMP>.json.
Websocket Recorder Canary
Checkpoint 10D adds the websocket recorder as a canary, not as a replacement for
the REST snapshot collector. The canary subscribes to public Polymarket market
websocket messages for active BTC Up/Down token IDs, preserves every websocket
text payload exactly in raw_text, and keeps periodic REST /books checkpoints
for recovery and divergence evidence.
The script and example config default to market_limit: 0, which means all
discovered active BTC Up/Down markets. The Kubernetes canary config currently
sets market_limit: 2, manifest_write_interval_seconds: 60, first_message_timeout_seconds: 90, and stale_feed_threshold_seconds: 90 as explicit
smoke/safety settings. The 10D local bounded run
wrote about 3.35 MB of compressed websocket data in two minutes for two markets;
running all active BTC markets on the current 10Gi PVC needs a separate sizing
or retention decision before removing the cap. Do not use a cap silently in
production evidence.
Raw/current file safety:
- completed archives end in
.jsonl.gz; - the recorder writes current gzip files with a hidden
.openname and renames them only after close; - the uploader skips
.open,.tmp, and.partialfiles; - verified cleanup deletes local files only after rclone verification succeeds.
Pre-Deploy Validation
From this repository:
bash -n scripts/run_polymarket_collector_loop.sh
bash -n scripts/run_polymarket_ws_recorder_loop.sh
bash -n scripts/k8s_runtime_smoke_check.sh
bash -n scripts/k8s_ws_runtime_smoke_check.sh
python -m py_compile scripts/collect_polymarket_ws_orderbooks.py
kubectl kustomize deploy/k8s/base
KUBECONFIG=../nuri/unrip3/.state/hetzner/kubeconfig.yaml kubectl apply -k deploy/k8s/base --dry-run=server
KUBECONFIG=../nuri/unrip3/.state/hetzner/kubeconfig.yaml kubectl -n orderbooks get secret orderbooks-rclone-config -o go-template='{{if index .data "rclone.conf"}}rclone_secret_key_present{{else}}rclone_secret_key_missing{{end}}{{"\n"}}'
The last command checks only whether the key exists. It must not print secret data.
Runtime Smoke Gate
After the image is built and the workload is actually deployed, run:
KUBECONFIG=../nuri/unrip3/.state/hetzner/kubeconfig.yaml scripts/k8s_runtime_smoke_check.sh --namespace orderbooks --deployment orderbooks-collector --cronjob orderbooks-uploader --raw-dir /var/lib/orderbooks/raw_orderbooks --manifest-dir /var/lib/orderbooks/manifests --wait-seconds 1800 \
--upload-min-age-seconds 600
The smoke gate uses kubectl, not systemd. It writes local JSON evidence under
data/manifests/k8s_runtime_smoke_<UTC_TIMESTAMP>.json by default. It verifies:
- collector pod is running;
- latest collector manifest has
gate_status: PASS,rows_written > 0, andfailure_count: 0; - raw gzip JSONL parses and is under
/var/lib/orderbooks/raw_orderbooks; - deleting the collector pod does not corrupt the old raw file checksum or row count;
- a later post-restart collector cycle writes valid rows;
- an uploader Job created from the CronJob completes;
- the latest upload manifest records a verified rclone upload with at least one verified file.
A failed smoke run still writes JSON evidence and exits nonzero. Preserve failed manifests, raw files, upload manifests, and pod logs for review.
Websocket Reliability Observation
After deploying a websocket recorder reliability fix, run a read-only bounded observation before treating the canary as unattended:
KUBECONFIG=../nuri/unrip3/.state/hetzner/kubeconfig.yaml \
scripts/k8s_ws_reliability_check.sh --wait-seconds 1800
The observation fails if websocket message counts and archive mtimes do not advance while active tokens exist, if REST checkpoints stop succeeding, if parse errors appear, or if reconnect/stale counters grow rapidly without recovery. It also records the REST collector image/readiness before and after the observation.
Not Included
- No trading, signing, wallets, private keys, or API keys.
- No dashboard, database, strategy, backtest, or second-market connector.
- No websocket rewrite.
- No rclone config contents in this repository.
Websocket Canary Smoke Gate
After the canary image is deployed and has run long enough to close at least one websocket and REST checkpoint archive, run:
KUBECONFIG=../nuri/unrip3/.state/hetzner/kubeconfig.yaml scripts/k8s_ws_runtime_smoke_check.sh --namespace orderbooks --deployment orderbooks-ws-recorder --rest-deployment orderbooks-collector --cronjob orderbooks-uploader --wait-seconds 900 --upload-min-age-seconds 600
The smoke gate verifies the websocket pod is running, raw websocket gzip JSONL
parses, REST checkpoint gzip JSONL parses, manifests expose reconnect/stale and
divergence counters, pod deletion/restart does not corrupt the prior closed raw
file or produces a SIGTERM-closed archive when no prior closed file exists, a
later pod writes new data, and the existing REST collector remains healthy. For
upload evidence it creates a one-off uploader Job from the deployed image and
same PVC/secret with ORDERBOOKS_UPLOAD_MIN_AGE_SECONDS=0, then verifies the
upload manifest has UPLOAD_VERIFIED, gate_status: PASS, and at least one
verified websocket recorder raw or REST checkpoint file. Production CronJob
upload min age remains 600 seconds.