262 lines
11 KiB
Markdown
262 lines
11 KiB
Markdown
# Kubernetes Deployment
|
|
|
|
Status: draft runtime package for Checkpoint 8G
|
|
|
|
This document describes the Kubernetes package for the Polymarket raw
|
|
order-book collector. It follows the shared Hetzner k3s cluster model from
|
|
`../nuri/unrip3`: application code, Dockerfile, manifests, and Forgejo workflow
|
|
live in this repository; platform services, the shared registry, and the shared
|
|
Forgejo runner remain platform-owned.
|
|
|
|
This package does not claim production readiness. Production readiness still
|
|
requires a real Kubernetes runtime smoke run with preserved evidence.
|
|
|
|
## Cluster Decisions
|
|
|
|
- Namespace: `orderbooks`
|
|
- Workstation kubeconfig for validation: `../nuri/unrip3/.state/hetzner/kubeconfig.yaml`
|
|
- Shared registry and shared Forgejo runner
|
|
- Existing rclone Secret: `orderbooks/orderbooks-rclone-config`
|
|
- Secret key mounted by the uploader: `rclone.conf`
|
|
|
|
Do not commit or print rclone config contents.
|
|
|
|
## Runtime Layout
|
|
|
|
The collector and uploader share one PVC:
|
|
|
|
```text
|
|
PVC: orderbooks-data
|
|
mount: /var/lib/orderbooks
|
|
raw files: /var/lib/orderbooks/raw_orderbooks
|
|
manifests: /var/lib/orderbooks/manifests
|
|
discovery: /var/lib/orderbooks/discovery
|
|
```
|
|
|
|
The REST snapshot collector uses one Deployment with one replica. The container
|
|
runs `/app/scripts/run_polymarket_collector_loop.sh`, which repeatedly executes
|
|
the existing bounded collector cycle and records loop failure/interruption
|
|
manifests instead of relying on Kubernetes crash loops for normal operation.
|
|
|
|
The websocket recorder canary uses a separate Deployment named
|
|
`orderbooks-ws-recorder`. It runs `/app/scripts/run_polymarket_ws_recorder_loop.sh`
|
|
and does not replace or stop `orderbooks-collector`. It writes raw websocket
|
|
archives under `/var/lib/orderbooks/raw_orderbooks/polymarket/ws_raw/`, REST
|
|
checkpoint archives under `/var/lib/orderbooks/raw_orderbooks/polymarket/rest_checkpoints/`,
|
|
and runtime manifests under `/var/lib/orderbooks/manifests/`.
|
|
|
|
The uploader uses one CronJob. It runs the existing rclone uploader in execute
|
|
mode, mounts the same PVC, mounts `orderbooks-rclone-config` read-only at
|
|
`/etc/rclone/rclone.conf`, sets `RCLONE_CONFIG` to that file, uploads only
|
|
closed/aged files, skips `.open`/temporary writer files, and uses
|
|
`--cleanup-after-verify`. Local cleanup is allowed only after rclone copy and
|
|
check succeed. The Kubernetes retention setting is 3 days because websocket raw
|
|
capture is materially larger than REST snapshots and the current PVC is 10Gi.
|
|
|
|
|
|
## Bootstrap This App Repo
|
|
|
|
Run the orderbooks-specific bootstrap from this repository:
|
|
|
|
```sh
|
|
scripts/deploy/bootstrap_orderbooks_k8s.sh
|
|
```
|
|
|
|
The bootstrap loads platform defaults and resolved secrets from the local
|
|
platform state without printing secret values. It ensures namespace `orderbooks`,
|
|
creates or updates `orderbooks-registry-creds`, verifies the existing
|
|
`orderbooks-rclone-config` secret has key `rclone.conf`, creates or updates the
|
|
Forgejo repo `philipp/orderbooks`, and upserts the required Actions secret and
|
|
variables.
|
|
|
|
After bootstrap, push a clean source tree to Forgejo `main`. Do not push local
|
|
`data/`, `artifacts/`, `reports/`, `orchestration/`, kubeconfigs, rclone config,
|
|
`.env`, private keys, or other local evidence/secrets.
|
|
|
|
## Image Build And Deploy
|
|
|
|
The Forgejo workflow is `.forgejo/workflows/deploy.yml`. It follows the shared
|
|
runner pattern:
|
|
|
|
1. load `KUBECONFIG_B64` from Forgejo secrets;
|
|
2. clone this repo inside the runner;
|
|
3. create an in-cluster Kaniko Job;
|
|
4. build and push `REGISTRY_HOST/orderbooks:<git-sha>`;
|
|
5. apply `deploy/k8s/base` with the built image;
|
|
6. wait for `deployment/orderbooks-collector` and `deployment/orderbooks-ws-recorder` rollout.
|
|
|
|
Required Forgejo repo secret:
|
|
|
|
```text
|
|
KUBECONFIG_B64
|
|
```
|
|
|
|
Required Forgejo repo variable:
|
|
|
|
```text
|
|
REGISTRY_HOST
|
|
```
|
|
|
|
Project defaults used by the workflow:
|
|
|
|
```text
|
|
PROJECT_NAME=orderbooks
|
|
PROJECT_NAMESPACE=orderbooks
|
|
PROJECT_DEPLOYMENTS=orderbooks-collector,orderbooks-ws-recorder
|
|
PROJECT_REGISTRY_SECRET_NAME=orderbooks-registry-creds
|
|
```
|
|
|
|
The registry pull/build secret `orderbooks-registry-creds` must exist in the
|
|
`orderbooks` namespace before the workflow builds and deploys.
|
|
|
|
Pushes to `main` are intentionally non-deploying during the websocket canary
|
|
work. `workflow_dispatch` remains the broad release path and may roll both
|
|
Deployments listed in `PROJECT_DEPLOYMENTS`. Do not use that broad workflow for
|
|
websocket-only canary evidence.
|
|
|
|
## Websocket Canary-Only Deploy Path
|
|
|
|
Checkpoint 10D1 uses `scripts/deploy/deploy_ws_canary_kaniko.sh` for the
|
|
websocket canary. The helper builds an image from the committed Forgejo `main`
|
|
SHA with an in-cluster Kaniko Job, then applies only:
|
|
|
|
```text
|
|
namespace.yaml
|
|
configmap.yaml
|
|
pvc.yaml
|
|
cronjob-uploader.yaml
|
|
deployment-ws-recorder.yaml
|
|
```
|
|
|
|
It does not apply `deployment-collector.yaml`, does not set the
|
|
`orderbooks-collector` image, and waits only for
|
|
`deployment/orderbooks-ws-recorder`. Validate the scoped apply set first:
|
|
|
|
```sh
|
|
KUBECONFIG=../nuri/unrip3/.state/hetzner/kubeconfig.yaml \
|
|
scripts/deploy/deploy_ws_canary_kaniko.sh --server-dry-run
|
|
```
|
|
|
|
After a clean source-only commit has been pushed to Forgejo `main`, deploy the
|
|
canary with:
|
|
|
|
```sh
|
|
KUBECONFIG=../nuri/unrip3/.state/hetzner/kubeconfig.yaml \
|
|
scripts/deploy/deploy_ws_canary_kaniko.sh --git-ref "$(git rev-parse HEAD)"
|
|
```
|
|
|
|
The helper writes compact deploy evidence under
|
|
`data/manifests/ws_canary_deploy_<UTC_TIMESTAMP>.json`.
|
|
|
|
|
|
## Websocket Recorder Canary
|
|
|
|
Checkpoint 10D adds the websocket recorder as a canary, not as a replacement for
|
|
the REST snapshot collector. The canary subscribes to public Polymarket market
|
|
websocket messages for active BTC Up/Down token IDs, preserves every websocket
|
|
text payload exactly in `raw_text`, and keeps periodic REST `/books` checkpoints
|
|
for recovery and divergence evidence.
|
|
|
|
The script and example config default to `market_limit: 0`, which means all
|
|
discovered active BTC Up/Down markets. The Kubernetes canary config currently
|
|
sets `market_limit: 2`, `manifest_write_interval_seconds: 60`, `first_message_timeout_seconds: 90`, and `stale_feed_threshold_seconds: 90` as explicit
|
|
smoke/safety settings. The 10D local bounded run
|
|
wrote about 3.35 MB of compressed websocket data in two minutes for two markets;
|
|
running all active BTC markets on the current 10Gi PVC needs a separate sizing
|
|
or retention decision before removing the cap. Do not use a cap silently in
|
|
production evidence.
|
|
|
|
Raw/current file safety:
|
|
|
|
- completed archives end in `.jsonl.gz`;
|
|
- the recorder writes current gzip files with a hidden `.open` name and renames
|
|
them only after close;
|
|
- the uploader skips `.open`, `.tmp`, and `.partial` files;
|
|
- verified cleanup deletes local files only after rclone verification succeeds.
|
|
|
|
## Pre-Deploy Validation
|
|
|
|
From this repository:
|
|
|
|
```sh
|
|
bash -n scripts/run_polymarket_collector_loop.sh
|
|
bash -n scripts/run_polymarket_ws_recorder_loop.sh
|
|
bash -n scripts/k8s_runtime_smoke_check.sh
|
|
bash -n scripts/k8s_ws_runtime_smoke_check.sh
|
|
python -m py_compile scripts/collect_polymarket_ws_orderbooks.py
|
|
kubectl kustomize deploy/k8s/base
|
|
KUBECONFIG=../nuri/unrip3/.state/hetzner/kubeconfig.yaml kubectl apply -k deploy/k8s/base --dry-run=server
|
|
KUBECONFIG=../nuri/unrip3/.state/hetzner/kubeconfig.yaml kubectl -n orderbooks get secret orderbooks-rclone-config -o go-template='{{if index .data "rclone.conf"}}rclone_secret_key_present{{else}}rclone_secret_key_missing{{end}}{{"\n"}}'
|
|
```
|
|
|
|
The last command checks only whether the key exists. It must not print secret
|
|
data.
|
|
|
|
## Runtime Smoke Gate
|
|
|
|
After the image is built and the workload is actually deployed, run:
|
|
|
|
```sh
|
|
KUBECONFIG=../nuri/unrip3/.state/hetzner/kubeconfig.yaml scripts/k8s_runtime_smoke_check.sh --namespace orderbooks --deployment orderbooks-collector --cronjob orderbooks-uploader --raw-dir /var/lib/orderbooks/raw_orderbooks --manifest-dir /var/lib/orderbooks/manifests --wait-seconds 1800 \
|
|
--upload-min-age-seconds 600
|
|
```
|
|
|
|
The smoke gate uses `kubectl`, not systemd. It writes local JSON evidence under
|
|
`data/manifests/k8s_runtime_smoke_<UTC_TIMESTAMP>.json` by default. It verifies:
|
|
|
|
- collector pod is running;
|
|
- latest collector manifest has `gate_status: PASS`, `rows_written > 0`, and
|
|
`failure_count: 0`;
|
|
- raw gzip JSONL parses and is under `/var/lib/orderbooks/raw_orderbooks`;
|
|
- deleting the collector pod does not corrupt the old raw file checksum or row
|
|
count;
|
|
- a later post-restart collector cycle writes valid rows;
|
|
- an uploader Job created from the CronJob completes;
|
|
- the latest upload manifest records a verified rclone upload with at least one
|
|
verified file.
|
|
|
|
A failed smoke run still writes JSON evidence and exits nonzero. Preserve failed
|
|
manifests, raw files, upload manifests, and pod logs for review.
|
|
|
|
|
|
## Websocket Reliability Observation
|
|
|
|
After deploying a websocket recorder reliability fix, run a read-only bounded
|
|
observation before treating the canary as unattended:
|
|
|
|
```sh
|
|
KUBECONFIG=../nuri/unrip3/.state/hetzner/kubeconfig.yaml \
|
|
scripts/k8s_ws_reliability_check.sh --wait-seconds 1800
|
|
```
|
|
|
|
The observation fails if websocket message counts and archive mtimes do not
|
|
advance while active tokens exist, if REST checkpoints stop succeeding, if parse
|
|
errors appear, or if reconnect/stale counters grow rapidly without recovery. It
|
|
also records the REST collector image/readiness before and after the observation.
|
|
|
|
## Not Included
|
|
|
|
- No trading, signing, wallets, private keys, or API keys.
|
|
- No dashboard, database, strategy, backtest, or second-market connector.
|
|
- No websocket rewrite.
|
|
- No rclone config contents in this repository.
|
|
|
|
## Websocket Canary Smoke Gate
|
|
|
|
After the canary image is deployed and has run long enough to close at least one
|
|
websocket and REST checkpoint archive, run:
|
|
|
|
```sh
|
|
KUBECONFIG=../nuri/unrip3/.state/hetzner/kubeconfig.yaml scripts/k8s_ws_runtime_smoke_check.sh --namespace orderbooks --deployment orderbooks-ws-recorder --rest-deployment orderbooks-collector --cronjob orderbooks-uploader --wait-seconds 900 --upload-min-age-seconds 600
|
|
```
|
|
|
|
The smoke gate verifies the websocket pod is running, raw websocket gzip JSONL
|
|
parses, REST checkpoint gzip JSONL parses, manifests expose reconnect/stale and
|
|
divergence counters, pod deletion/restart does not corrupt the prior closed raw
|
|
file or produces a SIGTERM-closed archive when no prior closed file exists, a
|
|
later pod writes new data, and the existing REST collector remains healthy. For
|
|
upload evidence it creates a one-off uploader Job from the deployed image and
|
|
same PVC/secret with `ORDERBOOKS_UPLOAD_MIN_AGE_SECONDS=0`, then verifies the
|
|
upload manifest has `UPLOAD_VERIFIED`, `gate_status: PASS`, and at least one
|
|
verified websocket recorder raw or REST checkpoint file. Production CronJob
|
|
upload min age remains 600 seconds.
|