philipp/doran

Fork 0

Philipp 2a32461e39 feat: bootstrap hetzner k3s deployment

2026-03-28 20:53:29 +01:00

10 KiB

Raw Blame History

near-intents-monitor

Production-shaped first slice of the trading system:

venue ingest: NEAR Intents solver-bus quote flow
bus: Redpanda first, Kafka-compatible by design
reactor: dummy decision engine emitting commands
executor: dummy execution worker with durable idempotency state
result consumer: downstream observer of execution outcomes

Canonical repo shape

src/
  apps/
    near-intents-ingest.mjs
    dummy-reactor.mjs
    dummy-executor.mjs
    dummy-consumer.mjs
  bus/
    kafka/
      producer.mjs
      consumer.mjs
  core/
    event-envelope.mjs
    executor-state-store.mjs
    log.mjs
    pair-filter.mjs
    schemas.mjs
  lib/
    config.mjs
    env.mjs
  venues/
    near-intents/
      ingest.mjs
      normalize.mjs
      ws.mjs
compose.yml
Dockerfile
docs/contracts.md
deploy/hetzner/README.md

Event flow

NEAR Intents WebSocket
        |
        +--> raw.near_intents.quote
        |
        v
norm.swap_demand
        |
        v
cmd.execute_trade
        |
        v
exec.trade_result

Core rule: services do not call each other directly for trading flow; they communicate through bus topics only.

Contracts

See docs/contracts.md.

Current topics:

raw.near_intents.quote
norm.swap_demand
cmd.execute_trade
exec.trade_result

Primary deployment path: repo-driven Hetzner bootstrap

The primary production path is no longer a Compose-only VM workflow.

The intended operating model is:

Terraform provisions a Hetzner single-node environment
cloud-init installs k3s automatically on first boot
a local operator workstation performs the first repo-driven bootstrap
Kubernetes manifests install Redpanda, the app workloads, Forgejo, runner, registry, and ingress-related components
once the in-cluster Git + CI stack is alive, routine app deploys move to self-hosted CI

This is a two-phase model:

Phase 0: local workstation bootstrap of a brand-new cluster
Phase 1: self-hosted Forgejo + runner takes over app delivery

Compose still exists for local development and optional single-machine testing, but it is not the canonical production story.

Prerequisites for first deployment

Install locally on the operator workstation:

Terraform >= 1.6
kubectl
docker
curl

You also need:

a Hetzner Cloud API token
a local SSH public key file for Terraform node provisioning
DNS control for your chosen base domain and Forgejo hostname
preferably a Tailscale tailnet and auth key for private admin/control-plane access
the repo checked out locally

Required bootstrap secrets and inputs

Create the bootstrap env file:

cp scripts/hetzner/bootstrap-secrets.env.example scripts/hetzner/bootstrap-secrets.env

Set at least:

HCLOUD_TOKEN
SSH_PUBLIC_KEY_PATH
PUBLIC_DOMAIN
recommended:
- TAILSCALE_AUTH_KEY
- TAILSCALE_CONTROL_PLANE_HOSTNAME
optional fallback:
- TF_ADMIN_CIDR_BLOCKS
BASE_DOMAIN
FORGEJO_DOMAIN
FORGEJO_ROOT_URL
REGISTRY_DOMAIN
LETSENCRYPT_EMAIL
REGISTRY_USERNAME
REGISTRY_PASSWORD
NEAR_INTENTS_API_KEY
FORGEJO_RUNNER_REGISTRATION_TOKEN
optional DNS automation:
- Cloudflare:
  - CLOUDFLARE_API_TOKEN
  - CLOUDFLARE_ZONE_ID
- Porkbun:
  - PORKBUN_API_KEY
  - PORKBUN_SECRET_API_KEY

Then load them:

source scripts/hetzner/bootstrap-secrets.env

First bootstrap sequence

Run the end-to-end bootstrap from repo root:

bash scripts/hetzner/bootstrap.sh

Current repo behavior of that script:

runs Terraform in infra/terraform/hetzner
optionally creates DNS records for the base, Forgejo, and registry hosts via Cloudflare or Porkbun
if configured, joins the node to Tailscale and prefers the Tailscale control-plane hostname for Kubernetes API access
waits for SSH and the k3s API endpoint to become ready
fetches the real k3s kubeconfig from the node and writes it to .state/hetzner/kubeconfig.yaml
renders the Hetzner single-node overlay from local operator inputs
creates registry pull/auth secrets
applies the Kubernetes bootstrap manifests
builds the app image locally and imports it into k3s on the node
performs the first rollout using the imported bootstrap image

Use the generated kubeconfig afterward:

export KUBECONFIG=$PWD/.state/hetzner/kubeconfig.yaml
kubectl get nodes -o wide
kubectl get pods -A
kubectl -n unrip get deploy,pods
kubectl -n forgejo get deploy,pods,svc

What is deployed into k3s

The repo-managed Kubernetes assets are under deploy/k8s/.

Current single-node target includes resources for:

unrip workloads in namespace unrip
Redpanda
Forgejo
Forgejo runner
private registry
ingress-nginx namespace/resources
cert-manager namespace/resources
ACME issuers and ingress definitions
a bootstrap job for Redpanda topic creation

Shared platform namespaces:

forgejo
registry
ingress-nginx
cert-manager

Project-specific namespaces:

unrip
future projects should get their own namespace rather than sharing unrip

Important current-state nuance:

the bootstrap script currently applies deploy/k8s/base
the longer-term intended target is deploy/k8s/overlays/hetzner-single-node

Executor persistence in k3s

The executor is stateful by design because it persists idempotency/execution tracking.

Current persistence boundary:

app env uses EXECUTOR_STATE_DIR=/var/lib/unrip/executor-state
in Kubernetes, the executor deployment mounts storage at that path
the Hetzner single-node overlay pins storage to the k3s local-path storage class
cloud-init also prepares the host directory boundary for executor state on first boot

Operational meaning:

executor state lives on node-backed storage in the single-node k3s environment
if that PVC or underlying node storage is lost, duplicate-suppression history is lost too
treat executor persistence as part of the minimal durable state of the cluster

Failure recovery and operator checks

If bootstrap fails before Terraform completes

Re-run after fixing the local input problem:

missing token
invalid CIDRs
invalid SSH public key path

If the infrastructure must be torn down:

source scripts/hetzner/bootstrap-secrets.env
bash scripts/hetzner/destroy.sh

If Terraform succeeds but Kubernetes is not ready

Check the public API and cluster state from the workstation:

export KUBECONFIG=$PWD/.state/hetzner/kubeconfig.yaml
kubectl get nodes -o wide
kubectl get pods -A
kubectl get events -A --sort-by=.lastTimestamp | tail -n 50

Typical next checks:

cloud-init may still be finishing
k3s may still be starting
a workload may be crash-looping due to missing secret values or image-delivery issues

If workloads do not roll out

Inspect the affected namespace:

kubectl -n unrip get pods
kubectl -n unrip describe pod <pod-name>
kubectl -n unrip logs deploy/dummy-executor --tail=100
kubectl -n forgejo logs deploy/forgejo --tail=100

If you need to recreate secrets

The workstation bootstrap creates these Secrets:

unrip/unrip-secrets
forgejo/forgejo-secrets

Verify them:

kubectl -n unrip get secret unrip-secrets
kubectl -n forgejo get secret forgejo-secrets

Current known limitations

Current colony state already identified an important gap:

bootstrap and CI are not yet fully production-hardened, even though the first deploy path now fetches the real kubeconfig and imports the bootstrap image directly into k3s

Treat the current bootstrap as a repo-driven first-deploy path suitable for testing, with hardening still pending.

Self-hosted CI handoff

After cluster bootstrap:

open Forgejo at https://${FORGEJO_DOMAIN}
seed or push this repo into Forgejo
create Forgejo repository secrets:
- KUBECONFIG_B64
- REGISTRY_USERNAME
- REGISTRY_PASSWORD
create Forgejo repository variables:
- REGISTRY_HOST=${REGISTRY_DOMAIN}
- optional: PROJECT_NAME=unrip
- optional: PROJECT_NAMESPACE=unrip
- optional: PROJECT_DEPLOYMENTS=near-intents-ingest,dummy-reactor,dummy-executor,dummy-consumer
push to main

Routine application deploys then follow .forgejo/workflows/deploy.yml:

build image as REGISTRY_HOST/PROJECT_NAME:${GIT_SHA}
push to the private registry
kubectl set image for each deployment listed in PROJECT_DEPLOYMENTS inside PROJECT_NAMESPACE
wait for rollout

If project variables are omitted, the workflow defaults to the current repo project:

PROJECT_NAME=unrip
PROJECT_NAMESPACE=unrip
PROJECT_DEPLOYMENTS=near-intents-ingest,dummy-reactor,dummy-executor,dummy-consumer

Infrastructure changes remain Terraform-driven from the operator workstation unless and until that responsibility is also automated.

For the detailed operator runbooks, see:

docs/hetzner-k3s-bootstrap.md
docs/hetzner-self-hosted-ci-runbook.md
deploy/k8s/projects/README.md
docs/next-session-architecture.md

Local development with Compose

Compose remains available for local development and debugging.

npm install
cp .env.example .env
# edit .env

docker compose build
docker compose up -d

Useful commands:

docker compose ps
docker compose logs -f
docker compose logs -f near-intents-ingest dummy-reactor dummy-executor dummy-consumer
docker compose restart dummy-executor
docker compose down
docker compose down -v

Individual services

npm run near-intents:ingest
npm run dummy-reactor
npm run dummy-executor
npm run dummy-consumer

Optional pair filter:

npm run near-intents:ingest -- --pair 'asset_a->asset_b'

Idempotent executor behavior

every command has a command_id
commands carry idempotency_key and execution_key
executor persists state under EXECUTOR_STATE_DIR
completed commands are skipped after restart or replay

Env

NEAR_INTENTS_API_KEY=your_solver_jwt
NEAR_INTENTS_WS_URL=wss://solver-relay-v2.chaindefuser.com/ws
KAFKA_BROKERS=redpanda:9092
KAFKA_CLIENT_ID=unrip
KAFKA_TOPIC_RAW_NEAR_INTENTS_QUOTE=raw.near_intents.quote
KAFKA_TOPIC_NORM_SWAP_DEMAND=norm.swap_demand
KAFKA_TOPIC_CMD_EXECUTE_TRADE=cmd.execute_trade
KAFKA_TOPIC_EXEC_TRADE_RESULT=exec.trade_result
KAFKA_CONSUMER_GROUP_DUMMY=dummy-reactor-v1
KAFKA_CONSUMER_GROUP_EXECUTOR=dummy-executor-v1
EXECUTOR_STATE_DIR=/var/lib/unrip/executor-state

10 KiB Raw Blame History