doran/deploy/hetzner/README.md
2026-03-28 20:53:29 +01:00

8.8 KiB

Hetzner single-node bootstrap (Terraform + cloud-init + k3s)

This is the canonical first-production deployment path for the repo.

A local operator workstation drives the first deployment end to end:

  • Terraform provisions Hetzner infrastructure
  • cloud-init installs k3s automatically on first boot
  • the workstation waits for the public Kubernetes API
  • the workstation creates initial Kubernetes Secrets
  • the workstation applies repo-managed Kubernetes manifests
  • the workstation performs the first image/bootstrap delivery attempt
  • once Forgejo + runner are alive, routine app deploys are intended to move to self-hosted CI

Compose remains available for local development, but it is not the primary production deployment model.

Scope of this layer

The foundation under infra/terraform/hetzner provisions:

  • one Hetzner Cloud server
  • one SSH key resource based on your local public key
  • firewall rules for SSH, Kubernetes API, and HTTP/HTTPS ingress
  • a private network attachment for future growth
  • cloud-init user-data for unattended k3s installation and host preparation

The repo bootstrap then applies the Hetzner single-node overlay under deploy/k8s/overlays/hetzner-single-node, which composes Kubernetes resources under deploy/k8s/ for:

  • shared platform namespaces and services
  • Redpanda
  • unrip workloads
  • Forgejo
  • Forgejo runner
  • private registry
  • ingress/TLS-related resources
  • Redpanda topic bootstrap job

Prerequisites

Install on the operator workstation:

  • Terraform >= 1.6
  • kubectl
  • docker
  • curl

You also need:

  • a Hetzner Cloud API token
  • an SSH keypair already present locally
  • access to DNS for your chosen domains
  • admin CIDRs that can reach the future server on 22/tcp and 6443/tcp
  • this repo checked out locally

Required bootstrap secrets and inputs

Prepare the operator env file:

cp scripts/hetzner/bootstrap-secrets.env.example scripts/hetzner/bootstrap-secrets.env
${EDITOR:-vi} scripts/hetzner/bootstrap-secrets.env

Set at least:

  • HCLOUD_TOKEN
  • SSH_PUBLIC_KEY_PATH
  • TF_ADMIN_CIDR_BLOCKS
  • BASE_DOMAIN
  • FORGEJO_DOMAIN
  • FORGEJO_ROOT_URL
  • NEAR_INTENTS_API_KEY
  • FORGEJO_RUNNER_REGISTRATION_TOKEN

Load it into the current shell:

source scripts/hetzner/bootstrap-secrets.env

Canonical bootstrap sequence

Run from repo root:

bash scripts/hetzner/bootstrap.sh

Current behavior of the script:

  1. validates local tooling
  2. runs terraform init and terraform apply in infra/terraform/hetzner
  3. reads Terraform outputs such as server IP and k3s_api_url
  4. waits for the k3s API readiness endpoint
  5. writes a local workstation kubeconfig to .state/hetzner/kubeconfig.yaml
  6. writes overlay secret env input files and creates:
    • unrip/unrip-secrets
    • unrip/unrip-registry-creds
    • forgejo/forgejo-secrets
    • registry/registry-secrets
  7. applies deploy/k8s/platform/base/namespace.yaml and deploy/k8s/overlays/hetzner-single-node
  8. builds the repo bootstrap image locally
  9. pushes it through the temporary local registry bridge using the active project name
  10. updates and waits for rollout status in the active project namespace

After the script finishes:

export KUBECONFIG=$PWD/.state/hetzner/kubeconfig.yaml
kubectl get nodes -o wide
kubectl get pods -A
kubectl -n unrip get deploy,pods,jobs
kubectl -n forgejo get deploy,pods,svc
kubectl -n registry get pods,svc

Current manifest target

Important current-state detail:

  • scripts/hetzner/bootstrap.sh now applies deploy/k8s/platform/base/namespace.yaml
  • it then applies deploy/k8s/overlays/hetzner-single-node
  • bootstrap naming no longer assumes legacy trading-system kubeconfig contexts, image tags, or rollout namespaces

Executor persistence in k3s

The dummy executor persists durable idempotency state.

Current persistence model:

  • application path: EXECUTOR_STATE_DIR=/var/lib/unrip/executor-state
  • cloud-init prepares the host boundary for executor storage on first boot
  • Kubernetes mounts storage at that same path for the executor workload
  • the Hetzner single-node overlay pins PVC-backed storage to k3s local-path

Operational consequence:

  • executor duplicate-suppression state lives on node-backed persistent storage
  • replacing the node or deleting the PVC without migration loses that history
  • treat executor state as required operational data, even though the executor is still a dummy implementation

Failure recovery runbook

A. Bootstrap fails before infrastructure exists

Typical causes:

  • invalid HCLOUD_TOKEN
  • wrong SSH_PUBLIC_KEY_PATH
  • malformed TF_ADMIN_CIDR_BLOCKS

Fix the input and rerun:

source scripts/hetzner/bootstrap-secrets.env
bash scripts/hetzner/bootstrap.sh

If you need to destroy partially created infrastructure:

source scripts/hetzner/bootstrap-secrets.env
bash scripts/hetzner/destroy.sh

B. Terraform succeeds but cluster access is not usable

Verify the generated kubeconfig and cluster health:

export KUBECONFIG=$PWD/.state/hetzner/kubeconfig.yaml
kubectl get nodes -o wide
kubectl get pods -A
kubectl get events -A --sort-by=.lastTimestamp | tail -n 50

What to suspect first:

  • cloud-init still running
  • k3s still starting
  • bootstrap kubeconfig/auth not fully aligned yet
  • public API reachable, but workloads not yet healthy

C. Secrets were wrong or missing

The current bootstrap depends on:

  • ${PROJECT_NAME:-unrip}-secrets
    • NEAR_INTENTS_API_KEY
  • forgejo-secrets
    • root_url
    • domain
    • runner_registration_token

Verify:

kubectl -n unrip get secret unrip-secrets
kubectl -n unrip get secret unrip-registry-creds
kubectl -n forgejo get secret forgejo-secrets
kubectl -n registry get secret registry-secrets

If needed, recreate them from the workstation before restarting the affected deployments.

D. Workloads are present but not healthy

Inspect by namespace:

kubectl -n unrip get pods
kubectl -n unrip describe pod <pod-name>
kubectl -n unrip logs deploy/dummy-executor --tail=100
kubectl -n forgejo logs deploy/forgejo --tail=100
kubectl -n forgejo logs deploy/forgejo-runner --tail=100

Useful rollout checks:

kubectl -n unrip rollout status deployment/near-intents-ingest --timeout=300s
kubectl -n unrip rollout status deployment/dummy-reactor --timeout=300s
kubectl -n unrip rollout status deployment/dummy-executor --timeout=300s
kubectl -n unrip rollout status deployment/dummy-consumer --timeout=300s
kubectl -n forgejo rollout status deployment/forgejo --timeout=300s
kubectl -n forgejo rollout status deployment/forgejo-runner --timeout=300s

E. Need to inspect Terraform outputs directly

cd infra/terraform/hetzner
terraform output
terraform output server_ipv4
terraform output server_private_ipv4
terraform output k3s_api_url
terraform output kubeconfig_strategy

Self-hosted CI handoff

After the cluster is reachable and workloads are up:

  1. reach Forgejo at the configured domain or by port-forward
  2. perform the initial admin/bootstrap steps in Forgejo
  3. create the target repository in Forgejo
  4. push or mirror this repo into that Forgejo instance
  5. confirm the runner is registered and healthy
  6. move routine application deploys to the self-hosted pipeline, which now derives image naming and rollout targets from Forgejo repository variables instead of hard-coding the legacy project

Current repo-state caveats already known:

  • first bootstrap is repo-driven from the workstation
  • the bootstrap path no longer relies on SSH/scp transport in control flow
  • the kubeconfig/auth result is not yet fully production-hardened
  • first rollout still uses a temporary local registry bridge; routine CI deploys are intended to be registry-native and the Forgejo workflow now defaults to unrip while allowing per-repo overrides for image name, namespace, and deployment list
  • Forgejo admin creation, repo creation, and Actions configuration still require operator action after cluster bring-up
  • DNS automation is currently wired for Cloudflare when credentials are supplied during bootstrap
  • TLS is expected to come from cert-manager + Let's Encrypt once ingress hostnames resolve publicly

Terraform-only usage

If you only want the infra layer:

cd infra/terraform/hetzner
export TF_VAR_hcloud_token="<your-hetzner-token>"
export TF_VAR_ssh_public_key="$(cat ~/.ssh/id_ed25519.pub)"
export TF_VAR_admin_cidr_blocks='["203.0.113.10/32"]'

terraform init
terraform apply

Useful outputs:

  • server_ipv4
  • server_private_ipv4
  • server_name
  • server_fqdn
  • k3s_api_url
  • kubeconfig_strategy

For CI/CD details, also see:

  • docs/hetzner-k3s-bootstrap.md
  • docs/hetzner-self-hosted-ci-runbook.md

Compose status

Compose is still useful for:

  • local development
  • fast topology debugging
  • non-production single-machine testing

But it should be treated as optional/dev runtime support, not as the primary production deployment path.