doran/docs/hetzner-rebuild-pipeline.md

3.3 KiB

Hetzner rebuild pipeline map

This document summarizes the currently intended rebuild flow for the repo-driven Hetzner single-node cluster.

It is a companion to the operator runbooks, not a competing source of truth. Use these first for exact commands and required env:

  • docs/hetzner-k3s-bootstrap.md
  • docs/hetzner-self-hosted-ci-runbook.md
  • docs/k8s-observability.md

High-level rebuild sequence

  1. prepare scripts/hetzner/bootstrap-secrets.env
  2. source it so *_PASS mappings resolve through pass
  3. optionally run scripts/hetzner/destroy.sh
  4. run scripts/hetzner/bootstrap.sh
  5. let bootstrap:
    • provision/update Hetzner infra with Terraform
    • configure DNS when provider credentials are present
    • fetch the real kubeconfig from the node
    • render .state/hetzner/generated-overlay/
    • apply platform + project manifests
    • bootstrap Forgejo admin, runner, repo, and Actions configuration
    • seed the repo into Forgejo
    • trigger the normal Forgejo Actions build/push/deploy path
  6. verify public/operator surfaces:
    • Forgejo
    • registry
    • Grafana
    • Headlamp
  7. verify workload health and CI success

Ownership boundaries

Terraform owns

  • Hetzner VM
  • network
  • firewall
  • cloud-init user data

Cloud-init owns

  • OS package prep
  • optional Tailscale join
  • k3s installation
  • a marker file under /opt/unrip/bootstrap/README.txt

Cloud-init does not clone this repo or apply Kubernetes manifests.

Bootstrap script owns

  • pass-resolved secret loading
  • DNS automation
  • kubeconfig retrieval/rendering
  • generated overlay rendering under .state/hetzner/generated-overlay/
  • imperative registry auth secret creation
  • Forgejo bootstrap API calls
  • repo seeding
  • Headlamp token export to pass

Kubernetes manifests own

  • platform services
  • project services
  • ingress/TLS resources
  • observability stack
  • persistent volume claims and workload specs

Current default runtime model

Platform services:

  • Forgejo
  • Forgejo runner
  • registry
  • cert-manager
  • Grafana
  • Loki
  • Promtail
  • Headlamp

Project services:

  • Redpanda
  • near-intents-ingest
  • dummy-reactor
  • dummy-executor
  • dummy-consumer

Ingress/controller model:

  • Traefik bundled with k3s
  • no ingress-nginx in the active path

Rebuild verification checklist

After bootstrap, verify:

export KUBECONFIG=$PWD/.state/hetzner/kubeconfig.yaml
kubectl get nodes -o wide
kubectl get pods -A
kubectl -n observability get deploy,ds,pods,svc,ingress,secrets
kubectl -n forgejo get deploy,pods,svc,ingress
kubectl -n registry get deploy,pods,svc,ingress
kubectl -n unrip get deploy,pods

Public/operator surfaces should respond:

  • https://git.<public-domain>/
  • https://registry.<public-domain>/v2/
  • https://grafana.<public-domain>/
  • https://headlamp.<public-domain>/

CI should show a successful deploy workflow in Forgejo Actions.

Current caveat

The core Hetzner/k3s/Forgejo path has been rebuilt successfully before. Headlamp was added afterward and validated live on the rebuilt cluster, but a brand-new destroy/rebuild rehearsal with Headlamp included has not yet been re-run from zero.

So the rebuild story is repo-driven and operationally close to fully reproducible, with one remaining value-add validation step: a final clean-room rebuild after the latest Headlamp/docs cleanup.