doran/docs/hetzner-rebuild-pipeline.md

117 lines
3.3 KiB
Markdown

# Hetzner rebuild pipeline map
This document summarizes the currently intended rebuild flow for the repo-driven Hetzner single-node cluster.
It is a companion to the operator runbooks, not a competing source of truth.
Use these first for exact commands and required env:
- `docs/hetzner-k3s-bootstrap.md`
- `docs/hetzner-self-hosted-ci-runbook.md`
- `docs/k8s-observability.md`
## High-level rebuild sequence
1. prepare `scripts/hetzner/bootstrap-secrets.env`
2. source it so `*_PASS` mappings resolve through `pass`
3. optionally run `scripts/hetzner/destroy.sh`
4. run `scripts/hetzner/bootstrap.sh`
5. let bootstrap:
- provision/update Hetzner infra with Terraform
- configure DNS when provider credentials are present
- fetch the real kubeconfig from the node
- render `.state/hetzner/generated-overlay/`
- apply platform + project manifests
- bootstrap Forgejo admin, runner, repo, and Actions configuration
- seed the repo into Forgejo
- trigger the normal Forgejo Actions build/push/deploy path
6. verify public/operator surfaces:
- Forgejo
- registry
- Grafana
- Headlamp
7. verify workload health and CI success
## Ownership boundaries
### Terraform owns
- Hetzner VM
- network
- firewall
- cloud-init user data
### Cloud-init owns
- OS package prep
- optional Tailscale join
- k3s installation
- a marker file under `/opt/unrip/bootstrap/README.txt`
Cloud-init does **not** clone this repo or apply Kubernetes manifests.
### Bootstrap script owns
- `pass`-resolved secret loading
- DNS automation
- kubeconfig retrieval/rendering
- generated overlay rendering under `.state/hetzner/generated-overlay/`
- imperative registry auth secret creation
- Forgejo bootstrap API calls
- repo seeding
- Headlamp token export to `pass`
### Kubernetes manifests own
- platform services
- project services
- ingress/TLS resources
- observability stack
- persistent volume claims and workload specs
## Current default runtime model
Platform services:
- Forgejo
- Forgejo runner
- registry
- cert-manager
- Grafana
- Loki
- Promtail
- Headlamp
Project services:
- Redpanda
- `near-intents-ingest`
- `dummy-reactor`
- `dummy-executor`
- `dummy-consumer`
Ingress/controller model:
- Traefik bundled with k3s
- no ingress-nginx in the active path
## Rebuild verification checklist
After bootstrap, verify:
```bash
export KUBECONFIG=$PWD/.state/hetzner/kubeconfig.yaml
kubectl get nodes -o wide
kubectl get pods -A
kubectl -n observability get deploy,ds,pods,svc,ingress,secrets
kubectl -n forgejo get deploy,pods,svc,ingress
kubectl -n registry get deploy,pods,svc,ingress
kubectl -n unrip get deploy,pods
```
Public/operator surfaces should respond:
- `https://git.<public-domain>/`
- `https://registry.<public-domain>/v2/`
- `https://grafana.<public-domain>/`
- `https://headlamp.<public-domain>/`
CI should show a successful deploy workflow in Forgejo Actions.
## Current caveat
The core Hetzner/k3s/Forgejo path has been rebuilt successfully before.
Headlamp was added afterward and validated live on the rebuilt cluster, but a brand-new destroy/rebuild rehearsal with Headlamp included has not yet been re-run from zero.
So the rebuild story is repo-driven and operationally close to fully reproducible, with one remaining value-add validation step: a final clean-room rebuild after the latest Headlamp/docs cleanup.