doran/docs/hetzner-k3s-bootstrap.md

246 lines
12 KiB
Markdown

# Hetzner + k3s + self-hosted Git/CI bootstrap
Goal: provision and deploy everything from this repo to a single Hetzner machine with no manual server login.
## Stack
- Terraform provisions the Hetzner Cloud VM, private network, and firewall
- cloud-init installs Tailscale first when configured, then installs k3s automatically
- cloud-init leaves only a bootstrap marker on the node; it does not clone this repo or apply Kubernetes assets
- Kubernetes manifests deploy:
- Redpanda
- trading system services
- private registry
- Forgejo
- Loki + Promtail + Grafana + Headlamp observability
- k3s-bundled Traefik ingress resources
- cert-manager
- ACME issuers
- local bootstrap script:
- runs Terraform
- optionally creates DNS records via Cloudflare or Porkbun
- fetches the real kubeconfig from the node
- writes overlay secrets/host patches from local env
- renders `.state/hetzner/generated-overlay/` from the checked-in Hetzner overlay template plus `deploy/k8s/platform/base/kustomization.yaml`
- applies that generated overlay from the operator workstation checkout
- builds the current app image locally
- imports the bootstrap image into k3s for the first rollout
## Files
- `infra/terraform/hetzner/`
- `deploy/k8s/platform/`
- `deploy/k8s/overlays/hetzner-single-node/`
- `projects/unrip/deploy/k8s/base/`
- `projects/unrip/`
- `scripts/hetzner/bootstrap.sh`
- `scripts/hetzner/configure-cloudflare-dns.sh`
- `scripts/hetzner/destroy.sh`
- `scripts/k8s/logs.sh`
- `.forgejo/workflows/deploy.yml`
## Required local tools
Always required:
- `terraform`
- `kubectl`
- `curl`
- `python3`
- `ssh`
- `git`
- `base64`
- `realpath`
- `pass` when using any `*_PASS` mapping
Conditionally required:
- `docker` only for `BOOTSTRAP_DELIVERY_MODE=local-image-import`, or as a fallback when no native `htpasswd` binary is available locally
- `htpasswd` is preferred for registry secret generation and avoids the docker fallback
Required local Python modules:
- `PyYAML` (`python3 -m pip install PyYAML`) for kubeconfig rendering during bootstrap
- `PyNaCl` (`python3 -m pip install PyNaCl`) only when `BOOTSTRAP_DELIVERY_MODE=forgejo-actions` so bootstrap can encrypt Forgejo Actions secrets
## Required local env
Start from:
```bash
cp scripts/hetzner/bootstrap-secrets.env.example scripts/hetzner/bootstrap-secrets.env
${EDITOR:-vi} scripts/hetzner/bootstrap-secrets.env
source scripts/hetzner/bootstrap-secrets.env
```
The mapping file should contain non-secret config plus `pass` entry references for secrets. Bootstrap and destroy load the first line from each configured pass entry without echoing it. Explicit env exports still override `pass` lookups.
When you run `scripts/hetzner/bootstrap.sh`, it uses this file to materialize local Kubernetes inputs before apply:
- overwrites `deploy/k8s/overlays/hetzner-single-node/secrets/unrip.env` with `NEAR_INTENTS_API_KEY`
- overwrites `deploy/k8s/overlays/hetzner-single-node/secrets/forgejo.env` with Forgejo `root_url` and `domain`
- overwrites `deploy/k8s/overlays/hetzner-single-node/secrets/observability.env` with Grafana bootstrap credentials and root URL
- renders `.state/hetzner/generated-overlay/` as the bootstrap-time source of truth
- copies the checked-in overlay patch behavior into that generated overlay
- imports platform resources from `deploy/k8s/platform/base/kustomization.yaml`, so newly added platform modules such as observability manifests are included automatically
- creates `registry-secrets` in namespace `registry` from `REGISTRY_USERNAME` and `REGISTRY_PASSWORD`
- creates the project docker-registry pull secret in `PROJECT_NAMESPACE` from the same registry credentials
This is different from running `kubectl apply -k deploy/k8s/overlays/hetzner-single-node` manually: plain Kustomize apply only consumes the checked-in overlay files, while bootstrap applies the generated overlay copy. Manual apply still only reads the checked-in files and does not read `scripts/hetzner/bootstrap-secrets.env` or create the imperative registry auth secrets on its own.
Required values:
- `HCLOUD_TOKEN_PASS` or `HCLOUD_TOKEN`
- `SSH_PUBLIC_KEY_PATH`
- `PUBLIC_DOMAIN`
- `BASE_DOMAIN`
- recommended Tailscale values:
- `TAILSCALE_AUTH_KEY_PASS` or `TAILSCALE_AUTH_KEY`
- optional `TAILSCALE_CONTROL_PLANE_HOSTNAME` to force a stable Tailscale DNS name for kube access
- if `TAILSCALE_CONTROL_PLANE_HOSTNAME` is left empty, bootstrap auto-discovers the node via local `tailscale status --json`
- `FORGEJO_DOMAIN`
- `FORGEJO_ROOT_URL`
- `REGISTRY_DOMAIN`
- `GRAFANA_DOMAIN`
- `GRAFANA_ROOT_URL`
- `HEADLAMP_DOMAIN`
- `LETSENCRYPT_EMAIL`
- `REGISTRY_USERNAME`
- `REGISTRY_PASSWORD_PASS` or `REGISTRY_PASSWORD`
- `NEAR_INTENTS_API_KEY_PASS` or `NEAR_INTENTS_API_KEY`
- `FORGEJO_ADMIN_USERNAME`
- `FORGEJO_ADMIN_EMAIL`
- `FORGEJO_ADMIN_PASSWORD_PASS` or `FORGEJO_ADMIN_PASSWORD`
- `GRAFANA_ADMIN_USERNAME` (defaults to `admin`)
- `GRAFANA_ADMIN_PASSWORD_PASS` or `GRAFANA_ADMIN_PASSWORD`
- optional `HEADLAMP_ADMIN_TOKEN_PASS` for storing the generated Headlamp login token back into `pass`
- optional repo settings: `FORGEJO_REPO_OWNER`, `FORGEJO_REPO_NAME`, `FORGEJO_REPO_PRIVATE`
- optional project path settings: `PROJECT_DIR`, `PROJECT_KUSTOMIZE_PATH`
Optional for automatic DNS:
- Cloudflare:
- `CLOUDFLARE_API_TOKEN_PASS` or `CLOUDFLARE_API_TOKEN`
- `CLOUDFLARE_ZONE_ID_PASS` or `CLOUDFLARE_ZONE_ID`
- Porkbun:
- `PORKBUN_API_KEY_PASS` or `PORKBUN_API_KEY`
- `PORKBUN_SECRET_API_KEY_PASS` or `PORKBUN_SECRET_API_KEY`
## Bootstrap
```bash
bash scripts/hetzner/bootstrap.sh
```
Outputs:
- Hetzner VM created
- Tailscale joined if configured
- k3s installed
- cloud-init writes `/opt/unrip/bootstrap/README.txt` as a marker that node-local repo bootstrap is not active yet
- kubeconfig written to `.state/hetzner/kubeconfig.yaml`
- CI kubeconfig written to `.state/hetzner/kubeconfig.incluster.yaml`
- overlay secrets and ingress host patches rendered from local env / `pass`
- `.state/hetzner/generated-overlay/` rendered and applied as the canonical bootstrap manifest set for that run
- namespaces, Redpanda, app deployments, Forgejo, registry, Traefik-targeted ingress resources, cert-manager, issuers, and any additional platform resources referenced by `deploy/k8s/platform/base/kustomization.yaml` applied
- Headlamp is deployed and wired to the configured public hostname model
- bootstrap stores the generated Headlamp service-account token in `pass` when `HEADLAMP_ADMIN_TOKEN_PASS` is configured
- Forgejo admin account created automatically if missing
- Forgejo runner registration is generated automatically from inside the Forgejo pod and the resulting `/data/.runner` config is stored under the shared `forgejo-data` persistent volume used by the runner deployment
- Forgejo repository created automatically in either the admin user's namespace or a pre-existing organization named by `FORGEJO_REPO_OWNER`
- Forgejo Actions secrets and variables configured automatically
- repo pushed to Forgejo automatically in the default `forgejo-actions` delivery mode via authenticated HTTPS Git push
- first deployment triggered from Forgejo Actions by default
## Tailscale-first admin access
Recommended mode:
- public firewall exposes only `80/443`
- admin access uses Tailscale
- Kubernetes API uses the Tailscale hostname when `TAILSCALE_CONTROL_PLANE_HOSTNAME` is set
`TF_ADMIN_CIDR_BLOCKS` remains only as a fallback if you intentionally want public admin/API exposure.
## DNS and TLS
If DNS provider credentials are present, bootstrap updates:
- `${PUBLIC_DOMAIN}`
- `git.${PUBLIC_DOMAIN}`
- `registry.${PUBLIC_DOMAIN}`
- `grafana.${PUBLIC_DOMAIN}`
- `headlamp.${PUBLIC_DOMAIN}`
Supported scripted providers:
- Cloudflare
- Porkbun
TLS is handled in-cluster by cert-manager using Let's Encrypt issuers and the rendered ingress hosts.
Grafana and Headlamp are both wired into the public hostname model by default. Keep Grafana authenticated, and treat the Headlamp token as an operator credential.
The platform base assumes the default k3s Traefik ingress controller is present; it does not install ingress-nginx.
For clean-cluster applies, the base kustomization now includes cert-manager before the `ClusterIssuer` resources so the issuer CRs can be created in the same bootstrap flow.
## Observe the cluster
```bash
KUBECONFIG=.state/hetzner/kubeconfig.yaml kubectl get pods -A
bash scripts/k8s/logs.sh
```
For the web log UI and observability stack, see `docs/k8s-observability.md`.
## Self-hosted CI/CD handoff
Default bootstrap now automates the Forgejo handoff:
1. create the Forgejo repo in the admin namespace or in a pre-existing organization named by `FORGEJO_REPO_OWNER`
2. configure the repository Actions secrets:
- `KUBECONFIG_B64`
- `REGISTRY_USERNAME`
- `REGISTRY_PASSWORD`
3. configure the repository Actions variables:
- `REGISTRY_HOST=${REGISTRY_DOMAIN}`
- `PROJECT_NAME`
- `PROJECT_NAMESPACE`
- `PROJECT_DEPLOYMENTS`
4. push the current repo to `main`
The workflow then:
- starts a Kubernetes Job in the target namespace
- checks out the repo inside that Job using the Forgejo job token via `Authorization: Bearer ...` HTTP auth
- uses Kaniko plus the Kubernetes registry auth secret to build and push `${REGISTRY_DOMAIN}/${PROJECT_NAME}:${GIT_SHA}` from `PROJECT_PATH` inside the repo checkout
- updates the app deployments in `PROJECT_NAMESPACE`
- waits for rollout
Legacy local-image bootstrap remains available with:
```bash
BOOTSTRAP_DELIVERY_MODE=local-image-import bash scripts/hetzner/bootstrap.sh
```
## Destroy everything
Default destroy only removes Terraform-managed Hetzner infrastructure:
```bash
source scripts/hetzner/bootstrap-secrets.env
bash scripts/hetzner/destroy.sh
```
Opt-in flags make destructive cleanup of bootstrap-managed leftovers explicit:
```bash
source scripts/hetzner/bootstrap-secrets.env
DESTROY_DNS=true \
DESTROY_LOCAL_STATE=true \
DESTROY_FORGEJO_REPO=true \
bash scripts/hetzner/destroy.sh
```
`destroy.sh` reads `HCLOUD_TOKEN`, optional `TAILSCALE_AUTH_KEY`, optional DNS provider credentials, and optional Forgejo admin credentials via the same `*_PASS` mapping mechanism as bootstrap.
It uses the same Terraform inputs as bootstrap for the infrastructure resources, then can optionally:
- delete the scripted DNS records for `${PUBLIC_DOMAIN}`, `git.${PUBLIC_DOMAIN}`, `registry.${PUBLIC_DOMAIN}`, `grafana.${PUBLIC_DOMAIN}`, and `headlamp.${PUBLIC_DOMAIN}`
- remove local bootstrap artifacts under `.state/hetzner/`, `deploy/k8s/overlays/hetzner-single-node/generated/`, and the local Terraform working/state files in `infra/terraform/hetzner/`
- delete the bootstrap-managed Forgejo repository via the Forgejo API
Supported scripted DNS cleanup providers:
- Cloudflare
- Porkbun
Cleanup defaults are intentionally conservative:
- `DESTROY_DNS=false` keeps provider records unless you explicitly opt in
- `DESTROY_LOCAL_STATE=false` keeps the last kubeconfigs and generated manifests for inspection
- `DESTROY_FORGEJO_REPO=false` keeps the remote Git repository unless you explicitly opt in
If any optional cleanup step is enabled but cannot run because credentials are missing, `destroy.sh` prints a skip message describing what was not removed.
If DNS cleanup or Forgejo repo deletion fails after Terraform teardown, rerun the same cleanup flags or remove the remaining resources manually.
## Current limitations
- organization-owned repo bootstrap works only when `FORGEJO_REPO_OWNER` names a pre-existing organization that the configured admin can create repositories in; bootstrap does not create the organization itself
- unattended repo seeding now uses an authenticated HTTPS remote built from the configured Forgejo admin credentials, so operators should replace that local remote with a token, SSH, or credential-helper-backed remote after bootstrap if they do not want credentials stored in `.git/config`
- cloud-init no longer clones a bootstrap repository onto the node; Kubernetes asset delivery is still workstation-driven after Terraform
- `bootstrap_repo_path` in Terraform is only a reserved marker for a future node-local bootstrap/GitOps flow
- bootstrap requires either a local `htpasswd` binary or local `docker` as a fallback to generate the registry htpasswd secret
- bootstrap and CI authentication paths should still be hardened before production use
- runner identity is persisted under the shared `forgejo-data` PVC, so deleting the `forgejo-runner` pod is safe but deleting that PVC forces re-registration on the next bootstrap run