doran/docs/hetzner-k3s-bootstrap.md
2026-03-28 23:05:43 +01:00

224 lines
10 KiB
Markdown

# Hetzner + k3s + self-hosted Git/CI bootstrap
Goal: provision and deploy everything from this repo to a single Hetzner machine with no manual server login.
## Stack
- Terraform provisions the Hetzner Cloud VM, private network, and firewall
- cloud-init installs Tailscale first when configured, then installs k3s automatically
- cloud-init leaves only a bootstrap marker on the node; it does not clone this repo or apply Kubernetes assets
- Kubernetes manifests deploy:
- Redpanda
- trading system services
- private registry
- Forgejo
- k3s-bundled Traefik ingress resources
- cert-manager
- ACME issuers
- local bootstrap script:
- runs Terraform
- optionally creates DNS records via Cloudflare or Porkbun
- fetches the real kubeconfig from the node
- writes overlay secrets/host patches from local env
- applies the Hetzner single-node k8s overlay from the operator workstation checkout
- builds the current app image locally
- imports the bootstrap image into k3s for the first rollout
## Files
- `infra/terraform/hetzner/`
- `deploy/k8s/base/`
- `deploy/k8s/overlays/hetzner-single-node/`
- `scripts/hetzner/bootstrap.sh`
- `scripts/hetzner/configure-cloudflare-dns.sh`
- `scripts/hetzner/destroy.sh`
- `scripts/k8s/logs.sh`
- `.forgejo/workflows/deploy.yml`
## Required local tools
Always required:
- `terraform`
- `kubectl`
- `curl`
- `python3`
- `ssh`
- `git`
- `base64`
- `realpath`
- `pass` when using any `*_PASS` mapping
Conditionally required:
- `docker` only for `BOOTSTRAP_DELIVERY_MODE=local-image-import`, or as a fallback when no native `htpasswd` binary is available locally
- `htpasswd` is preferred for registry secret generation and avoids the docker fallback
Required local Python modules:
- `PyYAML` (`python3 -m pip install PyYAML`) for kubeconfig rendering during bootstrap
- `PyNaCl` (`python3 -m pip install PyNaCl`) only when `BOOTSTRAP_DELIVERY_MODE=forgejo-actions` so bootstrap can encrypt Forgejo Actions secrets
## Required local env
Start from:
```bash
cp scripts/hetzner/bootstrap-secrets.env.example scripts/hetzner/bootstrap-secrets.env
${EDITOR:-vi} scripts/hetzner/bootstrap-secrets.env
source scripts/hetzner/bootstrap-secrets.env
```
The mapping file should contain non-secret config plus `pass` entry references for secrets. Bootstrap and destroy load the first line from each configured pass entry without echoing it. Explicit env exports still override `pass` lookups.
When you run `scripts/hetzner/bootstrap.sh`, it uses this file to materialize local Kubernetes inputs before apply:
- overwrites `deploy/k8s/overlays/hetzner-single-node/secrets/unrip.env` with `NEAR_INTENTS_API_KEY`
- overwrites `deploy/k8s/overlays/hetzner-single-node/secrets/forgejo.env` with Forgejo `root_url` and `domain`
- renders generated ingress and issuer patch files under `.state/hetzner/generated-overlay/`
- creates `registry-secrets` in namespace `registry` from `REGISTRY_USERNAME` and `REGISTRY_PASSWORD`
- creates the project docker-registry pull secret in `PROJECT_NAMESPACE` from the same registry credentials
This is different from running `kubectl apply -k deploy/k8s/overlays/hetzner-single-node` manually: plain Kustomize apply only consumes the checked-in overlay files and only generates `unrip-secrets` and `forgejo-secrets`. It does not create registry auth secrets and does not read `scripts/hetzner/bootstrap-secrets.env` on its own.
Required values:
- `HCLOUD_TOKEN_PASS` or `HCLOUD_TOKEN`
- `SSH_PUBLIC_KEY_PATH`
- `PUBLIC_DOMAIN`
- `BASE_DOMAIN`
- recommended Tailscale values:
- `TAILSCALE_AUTH_KEY_PASS` or `TAILSCALE_AUTH_KEY`
- optional `TAILSCALE_CONTROL_PLANE_HOSTNAME` to force a stable Tailscale DNS name for kube access
- if `TAILSCALE_CONTROL_PLANE_HOSTNAME` is left empty, bootstrap auto-discovers the node via local `tailscale status --json`
- `FORGEJO_DOMAIN`
- `FORGEJO_ROOT_URL`
- `REGISTRY_DOMAIN`
- `LETSENCRYPT_EMAIL`
- `REGISTRY_USERNAME`
- `REGISTRY_PASSWORD_PASS` or `REGISTRY_PASSWORD`
- `NEAR_INTENTS_API_KEY_PASS` or `NEAR_INTENTS_API_KEY`
- `FORGEJO_ADMIN_USERNAME`
- `FORGEJO_ADMIN_EMAIL`
- `FORGEJO_ADMIN_PASSWORD_PASS` or `FORGEJO_ADMIN_PASSWORD`
- optional repo settings: `FORGEJO_REPO_OWNER`, `FORGEJO_REPO_NAME`, `FORGEJO_REPO_PRIVATE`
Optional for automatic DNS:
- Cloudflare:
- `CLOUDFLARE_API_TOKEN_PASS` or `CLOUDFLARE_API_TOKEN`
- `CLOUDFLARE_ZONE_ID_PASS` or `CLOUDFLARE_ZONE_ID`
- Porkbun:
- `PORKBUN_API_KEY_PASS` or `PORKBUN_API_KEY`
- `PORKBUN_SECRET_API_KEY_PASS` or `PORKBUN_SECRET_API_KEY`
## Bootstrap
```bash
bash scripts/hetzner/bootstrap.sh
```
Outputs:
- Hetzner VM created
- Tailscale joined if configured
- k3s installed
- cloud-init writes `/opt/unrip/bootstrap/README.txt` as a marker that node-local repo bootstrap is not active yet
- kubeconfig written to `.state/hetzner/kubeconfig.yaml`
- CI kubeconfig written to `.state/hetzner/kubeconfig.incluster.yaml`
- overlay secrets and ingress host patches rendered from local env / `pass`
- namespaces, Redpanda, app deployments, Forgejo, registry, Traefik-targeted ingress resources, cert-manager, and issuers applied
- Forgejo admin account created automatically if missing
- Forgejo runner registration is generated automatically from inside the Forgejo pod and the resulting `/data/.runner` config is stored under the shared `forgejo-data` persistent volume used by the runner deployment
- Forgejo repository created automatically in either the admin user's namespace or a pre-existing organization named by `FORGEJO_REPO_OWNER`
- Forgejo Actions secrets and variables configured automatically
- repo pushed to Forgejo automatically in the default `forgejo-actions` delivery mode via authenticated HTTPS Git push
- first deployment triggered from Forgejo Actions by default
## Tailscale-first admin access
Recommended mode:
- public firewall exposes only `80/443`
- admin access uses Tailscale
- Kubernetes API uses the Tailscale hostname when `TAILSCALE_CONTROL_PLANE_HOSTNAME` is set
`TF_ADMIN_CIDR_BLOCKS` remains only as a fallback if you intentionally want public admin/API exposure.
## DNS and TLS
If DNS provider credentials are present, bootstrap updates:
- `${BASE_DOMAIN}`
- `git.${BASE_DOMAIN}`
- `registry.${BASE_DOMAIN}`
Supported scripted providers:
- Cloudflare
- Porkbun
TLS is handled in-cluster by cert-manager using Let's Encrypt issuers and the rendered ingress hosts.
The platform base assumes the default k3s Traefik ingress controller is present; it does not install ingress-nginx.
For clean-cluster applies, the base kustomization now includes cert-manager before the `ClusterIssuer` resources so the issuer CRs can be created in the same bootstrap flow.
## Observe the cluster
```bash
KUBECONFIG=.state/hetzner/kubeconfig.yaml kubectl get pods -A
bash scripts/k8s/logs.sh
```
## Self-hosted CI/CD handoff
Default bootstrap now automates the Forgejo handoff:
1. create the Forgejo repo in the admin namespace or in a pre-existing organization named by `FORGEJO_REPO_OWNER`
2. configure the repository Actions secrets:
- `KUBECONFIG_B64`
- `REGISTRY_USERNAME`
- `REGISTRY_PASSWORD`
3. configure the repository Actions variables:
- `REGISTRY_HOST=${REGISTRY_DOMAIN}`
- `PROJECT_NAME`
- `PROJECT_NAMESPACE`
- `PROJECT_DEPLOYMENTS`
4. push the current repo to `main`
The workflow then:
- starts a Kubernetes Job in the target namespace
- checks out the repo inside that Job using the Forgejo job token via `Authorization: Bearer ...` HTTP auth
- uses Kaniko plus the Kubernetes registry auth secret to build and push `${REGISTRY_DOMAIN}/${PROJECT_NAME}:${GIT_SHA}`
- updates the app deployments in `PROJECT_NAMESPACE`
- waits for rollout
Legacy local-image bootstrap remains available with:
```bash
BOOTSTRAP_DELIVERY_MODE=local-image-import bash scripts/hetzner/bootstrap.sh
```
## Destroy everything
Default destroy only removes Terraform-managed Hetzner infrastructure:
```bash
source scripts/hetzner/bootstrap-secrets.env
bash scripts/hetzner/destroy.sh
```
Opt-in flags make destructive cleanup of bootstrap-managed leftovers explicit:
```bash
source scripts/hetzner/bootstrap-secrets.env
DESTROY_DNS=true \
DESTROY_LOCAL_STATE=true \
DESTROY_FORGEJO_REPO=true \
bash scripts/hetzner/destroy.sh
```
`destroy.sh` reads `HCLOUD_TOKEN`, optional `TAILSCALE_AUTH_KEY`, optional DNS provider credentials, and optional Forgejo admin credentials via the same `*_PASS` mapping mechanism as bootstrap.
It uses the same Terraform inputs as bootstrap for the infrastructure resources, then can optionally:
- delete the scripted DNS records for `${BASE_DOMAIN}`, `git.${BASE_DOMAIN}`, and `registry.${BASE_DOMAIN}`
- remove local bootstrap artifacts under `.state/hetzner/`, `deploy/k8s/overlays/hetzner-single-node/generated/`, and the local Terraform working/state files in `infra/terraform/hetzner/`
- delete the bootstrap-managed Forgejo repository via the Forgejo API
Supported scripted DNS cleanup providers:
- Cloudflare
- Porkbun
Cleanup defaults are intentionally conservative:
- `DESTROY_DNS=false` keeps provider records unless you explicitly opt in
- `DESTROY_LOCAL_STATE=false` keeps the last kubeconfigs and generated manifests for inspection
- `DESTROY_FORGEJO_REPO=false` keeps the remote Git repository unless you explicitly opt in
If any optional cleanup step is enabled but cannot run because credentials are missing, `destroy.sh` prints a skip message describing what was not removed.
If DNS cleanup or Forgejo repo deletion fails after Terraform teardown, rerun the same cleanup flags or remove the remaining resources manually.
## Current limitations
- organization-owned repo bootstrap works only when `FORGEJO_REPO_OWNER` names a pre-existing organization that the configured admin can create repositories in; bootstrap does not create the organization itself
- unattended repo seeding now uses an authenticated HTTPS remote built from the configured Forgejo admin credentials, so operators should replace that local remote with a token, SSH, or credential-helper-backed remote after bootstrap if they do not want credentials stored in `.git/config`
- cloud-init no longer clones a bootstrap repository onto the node; Kubernetes asset delivery is still workstation-driven after Terraform
- `bootstrap_repo_path` in Terraform is only a reserved marker for a future node-local bootstrap/GitOps flow
- bootstrap requires either a local `htpasswd` binary or local `docker` as a fallback to generate the registry htpasswd secret
- bootstrap and CI authentication paths should still be hardened before production use
- runner identity is persisted under the shared `forgejo-data` PVC, so deleting the `forgejo-runner` pod is safe but deleting that PVC forces re-registration on the next bootstrap run