doran/deploy/hetzner/README.md
2026-03-28 20:53:29 +01:00

275 lines
8.8 KiB
Markdown

# Hetzner single-node bootstrap (Terraform + cloud-init + k3s)
This is the canonical first-production deployment path for the repo.
A local operator workstation drives the first deployment end to end:
- Terraform provisions Hetzner infrastructure
- cloud-init installs k3s automatically on first boot
- the workstation waits for the public Kubernetes API
- the workstation creates initial Kubernetes Secrets
- the workstation applies repo-managed Kubernetes manifests
- the workstation performs the first image/bootstrap delivery attempt
- once Forgejo + runner are alive, routine app deploys are intended to move to self-hosted CI
Compose remains available for local development, but it is not the primary production deployment model.
## Scope of this layer
The foundation under `infra/terraform/hetzner` provisions:
- one Hetzner Cloud server
- one SSH key resource based on your local public key
- firewall rules for SSH, Kubernetes API, and HTTP/HTTPS ingress
- a private network attachment for future growth
- cloud-init user-data for unattended k3s installation and host preparation
The repo bootstrap then applies the Hetzner single-node overlay under `deploy/k8s/overlays/hetzner-single-node`, which composes Kubernetes resources under `deploy/k8s/` for:
- shared platform namespaces and services
- Redpanda
- unrip workloads
- Forgejo
- Forgejo runner
- private registry
- ingress/TLS-related resources
- Redpanda topic bootstrap job
## Prerequisites
Install on the operator workstation:
- Terraform `>= 1.6`
- `kubectl`
- `docker`
- `curl`
You also need:
- a Hetzner Cloud API token
- an SSH keypair already present locally
- access to DNS for your chosen domains
- admin CIDRs that can reach the future server on `22/tcp` and `6443/tcp`
- this repo checked out locally
## Required bootstrap secrets and inputs
Prepare the operator env file:
```bash
cp scripts/hetzner/bootstrap-secrets.env.example scripts/hetzner/bootstrap-secrets.env
${EDITOR:-vi} scripts/hetzner/bootstrap-secrets.env
```
Set at least:
- `HCLOUD_TOKEN`
- `SSH_PUBLIC_KEY_PATH`
- `TF_ADMIN_CIDR_BLOCKS`
- `BASE_DOMAIN`
- `FORGEJO_DOMAIN`
- `FORGEJO_ROOT_URL`
- `NEAR_INTENTS_API_KEY`
- `FORGEJO_RUNNER_REGISTRATION_TOKEN`
Load it into the current shell:
```bash
source scripts/hetzner/bootstrap-secrets.env
```
## Canonical bootstrap sequence
Run from repo root:
```bash
bash scripts/hetzner/bootstrap.sh
```
Current behavior of the script:
1. validates local tooling
2. runs `terraform init` and `terraform apply` in `infra/terraform/hetzner`
3. reads Terraform outputs such as server IP and `k3s_api_url`
4. waits for the k3s API readiness endpoint
5. writes a local workstation kubeconfig to `.state/hetzner/kubeconfig.yaml`
6. writes overlay secret env input files and creates:
- `unrip/unrip-secrets`
- `unrip/unrip-registry-creds`
- `forgejo/forgejo-secrets`
- `registry/registry-secrets`
7. applies `deploy/k8s/platform/base/namespace.yaml` and `deploy/k8s/overlays/hetzner-single-node`
8. builds the repo bootstrap image locally
9. pushes it through the temporary local registry bridge using the active project name
10. updates and waits for rollout status in the active project namespace
After the script finishes:
```bash
export KUBECONFIG=$PWD/.state/hetzner/kubeconfig.yaml
kubectl get nodes -o wide
kubectl get pods -A
kubectl -n unrip get deploy,pods,jobs
kubectl -n forgejo get deploy,pods,svc
kubectl -n registry get pods,svc
```
## Current manifest target
Important current-state detail:
- `scripts/hetzner/bootstrap.sh` now applies `deploy/k8s/platform/base/namespace.yaml`
- it then applies `deploy/k8s/overlays/hetzner-single-node`
- bootstrap naming no longer assumes legacy `trading-system` kubeconfig contexts, image tags, or rollout namespaces
## Executor persistence in k3s
The dummy executor persists durable idempotency state.
Current persistence model:
- application path: `EXECUTOR_STATE_DIR=/var/lib/unrip/executor-state`
- cloud-init prepares the host boundary for executor storage on first boot
- Kubernetes mounts storage at that same path for the executor workload
- the Hetzner single-node overlay pins PVC-backed storage to k3s `local-path`
Operational consequence:
- executor duplicate-suppression state lives on node-backed persistent storage
- replacing the node or deleting the PVC without migration loses that history
- treat executor state as required operational data, even though the executor is still a dummy implementation
## Failure recovery runbook
### A. Bootstrap fails before infrastructure exists
Typical causes:
- invalid `HCLOUD_TOKEN`
- wrong `SSH_PUBLIC_KEY_PATH`
- malformed `TF_ADMIN_CIDR_BLOCKS`
Fix the input and rerun:
```bash
source scripts/hetzner/bootstrap-secrets.env
bash scripts/hetzner/bootstrap.sh
```
If you need to destroy partially created infrastructure:
```bash
source scripts/hetzner/bootstrap-secrets.env
bash scripts/hetzner/destroy.sh
```
### B. Terraform succeeds but cluster access is not usable
Verify the generated kubeconfig and cluster health:
```bash
export KUBECONFIG=$PWD/.state/hetzner/kubeconfig.yaml
kubectl get nodes -o wide
kubectl get pods -A
kubectl get events -A --sort-by=.lastTimestamp | tail -n 50
```
What to suspect first:
- cloud-init still running
- k3s still starting
- bootstrap kubeconfig/auth not fully aligned yet
- public API reachable, but workloads not yet healthy
### C. Secrets were wrong or missing
The current bootstrap depends on:
- `${PROJECT_NAME:-unrip}-secrets`
- `NEAR_INTENTS_API_KEY`
- `forgejo-secrets`
- `root_url`
- `domain`
- `runner_registration_token`
Verify:
```bash
kubectl -n unrip get secret unrip-secrets
kubectl -n unrip get secret unrip-registry-creds
kubectl -n forgejo get secret forgejo-secrets
kubectl -n registry get secret registry-secrets
```
If needed, recreate them from the workstation before restarting the affected deployments.
### D. Workloads are present but not healthy
Inspect by namespace:
```bash
kubectl -n unrip get pods
kubectl -n unrip describe pod <pod-name>
kubectl -n unrip logs deploy/dummy-executor --tail=100
kubectl -n forgejo logs deploy/forgejo --tail=100
kubectl -n forgejo logs deploy/forgejo-runner --tail=100
```
Useful rollout checks:
```bash
kubectl -n unrip rollout status deployment/near-intents-ingest --timeout=300s
kubectl -n unrip rollout status deployment/dummy-reactor --timeout=300s
kubectl -n unrip rollout status deployment/dummy-executor --timeout=300s
kubectl -n unrip rollout status deployment/dummy-consumer --timeout=300s
kubectl -n forgejo rollout status deployment/forgejo --timeout=300s
kubectl -n forgejo rollout status deployment/forgejo-runner --timeout=300s
```
### E. Need to inspect Terraform outputs directly
```bash
cd infra/terraform/hetzner
terraform output
terraform output server_ipv4
terraform output server_private_ipv4
terraform output k3s_api_url
terraform output kubeconfig_strategy
```
## Self-hosted CI handoff
After the cluster is reachable and workloads are up:
1. reach Forgejo at the configured domain or by port-forward
2. perform the initial admin/bootstrap steps in Forgejo
3. create the target repository in Forgejo
4. push or mirror this repo into that Forgejo instance
5. confirm the runner is registered and healthy
6. move routine application deploys to the self-hosted pipeline, which now derives image naming and rollout targets from Forgejo repository variables instead of hard-coding the legacy project
Current repo-state caveats already known:
- first bootstrap is repo-driven from the workstation
- the bootstrap path no longer relies on SSH/scp transport in control flow
- the kubeconfig/auth result is not yet fully production-hardened
- first rollout still uses a temporary local registry bridge; routine CI deploys are intended to be registry-native and the Forgejo workflow now defaults to `unrip` while allowing per-repo overrides for image name, namespace, and deployment list
- Forgejo admin creation, repo creation, and Actions configuration still require operator action after cluster bring-up
- DNS automation is currently wired for Cloudflare when credentials are supplied during bootstrap
- TLS is expected to come from cert-manager + Let's Encrypt once ingress hostnames resolve publicly
## Terraform-only usage
If you only want the infra layer:
```bash
cd infra/terraform/hetzner
export TF_VAR_hcloud_token="<your-hetzner-token>"
export TF_VAR_ssh_public_key="$(cat ~/.ssh/id_ed25519.pub)"
export TF_VAR_admin_cidr_blocks='["203.0.113.10/32"]'
terraform init
terraform apply
```
Useful outputs:
- `server_ipv4`
- `server_private_ipv4`
- `server_name`
- `server_fqdn`
- `k3s_api_url`
- `kubeconfig_strategy`
For CI/CD details, also see:
- `docs/hetzner-k3s-bootstrap.md`
- `docs/hetzner-self-hosted-ci-runbook.md`
## Compose status
Compose is still useful for:
- local development
- fast topology debugging
- non-production single-machine testing
But it should be treated as optional/dev runtime support, not as the primary production deployment path.