philipp/doran

Fork 0

Philipp 2a32461e39 feat: bootstrap hetzner k3s deployment

2026-03-28 20:53:29 +01:00

8.8 KiB

Raw Blame History

Hetzner single-node bootstrap (Terraform + cloud-init + k3s)

This is the canonical first-production deployment path for the repo.

A local operator workstation drives the first deployment end to end:

Terraform provisions Hetzner infrastructure
cloud-init installs k3s automatically on first boot
the workstation waits for the public Kubernetes API
the workstation creates initial Kubernetes Secrets
the workstation applies repo-managed Kubernetes manifests
the workstation performs the first image/bootstrap delivery attempt
once Forgejo + runner are alive, routine app deploys are intended to move to self-hosted CI

Compose remains available for local development, but it is not the primary production deployment model.

Scope of this layer

The foundation under infra/terraform/hetzner provisions:

one Hetzner Cloud server
one SSH key resource based on your local public key
firewall rules for SSH, Kubernetes API, and HTTP/HTTPS ingress
a private network attachment for future growth
cloud-init user-data for unattended k3s installation and host preparation

The repo bootstrap then applies the Hetzner single-node overlay under deploy/k8s/overlays/hetzner-single-node, which composes Kubernetes resources under deploy/k8s/ for:

shared platform namespaces and services
Redpanda
unrip workloads
Forgejo
Forgejo runner
private registry
ingress/TLS-related resources
Redpanda topic bootstrap job

Prerequisites

Install on the operator workstation:

Terraform >= 1.6
kubectl
docker
curl

You also need:

a Hetzner Cloud API token
an SSH keypair already present locally
access to DNS for your chosen domains
admin CIDRs that can reach the future server on 22/tcp and 6443/tcp
this repo checked out locally

Required bootstrap secrets and inputs

Prepare the operator env file:

cp scripts/hetzner/bootstrap-secrets.env.example scripts/hetzner/bootstrap-secrets.env
${EDITOR:-vi} scripts/hetzner/bootstrap-secrets.env

Set at least:

HCLOUD_TOKEN
SSH_PUBLIC_KEY_PATH
TF_ADMIN_CIDR_BLOCKS
BASE_DOMAIN
FORGEJO_DOMAIN
FORGEJO_ROOT_URL
NEAR_INTENTS_API_KEY
FORGEJO_RUNNER_REGISTRATION_TOKEN

Load it into the current shell:

source scripts/hetzner/bootstrap-secrets.env

Canonical bootstrap sequence

Run from repo root:

bash scripts/hetzner/bootstrap.sh

Current behavior of the script:

validates local tooling
runs terraform init and terraform apply in infra/terraform/hetzner
reads Terraform outputs such as server IP and k3s_api_url
waits for the k3s API readiness endpoint
writes a local workstation kubeconfig to .state/hetzner/kubeconfig.yaml
writes overlay secret env input files and creates:
- unrip/unrip-secrets
- unrip/unrip-registry-creds
- forgejo/forgejo-secrets
- registry/registry-secrets
applies deploy/k8s/platform/base/namespace.yaml and deploy/k8s/overlays/hetzner-single-node
builds the repo bootstrap image locally
pushes it through the temporary local registry bridge using the active project name
updates and waits for rollout status in the active project namespace

After the script finishes:

export KUBECONFIG=$PWD/.state/hetzner/kubeconfig.yaml
kubectl get nodes -o wide
kubectl get pods -A
kubectl -n unrip get deploy,pods,jobs
kubectl -n forgejo get deploy,pods,svc
kubectl -n registry get pods,svc

Current manifest target

Important current-state detail:

scripts/hetzner/bootstrap.sh now applies deploy/k8s/platform/base/namespace.yaml
it then applies deploy/k8s/overlays/hetzner-single-node
bootstrap naming no longer assumes legacy trading-system kubeconfig contexts, image tags, or rollout namespaces

Executor persistence in k3s

The dummy executor persists durable idempotency state.

Current persistence model:

application path: EXECUTOR_STATE_DIR=/var/lib/unrip/executor-state
cloud-init prepares the host boundary for executor storage on first boot
Kubernetes mounts storage at that same path for the executor workload
the Hetzner single-node overlay pins PVC-backed storage to k3s local-path

Operational consequence:

executor duplicate-suppression state lives on node-backed persistent storage
replacing the node or deleting the PVC without migration loses that history
treat executor state as required operational data, even though the executor is still a dummy implementation

Failure recovery runbook

A. Bootstrap fails before infrastructure exists

Typical causes:

invalid HCLOUD_TOKEN
wrong SSH_PUBLIC_KEY_PATH
malformed TF_ADMIN_CIDR_BLOCKS

Fix the input and rerun:

source scripts/hetzner/bootstrap-secrets.env
bash scripts/hetzner/bootstrap.sh

If you need to destroy partially created infrastructure:

source scripts/hetzner/bootstrap-secrets.env
bash scripts/hetzner/destroy.sh

B. Terraform succeeds but cluster access is not usable

Verify the generated kubeconfig and cluster health:

export KUBECONFIG=$PWD/.state/hetzner/kubeconfig.yaml
kubectl get nodes -o wide
kubectl get pods -A
kubectl get events -A --sort-by=.lastTimestamp | tail -n 50

What to suspect first:

cloud-init still running
k3s still starting
bootstrap kubeconfig/auth not fully aligned yet
public API reachable, but workloads not yet healthy

C. Secrets were wrong or missing

The current bootstrap depends on:

${PROJECT_NAME:-unrip}-secrets
- NEAR_INTENTS_API_KEY
forgejo-secrets
- root_url
- domain
- runner_registration_token

Verify:

kubectl -n unrip get secret unrip-secrets
kubectl -n unrip get secret unrip-registry-creds
kubectl -n forgejo get secret forgejo-secrets
kubectl -n registry get secret registry-secrets

If needed, recreate them from the workstation before restarting the affected deployments.

D. Workloads are present but not healthy

Inspect by namespace:

kubectl -n unrip get pods
kubectl -n unrip describe pod <pod-name>
kubectl -n unrip logs deploy/dummy-executor --tail=100
kubectl -n forgejo logs deploy/forgejo --tail=100
kubectl -n forgejo logs deploy/forgejo-runner --tail=100

Useful rollout checks:

kubectl -n unrip rollout status deployment/near-intents-ingest --timeout=300s
kubectl -n unrip rollout status deployment/dummy-reactor --timeout=300s
kubectl -n unrip rollout status deployment/dummy-executor --timeout=300s
kubectl -n unrip rollout status deployment/dummy-consumer --timeout=300s
kubectl -n forgejo rollout status deployment/forgejo --timeout=300s
kubectl -n forgejo rollout status deployment/forgejo-runner --timeout=300s

E. Need to inspect Terraform outputs directly

cd infra/terraform/hetzner
terraform output
terraform output server_ipv4
terraform output server_private_ipv4
terraform output k3s_api_url
terraform output kubeconfig_strategy

Self-hosted CI handoff

After the cluster is reachable and workloads are up:

reach Forgejo at the configured domain or by port-forward
perform the initial admin/bootstrap steps in Forgejo
create the target repository in Forgejo
push or mirror this repo into that Forgejo instance
confirm the runner is registered and healthy
move routine application deploys to the self-hosted pipeline, which now derives image naming and rollout targets from Forgejo repository variables instead of hard-coding the legacy project

Current repo-state caveats already known:

first bootstrap is repo-driven from the workstation
the bootstrap path no longer relies on SSH/scp transport in control flow
the kubeconfig/auth result is not yet fully production-hardened
first rollout still uses a temporary local registry bridge; routine CI deploys are intended to be registry-native and the Forgejo workflow now defaults to unrip while allowing per-repo overrides for image name, namespace, and deployment list
Forgejo admin creation, repo creation, and Actions configuration still require operator action after cluster bring-up
DNS automation is currently wired for Cloudflare when credentials are supplied during bootstrap
TLS is expected to come from cert-manager + Let's Encrypt once ingress hostnames resolve publicly

Terraform-only usage

If you only want the infra layer:

cd infra/terraform/hetzner
export TF_VAR_hcloud_token="<your-hetzner-token>"
export TF_VAR_ssh_public_key="$(cat ~/.ssh/id_ed25519.pub)"
export TF_VAR_admin_cidr_blocks='["203.0.113.10/32"]'

terraform init
terraform apply

Useful outputs:

server_ipv4
server_private_ipv4
server_name
server_fqdn
k3s_api_url
kubeconfig_strategy

For CI/CD details, also see:

docs/hetzner-k3s-bootstrap.md
docs/hetzner-self-hosted-ci-runbook.md

Compose status

Compose is still useful for:

local development
fast topology debugging
non-production single-machine testing

But it should be treated as optional/dev runtime support, not as the primary production deployment path.

8.8 KiB Raw Blame History