275 lines
8.8 KiB
Markdown
275 lines
8.8 KiB
Markdown
# Hetzner single-node bootstrap (Terraform + cloud-init + k3s)
|
|
|
|
This is the canonical first-production deployment path for the repo.
|
|
|
|
A local operator workstation drives the first deployment end to end:
|
|
- Terraform provisions Hetzner infrastructure
|
|
- cloud-init installs k3s automatically on first boot
|
|
- the workstation waits for the public Kubernetes API
|
|
- the workstation creates initial Kubernetes Secrets
|
|
- the workstation applies repo-managed Kubernetes manifests
|
|
- the workstation performs the first image/bootstrap delivery attempt
|
|
- once Forgejo + runner are alive, routine app deploys are intended to move to self-hosted CI
|
|
|
|
Compose remains available for local development, but it is not the primary production deployment model.
|
|
|
|
## Scope of this layer
|
|
|
|
The foundation under `infra/terraform/hetzner` provisions:
|
|
- one Hetzner Cloud server
|
|
- one SSH key resource based on your local public key
|
|
- firewall rules for SSH, Kubernetes API, and HTTP/HTTPS ingress
|
|
- a private network attachment for future growth
|
|
- cloud-init user-data for unattended k3s installation and host preparation
|
|
|
|
The repo bootstrap then applies the Hetzner single-node overlay under `deploy/k8s/overlays/hetzner-single-node`, which composes Kubernetes resources under `deploy/k8s/` for:
|
|
- shared platform namespaces and services
|
|
- Redpanda
|
|
- unrip workloads
|
|
- Forgejo
|
|
- Forgejo runner
|
|
- private registry
|
|
- ingress/TLS-related resources
|
|
- Redpanda topic bootstrap job
|
|
|
|
## Prerequisites
|
|
|
|
Install on the operator workstation:
|
|
- Terraform `>= 1.6`
|
|
- `kubectl`
|
|
- `docker`
|
|
- `curl`
|
|
|
|
You also need:
|
|
- a Hetzner Cloud API token
|
|
- an SSH keypair already present locally
|
|
- access to DNS for your chosen domains
|
|
- admin CIDRs that can reach the future server on `22/tcp` and `6443/tcp`
|
|
- this repo checked out locally
|
|
|
|
## Required bootstrap secrets and inputs
|
|
|
|
Prepare the operator env file:
|
|
|
|
```bash
|
|
cp scripts/hetzner/bootstrap-secrets.env.example scripts/hetzner/bootstrap-secrets.env
|
|
${EDITOR:-vi} scripts/hetzner/bootstrap-secrets.env
|
|
```
|
|
|
|
Set at least:
|
|
- `HCLOUD_TOKEN`
|
|
- `SSH_PUBLIC_KEY_PATH`
|
|
- `TF_ADMIN_CIDR_BLOCKS`
|
|
- `BASE_DOMAIN`
|
|
- `FORGEJO_DOMAIN`
|
|
- `FORGEJO_ROOT_URL`
|
|
- `NEAR_INTENTS_API_KEY`
|
|
- `FORGEJO_RUNNER_REGISTRATION_TOKEN`
|
|
|
|
Load it into the current shell:
|
|
|
|
```bash
|
|
source scripts/hetzner/bootstrap-secrets.env
|
|
```
|
|
|
|
## Canonical bootstrap sequence
|
|
|
|
Run from repo root:
|
|
|
|
```bash
|
|
bash scripts/hetzner/bootstrap.sh
|
|
```
|
|
|
|
Current behavior of the script:
|
|
1. validates local tooling
|
|
2. runs `terraform init` and `terraform apply` in `infra/terraform/hetzner`
|
|
3. reads Terraform outputs such as server IP and `k3s_api_url`
|
|
4. waits for the k3s API readiness endpoint
|
|
5. writes a local workstation kubeconfig to `.state/hetzner/kubeconfig.yaml`
|
|
6. writes overlay secret env input files and creates:
|
|
- `unrip/unrip-secrets`
|
|
- `unrip/unrip-registry-creds`
|
|
- `forgejo/forgejo-secrets`
|
|
- `registry/registry-secrets`
|
|
7. applies `deploy/k8s/platform/base/namespace.yaml` and `deploy/k8s/overlays/hetzner-single-node`
|
|
8. builds the repo bootstrap image locally
|
|
9. pushes it through the temporary local registry bridge using the active project name
|
|
10. updates and waits for rollout status in the active project namespace
|
|
|
|
After the script finishes:
|
|
|
|
```bash
|
|
export KUBECONFIG=$PWD/.state/hetzner/kubeconfig.yaml
|
|
kubectl get nodes -o wide
|
|
kubectl get pods -A
|
|
kubectl -n unrip get deploy,pods,jobs
|
|
kubectl -n forgejo get deploy,pods,svc
|
|
kubectl -n registry get pods,svc
|
|
```
|
|
|
|
## Current manifest target
|
|
|
|
Important current-state detail:
|
|
- `scripts/hetzner/bootstrap.sh` now applies `deploy/k8s/platform/base/namespace.yaml`
|
|
- it then applies `deploy/k8s/overlays/hetzner-single-node`
|
|
- bootstrap naming no longer assumes legacy `trading-system` kubeconfig contexts, image tags, or rollout namespaces
|
|
|
|
## Executor persistence in k3s
|
|
|
|
The dummy executor persists durable idempotency state.
|
|
|
|
Current persistence model:
|
|
- application path: `EXECUTOR_STATE_DIR=/var/lib/unrip/executor-state`
|
|
- cloud-init prepares the host boundary for executor storage on first boot
|
|
- Kubernetes mounts storage at that same path for the executor workload
|
|
- the Hetzner single-node overlay pins PVC-backed storage to k3s `local-path`
|
|
|
|
Operational consequence:
|
|
- executor duplicate-suppression state lives on node-backed persistent storage
|
|
- replacing the node or deleting the PVC without migration loses that history
|
|
- treat executor state as required operational data, even though the executor is still a dummy implementation
|
|
|
|
## Failure recovery runbook
|
|
|
|
### A. Bootstrap fails before infrastructure exists
|
|
Typical causes:
|
|
- invalid `HCLOUD_TOKEN`
|
|
- wrong `SSH_PUBLIC_KEY_PATH`
|
|
- malformed `TF_ADMIN_CIDR_BLOCKS`
|
|
|
|
Fix the input and rerun:
|
|
|
|
```bash
|
|
source scripts/hetzner/bootstrap-secrets.env
|
|
bash scripts/hetzner/bootstrap.sh
|
|
```
|
|
|
|
If you need to destroy partially created infrastructure:
|
|
|
|
```bash
|
|
source scripts/hetzner/bootstrap-secrets.env
|
|
bash scripts/hetzner/destroy.sh
|
|
```
|
|
|
|
### B. Terraform succeeds but cluster access is not usable
|
|
Verify the generated kubeconfig and cluster health:
|
|
|
|
```bash
|
|
export KUBECONFIG=$PWD/.state/hetzner/kubeconfig.yaml
|
|
kubectl get nodes -o wide
|
|
kubectl get pods -A
|
|
kubectl get events -A --sort-by=.lastTimestamp | tail -n 50
|
|
```
|
|
|
|
What to suspect first:
|
|
- cloud-init still running
|
|
- k3s still starting
|
|
- bootstrap kubeconfig/auth not fully aligned yet
|
|
- public API reachable, but workloads not yet healthy
|
|
|
|
### C. Secrets were wrong or missing
|
|
The current bootstrap depends on:
|
|
- `${PROJECT_NAME:-unrip}-secrets`
|
|
- `NEAR_INTENTS_API_KEY`
|
|
- `forgejo-secrets`
|
|
- `root_url`
|
|
- `domain`
|
|
- `runner_registration_token`
|
|
|
|
Verify:
|
|
|
|
```bash
|
|
kubectl -n unrip get secret unrip-secrets
|
|
kubectl -n unrip get secret unrip-registry-creds
|
|
kubectl -n forgejo get secret forgejo-secrets
|
|
kubectl -n registry get secret registry-secrets
|
|
```
|
|
|
|
If needed, recreate them from the workstation before restarting the affected deployments.
|
|
|
|
### D. Workloads are present but not healthy
|
|
Inspect by namespace:
|
|
|
|
```bash
|
|
kubectl -n unrip get pods
|
|
kubectl -n unrip describe pod <pod-name>
|
|
kubectl -n unrip logs deploy/dummy-executor --tail=100
|
|
kubectl -n forgejo logs deploy/forgejo --tail=100
|
|
kubectl -n forgejo logs deploy/forgejo-runner --tail=100
|
|
```
|
|
|
|
Useful rollout checks:
|
|
|
|
```bash
|
|
kubectl -n unrip rollout status deployment/near-intents-ingest --timeout=300s
|
|
kubectl -n unrip rollout status deployment/dummy-reactor --timeout=300s
|
|
kubectl -n unrip rollout status deployment/dummy-executor --timeout=300s
|
|
kubectl -n unrip rollout status deployment/dummy-consumer --timeout=300s
|
|
kubectl -n forgejo rollout status deployment/forgejo --timeout=300s
|
|
kubectl -n forgejo rollout status deployment/forgejo-runner --timeout=300s
|
|
```
|
|
|
|
### E. Need to inspect Terraform outputs directly
|
|
|
|
```bash
|
|
cd infra/terraform/hetzner
|
|
terraform output
|
|
terraform output server_ipv4
|
|
terraform output server_private_ipv4
|
|
terraform output k3s_api_url
|
|
terraform output kubeconfig_strategy
|
|
```
|
|
|
|
## Self-hosted CI handoff
|
|
|
|
After the cluster is reachable and workloads are up:
|
|
1. reach Forgejo at the configured domain or by port-forward
|
|
2. perform the initial admin/bootstrap steps in Forgejo
|
|
3. create the target repository in Forgejo
|
|
4. push or mirror this repo into that Forgejo instance
|
|
5. confirm the runner is registered and healthy
|
|
6. move routine application deploys to the self-hosted pipeline, which now derives image naming and rollout targets from Forgejo repository variables instead of hard-coding the legacy project
|
|
|
|
Current repo-state caveats already known:
|
|
- first bootstrap is repo-driven from the workstation
|
|
- the bootstrap path no longer relies on SSH/scp transport in control flow
|
|
- the kubeconfig/auth result is not yet fully production-hardened
|
|
- first rollout still uses a temporary local registry bridge; routine CI deploys are intended to be registry-native and the Forgejo workflow now defaults to `unrip` while allowing per-repo overrides for image name, namespace, and deployment list
|
|
- Forgejo admin creation, repo creation, and Actions configuration still require operator action after cluster bring-up
|
|
- DNS automation is currently wired for Cloudflare when credentials are supplied during bootstrap
|
|
- TLS is expected to come from cert-manager + Let's Encrypt once ingress hostnames resolve publicly
|
|
|
|
## Terraform-only usage
|
|
|
|
If you only want the infra layer:
|
|
|
|
```bash
|
|
cd infra/terraform/hetzner
|
|
export TF_VAR_hcloud_token="<your-hetzner-token>"
|
|
export TF_VAR_ssh_public_key="$(cat ~/.ssh/id_ed25519.pub)"
|
|
export TF_VAR_admin_cidr_blocks='["203.0.113.10/32"]'
|
|
|
|
terraform init
|
|
terraform apply
|
|
```
|
|
|
|
Useful outputs:
|
|
- `server_ipv4`
|
|
- `server_private_ipv4`
|
|
- `server_name`
|
|
- `server_fqdn`
|
|
- `k3s_api_url`
|
|
- `kubeconfig_strategy`
|
|
|
|
For CI/CD details, also see:
|
|
- `docs/hetzner-k3s-bootstrap.md`
|
|
- `docs/hetzner-self-hosted-ci-runbook.md`
|
|
|
|
## Compose status
|
|
|
|
Compose is still useful for:
|
|
- local development
|
|
- fast topology debugging
|
|
- non-production single-machine testing
|
|
|
|
But it should be treated as optional/dev runtime support, not as the primary production deployment path.
|