philipp/doran

Fork 0

Philipp 61b973cccb feat: add cluster log aggregation with grafana

2026-03-29 00:38:24 +01:00

12 KiB

Raw Blame History

Hetzner + k3s + self-hosted Git/CI bootstrap

Goal: provision and deploy everything from this repo to a single Hetzner machine with no manual server login.

Stack

Terraform provisions the Hetzner Cloud VM, private network, and firewall
cloud-init installs Tailscale first when configured, then installs k3s automatically
cloud-init leaves only a bootstrap marker on the node; it does not clone this repo or apply Kubernetes assets
Kubernetes manifests deploy:
- Redpanda
- trading system services
- private registry
- Forgejo
- Loki + Promtail + Grafana observability
- k3s-bundled Traefik ingress resources
- cert-manager
- ACME issuers
local bootstrap script:
- runs Terraform
- optionally creates DNS records via Cloudflare or Porkbun
- fetches the real kubeconfig from the node
- writes overlay secrets/host patches from local env
- renders .state/hetzner/generated-overlay/ from the checked-in Hetzner overlay template plus deploy/k8s/platform/base/kustomization.yaml
- applies that generated overlay from the operator workstation checkout
- builds the current app image locally
- imports the bootstrap image into k3s for the first rollout

Files

infra/terraform/hetzner/
deploy/k8s/base/
deploy/k8s/overlays/hetzner-single-node/
scripts/hetzner/bootstrap.sh
scripts/hetzner/configure-cloudflare-dns.sh
scripts/hetzner/destroy.sh
scripts/k8s/logs.sh
.forgejo/workflows/deploy.yml

Required local tools

Always required:

terraform
kubectl
curl
python3
ssh
git
base64
realpath
pass when using any *_PASS mapping

Conditionally required:

docker only for BOOTSTRAP_DELIVERY_MODE=local-image-import, or as a fallback when no native htpasswd binary is available locally
htpasswd is preferred for registry secret generation and avoids the docker fallback

Required local Python modules:

PyYAML (python3 -m pip install PyYAML) for kubeconfig rendering during bootstrap
PyNaCl (python3 -m pip install PyNaCl) only when BOOTSTRAP_DELIVERY_MODE=forgejo-actions so bootstrap can encrypt Forgejo Actions secrets

Required local env

Start from:

cp scripts/hetzner/bootstrap-secrets.env.example scripts/hetzner/bootstrap-secrets.env
${EDITOR:-vi} scripts/hetzner/bootstrap-secrets.env
source scripts/hetzner/bootstrap-secrets.env

The mapping file should contain non-secret config plus pass entry references for secrets. Bootstrap and destroy load the first line from each configured pass entry without echoing it. Explicit env exports still override pass lookups.

When you run scripts/hetzner/bootstrap.sh, it uses this file to materialize local Kubernetes inputs before apply:

overwrites deploy/k8s/overlays/hetzner-single-node/secrets/unrip.env with NEAR_INTENTS_API_KEY
overwrites deploy/k8s/overlays/hetzner-single-node/secrets/forgejo.env with Forgejo root_url and domain
overwrites deploy/k8s/overlays/hetzner-single-node/secrets/observability.env with Grafana bootstrap credentials and root URL
renders .state/hetzner/generated-overlay/ as the bootstrap-time source of truth
copies the checked-in overlay patch behavior into that generated overlay
imports platform resources from deploy/k8s/platform/base/kustomization.yaml, so newly added platform modules such as observability manifests are included automatically
creates registry-secrets in namespace registry from REGISTRY_USERNAME and REGISTRY_PASSWORD
creates the project docker-registry pull secret in PROJECT_NAMESPACE from the same registry credentials

This is different from running kubectl apply -k deploy/k8s/overlays/hetzner-single-node manually: plain Kustomize apply only consumes the checked-in overlay files, while bootstrap applies the generated overlay copy. Manual apply still only reads the checked-in files and does not read scripts/hetzner/bootstrap-secrets.env or create the imperative registry auth secrets on its own.

Required values:

HCLOUD_TOKEN_PASS or HCLOUD_TOKEN
SSH_PUBLIC_KEY_PATH
PUBLIC_DOMAIN
BASE_DOMAIN
recommended Tailscale values:
- TAILSCALE_AUTH_KEY_PASS or TAILSCALE_AUTH_KEY
- optional TAILSCALE_CONTROL_PLANE_HOSTNAME to force a stable Tailscale DNS name for kube access
- if TAILSCALE_CONTROL_PLANE_HOSTNAME is left empty, bootstrap auto-discovers the node via local tailscale status --json
FORGEJO_DOMAIN
FORGEJO_ROOT_URL
REGISTRY_DOMAIN
GRAFANA_DOMAIN
GRAFANA_ROOT_URL
LETSENCRYPT_EMAIL
REGISTRY_USERNAME
REGISTRY_PASSWORD_PASS or REGISTRY_PASSWORD
NEAR_INTENTS_API_KEY_PASS or NEAR_INTENTS_API_KEY
FORGEJO_ADMIN_USERNAME
FORGEJO_ADMIN_EMAIL
FORGEJO_ADMIN_PASSWORD_PASS or FORGEJO_ADMIN_PASSWORD
GRAFANA_ADMIN_USERNAME (defaults to admin)
GRAFANA_ADMIN_PASSWORD_PASS or GRAFANA_ADMIN_PASSWORD
optional repo settings: FORGEJO_REPO_OWNER, FORGEJO_REPO_NAME, FORGEJO_REPO_PRIVATE

Optional for automatic DNS:

Cloudflare:
- CLOUDFLARE_API_TOKEN_PASS or CLOUDFLARE_API_TOKEN
- CLOUDFLARE_ZONE_ID_PASS or CLOUDFLARE_ZONE_ID
Porkbun:
- PORKBUN_API_KEY_PASS or PORKBUN_API_KEY
- PORKBUN_SECRET_API_KEY_PASS or PORKBUN_SECRET_API_KEY

Bootstrap

bash scripts/hetzner/bootstrap.sh

Outputs:

Hetzner VM created
Tailscale joined if configured
k3s installed
cloud-init writes /opt/unrip/bootstrap/README.txt as a marker that node-local repo bootstrap is not active yet
kubeconfig written to .state/hetzner/kubeconfig.yaml
CI kubeconfig written to .state/hetzner/kubeconfig.incluster.yaml
overlay secrets and ingress host patches rendered from local env / pass
.state/hetzner/generated-overlay/ rendered and applied as the canonical bootstrap manifest set for that run
namespaces, Redpanda, app deployments, Forgejo, registry, Traefik-targeted ingress resources, cert-manager, issuers, and any additional platform resources referenced by deploy/k8s/platform/base/kustomization.yaml applied
Forgejo admin account created automatically if missing
Forgejo runner registration is generated automatically from inside the Forgejo pod and the resulting /data/.runner config is stored under the shared forgejo-data persistent volume used by the runner deployment
Forgejo repository created automatically in either the admin user's namespace or a pre-existing organization named by FORGEJO_REPO_OWNER
Forgejo Actions secrets and variables configured automatically
repo pushed to Forgejo automatically in the default forgejo-actions delivery mode via authenticated HTTPS Git push
first deployment triggered from Forgejo Actions by default

Tailscale-first admin access

Recommended mode:

public firewall exposes only 80/443
admin access uses Tailscale
Kubernetes API uses the Tailscale hostname when TAILSCALE_CONTROL_PLANE_HOSTNAME is set

TF_ADMIN_CIDR_BLOCKS remains only as a fallback if you intentionally want public admin/API exposure.

DNS and TLS

If DNS provider credentials are present, bootstrap updates:

${PUBLIC_DOMAIN}
git.${PUBLIC_DOMAIN}
registry.${PUBLIC_DOMAIN}
grafana.${PUBLIC_DOMAIN}

Supported scripted providers:

Cloudflare
Porkbun

TLS is handled in-cluster by cert-manager using Let's Encrypt issuers and the rendered ingress hosts. Grafana is the default observability UI wired into the public hostname model. Keep Grafana authenticated. The platform base assumes the default k3s Traefik ingress controller is present; it does not install ingress-nginx. For clean-cluster applies, the base kustomization now includes cert-manager before the ClusterIssuer resources so the issuer CRs can be created in the same bootstrap flow.

Observe the cluster

KUBECONFIG=.state/hetzner/kubeconfig.yaml kubectl get pods -A
bash scripts/k8s/logs.sh

For the web log UI and observability stack, see docs/k8s-observability.md.

Self-hosted CI/CD handoff

Default bootstrap now automates the Forgejo handoff:

create the Forgejo repo in the admin namespace or in a pre-existing organization named by FORGEJO_REPO_OWNER
configure the repository Actions secrets:
- KUBECONFIG_B64
- REGISTRY_USERNAME
- REGISTRY_PASSWORD
configure the repository Actions variables:
- REGISTRY_HOST=${REGISTRY_DOMAIN}
- PROJECT_NAME
- PROJECT_NAMESPACE
- PROJECT_DEPLOYMENTS
push the current repo to main

The workflow then:

starts a Kubernetes Job in the target namespace
checks out the repo inside that Job using the Forgejo job token via Authorization: Bearer ... HTTP auth
uses Kaniko plus the Kubernetes registry auth secret to build and push ${REGISTRY_DOMAIN}/${PROJECT_NAME}:${GIT_SHA}
updates the app deployments in PROJECT_NAMESPACE
waits for rollout

Legacy local-image bootstrap remains available with:

BOOTSTRAP_DELIVERY_MODE=local-image-import bash scripts/hetzner/bootstrap.sh

Destroy everything

Default destroy only removes Terraform-managed Hetzner infrastructure:

source scripts/hetzner/bootstrap-secrets.env
bash scripts/hetzner/destroy.sh

Opt-in flags make destructive cleanup of bootstrap-managed leftovers explicit:

source scripts/hetzner/bootstrap-secrets.env
DESTROY_DNS=true \
DESTROY_LOCAL_STATE=true \
DESTROY_FORGEJO_REPO=true \
bash scripts/hetzner/destroy.sh

destroy.sh reads HCLOUD_TOKEN, optional TAILSCALE_AUTH_KEY, optional DNS provider credentials, and optional Forgejo admin credentials via the same *_PASS mapping mechanism as bootstrap. It uses the same Terraform inputs as bootstrap for the infrastructure resources, then can optionally:

delete the scripted DNS records for ${BASE_DOMAIN}, git.${BASE_DOMAIN}, registry.${BASE_DOMAIN}, and grafana.${BASE_DOMAIN}
remove local bootstrap artifacts under .state/hetzner/, deploy/k8s/overlays/hetzner-single-node/generated/, and the local Terraform working/state files in infra/terraform/hetzner/
delete the bootstrap-managed Forgejo repository via the Forgejo API

Supported scripted DNS cleanup providers:

Cloudflare
Porkbun

Cleanup defaults are intentionally conservative:

DESTROY_DNS=false keeps provider records unless you explicitly opt in
DESTROY_LOCAL_STATE=false keeps the last kubeconfigs and generated manifests for inspection
DESTROY_FORGEJO_REPO=false keeps the remote Git repository unless you explicitly opt in

If any optional cleanup step is enabled but cannot run because credentials are missing, destroy.sh prints a skip message describing what was not removed. If DNS cleanup or Forgejo repo deletion fails after Terraform teardown, rerun the same cleanup flags or remove the remaining resources manually.

Current limitations

organization-owned repo bootstrap works only when FORGEJO_REPO_OWNER names a pre-existing organization that the configured admin can create repositories in; bootstrap does not create the organization itself
unattended repo seeding now uses an authenticated HTTPS remote built from the configured Forgejo admin credentials, so operators should replace that local remote with a token, SSH, or credential-helper-backed remote after bootstrap if they do not want credentials stored in .git/config
cloud-init no longer clones a bootstrap repository onto the node; Kubernetes asset delivery is still workstation-driven after Terraform
bootstrap_repo_path in Terraform is only a reserved marker for a future node-local bootstrap/GitOps flow
bootstrap requires either a local htpasswd binary or local docker as a fallback to generate the registry htpasswd secret
bootstrap and CI authentication paths should still be hardened before production use
runner identity is persisted under the shared forgejo-data PVC, so deleting the forgejo-runner pod is safe but deleting that PVC forces re-registration on the next bootstrap run

12 KiB Raw Blame History