philipp/doran

Fork 0

Philipp 4340c903a3 fix: harden hetzner rebuild bootstrap flow

2026-03-28 23:05:43 +01:00

7.9 KiB

Raw Blame History

Hetzner self-hosted CI/CD runbook

This is the operator runbook for the handoff from local bootstrap to self-hosted Forgejo-based deployment.

Bootstrap prerequisites

From your workstation:

cp scripts/hetzner/bootstrap-secrets.env.example scripts/hetzner/bootstrap-secrets.env
source scripts/hetzner/bootstrap-secrets.env
python3 -c 'import yaml, nacl'  # verify PyYAML and PyNaCl are installed for forgejo-actions bootstrap
bash scripts/hetzner/bootstrap.sh

scripts/hetzner/bootstrap-secrets.env should contain non-secret bootstrap settings and pass entry mappings like HCLOUD_TOKEN_PASS, REGISTRY_PASSWORD_PASS, and FORGEJO_ADMIN_PASSWORD_PASS. If you explicitly export the raw env vars, they override the pass lookups.

After that you should have:

.state/hetzner/kubeconfig.yaml
.state/hetzner/kubeconfig.incluster.yaml
Forgejo reachable at https://${FORGEJO_DOMAIN}
the target Forgejo repo created automatically
repository Actions secrets/variables populated for CI
the current repo pushed to Forgejo automatically in default mode
Registry reachable at https://${REGISTRY_DOMAIN}
private admin/control-plane access over Tailscale if configured

Bootstrap repo automation requires FORGEJO_ADMIN_USERNAME, FORGEJO_ADMIN_PASSWORD, Python PyYAML locally for kubeconfig rendering, and Python PyNaCl locally in the default forgejo-actions mode so the script can encrypt Forgejo Actions secrets before upload. Bootstrap now fails fast with an explicit preflight error if those Python modules are missing. The same bootstrap flow now also creates the initial Forgejo admin account and writes a durable /data/.runner config into the shared Forgejo PVC before the runner deployment is allowed to start.

Repository bootstrap is now owner-aware:

if FORGEJO_REPO_OWNER matches FORGEJO_ADMIN_USERNAME, bootstrap creates the repo under the admin user's namespace
if FORGEJO_REPO_OWNER names an existing Forgejo organization that the admin can manage, bootstrap creates the repo under that organization instead
rerunning bootstrap remains idempotent because repo creation is skipped when the target repo already exists and secrets/variables are upserted in place

Verify the cluster

export KUBECONFIG=$PWD/.state/hetzner/kubeconfig.yaml
kubectl get nodes -o wide
kubectl get pods -A
kubectl -n forgejo get deploy,pods,svc,ingress
kubectl -n registry get deploy,pods,svc,ingress
kubectl -n unrip get deploy,pods

Seed the repo into Forgejo

Default bootstrap already seeds the repo with HTTPS Git auth derived from the configured admin credentials:

bash scripts/hetzner/seed-forgejo-repo.sh

You only need to run it manually if you skipped seeding during bootstrap or want to push again after local changes.

seed-forgejo-repo.sh rewrites the configured forgejo remote to an authenticated HTTPS URL for non-interactive pushes. That makes bootstrap reruns hands-free, but it also means the local Git remote can contain embedded credentials. Replace it with a token-backed, SSH, or credential-helper-managed remote after bootstrap if you do not want secrets persisted in .git/config.

Configure Forgejo Actions secrets and variables

Bootstrap upserts these repository secrets automatically:

KUBECONFIG_B64
REGISTRY_USERNAME
REGISTRY_PASSWORD

Bootstrap upserts these repository variables automatically:

REGISTRY_HOST=${REGISTRY_DOMAIN}
PROJECT_NAME=${PROJECT_NAME}
PROJECT_NAMESPACE=${PROJECT_NAMESPACE}
PROJECT_DEPLOYMENTS as a comma-separated version of the bootstrap deployment list

The Forgejo repo configuration step is idempotent, so rerunning bootstrap updates the same repo secrets/variables in place.

Workflow behavior

The workflow in .forgejo/workflows/deploy.yml now:

installs kubectl on the Forgejo runner
loads kubeconfig from KUBECONFIG_B64
computes IMAGE=${REGISTRY_HOST}/${PROJECT_NAME}:${GIT_SHA}
creates an in-cluster Kubernetes Job in PROJECT_NAMESPACE
that Job checks out the repo with the Forgejo job token in an init container using an Authorization: Bearer ... header instead of embedding the token in the clone URL
Kaniko builds and pushes the image using the Kubernetes registry auth secret
the workflow updates each deployment listed in PROJECT_DEPLOYMENTS inside PROJECT_NAMESPACE
the workflow waits for rollout after each image update

Default behavior if you do not set project variables:

PROJECT_NAME=unrip
PROJECT_NAMESPACE=unrip
PROJECT_DEPLOYMENTS=near-intents-ingest,dummy-reactor,dummy-executor,dummy-consumer
PROJECT_REGISTRY_SECRET_NAME=unrip-registry-creds

For a future project, reuse the same workflow by changing only the Forgejo repository variables instead of copying the workflow.

Default bootstrap now uses the same routine CI path for the first deploy:

bootstrap fetches the real kubeconfig from the node
bootstrap derives an in-cluster kubeconfig for the runner
bootstrap creates the Forgejo repo and Actions config
bootstrap pushes to main
Forgejo Actions builds the image in-cluster and deploys it

Legacy mode still exists if you explicitly set:

BOOTSTRAP_DELIVERY_MODE=local-image-import

That legacy mode still requires local docker to build and import the bootstrap image. In all modes, bootstrap also needs either a native htpasswd binary or local docker as a fallback to generate the registry auth secret.

Trigger deploys

Push to main in Forgejo:

git push forgejo main

Observe deploys

export KUBECONFIG=$PWD/.state/hetzner/kubeconfig.yaml
kubectl -n unrip rollout status deployment/near-intents-ingest --timeout=300s
kubectl -n unrip rollout status deployment/dummy-reactor --timeout=300s
kubectl -n unrip rollout status deployment/dummy-executor --timeout=300s
kubectl -n unrip rollout status deployment/dummy-consumer --timeout=300s
kubectl -n unrip get pods -o wide
kubectl get events -A --sort-by=.lastTimestamp | tail -n 50

DNS and TLS

If DNS automation was enabled during bootstrap, A records for the base, Forgejo, and registry hosts are already managed from the repo-side bootstrap.

Currently supported DNS providers:

Cloudflare
Porkbun

Destroy does not remove those external DNS records unless you explicitly opt in with DESTROY_DNS=true when running scripts/hetzner/destroy.sh. Likewise, generated local kubeconfigs/manifests remain on disk unless you set DESTROY_LOCAL_STATE=true, and the seeded Forgejo repository remains unless you set DESTROY_FORGEJO_REPO=true with valid Forgejo admin credentials.

TLS is issued by cert-manager using the rendered Let's Encrypt email and ingress hosts.

Current limitations

the bootstrap path now creates the initial admin account and runner config automatically from inside the Forgejo pod, but it still depends on the operator supplying the intended admin credentials up front
runner startup is now manifest-gated on a durable /data/.runner file stored under the shared forgejo-data PVC, so fresh applies no longer depend on a broken intermediate secret or a race against a crashing runner pod; deleting that Forgejo PVC still requires rerunning bootstrap to re-register the runner
organization-owned repo bootstrap works only when FORGEJO_REPO_OWNER names a pre-existing organization that the configured admin can create repositories in; bootstrap does not create the organization itself
seed-forgejo-repo.sh uses admin-password-backed HTTPS pushes by default for unattended bootstrap, so operators should swap to a token or SSH remote after initial seeding if they want to avoid storing credentials in .git/config
destroy.sh can now remove the seeded Forgejo repository, DNS records, and local bootstrap artifacts, but each destructive cleanup path is opt-in via DESTROY_FORGEJO_REPO=true, DESTROY_DNS=true, and DESTROY_LOCAL_STATE=true
the runner currently uses host-mode jobs and installs kubectl at job start; the image build itself runs in-cluster via Kaniko, which is functional but not yet optimized

7.9 KiB Raw Blame History