doran/docs/hetzner-self-hosted-ci-runbook.md
2026-03-29 10:28:09 +02:00

144 lines
8.3 KiB
Markdown

# Hetzner self-hosted CI/CD runbook
This is the operator runbook for the handoff from local bootstrap to self-hosted Forgejo-based deployment.
## Bootstrap prerequisites
From your workstation:
```bash
cp scripts/hetzner/bootstrap-secrets.env.example scripts/hetzner/bootstrap-secrets.env
source scripts/hetzner/bootstrap-secrets.env
python3 -c 'import yaml, nacl' # verify PyYAML and PyNaCl are installed for forgejo-actions bootstrap
bash scripts/hetzner/bootstrap.sh
```
`scripts/hetzner/bootstrap-secrets.env` should contain non-secret bootstrap settings and `pass` entry mappings like `HCLOUD_TOKEN_PASS`, `REGISTRY_PASSWORD_PASS`, and `FORGEJO_ADMIN_PASSWORD_PASS`. If you explicitly export the raw env vars, they override the `pass` lookups.
After that you should have:
- `.state/hetzner/kubeconfig.yaml`
- `.state/hetzner/kubeconfig.incluster.yaml`
- Forgejo reachable at `https://${FORGEJO_DOMAIN}`
- the target Forgejo repo created automatically
- repository Actions secrets/variables populated for CI
- the current repo pushed to Forgejo automatically in default mode
- Registry reachable at `https://${REGISTRY_DOMAIN}`
- Grafana reachable at `https://${GRAFANA_DOMAIN}`
- Headlamp reachable at `https://${HEADLAMP_DOMAIN}`
- private admin/control-plane access over Tailscale if configured
Bootstrap repo automation requires `FORGEJO_ADMIN_USERNAME`, `FORGEJO_ADMIN_PASSWORD`, Python `PyYAML` locally for kubeconfig rendering, and Python `PyNaCl` locally in the default `forgejo-actions` mode so the script can encrypt Forgejo Actions secrets before upload. Bootstrap now fails fast with an explicit preflight error if those Python modules are missing. The same bootstrap flow now also creates the initial Forgejo admin account and writes a durable `/data/.runner` config into the shared Forgejo PVC before the runner deployment is allowed to start.
Repository bootstrap is now owner-aware:
- if `FORGEJO_REPO_OWNER` matches `FORGEJO_ADMIN_USERNAME`, bootstrap creates the repo under the admin user's namespace
- if `FORGEJO_REPO_OWNER` names an existing Forgejo organization that the admin can manage, bootstrap creates the repo under that organization instead
- rerunning bootstrap remains idempotent because repo creation is skipped when the target repo already exists and secrets/variables are upserted in place
## Verify the cluster
```bash
export KUBECONFIG=$PWD/.state/hetzner/kubeconfig.yaml
kubectl get nodes -o wide
kubectl get pods -A
kubectl -n forgejo get deploy,pods,svc,ingress
kubectl -n registry get deploy,pods,svc,ingress
kubectl -n observability get deploy,ds,pods,svc,ingress,secrets
kubectl -n unrip get deploy,pods
```
## Seed the repo into Forgejo
Default bootstrap already seeds the repo with HTTPS Git auth derived from the configured admin credentials:
```bash
bash scripts/hetzner/seed-forgejo-repo.sh
```
You only need to run it manually if you skipped seeding during bootstrap or want to push again after local changes.
`seed-forgejo-repo.sh` rewrites the configured `forgejo` remote to an authenticated HTTPS URL for non-interactive pushes. That makes bootstrap reruns hands-free, but it also means the local Git remote can contain embedded credentials. Replace it with a token-backed, SSH, or credential-helper-managed remote after bootstrap if you do not want secrets persisted in `.git/config`.
## Configure Forgejo Actions secrets and variables
Bootstrap upserts these repository secrets automatically:
- `KUBECONFIG_B64`
- `REGISTRY_USERNAME`
- `REGISTRY_PASSWORD`
Bootstrap upserts these repository variables automatically:
- `REGISTRY_HOST=${REGISTRY_DOMAIN}`
- `PROJECT_NAME=${PROJECT_NAME}`
- `PROJECT_NAMESPACE=${PROJECT_NAMESPACE}`
- `PROJECT_DEPLOYMENTS` as a comma-separated version of the bootstrap deployment list
The Forgejo repo configuration step is idempotent, so rerunning bootstrap updates the same repo secrets/variables in place.
## Workflow behavior
The workflow in `.forgejo/workflows/deploy.yml` now:
1. installs `kubectl` on the Forgejo runner
2. loads kubeconfig from `KUBECONFIG_B64`
3. computes `IMAGE=${REGISTRY_HOST}/${PROJECT_NAME}:${GIT_SHA}`
4. creates an in-cluster Kubernetes Job in `PROJECT_NAMESPACE`
5. that Job checks out the repo with the Forgejo job token in an init container using an `Authorization: Bearer ...` header instead of embedding the token in the clone URL
6. Kaniko builds and pushes the image using the Kubernetes registry auth secret
7. the workflow updates each deployment listed in `PROJECT_DEPLOYMENTS` inside `PROJECT_NAMESPACE`
8. the workflow waits for rollout after each image update
Default behavior if you do not set project variables:
- `PROJECT_NAME=unrip`
- `PROJECT_NAMESPACE=unrip`
- `PROJECT_DEPLOYMENTS=near-intents-ingest,dummy-reactor,dummy-executor,dummy-consumer`
- `PROJECT_REGISTRY_SECRET_NAME=unrip-registry-creds`
For a future project, reuse the same workflow by changing only the Forgejo repository variables instead of copying the workflow.
Default bootstrap now uses the same routine CI path for the first deploy:
- bootstrap fetches the real kubeconfig from the node
- bootstrap derives an in-cluster kubeconfig for the runner
- bootstrap creates the Forgejo repo and Actions config
- bootstrap pushes to `main`
- Forgejo Actions builds the image in-cluster and deploys it
Legacy mode still exists if you explicitly set:
```bash
BOOTSTRAP_DELIVERY_MODE=local-image-import
```
That legacy mode still requires local `docker` to build and import the bootstrap image. In all modes, bootstrap also needs either a native `htpasswd` binary or local `docker` as a fallback to generate the registry auth secret.
## Trigger deploys
Push to `main` in Forgejo:
```bash
git push forgejo main
```
## Observe deploys
```bash
export KUBECONFIG=$PWD/.state/hetzner/kubeconfig.yaml
kubectl -n unrip rollout status deployment/near-intents-ingest --timeout=300s
kubectl -n unrip rollout status deployment/dummy-reactor --timeout=300s
kubectl -n unrip rollout status deployment/dummy-executor --timeout=300s
kubectl -n unrip rollout status deployment/dummy-consumer --timeout=300s
kubectl -n unrip get pods -o wide
kubectl get events -A --sort-by=.lastTimestamp | tail -n 50
```
## DNS and TLS
If DNS automation was enabled during bootstrap, A records for the base, Forgejo, and registry hosts are already managed from the repo-side bootstrap.
Currently supported DNS providers:
- Cloudflare
- Porkbun
Destroy does not remove those external DNS records unless you explicitly opt in with `DESTROY_DNS=true` when running `scripts/hetzner/destroy.sh`.
Likewise, generated local kubeconfigs/manifests remain on disk unless you set `DESTROY_LOCAL_STATE=true`, and the seeded Forgejo repository remains unless you set `DESTROY_FORGEJO_REPO=true` with valid Forgejo admin credentials.
TLS is issued by cert-manager using the rendered Let's Encrypt email and ingress hosts.
For browser-based cluster inspection and pod logs, use Headlamp. For historical log search, use Grafana/Loki. Both are documented in `docs/k8s-observability.md`.
## Current limitations
- the bootstrap path now creates the initial admin account and runner config automatically from inside the Forgejo pod, but it still depends on the operator supplying the intended admin credentials up front
- runner startup is now manifest-gated on a durable `/data/.runner` file stored under the shared `forgejo-data` PVC, so fresh applies no longer depend on a broken intermediate secret or a race against a crashing runner pod; deleting that Forgejo PVC still requires rerunning bootstrap to re-register the runner
- organization-owned repo bootstrap works only when `FORGEJO_REPO_OWNER` names a pre-existing organization that the configured admin can create repositories in; bootstrap does not create the organization itself
- `seed-forgejo-repo.sh` uses admin-password-backed HTTPS pushes by default for unattended bootstrap, so operators should swap to a token or SSH remote after initial seeding if they want to avoid storing credentials in `.git/config`
- `destroy.sh` can now remove the seeded Forgejo repository, DNS records, and local bootstrap artifacts, but each destructive cleanup path is opt-in via `DESTROY_FORGEJO_REPO=true`, `DESTROY_DNS=true`, and `DESTROY_LOCAL_STATE=true`
- the runner currently uses host-mode jobs and installs `kubectl` at job start; the image build itself runs in-cluster via Kaniko, which is functional but not yet optimized