doran/docs/k8s-observability.md

137 lines
4.1 KiB
Markdown

# Kubernetes observability on the Hetzner single-node cluster
This cluster now includes a minimal reproducible log stack in the `observability` namespace:
- `loki` for log storage and querying
- `promtail` as a DaemonSet that ships pod stdout/stderr logs from every node
- `grafana` as the web UI
## What gets collected
Promtail tails Kubernetes container log files under `/var/log/pods` on each node.
That means any container writing logs to stdout/stderr automatically shows up in Loki/Grafana.
This fits the current app setup in this repo because the services already log to stdout/stderr.
What is **not** collected automatically:
- arbitrary log files written somewhere else inside a container filesystem
- logs from external services that are not running as Kubernetes pods on this cluster
## Access
Grafana is exposed through Traefik + cert-manager at:
- `https://${GRAFANA_DOMAIN}` when bootstrapped from `scripts/hetzner/bootstrap-secrets.env`
- in the current live environment: `https://grafana.doran.133011.xyz/`
Admin credentials come from:
- `GRAFANA_ADMIN_USERNAME`
- `GRAFANA_ADMIN_PASSWORD_PASS` or `GRAFANA_ADMIN_PASSWORD`
The recommended path is `pass`.
In the current live setup the password is stored at:
- `api/hetznerk3s/grafana-admin-password`
## Reproducible bootstrap path
The observability stack is part of the repo-managed platform layer:
- `deploy/k8s/platform/base/observability.yaml`
- `deploy/k8s/platform/base/kustomization.yaml`
- `deploy/k8s/platform/base/namespace.yaml`
- `deploy/k8s/overlays/hetzner-single-node/storage-class.patch.yaml`
- `deploy/k8s/overlays/hetzner-single-node/kustomization.yaml`
- `deploy/k8s/overlays/hetzner-single-node/ingress-hosts.patch.yaml`
- `deploy/k8s/overlays/hetzner-single-node/secrets/observability.env.example`
Bootstrap materializes the Grafana secret from local env / `pass`:
- writes `deploy/k8s/overlays/hetzner-single-node/secrets/observability.env`
- copies it into `.state/hetzner/generated-overlay/`
- applies the generated overlay
## Verify the stack
```bash
export KUBECONFIG=$PWD/.state/hetzner/kubeconfig.yaml
kubectl -n observability get pods
kubectl -n observability get pvc
kubectl -n observability get ingress
kubectl -n observability rollout status deployment/loki --timeout=300s
kubectl -n observability rollout status deployment/grafana --timeout=300s
kubectl -n observability rollout status daemonset/promtail --timeout=300s
```
## Verify logs are arriving
Generate some app logs, then query Loki directly:
```bash
export KUBECONFIG=$PWD/.state/hetzner/kubeconfig.yaml
kubectl -n observability port-forward svc/loki 3100:3100
```
In another shell:
```bash
curl -sS 'http://127.0.0.1:3100/loki/api/v1/labels' | jq
curl -G -sS 'http://127.0.0.1:3100/loki/api/v1/query' \
--data-urlencode 'query={namespace="unrip"}' | jq
```
If those queries return labels/streams, pod logs are reaching Loki.
## Use Grafana
After logging into Grafana:
1. open **Explore**
2. choose the default **Loki** datasource
3. run queries like:
- `{namespace="unrip"}`
- `{namespace="forgejo"}`
- `{namespace="registry"}`
- `{pod=~"near-intents-ingest.*"}`
- `{container="app"}`
Useful labels added by promtail:
- `namespace`
- `pod`
- `container`
- `app`
- selected `app.kubernetes.io/*` labels
## Day-to-day ops
CLI remains useful for fast debugging:
```bash
kubectl get pods -A
kubectl -n unrip logs deploy/near-intents-ingest -f
kubectl -n forgejo logs deploy/forgejo -f
bash scripts/k8s/logs.sh
```
Use Grafana when you want:
- a browser UI
- historical log search
- multi-namespace filtering
- easier cross-pod inspection
## Security notes
Grafana is an admin/operator surface.
For this cluster it is publicly reachable behind Grafana login.
That is acceptable for this disposable single-node setup, but for a harder production posture prefer one of:
- Tailscale-only access
- ingress auth in front of Grafana
- SSO/OIDC
## Add a new app and have logs show up there
Nothing special is required as long as the new pod logs to stdout/stderr.
If you deploy a new app under Kubernetes and expose it through the usual manifests/Ingress flow, promtail will scrape its pod logs automatically.