doran/README.md
2026-03-28 20:53:29 +01:00

368 lines
No EOL
10 KiB
Markdown

# near-intents-monitor
Production-shaped first slice of the trading system:
- **venue ingest**: NEAR Intents solver-bus quote flow
- **bus**: Redpanda first, Kafka-compatible by design
- **reactor**: dummy decision engine emitting commands
- **executor**: dummy execution worker with durable idempotency state
- **result consumer**: downstream observer of execution outcomes
## Canonical repo shape
```text
src/
apps/
near-intents-ingest.mjs
dummy-reactor.mjs
dummy-executor.mjs
dummy-consumer.mjs
bus/
kafka/
producer.mjs
consumer.mjs
core/
event-envelope.mjs
executor-state-store.mjs
log.mjs
pair-filter.mjs
schemas.mjs
lib/
config.mjs
env.mjs
venues/
near-intents/
ingest.mjs
normalize.mjs
ws.mjs
compose.yml
Dockerfile
docs/contracts.md
deploy/hetzner/README.md
```
## Event flow
```text
NEAR Intents WebSocket
|
+--> raw.near_intents.quote
|
v
norm.swap_demand
|
v
cmd.execute_trade
|
v
exec.trade_result
```
Core rule: services do not call each other directly for trading flow; they communicate through bus topics only.
## Contracts
See `docs/contracts.md`.
Current topics:
- `raw.near_intents.quote`
- `norm.swap_demand`
- `cmd.execute_trade`
- `exec.trade_result`
## Primary deployment path: repo-driven Hetzner bootstrap
The primary production path is no longer a Compose-only VM workflow.
The intended operating model is:
- Terraform provisions a Hetzner single-node environment
- cloud-init installs k3s automatically on first boot
- a local operator workstation performs the first repo-driven bootstrap
- Kubernetes manifests install Redpanda, the app workloads, Forgejo, runner, registry, and ingress-related components
- once the in-cluster Git + CI stack is alive, routine app deploys move to self-hosted CI
This is a two-phase model:
- **Phase 0:** local workstation bootstrap of a brand-new cluster
- **Phase 1:** self-hosted Forgejo + runner takes over app delivery
Compose still exists for local development and optional single-machine testing, but it is not the canonical production story.
## Prerequisites for first deployment
Install locally on the operator workstation:
- Terraform `>= 1.6`
- `kubectl`
- `docker`
- `curl`
You also need:
- a Hetzner Cloud API token
- a local SSH public key file for Terraform node provisioning
- DNS control for your chosen base domain and Forgejo hostname
- preferably a Tailscale tailnet and auth key for private admin/control-plane access
- the repo checked out locally
## Required bootstrap secrets and inputs
Create the bootstrap env file:
```bash
cp scripts/hetzner/bootstrap-secrets.env.example scripts/hetzner/bootstrap-secrets.env
```
Set at least:
- `HCLOUD_TOKEN`
- `SSH_PUBLIC_KEY_PATH`
- `PUBLIC_DOMAIN`
- recommended:
- `TAILSCALE_AUTH_KEY`
- `TAILSCALE_CONTROL_PLANE_HOSTNAME`
- optional fallback:
- `TF_ADMIN_CIDR_BLOCKS`
- `BASE_DOMAIN`
- `FORGEJO_DOMAIN`
- `FORGEJO_ROOT_URL`
- `REGISTRY_DOMAIN`
- `LETSENCRYPT_EMAIL`
- `REGISTRY_USERNAME`
- `REGISTRY_PASSWORD`
- `NEAR_INTENTS_API_KEY`
- `FORGEJO_RUNNER_REGISTRATION_TOKEN`
- optional DNS automation:
- Cloudflare:
- `CLOUDFLARE_API_TOKEN`
- `CLOUDFLARE_ZONE_ID`
- Porkbun:
- `PORKBUN_API_KEY`
- `PORKBUN_SECRET_API_KEY`
Then load them:
```bash
source scripts/hetzner/bootstrap-secrets.env
```
## First bootstrap sequence
Run the end-to-end bootstrap from repo root:
```bash
bash scripts/hetzner/bootstrap.sh
```
Current repo behavior of that script:
1. runs Terraform in `infra/terraform/hetzner`
2. optionally creates DNS records for the base, Forgejo, and registry hosts via Cloudflare or Porkbun
3. if configured, joins the node to Tailscale and prefers the Tailscale control-plane hostname for Kubernetes API access
4. waits for SSH and the k3s API endpoint to become ready
5. fetches the real k3s kubeconfig from the node and writes it to `.state/hetzner/kubeconfig.yaml`
6. renders the Hetzner single-node overlay from local operator inputs
7. creates registry pull/auth secrets
8. applies the Kubernetes bootstrap manifests
9. builds the app image locally and imports it into k3s on the node
10. performs the first rollout using the imported bootstrap image
Use the generated kubeconfig afterward:
```bash
export KUBECONFIG=$PWD/.state/hetzner/kubeconfig.yaml
kubectl get nodes -o wide
kubectl get pods -A
kubectl -n unrip get deploy,pods
kubectl -n forgejo get deploy,pods,svc
```
## What is deployed into k3s
The repo-managed Kubernetes assets are under `deploy/k8s/`.
Current single-node target includes resources for:
- `unrip` workloads in namespace `unrip`
- Redpanda
- Forgejo
- Forgejo runner
- private registry
- ingress-nginx namespace/resources
- cert-manager namespace/resources
- ACME issuers and ingress definitions
- a bootstrap job for Redpanda topic creation
Shared platform namespaces:
- `forgejo`
- `registry`
- `ingress-nginx`
- `cert-manager`
Project-specific namespaces:
- `unrip`
- future projects should get their own namespace rather than sharing `unrip`
Important current-state nuance:
- the bootstrap script currently applies `deploy/k8s/base`
- the longer-term intended target is `deploy/k8s/overlays/hetzner-single-node`
## Executor persistence in k3s
The executor is stateful by design because it persists idempotency/execution tracking.
Current persistence boundary:
- app env uses `EXECUTOR_STATE_DIR=/var/lib/unrip/executor-state`
- in Kubernetes, the executor deployment mounts storage at that path
- the Hetzner single-node overlay pins storage to the k3s `local-path` storage class
- cloud-init also prepares the host directory boundary for executor state on first boot
Operational meaning:
- executor state lives on node-backed storage in the single-node k3s environment
- if that PVC or underlying node storage is lost, duplicate-suppression history is lost too
- treat executor persistence as part of the minimal durable state of the cluster
## Failure recovery and operator checks
### If bootstrap fails before Terraform completes
Re-run after fixing the local input problem:
- missing token
- invalid CIDRs
- invalid SSH public key path
If the infrastructure must be torn down:
```bash
source scripts/hetzner/bootstrap-secrets.env
bash scripts/hetzner/destroy.sh
```
### If Terraform succeeds but Kubernetes is not ready
Check the public API and cluster state from the workstation:
```bash
export KUBECONFIG=$PWD/.state/hetzner/kubeconfig.yaml
kubectl get nodes -o wide
kubectl get pods -A
kubectl get events -A --sort-by=.lastTimestamp | tail -n 50
```
Typical next checks:
- cloud-init may still be finishing
- k3s may still be starting
- a workload may be crash-looping due to missing secret values or image-delivery issues
### If workloads do not roll out
Inspect the affected namespace:
```bash
kubectl -n unrip get pods
kubectl -n unrip describe pod <pod-name>
kubectl -n unrip logs deploy/dummy-executor --tail=100
kubectl -n forgejo logs deploy/forgejo --tail=100
```
### If you need to recreate secrets
The workstation bootstrap creates these Secrets:
- `unrip/unrip-secrets`
- `forgejo/forgejo-secrets`
Verify them:
```bash
kubectl -n unrip get secret unrip-secrets
kubectl -n forgejo get secret forgejo-secrets
```
### Current known limitations
Current colony state already identified an important gap:
- bootstrap and CI are not yet fully production-hardened, even though the first deploy path now fetches the real kubeconfig and imports the bootstrap image directly into k3s
Treat the current bootstrap as a repo-driven first-deploy path suitable for testing, with hardening still pending.
## Self-hosted CI handoff
After cluster bootstrap:
- open Forgejo at `https://${FORGEJO_DOMAIN}`
- seed or push this repo into Forgejo
- create Forgejo repository secrets:
- `KUBECONFIG_B64`
- `REGISTRY_USERNAME`
- `REGISTRY_PASSWORD`
- create Forgejo repository variables:
- `REGISTRY_HOST=${REGISTRY_DOMAIN}`
- optional: `PROJECT_NAME=unrip`
- optional: `PROJECT_NAMESPACE=unrip`
- optional: `PROJECT_DEPLOYMENTS=near-intents-ingest,dummy-reactor,dummy-executor,dummy-consumer`
- push to `main`
Routine application deploys then follow `.forgejo/workflows/deploy.yml`:
- build image as `REGISTRY_HOST/PROJECT_NAME:${GIT_SHA}`
- push to the private registry
- `kubectl set image` for each deployment listed in `PROJECT_DEPLOYMENTS` inside `PROJECT_NAMESPACE`
- wait for rollout
If project variables are omitted, the workflow defaults to the current repo project:
- `PROJECT_NAME=unrip`
- `PROJECT_NAMESPACE=unrip`
- `PROJECT_DEPLOYMENTS=near-intents-ingest,dummy-reactor,dummy-executor,dummy-consumer`
Infrastructure changes remain Terraform-driven from the operator workstation unless and until that responsibility is also automated.
For the detailed operator runbooks, see:
- `docs/hetzner-k3s-bootstrap.md`
- `docs/hetzner-self-hosted-ci-runbook.md`
- `deploy/k8s/projects/README.md`
- `docs/next-session-architecture.md`
## Local development with Compose
Compose remains available for local development and debugging.
```bash
npm install
cp .env.example .env
# edit .env
docker compose build
docker compose up -d
```
Useful commands:
```bash
docker compose ps
docker compose logs -f
docker compose logs -f near-intents-ingest dummy-reactor dummy-executor dummy-consumer
docker compose restart dummy-executor
docker compose down
docker compose down -v
```
### Individual services
```bash
npm run near-intents:ingest
npm run dummy-reactor
npm run dummy-executor
npm run dummy-consumer
```
Optional pair filter:
```bash
npm run near-intents:ingest -- --pair 'asset_a->asset_b'
```
## Idempotent executor behavior
- every command has a `command_id`
- commands carry `idempotency_key` and `execution_key`
- executor persists state under `EXECUTOR_STATE_DIR`
- completed commands are skipped after restart or replay
## Env
```env
NEAR_INTENTS_API_KEY=your_solver_jwt
NEAR_INTENTS_WS_URL=wss://solver-relay-v2.chaindefuser.com/ws
KAFKA_BROKERS=redpanda:9092
KAFKA_CLIENT_ID=unrip
KAFKA_TOPIC_RAW_NEAR_INTENTS_QUOTE=raw.near_intents.quote
KAFKA_TOPIC_NORM_SWAP_DEMAND=norm.swap_demand
KAFKA_TOPIC_CMD_EXECUTE_TRADE=cmd.execute_trade
KAFKA_TOPIC_EXEC_TRADE_RESULT=exec.trade_result
KAFKA_CONSUMER_GROUP_DUMMY=dummy-reactor-v1
KAFKA_CONSUMER_GROUP_EXECUTOR=dummy-executor-v1
EXECUTOR_STATE_DIR=/var/lib/unrip/executor-state
```