unrip/docs/next-session-architecture.md

11 KiB

Trading System Architecture Notes for Next Session

Objective

Build the first real version of the trading system as an event-driven, multi-service architecture.

Current implemented seed:

  • NEAR Intents ingest in Node.js
  • Kafka-compatible bus usage via kafkajs
  • dummy reactor / executor / result consumer loop

Next session should continue from this architecture, not revert to a monolith, local-only script, or TUI.


Core Architecture

All components are independent services. They communicate only through a central Kafka-compatible bus (Redpanda first, Kafka-compatible by design).

Service classes

  • venue ingestors
  • normalizers
  • reactors / decision engines
  • executors
  • downstream consumers / monitors / archivers / replay tools

Service communication rule

No direct service-to-service calls for core trading flow. Use bus topics only.


Venue-Oriented Structure

The system should be organized by venue. Each venue can have different:

  • ingest/feed mechanics
  • normalization logic
  • execution mechanics

Per-venue responsibilities

  • ingest = venue-native intake
  • normalize = convert venue-native payload into canonical internal event
  • execute = venue-specific action logic

Planned shape:

src/
  apps/
  bus/
  core/
  venues/
    near-intents/
      ingest
      normalize
      execute

Bus Choice

Use Redpanda first, but stay fully Kafka-compatible.

Reason

Requirements:

  • high throughput
  • low latency
  • retention
  • replay
  • multiple producers/consumers
  • independent services
  • future scale-out
  • multi-language compatibility

Constraint

Do not use broker-specific features that make migration to Kafka difficult. Use standard Kafka clients and semantics.


Data Model Principles

Kafka/Redpanda is the operational event backbone.

Event model rules

  • append-only
  • immutable events
  • versioned schemas
  • raw and normalized events both preserved

Every event should include

  • event_id
  • event_type
  • venue
  • observed_at / ingested_at
  • schema_version
  • payload
  • optionally raw/original payload where appropriate

Raw vs normalized

Keep both.

  • raw topics = exact venue-native source truth
  • normalized topics = canonical research/trading inputs

This is required for:

  • replay
  • debugging
  • future backtesting
  • future Spark/batch processing

Current/Planned Topic Flow

Minimal 3-stage pipeline:

  1. ingest publishes normalized demand
  2. reactor publishes trade command
  3. executor publishes trade result

Topic classes

  • raw.* = raw venue-native events
  • norm.* = canonical normalized market events
  • cmd.* = execution commands
  • exec.* = execution outcomes
  • later signal.* if needed for reactor outputs before command stage

Current minimal topics

  • norm.swap_demand
  • cmd.execute_trade
  • exec.trade_result

NEAR Intents

NEAR Intents source currently feeds quote-demand style events from solver-bus websocket. This is a venue ingest source, not the whole trading system.


Execution Safety / Zero Downtime Requirements

This is critical.

Constraint

Multiple executors must never duplicate the same trade/action during deploys, restarts, or rebalances.

Must-have rules

  1. Every execution command must carry a unique command_id
  2. Commands must include deterministic idempotency information
  3. Executors must be idempotent
  4. Executors must belong to a consumer group per executor role
  5. Commands should be partitioned by a stable execution key where ordering matters
  6. Executor state must be persisted durably enough to detect duplicate command execution

Kafka consumer groups are not sufficient alone

They help assign work, but they do not guarantee no duplicate processing under restart/rebalance conditions. Idempotency is still required.

Rolling updates / zero downtime

Executors must support:

  • graceful shutdown
  • stop taking new work before exit
  • finish or safely recover in-flight work
  • commit offsets only after safe execution state transition

Persistence implication

Executor idempotency state is not optional metadata. It is operational state that must survive pod restarts.

Current single-node k3s direction:

  • executor state lives at /var/lib/unrip/executor-state
  • Kubernetes mounts that path through persistent storage
  • the Hetzner single-node overlay currently targets k3s local-path storage
  • node loss without storage migration means duplicate-suppression history is lost

Deployment Target

First deployment phase

  • single machine on Hetzner
  • but still multiple independent services
  • no architecture shortcuts that prevent future clustering

Future target

  • split across multiple machines
  • cluster capable
  • fault tolerant
  • multi-node
  • zero-downtime deploys

Deployment rules from day 1

  • every component is a separate container/service
  • all config via env/config files
  • communication over network/bus only
  • persistent components use mounted volumes/PVCs
  • no manual SSH-based operational workflow

Infrastructure / Ops Direction

Target environment:

  • Hetzner
  • self-hosted CI/CD
  • provisioning by code
  • no GitHub dependency

Desired stack direction

  • Terraform for Hetzner provisioning
  • Kubernetes-oriented target from the start
  • self-hosted Git + CI/CD
  • Kafka-compatible broker
  • object storage later for long-term archived event history

Single-node first, future cluster later

The first version may run on one machine, but deployment structure should already match a future distributed system.

Current canonical operator path

After the repo split, the primary deployment path is shared across two repos:

  • the separate platform repo owns Hetzner/OpenTofu, cloud-init, k3s bootstrap, Forgejo, the runner, and the shared registry
  • this repo owns the app image, app manifests, and the app rollout workflow

Phase 0: platform bootstrap (platform repo)

  1. A local operator workstation prepares platform bootstrap secrets in the platform repo.
  2. The operator runs the platform repo bootstrap flow.
  3. Terraform provisions the server, firewall, network, and cloud-init user-data.
  4. cloud-init installs k3s automatically and prepares persistence directories plus bootstrap artifacts.
  5. The workstation writes the public and in-cluster kubeconfigs.
  6. The workstation injects the shared platform secrets and applies the shared platform manifests.
  7. Forgejo and the runner come online in the cluster.

Phase 1: app repo handoff (this repo)

  1. The operator creates this app repo's Kubernetes secrets such as unrip-secrets and unrip-registry-creds.
  2. This repo is pushed or mirrored into Forgejo.
  3. On push to main, the Forgejo runner applies deploy/k8s/base, builds the app image in-cluster with Kaniko, and updates the unrip deployments.
  4. Terraform remains the infra mutation entrypoint, but app rollout is owned by this repo's workflow.

Failure-recovery expectation

The overall bootstrap path must be rerunnable from the workstation. Docs should keep treating recovery as:

  • fix local secrets/inputs
  • rerun the platform repo bootstrap script
  • inspect the cluster with the generated kubeconfig
  • destroy/recreate infra from the platform repo only when required

Current repo-state caveats

The direction is clear, but the implementation still has caveats:

  • this repo now assumes a pre-bootstrapped platform repo/cluster
  • kubeconfig/auth handling is only as strong as the external Forgejo secret management
  • base manifests still carry a placeholder bootstrap image until CI rolls the first real tag
  • app secrets and registry pull credentials still require operator setup
  • local Compose remains in the repo for development/testing, not as the canonical production path

Minimal repo layout target

deploy/
  hetzner/
    README.md
  k8s/
    base/
    overlays/
      hetzner-single-node/
infra/
  terraform/
    hetzner/

Guidelines:

  • infra/terraform/hetzner/ owns VM, firewall, networking, and cloud-init rendering
  • deploy/k8s/ owns Kubernetes-native manifests and overlays
  • app runtime manifests should remain Kubernetes-native so they can later move from single-node k3s to a larger cluster with minimal rewrite
  • secret material must not live in git in plaintext; bootstrap docs should describe workstation-driven injection or generated secret references

Local Development / Testing Direction

Do not assume manual multi-terminal operation long term.

Requirement

Need an orchestrated local/dev runtime.

Local dev should preserve real boundaries

  • separate services
  • broker present
  • env/config driven
  • same event flow as production

Current local/dev answer

Compose is still acceptable for:

  • developer laptops
  • fast local iteration
  • debugging event flow
  • validating container boundaries before Kubernetes rollout

But Compose should remain explicitly secondary to the repo-driven Hetzner + k3s path for production operations.

Testing layers

  1. unit tests for normalizers / schema logic / helpers
  2. integration tests against Kafka-compatible broker
  3. replay/simulation tests using retained event streams

Spark Readiness

Do not add Spark now. But keep the system Spark-compatible later by:

  • preserving raw events
  • preserving normalized events
  • using immutable append-only event streams
  • versioning schemas
  • separating operational event log from future analytical processing

Spark later would be for:

  • large-scale backtesting
  • feature generation
  • archive processing
  • multi-venue analytics

Immediate Next Engineering Tasks

Next session should focus on the following.

1. Clean current repo structure

Remove duplicate/legacy paths and keep one canonical structure only.

2. Keep/complete the 3-stage loop

  • NEAR Intents ingest -> norm.swap_demand
  • dummy reactor -> cmd.execute_trade
  • dummy executor -> exec.trade_result
  • downstream result consumer

3. Define canonical schemas

Define concrete event schemas for:

  • normalized swap demand
  • execute trade command
  • trade result

4. Define executor idempotency model

Specify:

  • command_id
  • idempotency key rules
  • execution state transition rules
  • duplicate handling rules

5. Move toward production-shaped deployment

Design for:

  • one service per container
  • single-node deployment first
  • future multi-node split without app rewrite

6. Harden provisioning/deployment path

Next infra work should continue improving:

  • Hetzner provisioning by code
  • workstation bootstrap rerunnability
  • self-hosted CI/CD handoff
  • registry-native image delivery
  • overlay convergence for the Hetzner single-node target

Status update:

  • minimal Terraform exists under infra/terraform/hetzner
  • first boot is cloud-init driven and installs k3s automatically
  • bootstrap now starts from a local operator workstation rather than manual host login
  • Kubernetes assets exist under deploy/k8s
  • executor persistence boundaries are explicit for single-node k3s
  • self-hosted CI handoff is documented, but still requires follow-up hardening

Non-Goals for Next Session

  • no dashboards
  • no UI/TUI
  • no monolith convenience architecture
  • no SQLite-first system of record
  • no direct coupling between ingest, decision, and execution
  • no temporary local-only shortcuts that block future cluster deployment

Guiding Principle

Build the single-node first version as if it is already a distributed system:

  • separate services
  • durable event bus
  • replayable events
  • explicit contracts
  • idempotent execution
  • production-compatible deployment boundaries
  • bootstrapable from scratch without manual SSH-based host setup