philipp/unrip

Fork 0

philipp 24a5002d1d Reduce ingest scope and bootstrap app deploy

2026-04-01 00:09:10 +02:00

11 KiB

Raw Blame History

Trading System Architecture Notes for Next Session

Objective

Build the first real version of the trading system as an event-driven, multi-service architecture.

Current implemented seed:

NEAR Intents ingest in Node.js
Kafka-compatible bus usage via kafkajs
dummy reactor / executor / result consumer loop

Next session should continue from this architecture, not revert to a monolith, local-only script, or TUI.

Core Architecture

All components are independent services. They communicate only through a central Kafka-compatible bus (Redpanda first, Kafka-compatible by design).

Service classes

venue ingestors
normalizers
reactors / decision engines
executors
downstream consumers / monitors / archivers / replay tools

Service communication rule

No direct service-to-service calls for core trading flow. Use bus topics only.

Venue-Oriented Structure

The system should be organized by venue. Each venue can have different:

ingest/feed mechanics
normalization logic
execution mechanics

Per-venue responsibilities

ingest = venue-native intake
normalize = convert venue-native payload into canonical internal event
execute = venue-specific action logic

Planned shape:

src/
  apps/
  bus/
  core/
  venues/
    near-intents/
      ingest
      normalize
      execute

Bus Choice

Use Redpanda first, but stay fully Kafka-compatible.

Reason

Requirements:

high throughput
low latency
retention
replay
multiple producers/consumers
independent services
future scale-out
multi-language compatibility

Constraint

Do not use broker-specific features that make migration to Kafka difficult. Use standard Kafka clients and semantics.

Data Model Principles

Kafka/Redpanda is the operational event backbone.

Event model rules

append-only
immutable events
versioned schemas
raw and normalized events both preserved

Every event should include

event_id
event_type
venue
observed_at / ingested_at
schema_version
payload
optionally raw/original payload where appropriate

Raw vs normalized

Keep both.

raw topics = exact venue-native source truth
normalized topics = canonical research/trading inputs

This is required for:

replay
debugging
future backtesting
future Spark/batch processing

Current/Planned Topic Flow

Minimal 3-stage pipeline:

ingest publishes normalized demand
reactor publishes trade command
executor publishes trade result

Topic classes

raw.* = raw venue-native events
norm.* = canonical normalized market events
cmd.* = execution commands
exec.* = execution outcomes
later signal.* if needed for reactor outputs before command stage

Current minimal topics

norm.swap_demand
cmd.execute_trade
exec.trade_result

NEAR Intents

NEAR Intents source currently feeds quote-demand style events from solver-bus websocket. This is a venue ingest source, not the whole trading system.

Execution Safety / Zero Downtime Requirements

This is critical.

Constraint

Multiple executors must never duplicate the same trade/action during deploys, restarts, or rebalances.

Must-have rules

Every execution command must carry a unique command_id
Commands must include deterministic idempotency information
Executors must be idempotent
Executors must belong to a consumer group per executor role
Commands should be partitioned by a stable execution key where ordering matters
Executor state must be persisted durably enough to detect duplicate command execution

Kafka consumer groups are not sufficient alone

They help assign work, but they do not guarantee no duplicate processing under restart/rebalance conditions. Idempotency is still required.

Rolling updates / zero downtime

Executors must support:

graceful shutdown
stop taking new work before exit
finish or safely recover in-flight work
commit offsets only after safe execution state transition

Persistence implication

Executor idempotency state is not optional metadata. It is operational state that must survive pod restarts.

Current single-node k3s direction:

executor state lives at /var/lib/unrip/executor-state
Kubernetes mounts that path through persistent storage
the Hetzner single-node overlay currently targets k3s local-path storage
node loss without storage migration means duplicate-suppression history is lost

Deployment Target

First deployment phase

single machine on Hetzner
but still multiple independent services
no architecture shortcuts that prevent future clustering

Future target

split across multiple machines
cluster capable
fault tolerant
multi-node
zero-downtime deploys

Deployment rules from day 1

every component is a separate container/service
all config via env/config files
communication over network/bus only
persistent components use mounted volumes/PVCs
no manual SSH-based operational workflow

Infrastructure / Ops Direction

Target environment:

Hetzner
self-hosted CI/CD
provisioning by code
no GitHub dependency

Desired stack direction

Terraform for Hetzner provisioning
Kubernetes-oriented target from the start
self-hosted Git + CI/CD
Kafka-compatible broker
object storage later for long-term archived event history

Single-node first, future cluster later

The first version may run on one machine, but deployment structure should already match a future distributed system.

Current canonical operator path

After the repo split, the primary deployment path is shared across two repos:

the separate platform repo owns Hetzner/OpenTofu, cloud-init, k3s bootstrap, Forgejo, the runner, and the shared registry
this repo owns the app image, app manifests, and the app rollout workflow

Phase 0: platform bootstrap (platform repo)

A local operator workstation prepares platform bootstrap secrets in the platform repo.
The operator runs the platform repo bootstrap flow.
Terraform provisions the server, firewall, network, and cloud-init user-data.
cloud-init installs k3s automatically and prepares persistence directories plus bootstrap artifacts.
The workstation writes the public and in-cluster kubeconfigs.
The workstation injects the shared platform secrets and applies the shared platform manifests.
Forgejo and the runner come online in the cluster.

Phase 1: app repo handoff (this repo)

The operator creates this app repo's Kubernetes secrets such as unrip-secrets and unrip-registry-creds.
This repo is pushed or mirrored into Forgejo.
On push to main, the Forgejo runner applies deploy/k8s/base, builds the app image in-cluster with Kaniko, and updates the unrip deployments.
Terraform remains the infra mutation entrypoint, but app rollout is owned by this repo's workflow.

Failure-recovery expectation

The overall bootstrap path must be rerunnable from the workstation. Docs should keep treating recovery as:

fix local secrets/inputs
rerun the platform repo bootstrap script
inspect the cluster with the generated kubeconfig
destroy/recreate infra from the platform repo only when required

Current repo-state caveats

The direction is clear, but the implementation still has caveats:

this repo now assumes a pre-bootstrapped platform repo/cluster
kubeconfig/auth handling is only as strong as the external Forgejo secret management
base manifests still carry a placeholder bootstrap image until CI rolls the first real tag
app secrets and registry pull credentials still require operator setup
local Compose remains in the repo for development/testing, not as the canonical production path

Minimal repo layout target

deploy/
  hetzner/
    README.md
  k8s/
    base/
    overlays/
      hetzner-single-node/
infra/
  terraform/
    hetzner/

Guidelines:

infra/terraform/hetzner/ owns VM, firewall, networking, and cloud-init rendering
deploy/k8s/ owns Kubernetes-native manifests and overlays
app runtime manifests should remain Kubernetes-native so they can later move from single-node k3s to a larger cluster with minimal rewrite
secret material must not live in git in plaintext; bootstrap docs should describe workstation-driven injection or generated secret references

Local Development / Testing Direction

Do not assume manual multi-terminal operation long term.

Requirement

Need an orchestrated local/dev runtime.

Local dev should preserve real boundaries

separate services
broker present
env/config driven
same event flow as production

Current local/dev answer

Compose is still acceptable for:

developer laptops
fast local iteration
debugging event flow
validating container boundaries before Kubernetes rollout

But Compose should remain explicitly secondary to the repo-driven Hetzner + k3s path for production operations.

Testing layers

unit tests for normalizers / schema logic / helpers
integration tests against Kafka-compatible broker
replay/simulation tests using retained event streams

Spark Readiness

Do not add Spark now. But keep the system Spark-compatible later by:

preserving raw events
preserving normalized events
using immutable append-only event streams
versioning schemas
separating operational event log from future analytical processing

Spark later would be for:

large-scale backtesting
feature generation
archive processing
multi-venue analytics

Immediate Next Engineering Tasks

Next session should focus on the following.

1. Clean current repo structure

Remove duplicate/legacy paths and keep one canonical structure only.

2. Keep/complete the 3-stage loop

NEAR Intents ingest -> norm.swap_demand
dummy reactor -> cmd.execute_trade
dummy executor -> exec.trade_result
downstream result consumer

3. Define canonical schemas

Define concrete event schemas for:

normalized swap demand
execute trade command
trade result

4. Define executor idempotency model

Specify:

command_id
idempotency key rules
execution state transition rules
duplicate handling rules

5. Move toward production-shaped deployment

Design for:

one service per container
single-node deployment first
future multi-node split without app rewrite

6. Harden provisioning/deployment path

Next infra work should continue improving:

Hetzner provisioning by code
workstation bootstrap rerunnability
self-hosted CI/CD handoff
registry-native image delivery
overlay convergence for the Hetzner single-node target

Status update:

minimal Terraform exists under infra/terraform/hetzner
first boot is cloud-init driven and installs k3s automatically
bootstrap now starts from a local operator workstation rather than manual host login
Kubernetes assets exist under deploy/k8s
executor persistence boundaries are explicit for single-node k3s
self-hosted CI handoff is documented, but still requires follow-up hardening

Non-Goals for Next Session

no dashboards
no UI/TUI
no monolith convenience architecture
no SQLite-first system of record
no direct coupling between ingest, decision, and execution
no temporary local-only shortcuts that block future cluster deployment

Guiding Principle

Build the single-node first version as if it is already a distributed system:

separate services
durable event bus
replayable events
explicit contracts
idempotent execution
production-compatible deployment boundaries
bootstrapable from scratch without manual SSH-based host setup

11 KiB Raw Blame History

Trading System Architecture Notes for Next Session

Objective

Core Architecture

Service classes

Service communication rule

Venue-Oriented Structure

Per-venue responsibilities

Bus Choice

Reason

Constraint

Data Model Principles

Event model rules

Every event should include

Raw vs normalized

Current/Planned Topic Flow

Topic classes

Current minimal topics

NEAR Intents

Execution Safety / Zero Downtime Requirements

Constraint

Must-have rules

Kafka consumer groups are not sufficient alone

Rolling updates / zero downtime

Persistence implication

Deployment Target

First deployment phase

Future target

Deployment rules from day 1

Infrastructure / Ops Direction

Desired stack direction

Single-node first, future cluster later

Current canonical operator path

Phase 0: platform bootstrap (platform repo)

Phase 1: app repo handoff (this repo)

Failure-recovery expectation

Current repo-state caveats

Minimal repo layout target

Local Development / Testing Direction

Requirement

Local dev should preserve real boundaries

Current local/dev answer

Testing layers

Spark Readiness

Immediate Next Engineering Tasks

1. Clean current repo structure

2. Keep/complete the 3-stage loop

3. Define canonical schemas

4. Define executor idempotency model

5. Move toward production-shaped deployment

6. Harden provisioning/deployment path

Non-Goals for Next Session

Guiding Principle

11 KiB

Raw Blame History