# Trading System Architecture Notes for Next Session ## Objective Build the first real version of the trading system as an event-driven, multi-service architecture. Current implemented seed: - NEAR Intents ingest in Node.js - Kafka-compatible bus usage via `kafkajs` - dummy reactor / executor / result consumer loop Next session should continue from this architecture, not revert to a monolith, local-only script, or TUI. --- ## Core Architecture All components are independent services. They communicate only through a central Kafka-compatible bus (Redpanda first, Kafka-compatible by design). ### Service classes - venue ingestors - normalizers - reactors / decision engines - executors - downstream consumers / monitors / archivers / replay tools ### Service communication rule No direct service-to-service calls for core trading flow. Use bus topics only. --- ## Venue-Oriented Structure The system should be organized by venue. Each venue can have different: - ingest/feed mechanics - normalization logic - execution mechanics ### Per-venue responsibilities - `ingest` = venue-native intake - `normalize` = convert venue-native payload into canonical internal event - `execute` = venue-specific action logic Planned shape: ```text src/ apps/ bus/ core/ venues/ near-intents/ ingest normalize execute ``` --- ## Bus Choice Use **Redpanda** first, but stay fully **Kafka-compatible**. ### Reason Requirements: - high throughput - low latency - retention - replay - multiple producers/consumers - independent services - future scale-out - multi-language compatibility ### Constraint Do not use broker-specific features that make migration to Kafka difficult. Use standard Kafka clients and semantics. --- ## Data Model Principles Kafka/Redpanda is the operational event backbone. ### Event model rules - append-only - immutable events - versioned schemas - raw and normalized events both preserved ### Every event should include - `event_id` - `event_type` - `venue` - `observed_at` / `ingested_at` - `schema_version` - `payload` - optionally raw/original payload where appropriate ### Raw vs normalized Keep both. - raw topics = exact venue-native source truth - normalized topics = canonical research/trading inputs This is required for: - replay - debugging - future backtesting - future Spark/batch processing --- ## Current/Planned Topic Flow Minimal 3-stage pipeline: 1. ingest publishes normalized demand 2. reactor publishes trade command 3. executor publishes trade result ### Topic classes - `raw.*` = raw venue-native events - `norm.*` = canonical normalized market events - `cmd.*` = execution commands - `exec.*` = execution outcomes - later `signal.*` if needed for reactor outputs before command stage ### Current minimal topics - `norm.swap_demand` - `cmd.execute_trade` - `exec.trade_result` ### NEAR Intents NEAR Intents source currently feeds quote-demand style events from solver-bus websocket. This is a venue ingest source, not the whole trading system. --- ## Execution Safety / Zero Downtime Requirements This is critical. ### Constraint Multiple executors must never duplicate the same trade/action during deploys, restarts, or rebalances. ### Must-have rules 1. Every execution command must carry a unique `command_id` 2. Commands must include deterministic idempotency information 3. Executors must be idempotent 4. Executors must belong to a consumer group per executor role 5. Commands should be partitioned by a stable execution key where ordering matters 6. Executor state must be persisted durably enough to detect duplicate command execution ### Kafka consumer groups are not sufficient alone They help assign work, but they do not guarantee no duplicate processing under restart/rebalance conditions. Idempotency is still required. ### Rolling updates / zero downtime Executors must support: - graceful shutdown - stop taking new work before exit - finish or safely recover in-flight work - commit offsets only after safe execution state transition ### Persistence implication Executor idempotency state is not optional metadata. It is operational state that must survive pod restarts. Current single-node k3s direction: - executor state lives at `/var/lib/unrip/executor-state` - Kubernetes mounts that path through persistent storage - the Hetzner single-node overlay currently targets k3s `local-path` storage - node loss without storage migration means duplicate-suppression history is lost --- ## Deployment Target ### First deployment phase - single machine on Hetzner - but still multiple independent services - no architecture shortcuts that prevent future clustering ### Future target - split across multiple machines - cluster capable - fault tolerant - multi-node - zero-downtime deploys ### Deployment rules from day 1 - every component is a separate container/service - all config via env/config files - communication over network/bus only - persistent components use mounted volumes/PVCs - no manual SSH-based operational workflow --- ## Infrastructure / Ops Direction Target environment: - Hetzner - self-hosted CI/CD - provisioning by code - no GitHub dependency ### Desired stack direction - Terraform for Hetzner provisioning - Kubernetes-oriented target from the start - self-hosted Git + CI/CD - Kafka-compatible broker - object storage later for long-term archived event history ### Single-node first, future cluster later The first version may run on one machine, but deployment structure should already match a future distributed system. ### Current canonical operator path The repo now documents and partially implements this path as the primary deployment workflow: #### Phase 0: workstation bootstrap 1. A local operator workstation prepares bootstrap secrets in `scripts/hetzner/bootstrap-secrets.env`. 2. The operator runs `bash scripts/hetzner/bootstrap.sh`. 3. Terraform provisions the server, firewall, network, and cloud-init user-data. 4. cloud-init installs k3s automatically and prepares persistence directories plus bootstrap artifacts. 5. The workstation waits for the public k3s API endpoint to report ready. 6. The workstation writes `.state/hetzner/kubeconfig.yaml`. 7. The workstation injects initial Kubernetes Secrets for app and Forgejo bootstrap. 8. The workstation applies repo-managed Kubernetes manifests under `deploy/k8s/`. 9. The workstation performs the first image/bootstrap delivery attempt for the app workloads. 10. The workstation verifies rollout status. #### Phase 1: self-hosted handoff 1. Forgejo becomes reachable in-cluster. 2. The operator completes initial Forgejo admin/repo setup. 3. This repo is pushed or mirrored into Forgejo. 4. The Forgejo runner becomes the routine app deployment mechanism. 5. Terraform remains the infra mutation entrypoint unless further automated later. ### Failure-recovery expectation The bootstrap path must be rerunnable from the workstation. Docs should keep treating recovery as: - fix local secrets/inputs - rerun the bootstrap script - inspect the cluster with the generated kubeconfig - destroy/recreate infra with `scripts/hetzner/destroy.sh` only when required ### Current repo-state caveats The direction is clear, but the implementation is still mid-transition: - the bootstrap script currently applies `deploy/k8s/base` directly rather than the Hetzner overlay - kubeconfig/auth handling is not yet fully production-hardened - first image delivery is still a bootstrap workaround rather than a final registry-native CI path - Forgejo admin bootstrap, repo creation, and Actions configuration still require operator steps - local Compose remains in the repo for development/testing, not as the canonical production path ### Minimal repo layout target ```text deploy/ hetzner/ README.md k8s/ base/ overlays/ hetzner-single-node/ infra/ terraform/ hetzner/ ``` Guidelines: - `infra/terraform/hetzner/` owns VM, firewall, networking, and cloud-init rendering - `deploy/k8s/` owns Kubernetes-native manifests and overlays - app runtime manifests should remain Kubernetes-native so they can later move from single-node k3s to a larger cluster with minimal rewrite - secret material must not live in git in plaintext; bootstrap docs should describe workstation-driven injection or generated secret references --- ## Local Development / Testing Direction Do not assume manual multi-terminal operation long term. ### Requirement Need an orchestrated local/dev runtime. ### Local dev should preserve real boundaries - separate services - broker present - env/config driven - same event flow as production ### Current local/dev answer Compose is still acceptable for: - developer laptops - fast local iteration - debugging event flow - validating container boundaries before Kubernetes rollout But Compose should remain explicitly secondary to the repo-driven Hetzner + k3s path for production operations. ### Testing layers 1. unit tests for normalizers / schema logic / helpers 2. integration tests against Kafka-compatible broker 3. replay/simulation tests using retained event streams --- ## Spark Readiness Do not add Spark now. But keep the system Spark-compatible later by: - preserving raw events - preserving normalized events - using immutable append-only event streams - versioning schemas - separating operational event log from future analytical processing Spark later would be for: - large-scale backtesting - feature generation - archive processing - multi-venue analytics --- ## Immediate Next Engineering Tasks Next session should focus on the following. ### 1. Clean current repo structure Remove duplicate/legacy paths and keep one canonical structure only. ### 2. Keep/complete the 3-stage loop - NEAR Intents ingest -> `norm.swap_demand` - dummy reactor -> `cmd.execute_trade` - dummy executor -> `exec.trade_result` - downstream result consumer ### 3. Define canonical schemas Define concrete event schemas for: - normalized swap demand - execute trade command - trade result ### 4. Define executor idempotency model Specify: - `command_id` - idempotency key rules - execution state transition rules - duplicate handling rules ### 5. Move toward production-shaped deployment Design for: - one service per container - single-node deployment first - future multi-node split without app rewrite ### 6. Harden provisioning/deployment path Next infra work should continue improving: - Hetzner provisioning by code - workstation bootstrap rerunnability - self-hosted CI/CD handoff - registry-native image delivery - overlay convergence for the Hetzner single-node target Status update: - minimal Terraform exists under `infra/terraform/hetzner` - first boot is cloud-init driven and installs k3s automatically - bootstrap now starts from a local operator workstation rather than manual host login - Kubernetes assets exist under `deploy/k8s` - executor persistence boundaries are explicit for single-node k3s - self-hosted CI handoff is documented, but still requires follow-up hardening --- ## Non-Goals for Next Session - no dashboards - no UI/TUI - no monolith convenience architecture - no SQLite-first system of record - no direct coupling between ingest, decision, and execution - no temporary local-only shortcuts that block future cluster deployment --- ## Guiding Principle Build the single-node first version as if it is already a distributed system: - separate services - durable event bus - replayable events - explicit contracts - idempotent execution - production-compatible deployment boundaries - bootstrapable from scratch without manual SSH-based host setup