383 lines
11 KiB
Markdown
383 lines
11 KiB
Markdown
# Trading System Architecture Notes for Next Session
|
|
|
|
## Objective
|
|
Build the first real version of the trading system as an event-driven, multi-service architecture.
|
|
|
|
Current implemented seed:
|
|
- NEAR Intents ingest in Node.js
|
|
- Kafka-compatible bus usage via `kafkajs`
|
|
- dummy reactor / executor / result consumer loop
|
|
|
|
Next session should continue from this architecture, not revert to a monolith, local-only script, or TUI.
|
|
|
|
---
|
|
|
|
## Core Architecture
|
|
All components are independent services.
|
|
They communicate only through a central Kafka-compatible bus (Redpanda first, Kafka-compatible by design).
|
|
|
|
### Service classes
|
|
- venue ingestors
|
|
- normalizers
|
|
- reactors / decision engines
|
|
- executors
|
|
- downstream consumers / monitors / archivers / replay tools
|
|
|
|
### Service communication rule
|
|
No direct service-to-service calls for core trading flow.
|
|
Use bus topics only.
|
|
|
|
---
|
|
|
|
## Venue-Oriented Structure
|
|
The system should be organized by venue.
|
|
Each venue can have different:
|
|
- ingest/feed mechanics
|
|
- normalization logic
|
|
- execution mechanics
|
|
|
|
### Per-venue responsibilities
|
|
- `ingest` = venue-native intake
|
|
- `normalize` = convert venue-native payload into canonical internal event
|
|
- `execute` = venue-specific action logic
|
|
|
|
Planned shape:
|
|
```text
|
|
src/
|
|
apps/
|
|
bus/
|
|
core/
|
|
venues/
|
|
near-intents/
|
|
ingest
|
|
normalize
|
|
execute
|
|
```
|
|
|
|
---
|
|
|
|
## Bus Choice
|
|
Use **Redpanda** first, but stay fully **Kafka-compatible**.
|
|
|
|
### Reason
|
|
Requirements:
|
|
- high throughput
|
|
- low latency
|
|
- retention
|
|
- replay
|
|
- multiple producers/consumers
|
|
- independent services
|
|
- future scale-out
|
|
- multi-language compatibility
|
|
|
|
### Constraint
|
|
Do not use broker-specific features that make migration to Kafka difficult.
|
|
Use standard Kafka clients and semantics.
|
|
|
|
---
|
|
|
|
## Data Model Principles
|
|
Kafka/Redpanda is the operational event backbone.
|
|
|
|
### Event model rules
|
|
- append-only
|
|
- immutable events
|
|
- versioned schemas
|
|
- raw and normalized events both preserved
|
|
|
|
### Every event should include
|
|
- `event_id`
|
|
- `event_type`
|
|
- `venue`
|
|
- `observed_at` / `ingested_at`
|
|
- `schema_version`
|
|
- `payload`
|
|
- optionally raw/original payload where appropriate
|
|
|
|
### Raw vs normalized
|
|
Keep both.
|
|
- raw topics = exact venue-native source truth
|
|
- normalized topics = canonical research/trading inputs
|
|
|
|
This is required for:
|
|
- replay
|
|
- debugging
|
|
- future backtesting
|
|
- future Spark/batch processing
|
|
|
|
---
|
|
|
|
## Current/Planned Topic Flow
|
|
Minimal 3-stage pipeline:
|
|
|
|
1. ingest publishes normalized demand
|
|
2. reactor publishes trade command
|
|
3. executor publishes trade result
|
|
|
|
### Topic classes
|
|
- `raw.*` = raw venue-native events
|
|
- `norm.*` = canonical normalized market events
|
|
- `cmd.*` = execution commands
|
|
- `exec.*` = execution outcomes
|
|
- later `signal.*` if needed for reactor outputs before command stage
|
|
|
|
### Current minimal topics
|
|
- `norm.swap_demand`
|
|
- `cmd.execute_trade`
|
|
- `exec.trade_result`
|
|
|
|
### NEAR Intents
|
|
NEAR Intents source currently feeds quote-demand style events from solver-bus websocket.
|
|
This is a venue ingest source, not the whole trading system.
|
|
|
|
---
|
|
|
|
## Execution Safety / Zero Downtime Requirements
|
|
This is critical.
|
|
|
|
### Constraint
|
|
Multiple executors must never duplicate the same trade/action during deploys, restarts, or rebalances.
|
|
|
|
### Must-have rules
|
|
1. Every execution command must carry a unique `command_id`
|
|
2. Commands must include deterministic idempotency information
|
|
3. Executors must be idempotent
|
|
4. Executors must belong to a consumer group per executor role
|
|
5. Commands should be partitioned by a stable execution key where ordering matters
|
|
6. Executor state must be persisted durably enough to detect duplicate command execution
|
|
|
|
### Kafka consumer groups are not sufficient alone
|
|
They help assign work, but they do not guarantee no duplicate processing under restart/rebalance conditions.
|
|
Idempotency is still required.
|
|
|
|
### Rolling updates / zero downtime
|
|
Executors must support:
|
|
- graceful shutdown
|
|
- stop taking new work before exit
|
|
- finish or safely recover in-flight work
|
|
- commit offsets only after safe execution state transition
|
|
|
|
### Persistence implication
|
|
Executor idempotency state is not optional metadata.
|
|
It is operational state that must survive pod restarts.
|
|
|
|
Current single-node k3s direction:
|
|
- executor state lives at `/var/lib/unrip/executor-state`
|
|
- Kubernetes mounts that path through persistent storage
|
|
- the Hetzner single-node overlay currently targets k3s `local-path` storage
|
|
- node loss without storage migration means duplicate-suppression history is lost
|
|
|
|
---
|
|
|
|
## Deployment Target
|
|
### First deployment phase
|
|
- single machine on Hetzner
|
|
- but still multiple independent services
|
|
- no architecture shortcuts that prevent future clustering
|
|
|
|
### Future target
|
|
- split across multiple machines
|
|
- cluster capable
|
|
- fault tolerant
|
|
- multi-node
|
|
- zero-downtime deploys
|
|
|
|
### Deployment rules from day 1
|
|
- every component is a separate container/service
|
|
- all config via env/config files
|
|
- communication over network/bus only
|
|
- persistent components use mounted volumes/PVCs
|
|
- no manual SSH-based operational workflow
|
|
|
|
---
|
|
|
|
## Infrastructure / Ops Direction
|
|
Target environment:
|
|
- Hetzner
|
|
- self-hosted CI/CD
|
|
- provisioning by code
|
|
- no GitHub dependency
|
|
|
|
### Desired stack direction
|
|
- Terraform for Hetzner provisioning
|
|
- Kubernetes-oriented target from the start
|
|
- self-hosted Git + CI/CD
|
|
- Kafka-compatible broker
|
|
- object storage later for long-term archived event history
|
|
|
|
### Single-node first, future cluster later
|
|
The first version may run on one machine, but deployment structure should already match a future distributed system.
|
|
|
|
### Current canonical operator path
|
|
The repo now documents and partially implements this path as the primary deployment workflow:
|
|
|
|
#### Phase 0: workstation bootstrap
|
|
1. A local operator workstation prepares bootstrap secrets in `scripts/hetzner/bootstrap-secrets.env`.
|
|
2. The operator runs `bash scripts/hetzner/bootstrap.sh`.
|
|
3. Terraform provisions the server, firewall, network, and cloud-init user-data.
|
|
4. cloud-init installs k3s automatically and prepares persistence directories plus bootstrap artifacts.
|
|
5. The workstation waits for the public k3s API endpoint to report ready.
|
|
6. The workstation writes `.state/hetzner/kubeconfig.yaml`.
|
|
7. The workstation injects initial Kubernetes Secrets for app and Forgejo bootstrap.
|
|
8. The workstation applies repo-managed Kubernetes manifests under `deploy/k8s/`.
|
|
9. The workstation performs the first image/bootstrap delivery attempt for the app workloads.
|
|
10. The workstation verifies rollout status.
|
|
|
|
#### Phase 1: self-hosted handoff
|
|
1. Forgejo becomes reachable in-cluster.
|
|
2. The operator completes initial Forgejo admin/repo setup.
|
|
3. This repo is pushed or mirrored into Forgejo.
|
|
4. The Forgejo runner becomes the routine app deployment mechanism.
|
|
5. Terraform remains the infra mutation entrypoint unless further automated later.
|
|
|
|
### Failure-recovery expectation
|
|
The bootstrap path must be rerunnable from the workstation.
|
|
Docs should keep treating recovery as:
|
|
- fix local secrets/inputs
|
|
- rerun the bootstrap script
|
|
- inspect the cluster with the generated kubeconfig
|
|
- destroy/recreate infra with `scripts/hetzner/destroy.sh` only when required
|
|
|
|
### Current repo-state caveats
|
|
The direction is clear, but the implementation is still mid-transition:
|
|
- the bootstrap script currently applies `deploy/k8s/base` directly rather than the Hetzner overlay
|
|
- kubeconfig/auth handling is not yet fully production-hardened
|
|
- first image delivery is still a bootstrap workaround rather than a final registry-native CI path
|
|
- Forgejo admin bootstrap, repo creation, and Actions configuration still require operator steps
|
|
- local Compose remains in the repo for development/testing, not as the canonical production path
|
|
|
|
### Minimal repo layout target
|
|
```text
|
|
deploy/
|
|
hetzner/
|
|
README.md
|
|
k8s/
|
|
base/
|
|
overlays/
|
|
hetzner-single-node/
|
|
infra/
|
|
terraform/
|
|
hetzner/
|
|
```
|
|
|
|
Guidelines:
|
|
- `infra/terraform/hetzner/` owns VM, firewall, networking, and cloud-init rendering
|
|
- `deploy/k8s/` owns Kubernetes-native manifests and overlays
|
|
- app runtime manifests should remain Kubernetes-native so they can later move from single-node k3s to a larger cluster with minimal rewrite
|
|
- secret material must not live in git in plaintext; bootstrap docs should describe workstation-driven injection or generated secret references
|
|
|
|
---
|
|
|
|
## Local Development / Testing Direction
|
|
Do not assume manual multi-terminal operation long term.
|
|
|
|
### Requirement
|
|
Need an orchestrated local/dev runtime.
|
|
|
|
### Local dev should preserve real boundaries
|
|
- separate services
|
|
- broker present
|
|
- env/config driven
|
|
- same event flow as production
|
|
|
|
### Current local/dev answer
|
|
Compose is still acceptable for:
|
|
- developer laptops
|
|
- fast local iteration
|
|
- debugging event flow
|
|
- validating container boundaries before Kubernetes rollout
|
|
|
|
But Compose should remain explicitly secondary to the repo-driven Hetzner + k3s path for production operations.
|
|
|
|
### Testing layers
|
|
1. unit tests for normalizers / schema logic / helpers
|
|
2. integration tests against Kafka-compatible broker
|
|
3. replay/simulation tests using retained event streams
|
|
|
|
---
|
|
|
|
## Spark Readiness
|
|
Do not add Spark now.
|
|
But keep the system Spark-compatible later by:
|
|
- preserving raw events
|
|
- preserving normalized events
|
|
- using immutable append-only event streams
|
|
- versioning schemas
|
|
- separating operational event log from future analytical processing
|
|
|
|
Spark later would be for:
|
|
- large-scale backtesting
|
|
- feature generation
|
|
- archive processing
|
|
- multi-venue analytics
|
|
|
|
---
|
|
|
|
## Immediate Next Engineering Tasks
|
|
Next session should focus on the following.
|
|
|
|
### 1. Clean current repo structure
|
|
Remove duplicate/legacy paths and keep one canonical structure only.
|
|
|
|
### 2. Keep/complete the 3-stage loop
|
|
- NEAR Intents ingest -> `norm.swap_demand`
|
|
- dummy reactor -> `cmd.execute_trade`
|
|
- dummy executor -> `exec.trade_result`
|
|
- downstream result consumer
|
|
|
|
### 3. Define canonical schemas
|
|
Define concrete event schemas for:
|
|
- normalized swap demand
|
|
- execute trade command
|
|
- trade result
|
|
|
|
### 4. Define executor idempotency model
|
|
Specify:
|
|
- `command_id`
|
|
- idempotency key rules
|
|
- execution state transition rules
|
|
- duplicate handling rules
|
|
|
|
### 5. Move toward production-shaped deployment
|
|
Design for:
|
|
- one service per container
|
|
- single-node deployment first
|
|
- future multi-node split without app rewrite
|
|
|
|
### 6. Harden provisioning/deployment path
|
|
Next infra work should continue improving:
|
|
- Hetzner provisioning by code
|
|
- workstation bootstrap rerunnability
|
|
- self-hosted CI/CD handoff
|
|
- registry-native image delivery
|
|
- overlay convergence for the Hetzner single-node target
|
|
|
|
Status update:
|
|
- minimal Terraform exists under `infra/terraform/hetzner`
|
|
- first boot is cloud-init driven and installs k3s automatically
|
|
- bootstrap now starts from a local operator workstation rather than manual host login
|
|
- Kubernetes assets exist under `deploy/k8s`
|
|
- executor persistence boundaries are explicit for single-node k3s
|
|
- self-hosted CI handoff is documented, but still requires follow-up hardening
|
|
|
|
---
|
|
|
|
## Non-Goals for Next Session
|
|
- no dashboards
|
|
- no UI/TUI
|
|
- no monolith convenience architecture
|
|
- no SQLite-first system of record
|
|
- no direct coupling between ingest, decision, and execution
|
|
- no temporary local-only shortcuts that block future cluster deployment
|
|
|
|
---
|
|
|
|
## Guiding Principle
|
|
Build the single-node first version as if it is already a distributed system:
|
|
- separate services
|
|
- durable event bus
|
|
- replayable events
|
|
- explicit contracts
|
|
- idempotent execution
|
|
- production-compatible deployment boundaries
|
|
- bootstrapable from scratch without manual SSH-based host setup
|