Proof: Archive the completed runtime-health turn and open a new implementation turn that replaces ambiguous quote verdicts with explicit lifecycle truth and durable execution explanation. Assumptions: Recent decision and execution records already contain enough repo-owned evidence to derive a truthful first-pass quote lifecycle for the active pair. Still fake: The new turn is planning state only; the dashboard still contains the existing lifecycle ambiguity until this proof is implemented.
174 lines
10 KiB
Markdown
174 lines
10 KiB
Markdown
# Implementation Proof: runtime health sentinel alert routing and anomaly detection
|
|
|
|
Status: open
|
|
Opened: 2026-04-08
|
|
|
|
## Target outcome
|
|
This turn proves that `unrip` can detect and loudly escalate broken runtime behavior before operators mistake stale or disconnected systems for healthy trading state.
|
|
|
|
The concrete target is the currently live NEAR Intents BTC/EURe system:
|
|
- detect when quote flow stops or becomes stale
|
|
- detect when websocket-backed services are disconnected or stranded
|
|
- detect when durable pipeline components are reachable but no longer advancing truth
|
|
- surface those failures as durable alerts and obvious dashboard state
|
|
- deliver at least one external alert notification through a repo-controlled routing path
|
|
- contain risk safely by exposing or applying only explicitly approved non-fund-moving remediation
|
|
|
|
## Why this is a meaningful architecture test
|
|
The cluster incident on 2026-04-07 and 2026-04-08 showed a real failure mode:
|
|
- `near-intents-ingest` stopped receiving and publishing quotes after `2026-04-07T09:06:38Z`
|
|
- the service remained running in Kubernetes
|
|
- the dashboard still rendered the service as effectively healthy despite `30h+` freshness age
|
|
- no durable alert or external notification made the failure hard to miss
|
|
|
|
That is an architecture failure, not just an isolated websocket bug. If the system cannot detect that its truth pipeline has gone stale, it is not safe enough to broaden venue scope.
|
|
|
|
## Hypothesis
|
|
`unrip` becomes more trustworthy if runtime health is treated as first-class product truth:
|
|
- service-local state is not enough; cross-service health must be derived from the shared pipeline
|
|
- stale or broken quote flow must become durable, replayable alert truth
|
|
- dashboard health must be severity-driven, not merely reachability-driven
|
|
- anomaly detection should complement hard invariants, not replace them
|
|
- safe containment is more important than optimistic self-healing
|
|
|
|
The turn passes only if a quote-flow or service-health failure like the recent ingest outage would now produce obvious dashboard severity plus durable and external alert evidence within bounded time.
|
|
|
|
## Scope
|
|
- [I020] Runtime health sentinel, alert routing, and anomaly detection: detect stale or broken service behavior across ingest, execution, persistence, and operator surfaces; surface critical failures loudly in the dashboard; emit durable alerts plus external notifications; and support tightly scoped safe remediation for non-fund-moving failures.
|
|
- Extend the existing `ops-sentinel` into the runtime health authority for the current single-pair NEAR Intents system.
|
|
- Cover at least these services and paths:
|
|
- `near-intents-ingest`
|
|
- `trade-executor`
|
|
- `history-writer`
|
|
- `operator-dashboard`
|
|
- the quote-flow path from ingest to PostgreSQL-backed operator visibility
|
|
- Add one repo-controlled external alert delivery path. Initial assumption: a webhook-oriented sink suitable for Alertmanager, Slack, Discord, or equivalent downstream routing.
|
|
- Add bounded anomaly detection based on rolling baselines and pipeline gaps for the active pair and live services. This is support for operator review, not a replacement for deterministic safety alerts.
|
|
|
|
## Assumptions
|
|
- The current control APIs remain the primary service-local truth surface for runtime state and safe controls.
|
|
- External notification should be added through one simple sink first, not a broad escalation matrix in the repo itself.
|
|
- Alertmanager or another downstream fan-out layer may exist later, but this turn only needs a truthful outbound integration point plus at least one working delivery path.
|
|
- The first anomaly layer should use deterministic rolling windows or baseline comparisons over repo-owned metrics and history, not opaque model training over free-text logs.
|
|
- Safe remediation in this turn means non-fund-moving actions only, such as reconnect, refresh, pause, resume, or explicit trading containment.
|
|
|
|
## Turn-shaping rules
|
|
- This is a runtime-truth turn, not a generic observability platform buildout.
|
|
- Do not add a broad metrics stack, tracing system, or large log-ML pipeline unless required by the proof.
|
|
- Do not pretend anomaly detection is safety-complete. Deterministic invariants remain the primary alert path.
|
|
- Do not let automatic remediation widen risk. If a condition is unsafe, containment or explicit alerting is preferred to speculative recovery.
|
|
- Browser clients must still talk only to repo-owned backend services.
|
|
|
|
## Non-goals
|
|
- No broad CoW Protocol implementation work in this turn beyond preserving the paused archive.
|
|
- No generalized self-healing orchestration across arbitrary workloads.
|
|
- No live-funds-moving automation such as bridge actions, funding, approvals, or trade submission retries.
|
|
- No full-blown ML log classifier or LLM-operated runtime manager.
|
|
- No replacement of Loki, Grafana, or external alert systems already in the cluster.
|
|
|
|
## Required runtime behavior
|
|
|
|
### Health truth
|
|
The system must be able to derive runtime health from more than mere pod liveness.
|
|
|
|
At minimum it must detect and represent:
|
|
- service reachable but stale
|
|
- websocket disconnected
|
|
- durable writer reachable but topic flow stalled
|
|
- operator dashboard upstream source degraded
|
|
- alerting subsystem stale or failing
|
|
- strategy or executor armed while critical upstream truth is stale
|
|
|
|
### Alert truth
|
|
The system must durably emit alert events for critical runtime failures.
|
|
|
|
At minimum it must raise and clear alerts for:
|
|
- `near_intents_ingest_disconnected`
|
|
- `near_intents_quotes_stale`
|
|
- `near_intents_publish_stale`
|
|
- `trade_executor_relay_disconnected`
|
|
- `history_writer_stalled` or equivalent durable-write stall
|
|
- `operator_dashboard_source_degraded`
|
|
- `sentinel_stale`
|
|
|
|
The exact names may vary, but the semantic coverage must be equivalent and replayable from stored records.
|
|
|
|
### Operator surface
|
|
Operators must be able to answer:
|
|
- which service is degraded or critical
|
|
- why it is degraded
|
|
- how stale the underlying truth is
|
|
- whether the issue is local-service, pipeline, or downstream-surface related
|
|
- whether alert delivery succeeded
|
|
- what safe control actions are available
|
|
|
|
The dashboard must no longer label a service healthy if its runtime truth is stale by policy.
|
|
|
|
### External notification
|
|
At least one external notification path must receive a raised alert from repo-controlled code.
|
|
|
|
That path must:
|
|
- dedupe by alert identity or equivalent key
|
|
- include severity, scope, reason, and timestamps
|
|
- support clear or recovery notification, or explicitly record why clear delivery is not supported
|
|
|
|
### Safe containment
|
|
For at least one critical stale-data condition, the system must have a truthful containment path.
|
|
|
|
Initial target:
|
|
- if ingest or quote truth is critically stale, operators can see a real safe action and the system can optionally force or recommend a non-fund-moving containment state such as pausing or disarming trade-driving components
|
|
|
|
### Bounded anomaly detection
|
|
The system must add one anomaly layer beyond named invariants.
|
|
|
|
At minimum it must be able to flag:
|
|
- quote rate collapse versus recent baseline
|
|
- reconnect frequency spike versus recent baseline
|
|
- topic-flow mismatch, such as raw or service-local activity without downstream durable progression
|
|
|
|
These anomaly signals may be warning-level and operator-facing rather than pager-level.
|
|
|
|
## Definition of done
|
|
- The paused CoW turn is archived and the live turn files govern this runtime-health turn.
|
|
- `ops-sentinel` or equivalent repo-owned runtime health authority evaluates service and pipeline health for the active live system.
|
|
- Durable alert events represent the critical stale or disconnected conditions that the recent incident exposed.
|
|
- The dashboard shows critical or stale health for those conditions rather than merely reporting large freshness ages under healthy badges.
|
|
- At least one external alert route is wired and demonstrated from repo-owned code.
|
|
- At least one safe containment action exists and is truthful.
|
|
- At least one anomaly signal is implemented using recent-history comparisons or rolling baselines.
|
|
- Tests cover alert raising and clearing, health classification, dashboard severity rendering, and alert delivery behavior.
|
|
|
|
For this turn to close with status `passed`, a reproduced or simulated quote-ingest failure must produce:
|
|
- a durable raised alert
|
|
- a clearly non-healthy dashboard state
|
|
- an external notification
|
|
- evidence of safe containment or explicit safe-control availability
|
|
|
|
## Validation evidence required
|
|
- direct evidence that a stale or disconnected ingest condition is detected within a bounded threshold
|
|
- direct evidence that the dashboard severity changes to warning or critical instead of healthy for that condition
|
|
- direct evidence that the corresponding alert is stored durably
|
|
- direct evidence that an external alert notification is emitted
|
|
- automated test evidence for alert raise and clear transitions
|
|
- automated test evidence for dashboard health classification from stale service state
|
|
- automated test evidence for anomaly detection logic on at least one rolling-baseline case
|
|
|
|
## Failure conditions
|
|
- The system still shows healthy or equivalent status while quote truth is stale by policy.
|
|
- Alerts exist only in logs and are not durable or operator-visible.
|
|
- External notification is absent or only manual.
|
|
- Anomaly detection is described but not implemented in repo-owned code.
|
|
- Safe containment would move funds or widen risk without explicit approval.
|
|
- The turn produces only docs, dashboards, or thresholds without stronger runtime truth.
|
|
|
|
## Current real before this turn
|
|
- The NEAR Intents BTC/EURe loop is live and already has durable history, strategy, execution, and dashboard surfaces.
|
|
- `ops-sentinel` already emits some durable alerts for price, inventory, funding, and executor submission failures.
|
|
- Grafana and Loki already exist in the cluster.
|
|
- Service control APIs already expose state and safe controls.
|
|
|
|
## Deliberately not built by this proof
|
|
- full multi-channel escalation policy management inside the repo
|
|
- ML-first anomaly detection over raw free-text logs
|
|
- broad auto-remediation beyond narrowly approved safe runtime actions
|
|
- second-venue CoW execution work
|