Proof: Archive the completed runtime-health turn and open a new implementation turn that replaces ambiguous quote verdicts with explicit lifecycle truth and durable execution explanation. Assumptions: Recent decision and execution records already contain enough repo-owned evidence to derive a truthful first-pass quote lifecycle for the active pair. Still fake: The new turn is planning state only; the dashboard still contains the existing lifecycle ambiguity until this proof is implemented.
284 lines
11 KiB
Markdown
284 lines
11 KiB
Markdown
# Implementation Turn: runtime health sentinel alert routing and anomaly detection
|
|
|
|
Status: open
|
|
Opened: 2026-04-08
|
|
|
|
## Goal
|
|
Make the live NEAR Intents system loudly and durably aware of broken runtime truth.
|
|
|
|
The system must detect stale or disconnected quote flow, grade service health from actual runtime behavior instead of mere reachability, surface that severity aggressively in the dashboard, emit durable alerts plus at least one external notification, and support tightly scoped safe containment for non-fund-moving failures.
|
|
|
|
## Selected backlog items
|
|
- [I020] Runtime health sentinel, alert routing, and anomaly detection: detect stale or broken service behavior across ingest, execution, persistence, and operator surfaces; surface critical failures loudly in the dashboard; emit durable alerts plus external notifications; and support tightly scoped safe remediation for non-fund-moving failures.
|
|
|
|
## Design rules
|
|
- Keep runtime health truth repo-owned and tied to the existing Kafka plus PostgreSQL pipeline.
|
|
- Prefer deterministic invariants first. Add anomaly detection as a secondary layer for things we did not name ahead of time.
|
|
- Do not trust service-local `/healthz` alone for end-to-end health.
|
|
- Safe containment beats speculative auto-repair.
|
|
- No new fund-moving actions or risk widening.
|
|
|
|
## Problem statement for this turn
|
|
The live incident showed three separate weaknesses:
|
|
- `near-intents-ingest` could remain running while quote truth stopped
|
|
- the dashboard could display large freshness ages while still calling the service healthy
|
|
- no durable alert or external notification made the issue impossible to miss
|
|
|
|
The turn therefore needs to improve:
|
|
- runtime-state collection
|
|
- cross-service health evaluation
|
|
- durable alert generation
|
|
- operator severity rendering
|
|
- external alert delivery
|
|
- bounded anomaly detection
|
|
|
|
## Minimal architecture changes for this turn
|
|
- `ops-sentinel` becomes the runtime health authority, not just a narrow stale-data checker.
|
|
- services expose richer state and truthful health hints through existing control APIs.
|
|
- `operator-dashboard` consumes alert severity and derived health rather than inferring “healthy” from reachability.
|
|
- one outbound notifier module sends raised or cleared alerts to an external webhook-style sink.
|
|
- anomaly logic runs in repo-owned code over service state and durable or live counters rather than free-text logs.
|
|
|
|
## Architecture changes
|
|
|
|
### 1. Extend service-local state and health surfaces
|
|
Each relevant service should expose enough state for sentinel evaluation.
|
|
|
|
For `near-intents-ingest`:
|
|
- websocket connected flag
|
|
- last websocket message time
|
|
- last matching quote time
|
|
- last published quote time
|
|
- publish error count
|
|
- pair filter state
|
|
|
|
For `trade-executor`:
|
|
- solver relay connected flag
|
|
- pending requests or in-flight count
|
|
- last quote-status or relay message time
|
|
- last error
|
|
- armed and paused state
|
|
|
|
For `history-writer`:
|
|
- last write time
|
|
- last alert write time
|
|
- offsets by topic
|
|
- database connectivity
|
|
- last error
|
|
|
|
For `operator-dashboard`:
|
|
- recent upstream source errors
|
|
- websocket fan-out status if available
|
|
- backend bootstrap-source failures
|
|
|
|
Do not rely on Kubernetes liveness alone.
|
|
|
|
### 2. Expand `ops-sentinel` into runtime-health evaluation
|
|
Add a service-state polling loop inside `ops-sentinel` for the existing control APIs.
|
|
|
|
The sentinel should evaluate at least:
|
|
- service reachability
|
|
- service-local freshness
|
|
- disconnected websocket clients
|
|
- pipeline stalls where source activity stops or downstream persistence stops
|
|
- unsafe armed-state combinations
|
|
- self-health staleness
|
|
|
|
Prefer deriving named condition objects first, then reconciling them into alert state.
|
|
|
|
### 3. Introduce a runtime health model
|
|
Define explicit derived states per service:
|
|
- `healthy`
|
|
- `warning`
|
|
- `critical`
|
|
- `offline`
|
|
- `paused`
|
|
|
|
Service severity must be derived from:
|
|
- explicit `/healthz ok: false`
|
|
- freshness thresholds
|
|
- disconnected state
|
|
- active alert severity for that service
|
|
- pipeline mismatches tied to that service
|
|
|
|
This logic should live in repo-owned code shared or mirrored between sentinel and dashboard so the UI cannot drift into a softer interpretation.
|
|
|
|
### 4. Add new deterministic alert classes
|
|
Extend `src/core/alert-engine.mjs` to cover at least:
|
|
- ingest disconnected
|
|
- ingest last quote stale
|
|
- ingest last publish stale
|
|
- executor relay disconnected
|
|
- history writer stalled
|
|
- dashboard upstream source degraded
|
|
- sentinel stale
|
|
- strategy or executor armed while critical input truth is stale
|
|
|
|
Each alert should include:
|
|
- `alert_code`
|
|
- `severity`
|
|
- `service_scope`
|
|
- clear reason text
|
|
- relevant timestamps
|
|
- threshold values used
|
|
- enough details for replay and operator debugging
|
|
|
|
### 5. Add a quote-flow path check
|
|
The recent incident was not merely “service disconnected.” It was “quote truth stopped.”
|
|
|
|
Add explicit checks for the active pair path:
|
|
- last matching quote age
|
|
- last raw quote persisted age if available
|
|
- last normalized quote persisted age if available
|
|
- mismatch between service-local quote timestamps and history-writer progress
|
|
|
|
This should allow the system to say whether the failure is:
|
|
- upstream relay disconnected
|
|
- ingest receiving traffic but not publishing
|
|
- history path stalled after publish
|
|
- dashboard path stale despite durable writes
|
|
|
|
### 6. Add bounded anomaly detection
|
|
Do not build a free-text log learner. Use recent history and rolling baselines.
|
|
|
|
Initial anomaly detectors:
|
|
- quote rate collapse over recent windows versus a rolling baseline
|
|
- reconnect frequency spike over recent windows
|
|
- durable-topic advancement collapse relative to recent baseline
|
|
- raw-to-normalized or normalized-to-durable mismatch if both sides normally advance together
|
|
|
|
Implementation shape:
|
|
- compute rolling counters in sentinel state or read recent durable counts from PostgreSQL
|
|
- emit warning-level anomaly alerts or operator notices
|
|
- keep deterministic alerts as the primary critical path
|
|
|
|
### 7. Add external alert delivery
|
|
Add one notifier module and config for a webhook-oriented external sink.
|
|
|
|
Required behaviors:
|
|
- deliver raised alerts
|
|
- optionally deliver clear events
|
|
- dedupe by alert identity
|
|
- record delivery success or failure in service state
|
|
- never block core alert state on delivery failure
|
|
|
|
Prefer a generic webhook shape that can feed:
|
|
- Alertmanager receiver
|
|
- Slack or Discord webhook
|
|
- another downstream router
|
|
|
|
If a direct Alertmanager integration is convenient, keep the repo-owned payload explicit and narrow.
|
|
|
|
### 8. Add safe containment and remediation hooks
|
|
Add only narrowly safe actions.
|
|
|
|
Allowed first-pass actions:
|
|
- reconnect or refresh a service-local websocket client
|
|
- pause or resume ingest-side or alert-side routines
|
|
- disarm or pause strategy or executor when critical stale-data conditions are active
|
|
|
|
Containment policy for this turn:
|
|
- if quote truth is critically stale, the system must at least surface a truthful safe control
|
|
- if automatic containment is added, it must be non-fund-moving and clearly auditable
|
|
|
|
Do not add:
|
|
- restarts of arbitrary deployments without explicit approval
|
|
- trade retries
|
|
- bridge, approval, or funding actions
|
|
|
|
### 9. Make the dashboard “angry”
|
|
The dashboard should no longer use friendly reachability heuristics for serious failures.
|
|
|
|
Required UI changes:
|
|
- global critical banner when critical alerts are active
|
|
- service cards reflect derived severity, not just `health.ok`
|
|
- stale services show exact age and why they are stale
|
|
- funds page or relevant page clearly flags stale quotes
|
|
- armed-state plus stale-data conditions render as critical
|
|
|
|
The specific failure mode `Freshness 30.8 h` plus `healthy` must become impossible.
|
|
|
|
### 10. Preserve durable truth
|
|
Runtime health changes must remain durable and replayable.
|
|
|
|
Use existing paths where practical:
|
|
- `ops.alert` stays the durable alert topic
|
|
- `ops_alerts` remains the canonical alert store
|
|
|
|
If anomaly notices need separate storage, keep them closely aligned with existing alerting rather than creating a parallel history system without need.
|
|
|
|
## Concrete implementation order
|
|
|
|
### Phase 1. Define the health contract
|
|
- enumerate service-local fields needed from each control API
|
|
- define severity thresholds for each relevant service
|
|
- define the first deterministic alert set
|
|
- define the first anomaly set and its data sources
|
|
|
|
### Phase 2. Strengthen service state surfaces
|
|
- update relevant `/state` and `/healthz` providers
|
|
- expose websocket connection and freshness timestamps where missing
|
|
- expose notifier and sentinel delivery state where needed
|
|
|
|
### Phase 3. Expand sentinel evaluation
|
|
- add service polling
|
|
- evaluate new deterministic conditions
|
|
- reconcile raised and cleared alert state
|
|
- keep current alert engine behavior passing
|
|
|
|
### Phase 4. Add quote-flow and pipeline checks
|
|
- compare ingest-local state to durable progression
|
|
- detect stalled or mismatched quote flow
|
|
- emit new alerts with precise reasons
|
|
|
|
### Phase 5. Add anomaly detection
|
|
- implement rolling or recent-window rate calculations
|
|
- compare against baseline windows
|
|
- emit warning-level anomaly alerts or notices
|
|
|
|
### Phase 6. Add external delivery
|
|
- add notifier config and delivery module
|
|
- send alert transitions to one external sink
|
|
- expose delivery health and failures
|
|
|
|
### Phase 7. Make the dashboard severity truthful
|
|
- replace soft healthy classification with derived severity
|
|
- add loud banners and clearer service-card rendering
|
|
- show freshness and cause directly
|
|
|
|
### Phase 8. Add safe containment
|
|
- wire at least one truthful safe containment action
|
|
- optionally auto-disarm or auto-pause on critical stale-data conditions if implemented cleanly
|
|
- keep changes auditable and non-fund-moving
|
|
|
|
### Phase 9. Validate against the recent incident
|
|
- reproduce or simulate a stranded ingest websocket condition
|
|
- confirm durable alert, dashboard critical state, external notification, and containment behavior
|
|
|
|
## Test plan
|
|
- unit tests for alert-engine raise and clear transitions for each new runtime-health alert
|
|
- unit tests for derived service severity from freshness and disconnected state
|
|
- unit tests for anomaly detectors on quote-rate collapse and reconnect spikes
|
|
- unit tests for notifier dedupe and failure handling
|
|
- integration tests for sentinel polling and alert emission using mocked service states
|
|
- integration tests for dashboard bootstrap or reducer behavior under critical alerts and stale service snapshots
|
|
- regression tests proving current price, inventory, funding, and executor submission alerts still work
|
|
|
|
No runtime-health bug fix is complete without a regression test covering the missed condition.
|
|
|
|
## Validation checklist against the proof
|
|
- stale ingest is detected by policy, not by manual interpretation
|
|
- dashboard shows warning or critical instead of healthy for stale or disconnected ingest
|
|
- alert event is durable and replayable
|
|
- external notification is emitted
|
|
- at least one safe containment path exists
|
|
- anomaly layer flags at least one non-predeclared abnormal pattern
|
|
- live NEAR spendable truth remains unchanged
|
|
|
|
## Failure modes to plan for
|
|
- service `/healthz` still says `ok` while runtime truth is stale
|
|
- dashboard severity logic diverges from sentinel logic
|
|
- alert delivery blocks or breaks alert generation
|
|
- anomaly detection floods noise and hides real failures
|
|
- containment silently changes arming state without durable evidence
|
|
- service polling itself goes stale and no sentinel-stale alert fires
|