Observability as Correctness: End-to-End Correlation Plan

Date: 2026-03-03

Why this change

The current telemetry is useful but event-centric. We can debug failures, but we do not yet have a full lifecycle trace for each chat/edit request. AI editing failures are often non-obvious:

schema rejects
partial apply failures
repair/retry loops
model behavior drift
preview sync mismatches

Treating observability as part of correctness means every request should be inspectable as one correlated execution, not separate logs.

Current state in repo

Already implemented

Structured chat telemetry phases in orchestrator:
- received
- forced_plan
- deterministic_plan_generated
- plan_attempt_failed
- plan_generated
- plan_apply_failed
- repair_attempt
- repair_generated
- result
NDJSON persistence + API endpoints:
- GET /telemetry/chat
- GET /telemetry/chat/review
Per-request traceId already included in chat debug payloads.
Preview patch transport includes txId + patchAck handshake.

Gaps

No parent/child span model (flat events only).
No standardized duration per lifecycle stage.
No shared trace context across orchestrator + editor + preview bridge.
Preview patch ack latency is not linked to server request trace.
No first-class rollback span when progressive apply fails.

Target model

Every chat/edit request is one root trace: chat.request. Child spans:

intent.detect
plan.generate (attempt-aware)
plan.normalize
repair.attempt
repair.generate
plan.validate
ops.apply
ops.rollback (when needed)
preview.sync (patch ack timing)
response.finalize

Correlated by shared identifiers:

traceId
spanId
parentSpanId
session
siteId
requestedSlug
effectiveSlug
provider
modelKey
modelUsed
promptHash

Proposed implementation

Phase 1: Span model on top of existing telemetry

Goal: no behavior change, just richer telemetry.

Add telemetry tracing helper in orchestrator (example: src/telemetry/trace.ts):
- startChatTrace(...)
- startSpan(name, attrs)
- endSpan(status, attrs?)
- recordException(error, attrs?)
Extend telemetry entry shape with optional:
- traceId
- spanId
- parentSpanId
- durationMs
- attempt
Keep existing phase events for backward compatibility.
In runChatPipeline(...), wrap each existing stage with span boundaries.

Phase 2: OpenTelemetry exporter and resource context

Goal: interoperable telemetry backend support.

Add dependencies in orchestrator:
- @opentelemetry/api
- @opentelemetry/sdk-node
- @opentelemetry/exporter-trace-otlp-http
- @opentelemetry/resources
Configure resource attributes:
- service.name=ai-site-editor-orchestrator
- service.version=<git sha or package version>
- deployment.environment=<env>
Enable with env switch:
- OTEL_ENABLED=1
- OTEL_EXPORTER_OTLP_ENDPOINT=...

Phase 3: Cross-app correlation to preview ack

Goal: close the loop from plan/apply to user-visible preview sync.

Include traceId + operation index in op_applied SSE payload.
In editor usePreviewBridge, measure:
- patch send timestamp
- patch ack timestamp
- ackMs
Add endpoint POST /telemetry/preview-ack in orchestrator to ingest:
- traceId
- txId
- opIndex
- ackMs
- accepted
- reason
Emit preview.sync child span from this payload.

Phase 4: Correctness-oriented metrics and SLOs

Metrics:

Histograms:
- chat.plan.duration_ms
- chat.apply.duration_ms
- chat.preview_ack.duration_ms
Counters:
- chat.retry.count
- chat.repair.count
- chat.rollback.count
- chat.schema_reject.count
- chat.partial_apply.count

Initial thresholds to monitor:

p95 plan.generate latency
p95 preview.sync ack latency
repair rate
rollback rate
schema rejection rate

Suggested code touchpoints

apps/orchestrator/src/chat/chat-pipeline.ts
- root trace and child spans around each lifecycle stage
apps/orchestrator/src/telemetry/chat-telemetry.ts
- entry schema enrichment for span metadata and durations
apps/orchestrator/src/routes/chat.ts
- include trace context in SSE op events
apps/editor/src/hooks/useChatEngine.ts
- carry trace context through streaming apply path
apps/editor/src/hooks/usePreviewBridge.ts
- measure and report patch ack timing
packages/preview-adapter/src/preview-bridge.tsx
- keep ack semantics stable; optional payload enrichment

Rollout strategy

Ship Phase 1 behind CHAT_TRACE_SPANS=1 and keep existing telemetry output unchanged.
Validate in local + integration tests (chat-pipeline-integration.test.ts).
Enable Phase 2 in staging only; verify trace volume and cardinality.
Add Phase 3 preview ack ingestion; ensure no UI regression when endpoint unavailable.
Start alerting on correctness metrics (repair/rollback/schema reject trends).

Risks and mitigations

Risk: telemetry cardinality explosion.
- Mitigation: cap high-cardinality attributes; hash long text; avoid raw prompts.
Risk: frontend reporting failures.
- Mitigation: fire-and-forget preview ack endpoint; never block user flow.
Risk: migration breaks existing telemetry consumers.
- Mitigation: additive schema only; keep old fields/phases.

Definition of done

Every chat/edit request has one root trace with child spans for planning, validation, apply, and preview sync.
Failed edits are searchable by traceId with clear stage failure location.
Repair/retry/rollback rates are measurable over time.
Preview ack latency is visible and attributable to specific edit traces.

Specifications

Observability

Guardrails

Testing

Operations

End-to-End Tracing

Observability as Correctness: End-to-End Correlation Plan

Why this change

Current state in repo

Already implemented

Gaps

Target model

Proposed implementation

Phase 1: Span model on top of existing telemetry

Phase 2: OpenTelemetry exporter and resource context

Phase 3: Cross-app correlation to preview ack

Phase 4: Correctness-oriented metrics and SLOs

Suggested code touchpoints

Rollout strategy

Risks and mitigations

Definition of done

Specifications

Observability

Guardrails

Testing

Operations

Documentation Index

​Observability as Correctness: End-to-End Correlation Plan

​Why this change

​Current state in repo

​Already implemented

​Gaps

​Target model

​Proposed implementation

​Phase 1: Span model on top of existing telemetry

​Phase 2: OpenTelemetry exporter and resource context

​Phase 3: Cross-app correlation to preview ack

​Phase 4: Correctness-oriented metrics and SLOs

​Suggested code touchpoints

​Rollout strategy

​Risks and mitigations

​Definition of done

Observability as Correctness: End-to-End Correlation Plan

Why this change

Current state in repo

Already implemented

Gaps

Target model

Proposed implementation

Phase 1: Span model on top of existing telemetry

Phase 2: OpenTelemetry exporter and resource context

Phase 3: Cross-app correlation to preview ack

Phase 4: Correctness-oriented metrics and SLOs

Suggested code touchpoints

Rollout strategy

Risks and mitigations

Definition of done