ObservabilityEdgeSRE

Edge Observability Playbook: From Logs to Decisions

A practical migration from “seeing incidents” to “making decisions fast” with metrics, logs, traces, and actionable alerts.

Our old incident workflow at the edge was simple: check logs, then guess.
It worked when traffic was small, but once we had multi-region rollouts and concurrent builds, troubleshooting became a manual marathon.

This migration aimed to move from “observable” to “decidable”:

identify whether incidents are regional, release-specific, or upstream in under 30 seconds;
produce a rollback/degrade decision in under 5 minutes;
reduce alert noise to a level humans can actually operate.

Metrics First: One Shared Vocabulary

We did not start with new dashboards.
We started by standardizing metric names and labels.

For example, latency, response_ms, and duration became request_duration_ms, always labeled with:

region
route
build_id
cache_status

Without this, every later query becomes fragile.

Logs: From Text Dumps to Structured Events

Instead of free-form log strings, every event now follows a strict JSON schema with:

trace_id
span_id
request_id
user_agent_bucket
degrade_mode

So when an alert fires, we can jump from metrics to relevant logs immediately.

Tracing: Critical Paths Only

We intentionally avoided full tracing across everything.
We traced only high-value paths:

home request
search request
project detail request

Each path keeps 3-5 meaningful spans.
The goal is readability under pressure, not maximal data volume.

Alerts: Actionable by Default

Alert messages moved from symptom-only to action-first:

Before: p95 latency > threshold
After: [region=HKG][build=2026.04.22.3] p95 +42%, enable cache-priority mode, re-check in 3 minutes

This single change reduced escalation time more than any dashboard polish.

Final Note

Observability maturity is not a chart-count contest.
The real benchmark is whether your system helps humans decide clearly during stressful moments.