Centralised Cloud Run observability — Work

The problem

The platform ran eight or nine Cloud Run services across two environments, each emitting structured JSON to Cloud Logging. Diagnosing a failure meant clicking through the Logs Explorer in one project and the Cloud Run console in another. Measuring SLA meant exporting logs to a spreadsheet. Adding a new service to the picture meant repeating the process.

The fix wasn’t a new vendor. It was joining what was already there.

What we built

Folder-level aggregated log sinks routing every Cloud Run revision log into a per-environment BigQuery dataset in the platform project, scoped at the GCP folder so new services are picked up automatically.
Same-project IAM, no cross-project grants. The sink and the dataset live together, and the existing convention for warehouse placement is preserved.
A small library of monitoring queries — SLA availability, error rate, latency percentiles, scaling and concurrency, dependency health — each a saved BigQuery view, each named so it appears in the Logs Explorer alongside the raw logs.
A roll-out plan that started in UAT. Validated the data flow against real traffic and the monitoring queries against the queries the platform team actually needed, before turning it on in production.

What we were careful about

Cost. Cloud Logging exports are billed by ingestion volume; the queries are partitioned and clustered to keep the dashboards cheap. We measured before, during and after.
No application code changes. The services already emit structured JSON via a shared logging library, so the work happens entirely at the platform layer.
The legacy service. One older Cloud Run service in production has only partial structured logging. The sink captures what’s there; the gap is documented as a follow-on rather than blocking the rollout.

The shape that’s left behind

A pair of Terragrunt modules, a dataset full of logs, and a small set of named views. Nothing exotic. Everything queryable in SQL.