Centralised Cloud Run observability
Scattered Cloud Run services across UAT and production, with no single place to query logs or measure SLA. We routed everything into BigQuery and put a set of monitoring skills on top of it.
- Outcome
- Log-derived SLAs and error budgets available without leaving BigQuery — and an extensible pattern for future services.
- Stack
-
- Google Cloud Run
- BigQuery
- Cloud Logging
- Terragrunt
- OpenTofu
- Tags
-
- GCP
- Cloud Run
- Terragrunt
- Observability
The problem
The platform ran eight or nine Cloud Run services across two environments, each emitting structured JSON to Cloud Logging. Diagnosing a failure meant clicking through the Logs Explorer in one project and the Cloud Run console in another. Measuring SLA meant exporting logs to a spreadsheet. Adding a new service to the picture meant repeating the process.
The fix wasn’t a new vendor. It was joining what was already there.
What we built
- Folder-level aggregated log sinks routing every Cloud Run revision log into a per-environment BigQuery dataset in the platform project, scoped at the GCP folder so new services are picked up automatically.
- Same-project IAM, no cross-project grants. The sink and the dataset live together, and the existing convention for warehouse placement is preserved.
- A small library of monitoring queries — SLA availability, error rate, latency percentiles, scaling and concurrency, dependency health — each a saved BigQuery view, each named so it appears in the Logs Explorer alongside the raw logs.
- A roll-out plan that started in UAT. Validated the data flow against real traffic and the monitoring queries against the queries the platform team actually needed, before turning it on in production.
What we were careful about
- Cost. Cloud Logging exports are billed by ingestion volume; the queries are partitioned and clustered to keep the dashboards cheap. We measured before, during and after.
- No application code changes. The services already emit structured JSON via a shared logging library, so the work happens entirely at the platform layer.
- The legacy service. One older Cloud Run service in production has only partial structured logging. The sink captures what’s there; the gap is documented as a follow-on rather than blocking the rollout.
The shape that’s left behind
A pair of Terragrunt modules, a dataset full of logs, and a small set of named views. Nothing exotic. Everything queryable in SQL.