Pipeline monitoring for a regulated lender — Work

The problem

The client’s analytics platform ran on Dagster OSS, dlt and dbt, writing to a cloud data warehouse. It worked — until it didn’t. When a run failed at 03:00, the first sign was usually a stakeholder noticing yesterday’s number was missing. There was no audit trail of which assets had materialised, no record of which dbt tests had failed and why, and no consistent way to route alerts to the team that owned each pipeline.

The brief was practical: keep the existing stack, don’t introduce a vendor, and make it observable.

What we built

A small, opinionated monitoring package that drops into an existing Dagster project:

A pluggable alert router. Severity-typed events (warn, error, critical), pluggable transports (Microsoft Teams via Adaptive Cards, SMTP email), cursor-based deduplication so a flapping job doesn’t drown the channel.
A warehouse-resident audit trail. Every run, every asset materialisation, every dbt test result captured to dedicated audit tables in BigQuery, with curated views that answer the questions an on-call engineer actually asks at 03:00.
Asset-level data-quality checks. dbt’s critical tag maps to Dagster AssetCheckSeverity.ERROR. Freshness checks generated from a thin factory. dlt loads gain row-count and schema-drift checks as a matter of course.
dbt observability. Elementary OSS configured against the same warehouse, with reports posted to the same Teams channel as the framework’s own alerts.

What we were careful about

A failing transport must never swallow the alert. Each transport is wrapped so an error in Teams doesn’t take email down with it. One broken channel is bad. Two is much worse.
The router has no Dagster dependency. It is plain Python with a Transport Protocol. That kept the unit tests small and fast, and means the same router can be used by sidecar tools like Elementary’s CLI.
Audit data is for humans. The curated views (vw_run_summary, vw_recent_failures) read like a runbook, not a database dump. The schema is partitioned and clustered so the dashboards stay cheap to run.

What it looks like in use

A nightly dbt run fails one critical test. Within seconds:

A Teams Adaptive Card lands in the data engineering channel with the asset name, the test name, the offending value, the run URL, and a one-click link to the dbt artefacts.
The same event is recorded in audit.dagster_asset_check_evaluations, tagged with severity and the run ID.
A nine-day rolling view shows whether this is a one-off or a trend.
The on-call engineer fixes the upstream source. The next run succeeds. The same dashboard quietly clears.

Nothing here is novel on its own. The point is that the parts are wired together in a way that turns “we run Dagster” into “we operate Dagster”.