15 data · Implementation

Data Engineer

Pipelines and data quality.

Updated: 2026-04-24 14 sections Download .zip

The Data Engineer is the persona that builds and operates the data pipelines the rest of the org depends on. In an AI-native SDLC, the Data Engineer operates a Pipeline Keeper agent, four slash prompts, and a validated MCP catalog spanning Azure Data Factory, Synapse, Fabric, SQL Database, and Cosmos DB — not a forest of brittle notebooks.

Executive summary

The Data Engineer owns the movement, transformation, and quality of data between systems. In an AI-native SDLC, their primary deliverables are idempotent pipelines, tested schema migrations, quality gates enforced in CI, and a lineage map that any stakeholder can read. The toolkit is fixed: the Pipeline Keeper agent, the /etl-scaffold, /schema-diff, /quality-gate, /lineage-map slash prompts, scoped instructions for SQL and pipeline definitions, and validated MCPs reaching into Azure Data Factory, Azure Synapse, Microsoft Fabric, Azure SQL Database, and Azure Cosmos DB.

The Data Engineer protects the organization from the most common data failure modes: silent schema drift, missing nulls turning into poisoned joins, late-arriving data breaking downstream contracts, and pipelines that succeed while producing subtly wrong results. Every pipeline ships with tests, a lineage entry, and a data-quality gate.

Data work is now inseparable from software work. Pipelines are code, reviewed in GitHub, tested in Actions, governed by Microsoft Purview, and observed through Azure Monitor.

Role and responsibilities

Think of the Data Engineer like the bridge master of a municipal water system. Pipes move water from reservoirs to neighborhoods; meters, valves, and testing stations ensure quality; any leak or contamination must be detected and routed before the kitchen tap. In an AI-native SDLC, the Data Engineer owns that plumbing for facts instead of liquid.

Primary responsibilities:

  • Design and operate ingestion pipelines in Azure Data Factory and Microsoft Fabric
  • Own transformations in Azure Synapse and curated layers in Fabric lakehouses
  • Model operational stores in Azure SQL Database and Azure Cosmos DB
  • Version every schema change with idempotent, reversible migrations
  • Enforce quality gates: schema validation, null rates, freshness, row-count deltas
  • Produce and maintain a lineage map for every dataset, queryable by downstream consumers
  • Operate the Pipeline Keeper agent and /etl-scaffold, /schema-diff, /quality-gate, /lineage-map prompts
  • Collaborate with the DBA on operational schemas and with the ML AI Engineer on feature stores

Jobs to be done

  1. As a Data Engineer, I want to scaffold a new ETL pipeline in minutes, so that onboarding a new source is not a week-long ticket.
  2. As a Data Engineer, I want every schema change reviewed with a diff and impact report, so that no silent breakage reaches production.
  3. As a Data Engineer, I want a quality gate on every dataset, so that downstream consumers can trust freshness and completeness.
  4. As a Data Engineer, I want a live lineage map, so that impact analysis for a schema change is a query, not a meeting.
  5. As a Data Engineer, I want pipeline failures to produce actionable alerts in Azure Monitor, so that on-call work is prioritized correctly.
  6. As a Data Engineer, I want cost anomalies in Synapse and Fabric detected automatically, so that runaway jobs do not blow the monthly budget.
  7. As a Data Engineer, I want sensitive-data classifications from Microsoft Purview propagated into pipeline tests, so that PII handling is proven, not assumed.
  8. As a Data Engineer, I want every transformation expressed in version-controlled code, so that no one-off SQL lives only in a notebook.

Pain points before AI-native

  • Notebook archaeology. Critical logic lives in untested notebooks that reference columns by position.
  • Silent schema drift. A source adds a column; a join silently duplicates rows; a report is wrong for a week.
  • Quality checks after the fact. Data-quality rules run in a separate dashboard the team forgets to open.
  • Lineage in heads. Only the senior Data Engineer knows which table feeds which report.
  • Cost surprises. A misconfigured Synapse pool runs overnight; the finance team notices before engineering does.
  • Pipeline as snowflake. Each pipeline invented its own retry, logging, alerting conventions.
  • No rollback. Migrations ran forward-only; reverting a bad change required a restore.

AI-native daily workflow

The Data Engineer works from Visual Studio Code with GitHub Copilot and from the terminal with Claude Code, invoking the Pipeline Keeper agent throughout the day.

Morning setup

  1. Open Azure Monitor and Microsoft Fabric monitoring to review overnight pipeline runs.
  2. In VS Code, run /quality-gate --since=yesterday to surface any dataset that failed freshness or completeness checks.
  3. Review pending schema PRs; the Pipeline Keeper has pre-drafted /schema-diff reports.
  4. Confirm Microsoft Purview classification updates from the Governance team.
  5. Post the Daily Data Health digest to Microsoft Teams with links to any open incidents.

Midday execution

  1. For each new source, invoke /etl-scaffold with the source metadata; the Pipeline Keeper generates an Azure Data Factory pipeline definition, Fabric dataflow, or Synapse notebook with tests.
  2. For every schema PR, run /schema-diff to produce the impact report (downstream tables, reports, pipelines) and attach it to the PR.
  3. Implement transformations in SQL or PySpark with scoped instructions enforcing style and safety.
  4. Wire /quality-gate into the CI workflow so merges block on broken freshness or nullability rules.

Afternoon review

  1. Run /lineage-map to refresh the lineage graph; export to Microsoft Fabric and push a Markdown snapshot to the repo.
  2. Triage any cost anomalies reported by Azure Monitor and open follow-up issues.
  3. Pair with the DBA on upcoming migrations for operational stores.

Agent

AgentFilePurpose
pipeline-keeper.github/agents/pipeline-keeper.agent.mdScaffolds pipelines, diffs schemas, enforces quality gates, refreshes lineage

Slash prompts

CommandFilePurpose
/etl-scaffold.github/prompts/etl-scaffold.prompt.mdGenerate an ingestion pipeline with tests, retries, quality gate
/schema-diff.github/prompts/schema-diff.prompt.mdDiff schema changes and report downstream impact
/quality-gate.github/prompts/quality-gate.prompt.mdRun data-quality checks in CI and block merge on failure
/lineage-map.github/prompts/lineage-map.prompt.mdRefresh the lineage graph and export to Fabric

Instructions scoped

Scope (applyTo)FilePurpose
**/*.sql.github/instructions/sql.instructions.mdIdempotent migrations, reversible down steps, no destructive DDL outside review
pipelines/**/*.json.github/instructions/adf.instructions.mdAzure Data Factory pipeline conventions, retries, alerts
fabric/**/*.py.github/instructions/fabric.instructions.mdMicrosoft Fabric notebook and dataflow standards
cosmos/**/*.ts.github/instructions/cosmos.instructions.mdAzure Cosmos DB partition key rules and SDK usage

Hooks

  • pre-commit: SQL lint, Purview classification check, secret scan
  • pre-push: run unit tests on transformations and schema diff on changed DDL
  • post-merge: deploy pipeline definitions via GitHub Actions to Azure Data Factory
  • nightly: run full quality gate and refresh lineage
  • on-cost-anomaly: open an issue when Azure Monitor cost metrics exceed a threshold

Validated MCPs

MCPPurposeOwner
GitHub MCP ServerPRs, Actions runs, schema-diff reportsGitHub
Azure MCP ServerDrive Azure Data Factory, Synapse, Fabric, SQL, and Cosmos DB operationsMicrosoft
Microsoft Learn Docs MCPResolve current Azure data-service documentation during implementationMicrosoft
Azure DevOps MCP ServerPipeline runbooks and work-item tracking for data projectsMicrosoft
Playwright MCPEnd-to-end validation of data surfaces exposed via web appsMicrosoft

Real examples

Example 1: onboarding a new SaaS source

A new CRM feed needs ingestion. The Data Engineer runs /etl-scaffold --source=crm-api --target=fabric-lakehouse. The Pipeline Keeper generates an Azure Data Factory pipeline with retries, checkpoint logic, and a /quality-gate rule for freshness and primary-key uniqueness. After review, a single PR merges 14 files; Actions deploys the pipeline to Azure Data Factory. The first run lands within the hour; the lineage graph updates overnight.

Example 2: catching schema drift before production

A Synapse table gains a column. The PR triggers /schema-diff, which identifies three downstream Power BI reports and two Azure SQL Database consumers. The Pipeline Keeper flags one consumer that assumes fixed positional selects; the Data Engineer updates the SQL and requests a compatibility test. The merge proceeds without surprise to finance or analytics.

Example 3: killing a runaway Synapse query

Overnight, Azure Monitor fires a cost anomaly. The on-cost-anomaly hook opens an issue assigned to the Pipeline Keeper owner. The next morning, the Data Engineer identifies a missing predicate; /schema-diff shows the affected query plan; a one-line fix merges and the cost returns to baseline.

Anti-patterns

  • Notebooks as production. If a notebook produces a signal the business relies on, it belongs in a versioned pipeline with tests.
  • Trust-me migrations. Migrations without a rollback plan are not safe, regardless of team confidence.
  • Quality checks in dashboards only. Quality rules belong in CI and in runtime hooks, not on a monitor someone might look at.
  • Lineage by memory. Every dataset should have a generated lineage entry; the graph is the source of truth.
  • Unclassified PII. Treat every un-labeled column as sensitive until Purview says otherwise.
  • Pipelines without alerts. A failing pipeline that does not page anyone is a bug.
  • Hand-rolled retries. Use pipeline-native retry policies; bespoke retry logic is a future incident.

KPIs and impact metrics

MetricBaseline (manual)Target (agentic)Source
Pipeline onboarding time7 days< 1 dayGitHub PR lead time
Schema-drift incidents per quarter80/schema-diff reports
Data-quality gate coverage40 percent100 percentCI check
Mean time to detect pipeline failure2 hours< 5 minutesAzure Monitor alerts
Cost anomalies unresolved > 24h5 per quarter0Azure Monitor, cost hook
Lineage map freshnessWeeklyLive on mergeFabric lineage export
PII classifications syncedAd hocDailyMicrosoft Purview

Maturity in four levels

  • L1 Manual: Notebooks in email, migrations in a shared drive, no tests, no lineage.
  • L2 Assisted: Copilot autocomplete in SQL, some pipelines in Azure Data Factory, but no schema diff and no quality gate.
  • L3 Augmented: Pipeline Keeper agent, four slash prompts, scoped instructions, quality gate in CI, lineage refreshed weekly.
  • L4 Autonomous: Quality gate blocking merges, cost anomalies opening issues within minutes, lineage refreshed on every merge, Purview classifications propagated into tests.

Integration with other personas

  • From Business Analyst: data requirements mapped to EARS, source-system contracts.
  • From Software Architect: storage and throughput constraints, target platform selection.
  • With DBA: shared schema governance for operational stores; migration reviews.
  • To ML AI Engineer: curated feature datasets in Microsoft Fabric with a documented lineage.
  • To BI Analyst: certified datasets with freshness and completeness guarantees.
  • To SRE: Azure Monitor runbooks for pipeline failures.
  • To Compliance Auditor: Purview classifications, lineage map, and quality-gate history as audit evidence.

Glossary

  • Pipeline: a version-controlled, testable definition that moves or transforms data.
  • Quality gate: a set of automated checks (freshness, completeness, schema) gating a merge or release.
  • Schema diff: a structured comparison of schema versions with downstream impact analysis.
  • Lineage: the directed graph from source system to consumer dataset, with transformations between.
  • Curated zone: the layer of Microsoft Fabric or Synapse where data is trusted and reported on.
  • Idempotent: a pipeline that produces the same result whether run once or ten times on the same input.
  • Rollback: a reversible migration script that returns the schema to a previous state.

References