ML AI Engineer
Model training and evals.
The ML AI Engineer is the persona that brings models and LLM-powered features from prototype to production. In an AI-native SDLC, the ML AI Engineer operates an Eval Runner agent, four slash prompts, and a validated MCP catalog centered on Azure AI Foundry, Azure OpenAI, and Azure Machine Learning — not a pile of ad-hoc notebooks.
Executive summary
The ML AI Engineer is responsible for the reliability of AI behavior: model quality, prompt stability, routing decisions, and deployment topology. In an AI-native SDLC, the role is operationalized through a single Eval Runner agent, four slash prompts, scoped instructions, and validated MCPs reaching into Azure AI Foundry, Azure OpenAI, Azure Machine Learning, and Microsoft Learn Docs MCP.
Primary deliverables are versioned evaluation suites, model-routing configurations with traceable decisions, prompt regression dashboards, and Foundry deployment artifacts with rollback plans. The ML AI Engineer closes the loop from model change to production telemetry: no model ships without an eval baseline, and no regression survives overnight.
AI behavior is software behavior. Evals are tests, prompts are code, routing is configuration, and deployments are PRs. The ML AI Engineer makes the AI layer obey the same governance as every other system in production.
Role and responsibilities
Think of the ML AI Engineer like a pharmaceutical process engineer. The chemist invents the compound; the process engineer ensures every batch meets the monograph, with tests at each stage. In an AI-native SDLC, the ML AI Engineer turns exploratory model work into repeatable, auditable production behavior.
Primary responsibilities:
- Maintain the evaluation suite: datasets, metrics, regressions per feature
- Operate model routing (which model, which version, which tenant) in Azure AI Foundry
- Track prompt changes with regression tests and diff reports
- Deploy models and agents to Azure AI Foundry with rollback-ready artifacts
- Train and fine-tune models in Azure Machine Learning with versioned experiments
- Integrate safety systems (content filters, Purview sensitivity) into the model pipeline
- Operate the Eval Runner agent and
/eval-suite,/model-route,/prompt-regression,/foundry-deployprompts - Collaborate with the Data Engineer on feature datasets and with the InfoSec Officer on threat models for AI
Jobs to be done
- As an ML AI Engineer, I want every prompt change gated by a regression test, so that quality does not silently degrade.
- As an ML AI Engineer, I want eval suites versioned with the model, so that I can reproduce any past baseline on demand.
- As an ML AI Engineer, I want safe production deploys via Azure AI Foundry with one-click rollback, so that incidents do not leave users stranded.
- As an ML AI Engineer, I want model routing to be code, not tribal knowledge, so that every routing decision is reviewable.
- As an ML AI Engineer, I want safety filters integrated and tested, so that content-safety violations are caught in CI.
- As an ML AI Engineer, I want training experiments logged in Azure Machine Learning, so that reproducibility is automatic.
- As an ML AI Engineer, I want Application Insights signals from deployed models to feed the next eval, so that the loop closes itself.
- As an ML AI Engineer, I want cost and latency budgets enforced per model and per tenant, so that runaway inference is impossible.
Pain points before AI-native
- Eval-free prompt changes. Prompts changed in a hurry regress quality; no one notices until a user complains.
- Model roulette. Which model served a given request? No one can tell; support cases take hours to reproduce.
- Deploy and pray. Models deployed by hand to Azure AI Foundry or OpenAI endpoints without a rollback plan.
- Cost explosions. A badly designed chain calls GPT-class models a hundred times per request; the bill arrives at month end.
- Safety as afterthought. Content filters and PII redaction added only after an incident.
- Training lineage lost. The best model’s training artifacts exist only on a laptop.
- Prompt-in-code. Prompts embedded in application code instead of version-controlled files with reviews.
AI-native daily workflow
The ML AI Engineer works from Visual Studio Code with GitHub Copilot and from the terminal with Claude Code, driving the Eval Runner across the day.
Morning setup
- Open Azure AI Foundry and Application Insights dashboards; review overnight inference metrics and safety-filter events.
- In VS Code, run
/eval-suite --since=yesterdayto see any eval regressions on merged prompts or models. - Triage PRs that change prompts or routing; the Eval Runner has pre-posted
/prompt-regressiondiffs. - Confirm Azure Machine Learning experiment queue from the training track.
- Sync with the Data Engineer on new or changed feature datasets feeding training.
Midday execution
- For each feature change, run
/eval-suiteagainst the proposed model or prompt; fail the PR if the score drops below threshold. - For routing changes, invoke
/model-routeto generate or update the routing configuration with decision rationale. - Implement training runs through
/foundry-deploy --stage=experiment; results are logged in Azure Machine Learning. - Keep prompts in
prompts/*.prompt.mdwith scoped instructions; no prompts inline in application code.
Afternoon review
- Deploy approved models via
/foundry-deploy --stage=prod. Azure AI Foundry receives a versioned deployment with rollback metadata. - Watch the next hour of Application Insights telemetry for anomalies; roll back automatically if guardrails trip.
- Review cost telemetry per tenant; open issues for outliers.
Recommended primitives
Agent
| Agent | File | Purpose |
|---|---|---|
eval-runner | .github/agents/eval-runner.agent.md | Runs eval suites, prompt regressions, routing updates, Foundry deploys |
Slash prompts
| Command | File | Purpose |
|---|---|---|
/eval-suite | .github/prompts/eval-suite.prompt.md | Run the evaluation suite and publish scored reports |
/model-route | .github/prompts/model-route.prompt.md | Update and validate the model-routing configuration |
/prompt-regression | .github/prompts/prompt-regression.prompt.md | Run regression tests against changed prompts |
/foundry-deploy | .github/prompts/foundry-deploy.prompt.md | Stage or promote a model or agent in Azure AI Foundry with rollback metadata |
Instructions scoped
Scope (applyTo) | File | Purpose |
|---|---|---|
prompts/**/*.prompt.md | .github/instructions/prompts.instructions.md | Prompt structure, version header, safety block, eval reference |
evals/**/*.yaml | .github/instructions/evals.instructions.md | Eval dataset structure, metric definitions, thresholds |
foundry/**/*.bicep | .github/instructions/foundry.instructions.md | Azure AI Foundry deployment templates, identity, rollback |
aml/**/*.py | .github/instructions/aml.instructions.md | Azure Machine Learning experiment scaffolding |
Hooks
pre-commit: lint prompts, check required metadata headers, redact secretspre-push: run eval subset on changed prompts and modelspost-merge: run full eval suite and publish the dashboardpre-deploy: enforce rollback metadata and Foundry policy gatespost-deploy: watch Application Insights for 60 minutes; auto-rollback on anomaly
Validated MCPs
| MCP | Purpose | Owner |
|---|---|---|
| GitHub MCP Server | PRs, Actions runs, eval report comments | GitHub |
| Azure MCP Server | Operate Azure AI Foundry, Azure OpenAI, Azure Machine Learning, Application Insights | Microsoft |
| Microsoft Learn Docs MCP | Look up current guidance on Foundry, OpenAI, and AML patterns | Microsoft |
| Azure DevOps MCP Server | Track ML experiments as work items when the team uses Azure DevOps | Microsoft |
| Playwright MCP | End-to-end evaluation of AI features embedded in web UIs | Microsoft |
Real examples
Example 1: gating a prompt change
A Product Owner requests a friendlier tone in a support assistant. A PR edits prompts/support-assistant.prompt.md. The pre-push hook runs /prompt-regression; the Eval Runner scores the change against the versioned eval suite. Two helpfulness metrics rise, but a groundedness metric drops below threshold. The PR is blocked; the ML AI Engineer adds a constraint to the prompt; the next run passes and the PR merges.
Example 2: safe model upgrade
A new version of GPT-class model becomes available in Azure OpenAI. The ML AI Engineer opens a PR modifying the routing config via /model-route. /eval-suite runs against a held-out dataset; cost and latency budgets are respected; safety-filter activations stay inside tolerance. /foundry-deploy --stage=prod deploys; post-deploy monitoring for 60 minutes confirms no regression. The old version stays hot for 24 hours for rollback.
Example 3: closing the telemetry loop
Application Insights shows a spike in hallucinated entity mentions for a specific tenant. The ML AI Engineer invokes /eval-suite --from=prod-telemetry --tenant=acme; the Eval Runner builds a new eval slice from recent cases, runs it against two candidate prompts, and recommends the winner. A PR lands the change with the new eval slice added to the regression suite permanently.
Anti-patterns
- Prompts in application code. Inline prompts cannot be reviewed as changes; keep them in
prompts/*. - Evals optional. Every production-facing prompt or model has at least one eval dataset with thresholds.
- Hand-built routing. Routing as a series of if-statements in code is unreviewable; keep it in a declarative config.
- Deploys without rollback. Every Foundry deployment ships with metadata enabling one-click rollback.
- Training on a laptop. Use Azure Machine Learning; experiments must be reproducible from the job definition.
- Safety filters bolted on. Content safety is part of the pipeline from the start, not an incident response.
- Evals that never change. Eval datasets rot; add real production cases regularly via telemetry loops.
KPIs and impact metrics
| Metric | Baseline (manual) | Target (agentic) | Source |
|---|---|---|---|
| Prompt regressions shipped to prod | 4 per quarter | 0 | Eval Runner history |
| Eval coverage on production prompts | 30 percent | 100 percent | Repo scan |
| Time from model release to evaluated prod deploy | 2 weeks | < 2 days | GitHub + Azure AI Foundry |
| Safety-filter violations per 1M requests | Unmeasured | < 10 | Application Insights |
| Cost per 1K requests (p50) | Drift | Within 10 percent of budget | Azure cost data |
| Rollback time after deploy anomaly | Hours | < 5 minutes | Azure AI Foundry |
| Experiments reproducible from Git | 40 percent | 100 percent | Azure Machine Learning jobs |
Maturity in four levels
- L1 Manual: Prompts in code, evals in notebooks, deploys by console clicks, no routing config.
- L2 Assisted: Copilot drafts prompts and training code, evals in a notebook, single-model setup.
- L3 Augmented: Eval Runner agent, four slash prompts, scoped instructions, Foundry deploys from GitHub Actions.
- L4 Autonomous: Prompt and routing changes gated by evals, telemetry feeding the next eval automatically, rollback on anomaly within five minutes.
Integration with other personas
- From Data Engineer: curated feature datasets in Microsoft Fabric with documented lineage.
- From Software Architect: AI topology decisions (which tier calls which model) and latency budgets.
- To Developer: prompt files and routing config consumed from the repo; no inline prompts.
- With InfoSec Officer: AI threat model, content-safety configuration, PII redaction policies.
- With SRE: production monitoring, rollback automation, cost telemetry.
- To Product Owner: eval dashboards and regression reports during release review.
- With Compliance Auditor: model cards, training provenance, safety filter evidence.
Glossary
- Eval suite: a versioned set of datasets, metrics, and thresholds used to score a model or prompt.
- Prompt regression: a comparison of eval scores before and after a prompt change.
- Routing: the declarative mapping from request context to model, version, and parameters.
- Foundry deployment: a versioned release of a model or agent in Azure AI Foundry with rollback metadata.
- Groundedness: a metric describing how faithfully a model’s answer reflects its retrieved context.
- Safety filter: an Azure content-safety component that blocks or flags disallowed outputs.
- Telemetry loop: the process of turning production signals into new eval cases.
References
- Azure AI Foundry documentation — agents, models, evaluations
- Azure OpenAI Service — OpenAI models via Azure
- Azure Machine Learning — training and experiment tracking
- Azure AI Content Safety — safety filtering
- Application Insights for LLM apps — production telemetry
- GitHub Copilot and agents — developer workflow integration
- Microsoft Purview for AI data governance — sensitivity classification of training data
- GitHub Actions — CI and deployment orchestration across the stack
- Microsoft Learn Docs MCP — first-party documentation retrieval at implementation time
- GitHub Advanced Security — CodeQL, Dependabot, Secret Scanning, Push Protection