MLOps: Model Management with Azure Machine Learning

Machine Learning Operations (MLOps) applies DevOps principles to the machine learning lifecycle, improving the quality, consistency, and efficiency of ML solutions.

MLOps enables faster experimentation, deployment, and iteration while maintaining quality assurance and end-to-end lineage tracking.

What is MLOps?

MLOps is based on DevOps principles that increase workflow efficiency:

Continuous Integration

Automated testing and validation of ML code and models

Continuous Deployment

Automated deployment of models to production

Continuous Delivery

Reliable release of ML solutions to users

Benefits of MLOps

Applying MLOps to machine learning results in:

Faster Experimentation
Faster Deployment
Better Quality

Quick iteration on model architectures
Parallel experiment tracking
Reproducible training pipelines
Efficient hyperparameter tuning

MLOps Capabilities in Azure Machine Learning

1. Reproducible ML Pipelines

Define repeatable workflows for data preparation, training, and scoring:

from azure.ai.ml import dsl
from azure.ai.ml import Input, Output

@dsl.pipeline(
    name="training_pipeline",
    description="End-to-end training pipeline",
)
def ml_pipeline(pipeline_input_data):
    # Data preparation step
    prep_data = prep_component(
        raw_data=pipeline_input_data
    )
    
    # Training step
    train_model = train_component(
        training_data=prep_data.outputs.prepared_data
    )
    
    # Evaluation step
    evaluate_model = eval_component(
        model=train_model.outputs.model,
        test_data=prep_data.outputs.test_data
    )
    
    return {
        "model": train_model.outputs.model,
        "metrics": evaluate_model.outputs.metrics
    }

# Create and submit pipeline
pipeline_job = ml_pipeline(
    pipeline_input_data=Input(type="uri_folder", path="azureml://datastores/data")
)

ml_client.jobs.create_or_update(pipeline_job)

Pipeline Benefits

Reusability: Use same pipeline for different datasets
Versioning: Track pipeline definitions over time
Parallelization: Run independent steps concurrently
Scheduling: Trigger pipelines on schedules or events

2. Reusable Software Environments

Ensure reproducible builds without manual configuration:

from azure.ai.ml.entities import Environment

env = Environment(
    name="sklearn-env",
    description="Scikit-learn environment",
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
    conda_file="environment.yml"
)

ml_client.environments.create_or_update(env)

3. Model Registration and Versioning

Store and track models in the Azure Machine Learning registry:

from azure.ai.ml.entities import Model

# Register model
model = Model(
    path="outputs/model",
    name="fraud-detection-model",
    description="XGBoost model for fraud detection",
    tags={"framework": "xgboost", "task": "classification"},
    properties={"accuracy": "0.95", "dataset": "fraud_v2"}
)

registered_model = ml_client.models.create_or_update(model)
print(f"Registered model: {registered_model.name} version {registered_model.version}")

Model Registry Features:

Automatic Versioning

Each registration increments version number automatically

Metadata Tracking

Store tags and properties for searchability

Lineage

Link to training job, dataset, and environment

Model Comparison

Compare metrics across versions

4. Model Deployment as Endpoints

Deploy models for real-time or batch inference:

Online Endpoints
Batch Endpoints
MLflow Models

Real-time inference with managed infrastructure:

from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    Model,
    Environment,
    CodeConfiguration
)

# Create endpoint
endpoint = ManagedOnlineEndpoint(
    name="fraud-detection-endpoint",
    description="Fraud detection API",
    auth_mode="key"
)
ml_client.online_endpoints.begin_create_or_update(endpoint)

# Create deployment
deployment = ManagedOnlineDeployment(
    name="blue",
    endpoint_name="fraud-detection-endpoint",
    model=registered_model,
    environment="azureml://registries/azureml/environments/sklearn-1.5/versions/1",
    code_configuration=CodeConfiguration(
        code="src",
        scoring_script="score.py"
    ),
    instance_type="Standard_DS3_v2",
    instance_count=2
)
ml_client.online_deployments.begin_create_or_update(deployment)

Process large datasets asynchronously:

from azure.ai.ml.entities import (
    BatchEndpoint,
    BatchDeployment,
    Model
)

# Create batch endpoint
endpoint = BatchEndpoint(
    name="fraud-batch-endpoint",
    description="Batch fraud scoring"
)
ml_client.batch_endpoints.begin_create_or_update(endpoint)

# Create deployment
deployment = BatchDeployment(
    name="default",
    endpoint_name="fraud-batch-endpoint",
    model=registered_model,
    compute="batch-cluster",
    instance_count=3,
    max_concurrency_per_instance=2,
    mini_batch_size=10,
    output_file_name="predictions.csv"
)
ml_client.batch_deployments.begin_create_or_update(deployment)

Deploy without scoring script:

deployment = ManagedOnlineDeployment(
    name="mlflow-deployment",
    endpoint_name="my-endpoint",
    model=Model(path="model", type="mlflow_model"),
    instance_type="Standard_DS3_v2",
    instance_count=1
)

MLflow models include the scoring logic, eliminating the need for a custom scoring script.

5. Controlled Rollout

Safely deploy new model versions with traffic splitting:

# Deploy new model version to "green" deployment
green_deployment = ManagedOnlineDeployment(
    name="green",
    endpoint_name="fraud-detection-endpoint",
    model=new_model_version,
    instance_type="Standard_DS3_v2",
    instance_count=1
)
ml_client.online_deployments.begin_create_or_update(green_deployment)

# Gradually shift traffic from blue to green
endpoint.traffic = {"blue": 90, "green": 10}
ml_client.online_endpoints.begin_create_or_update(endpoint)

# Monitor metrics, then complete rollout
endpoint.traffic = {"green": 100}
ml_client.online_endpoints.begin_create_or_update(endpoint)

Traffic Management Strategies:

Shadow Deployment

Mirror traffic to new deployment without affecting production

Canary Release

Route small percentage of traffic to new version

Blue-Green

Switch all traffic between versions instantly

A/B Testing

Compare performance of multiple model versions

Metadata and Lineage Tracking

Azure Machine Learning captures end-to-end lineage:

Data Lineage

from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

# Register dataset
data_asset = Data(
    name="fraud-training-data",
    version="2024-01",
    description="Fraud transactions dataset",
    path="azureml://datastores/data/paths/fraud/",
    type=AssetTypes.URI_FOLDER,
    tags={"year": "2024", "domain": "finance"}
)

ml_client.data.create_or_update(data_asset)

Job History

Automatic tracking of:

Code snapshots (Git commit)
Input datasets and versions
Hyperparameters
Metrics and outputs
Compute environment
Duration and costs

# Query job history
jobs = ml_client.jobs.list(
    parent_job_name="training-pipeline-run-123"
)

for job in jobs:
    print(f"{job.name}: {job.status} - {job.properties}")

Event-Driven Workflows

Trigger actions based on ML lifecycle events:

from azure.eventgrid import EventGridEvent

# Subscribe to model registration events
event_types = [
    "Microsoft.MachineLearningServices.ModelRegistered",
    "Microsoft.MachineLearningServices.ModelDeployed",
    "Microsoft.MachineLearningServices.DatasetDriftDetected"
]

# Event handler
def handle_ml_event(event: EventGridEvent):
    if event.event_type == "ModelRegistered":
        model_name = event.data["modelName"]
        model_version = event.data["modelVersion"]
        
        # Trigger deployment pipeline
        trigger_deployment(model_name, model_version)

Monitoring and Alerting

Model Monitoring

Track model performance in production:

from azure.ai.ml.entities import AlertNotification

# Configure monitoring
monitor = ModelMonitor(
    endpoint_name="fraud-detection-endpoint",
    deployment_name="blue",
    monitoring_signals=[
        "data_drift",
        "prediction_drift",
        "model_performance"
    ],
    alert_notification=AlertNotification(
        emails=["ml-team@company.com"]
    )
)

Metrics to Monitor

Operational
Model Performance
Data Quality

Request latency (P50, P95, P99)
Throughput (requests/second)
Error rate
CPU/GPU utilization
Memory usage

CI/CD with Azure Pipelines

Integrate Azure Machine Learning into DevOps workflows:

Azure DevOps Extension

The Machine Learning extension provides:

Azure ML workspace integration
Model training triggers
Automated deployment tasks
Environment management

GitHub Actions

name: Train and Deploy ML Model

on:
  push:
    branches: [ main ]
  workflow_dispatch:

jobs:
  train:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Azure Login
        uses: azure/login@v1
        with:
          creds: ${{ secrets.AZURE_CREDENTIALS }}
      
      - name: Install Azure ML CLI
        run: az extension add -n ml
      
      - name: Submit Training Job
        run: |
          az ml job create \
            --file jobs/train.yml \
            --resource-group ${{ secrets.RESOURCE_GROUP }} \
            --workspace-name ${{ secrets.WORKSPACE_NAME }}
      
      - name: Deploy Model
        run: |
          az ml online-endpoint create --file endpoints/endpoint.yml
          az ml online-deployment create --file endpoints/deployment.yml

Best Practices

Version Everything

Track versions for:

Training code (Git commits)
Data assets (versioned datasets)
Models (automatic versioning)
Environments (pinned dependencies)
Pipeline definitions (YAML configs)

Automate Testing

Implement:

Unit tests for training code
Integration tests for pipelines
Model validation tests
Deployment smoke tests
Performance benchmarks

Monitor in Production

Set up:

Real-time dashboards
Automated alerts
Data drift detection
Model performance tracking
Cost monitoring

Use Feature Stores

Benefits:

Consistent feature definitions
Training-serving skew prevention
Feature reusability
Point-in-time correctness

Implement Governance

Establish:

Model approval workflows
Access control policies
Compliance documentation
Audit trails
Responsible AI reviews

Next Steps

Set Up MLOps

Configure CI/CD with Azure DevOps

Model Deployment

Deploy models to endpoints

Model Monitoring

Monitor models in production

Azure Pipelines

Integrate with Azure DevOps

Documentation Index

​MLOps: Model Management with Azure Machine Learning

​What is MLOps?

Continuous Integration

Continuous Deployment

Continuous Delivery

​Benefits of MLOps

​MLOps Capabilities in Azure Machine Learning

​1. Reproducible ML Pipelines

​2. Reusable Software Environments

​3. Model Registration and Versioning

Automatic Versioning

Metadata Tracking

Lineage

Model Comparison

​4. Model Deployment as Endpoints

​5. Controlled Rollout

​Metadata and Lineage Tracking

​Data Lineage

​Job History

​Event-Driven Workflows

​Monitoring and Alerting

​Model Monitoring

​Metrics to Monitor

​CI/CD with Azure Pipelines

​Azure DevOps Extension

​GitHub Actions

​Best Practices

​Next Steps

Set Up MLOps

Model Deployment

Model Monitoring

Azure Pipelines

MLOps: Model Management with Azure Machine Learning

What is MLOps?

Benefits of MLOps

MLOps Capabilities in Azure Machine Learning

1. Reproducible ML Pipelines

2. Reusable Software Environments

3. Model Registration and Versioning

4. Model Deployment as Endpoints

5. Controlled Rollout

Metadata and Lineage Tracking

Data Lineage

Job History

Event-Driven Workflows

Monitoring and Alerting

Model Monitoring

Metrics to Monitor

CI/CD with Azure Pipelines

Azure DevOps Extension

GitHub Actions

Best Practices

Next Steps