Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/MicrosoftDocs/azure-ai-docs/llms.txt

Use this file to discover all available pages before exploring further.

Deploy Models with Azure Machine Learning

After training machine learning models, deploy them to production for inference using Azure Machine Learning endpoints. Deploy for real-time predictions or batch processing at scale.
Azure ML provides managed endpoints with automatic scaling, monitoring, and security - no infrastructure management required.

Inference and Endpoints

Inference is the process of applying new input data to a machine learning model to generate outputs (predictions, classifications, clusters, etc.). An endpoint is a stable, durable URL that can be used to request predictions from your model.

Online Endpoints

Real-time inference with low latency

Batch Endpoints

Asynchronous processing of large datasets

Endpoint Anatomy

Endpoint

Provides:
  • Stable URL: e.g., https://my-endpoint.eastus.inference.ml.azure.com
  • Authentication: Key-based or Microsoft Entra ID
  • Authorization: Role-based access control

Deployment

Contains:
  • Model: Trained model files
  • Code: Scoring script (optional for MLflow models)
  • Environment: Software dependencies
  • Compute: Resources to run inference
One endpoint can contain multiple deployments, enabling A/B testing and safe rollouts.

Deployment Types

Best for: Real-time, low-latency inferenceFeatures:
  • Fully managed compute and scaling
  • Built-in monitoring and logging
  • Traffic splitting for A/B testing
  • Zero-downtime updates
  • Cost tracking per deployment
Use when:
  • Response time is critical (<1 second)
  • Request-response pattern
  • Small payloads (fits in HTTP request)
  • Need to scale based on traffic
from azure.ai.ml.entities import ManagedOnlineEndpoint

endpoint = ManagedOnlineEndpoint(
    name="my-endpoint",
    description="Production inference endpoint",
    auth_mode="key"
)

Quick Start: Deploy a Model

1. Register Your Model

from azure.ai.ml import MLClient
from azure.ai.ml.entities import Model
from azure.identity import DefaultAzureCredential

ml_client = MLClient(
    DefaultAzureCredential(),
    subscription_id="<subscription-id>",
    resource_group="<resource-group>",
    workspace_name="<workspace>"
)

# Register model
model = Model(
    path="./model",
    name="sklearn-classifier",
    description="Iris classification model",
    type="mlflow_model"  # or "custom_model"
)

registered_model = ml_client.models.create_or_update(model)
print(f"Registered: {registered_model.name} v{registered_model.version}")

2. Create Endpoint

from azure.ai.ml.entities import ManagedOnlineEndpoint

endpoint = ManagedOnlineEndpoint(
    name="iris-classifier-endpoint",
    description="Iris species classification",
    auth_mode="key",
    tags={"environment": "production", "team": "ml-ops"}
)

ml_client.online_endpoints.begin_create_or_update(endpoint).result()
print(f"Endpoint created: {endpoint.name}")

3. Deploy Model

from azure.ai.ml.entities import ManagedOnlineDeployment, Model

deployment = ManagedOnlineDeployment(
    name="blue",
    endpoint_name="iris-classifier-endpoint",
    model=registered_model,
    instance_type="Standard_DS3_v2",
    instance_count=1,
    request_settings={
        "request_timeout_ms": 90000,
        "max_concurrent_requests_per_instance": 1
    },
    liveness_probe={
        "initial_delay": 10,
        "period": 10,
        "timeout": 2,
        "failure_threshold": 3
    }
)

ml_client.online_deployments.begin_create_or_update(deployment).result()

# Route 100% traffic to deployment
endpoint.traffic = {"blue": 100}
ml_client.online_endpoints.begin_create_or_update(endpoint).result()

4. Test the Deployment

# Test with sample data
sample_data = {
    "data": [
        [5.1, 3.5, 1.4, 0.2],
        [6.2, 2.9, 4.3, 1.3]
    ]
}

# Invoke endpoint
response = ml_client.online_endpoints.invoke(
    endpoint_name="iris-classifier-endpoint",
    request_file="sample_request.json",
    deployment_name="blue"  # Optional: test specific deployment
)

print(f"Predictions: {response}")

Deployment Patterns

Blue-Green Deployment

Switch traffic between two deployments instantly:
# Deploy new version to "green"
green_deployment = ManagedOnlineDeployment(
    name="green",
    endpoint_name="iris-classifier-endpoint",
    model=new_model_version,
    instance_type="Standard_DS3_v2",
    instance_count=1
)
ml_client.online_deployments.begin_create_or_update(green_deployment).result()

# Test green deployment
response = ml_client.online_endpoints.invoke(
    endpoint_name="iris-classifier-endpoint",
    request_file="test.json",
    deployment_name="green"
)

# Switch all traffic to green
endpoint.traffic = {"green": 100}
ml_client.online_endpoints.begin_create_or_update(endpoint).result()

# Delete old blue deployment
ml_client.online_deployments.begin_delete(
    name="blue",
    endpoint_name="iris-classifier-endpoint"
).result()

Canary Deployment

Gradually shift traffic to test new version:
# Start with 10% traffic to new version
endpoint.traffic = {"blue": 90, "green": 10}
ml_client.online_endpoints.begin_create_or_update(endpoint).result()

# Monitor metrics, then increase
endpoint.traffic = {"blue": 50, "green": 50}
ml_client.online_endpoints.begin_create_or_update(endpoint).result()

# Complete rollout
endpoint.traffic = {"green": 100}
ml_client.online_endpoints.begin_create_or_update(endpoint).result()

A/B Testing

Compare two model versions in production:
# Split traffic evenly
endpoint.traffic = {
    "model-v1": 50,
    "model-v2": 50
}
ml_client.online_endpoints.begin_create_or_update(endpoint).result()

# Monitor business metrics to choose winner
# Then route 100% to better performing model

Scaling and Performance

Autoscaling

Configure automatic scaling based on metrics:
from azure.ai.ml.entities import OnlineScaleSettings

deployment = ManagedOnlineDeployment(
    name="blue",
    endpoint_name="my-endpoint",
    model=model,
    instance_type="Standard_DS3_v2",
    instance_count=1,
    scale_settings=OnlineScaleSettings(
        scale_type="TargetUtilization",
        min_instances=1,
        max_instances=10,
        target_utilization_percentage=70,
        polling_interval=10
    )
)

Resource Limits

Control compute resources:
deployment = ManagedOnlineDeployment(
    name="blue",
    endpoint_name="my-endpoint",
    model=model,
    instance_type="Standard_DS3_v2",
    instance_count=2,
    request_settings={
        "request_timeout_ms": 90000,
        "max_concurrent_requests_per_instance": 5,
        "max_queue_wait_ms": 60000
    }
)

Monitoring Deployments

View Metrics in Azure Portal

Key metrics to monitor:
  • Request latency (P50, P95, P99)
  • Requests per second
  • HTTP status codes
  • CPU/GPU utilization
  • Memory usage

Query Logs

# Get deployment logs
logs = ml_client.online_deployments.get_logs(
    name="blue",
    endpoint_name="iris-classifier-endpoint",
    lines=100
)
print(logs)

Application Insights Integration

from applicationinsights import TelemetryClient

tc = TelemetryClient('<instrumentation-key>')

# Log custom events
tc.track_event('PredictionMade', {
    'model_version': '1.2.0',
    'latency_ms': 45
})
tc.flush()

Security

Authentication

# Get endpoint keys
keys = ml_client.online_endpoints.get_keys(
    name="iris-classifier-endpoint"
)

# Make authenticated request
import requests

headers = {
    "Authorization": f"Bearer {keys.primary_key}",
    "Content-Type": "application/json"
}

response = requests.post(
    endpoint.scoring_uri,
    headers=headers,
    json=sample_data
)

Network Security

Deploy with private networking:
endpoint = ManagedOnlineEndpoint(
    name="secure-endpoint",
    public_network_access="disabled",
    identity={
        "type": "SystemAssigned"
    }
)

ml_client.online_endpoints.begin_create_or_update(endpoint).result()

Cost Optimization

Start with smaller instances and scale up:
Instance TypevCPUsRAMCost (Relative)
Standard_DS2_v227GB1x
Standard_DS3_v2414GB2x
Standard_DS4_v2828GB4x
Scale to zero during low-traffic periods:
scale_settings=OnlineScaleSettings(
    min_instances=0,  # Scale to zero when idle
    max_instances=10
)
Use batch endpoints for large datasets - only pay during job execution:
batch_deployment = BatchDeployment(
    name="default",
    endpoint_name="batch-endpoint",
    model=model,
    compute="batch-cluster",
    instance_count=5  # Parallel processing
)
Track spending in Azure Cost Management:
  • Filter by deployment tags
  • Set budget alerts
  • Analyze cost trends

Troubleshooting

Check:
  1. Model files are valid
  2. Scoring script has no syntax errors
  3. Environment dependencies are correct
  4. Sufficient quota for instance type
View deployment logs:
logs = ml_client.online_deployments.get_logs(
    name="blue",
    endpoint_name="my-endpoint",
    lines=500
)
Solutions:
  • Use GPU instances for deep learning models
  • Optimize model (quantization, pruning)
  • Increase concurrent requests per instance
  • Enable request batching
  • Use model caching
  • Switch to larger instance type
  • Reduce batch size in scoring script
  • Optimize model memory usage
  • Use model compression techniques

Next Steps

Online Endpoints

Learn more about real-time inference

Batch Scoring

Deploy models for batch processing

Monitor Deployments

Track performance and costs

MLOps

Automate deployment pipelines