Azure Machine Learning v2: Managed Real-Time Inference | Quisitive
Azure Machine Learning v2: Managed Real-Time Inference
February 9, 2022
Guide on how to deploy a machine learning model to the new managed real-time inference endpoints in using Azure ML v2

Here at Quisitive, we are getting increasingly excited about the additional features coming to v2 of Azure Machine Learning (currently in public preview). One of the most game-changing features allows us to deploy ML models to infrastructure for real-time inference that is managed by Azure ML itself, without the need to maintain and manage an Azure Kubernetes Service (AKS) Cluster. Any workspace contributor can build these endpoints using v2 of the Azure ML Command Line Interface (CLI).

Architecture of an Endpoint

There are two key entities that users should be aware of within a real-time endpoint:

  1. The Endpoint – there is precisely one endpoint, which is the first entity to define. It consists of the endpoint URI and the expected swagger schema.
  2. The Deployments – there can be many deployments under a single endpoint, each corresponding to a different version of the model. Traffic can be varied between the deployments over time to accomplish blue-green or canary deployment strategies.

Important note: the virtual machine sizes are defined at the deployment level. This means that you need at least one virtual compute node per deployment within the endpoint.


Azure ML v2.0 is available only through Command Line Interface (CLI). Therefore, the following steps consist of bash commands to be executed in a Linux CLI. Configuring a new virtual environment beforehand is recommended. We recommend installing the latest version of the Azure ML SDK before setting up your first endpoints – you should look to have at least v1.37.0 installed:

pip install --upgrade azureml-sdk

Next, you need to install the v2 Azure ML CLI extension for the Azure CLI. This involves installing or updating the existing CLI, and then installing the new ML extension.

curl -sL | sudo bash 
az extension add -n ml -y

Finally, log in to the CLI:

az login

Creating the Endpoint

The endpoint creation process is configuration-driven. This means the first step is to create an endpoint.yaml configuration file to describe the endpoint. An example configuration file is shown below:

$schema: name: cattestendpoint 
auth_mode: key 
description: 'Test endpoint for managed endpoints public preview' 
     purpose: publicpreviewtest 
     creator: quisitive

Once you have created this file, you can deploy an endpoint into your workspace through the command line as shown below:

az ml online-endpoint create -f endpoint.yaml -w workspace_name

This will create the URI, which will be visible in the “Endpoints” tab of your Azure ML workspace. In this example, key-based authentication means that a permanent API key will be created, which must be passed in the header for authentication.

Creating a Deployment

Now that an endpoint has been created, you can deploy a model within it. You can deploy many versions of the same model behind an endpoint, each of which will be a separate deployment. You can then vary the traffic that passes to each deployment over time.

To create a deployment, you need to predefine the following:

  • A registered model in Azure ML
  • A registered environment in Azure ML, containing the required python packages
  • A file, in line with the usual Azure ML format for real-time endpoints

As with endpoints, deployments are configuration-driven, so you will need a deployment.yaml file for each deployment. For example:

name: catblue 
endpoint_name: cattestendpoint 
description: 'Blue deployment for online endpoint example' app_insights_enabled: true 
model: azureml:test_model:1 
          local_path: . 
environment: azureml:test_environment:1 

instance_type: Standard_F2s_v2 
instance_count: 2 
     request_timeout_ms: 3000 
     max_concurrent_requests_per_instance: 1 
     period: 10 
     initial_delay: 10 
     timeout: 2 
     success_threshold: 1 
     failure_threshold: 30 
     period: 10 
     initial_delay: 10 
     timeout: 2 
     success_threshold: 1 
     failure_threshold: 30 

     test_variable: 'test' 
     purpose: publicpreviewblue 
     creator: quisitive

Note: when the deployment is initially created, it will deploy with a fixed number of nodes, with VMs defined in the instance_type. We will cover how to enable autoscaling in a future post.

You can then create the deployment through the CLI:

az ml online-deployment create -f deployment.yaml -w workspace_name

When this initial deployment is completed, by default no traffic is routed to the endpoint. At this point, you may want to test the endpoint and deployment by passing it an example .json file:

az ml online-endpoint invoke --name cattestendpoint --deployment blue --request-file test.json

The endpoint is not yet live, so finally set the endpoint live by making 100% of the traffic pass through the deployment:

az ml online-endpoint update --name cattestendpoint --traffic "blue=100"

This update command can be used to alter the traffic if there are multiple deployments.