Lesson 4 : Pipeline automation
https://docs.microsoft.com/fr-fr/azure/machine-learning/how-to-use-automlstep-in-pipelines#configure-and-create-the-automated-ml-pipeline-step
https://docs.microsoft.com/fr-fr/python/api/azureml-core/azureml.core.model.model?view=azure-ml-py
https://docs.microsoft.com/fr-fr/python/api/azureml-pipeline-core/azureml.pipeline.core.portdatareference?view=azure-ml-py#azureml-pipeline-core-portdatareference-path-on-datastore
https://docs.microsoft.com/en-us/azure/machine-learning/how-to-deploy-azure-container-instance
https://docs.microsoft.com/en-us/azure/machine-learning/how-to-secure-web-service
Create a Pipeline
Summary
The most common SDK class is the Pipeline class. You will use this when creating a Pipeline. Pipelines can take configuration and different steps, like AutoML for example.
Different steps can have different arguments and parameters. Parameters are just like variables in a Python script.
There are areas you can play with when creating a pipeline and we covered two:
- Use pipeline parameters
- Recurring Scheduled Pipelines
- Batch Inference Pipelines
Pipeline Class
This is the most common Python SDK class you will see when dealing with Pipelines. Aside from accepting a workspace and allowing multiple steps to be passed in, it uses a description that is useful to identify it later.
Using Pipeline Parameters
Pipeline parameters are also available as a class. You configure this class with the various different parameters needed so that they can later be used.
In this example, the avg_rate_param
is used in the arguments attribute of the PythonScriptStep.
Scheduling a recurring Pipeline
To schedule a Pipeline, you must use the ScheduleRecurrence class which has the information necessary to set the interval.
Once that has been created, it has to be passed into the create()
method of the Schedule
class as a recurrence value.
Batch Inference Pipeline
One of the core responsibilities of a batch inference pipeline is to run in parallel. For this to happen, you must use the ParallelRunConfig class which helps define the configuration needed to run in parallel.
Some important aspects of this are the script that will do the work (entry_script
parameter), how many failures it should tolerate (error_threshold
parameter), and the number of nodes/batches needed to run (mini_batch_size
parameter, 5 in this example).
Définition
- Batch inference: The process of doing predictions using parallelism. In a pipeline, it will usually be on a recurring schedule
- Recurring schedule: A way to schedule pipelines to run at a given interval
- Pipeline parameters: Like variables in a Python script, these can be passed into a script argument
Exercise
Step 1 : Create a Pipeline
Summary
Pipelines are very useful and are a foundation of automation and operations in general. Being able to create a Pipeline allows for easier interaction with model deployments.
This demo shows you how to use the Python SDK to create a pipeline with AutoML steps.
For this exercise, you will create a pipeline using the python SDK.
First, create a pipeline using the Python SDK. (This is the part that up until the Examine Results section in the provided notebook)
It is optional, you can copy and run cells in Examine Results section to test the pipeline and retrieve the best model. This step involves running an Automated ML experiment so it will take about 30 min to complete. Please keep track of the remaining time before you run these cells.
Attention
Make sure you update cells to match your dataset and other variables. These are noted in comments like this:
Free free to modify the code to explore the different pipeline features and parameters. To speed up and shorten the total amount to train the model, you can change the experiment_timeout_minutes
value from 20 to 10. These settings are in the Python Notebook, which are currently set to 20 minutes:
Create and run the pipelines using the Python SDK.
Step 2 : Publish a pipeline
In this part, you need to publish a pipeline using the both Azure ML studio and the Python SDK. Please re-use the Pipeline created in the previous part.
You are recommended to write your own code to publish the pipeline. If you get stuck, review the first a few cells in the Publish and run from REST endpoint section in the provided notebook.
Azure Machine Learning Pipeline with AutoMLStep
We demonstrate the use of AutoMLStep in Azure Machine Learning Pipeline.
Introduction
In this example we showcase how you can use AzureML Dataset to load data for AutoML via AML Pipeline.
If you are using an Azure Machine Learning Notebook VM, you are all set. Otherwise, make sure you have executed the configuration before running this notebook.
Here, you will learn how to:
- Create an
Experiment
in an existingWorkspace
. - Create or Attach existing AmlCompute to a workspace.
- Define data loading in a
TabularDataset
. - Configure AutoML using
AutoMLConfig
. - Use
AutoMLStep
. - Train the model using AmlCompute.
- Explore the results.
- Test the best fitted model.
Initialize Workspace
Initialize a workspace object from persisted configuration. Make sure the config file is present at .\config.json
.
Create an Azure ML experiment
Let's create an experiment named "automlstep-classification" and a folder to hold the training scripts. The script runs will be recorded under the experiment in Azure.
The best practice is to use separate folders for scripts and its dependent files for each step and specify that folder as the source_directory
for the step.
This helps reduce the size of the snapshot created for the step (only the specific folder is snapshotted). Since changes in any files in the source_directory
would trigger a re-upload of the snapshot, this helps keep the reuse of the step when there are no changes in the source_directory
of the step.
Create or Attach an AmlCompute cluster
You will need to create a compute target for your AutoML run. In this tutorial, you get the default AmlCompute
as your training compute resource.
Data
Train
This creates a general AutoML settings object.
Create Pipeline and AutoMLStep
Define outputs
You can define outputs for the AutoMLStep using TrainingOutput
.
name
(str, Required)
The name of the PipelineData
object, which can contain only letters, digits, and underscores.
PipelineData
names are used to identify the outputs of a step. After a pipeline run has completed, you can use the step name with an output name to access a particular output. Names should be unique within a single step in a pipeline.
pipeline_output_name
(Required)
If provided this output will be available by using PipelineRun.get_pipeline_output()
. Pipeline output names must be unique in the pipeline.
More on TrainingOutput
.
Définition
Defines a specialized output of certain PipelineSteps
for use in a pipeline.
TrainingOutput
enables an automated machine learning metric or model to be made available as a step output to be consumed by another step in an Azure Machine Learning Pipeline. Can be used with AutoMLStep
or HyperDriveStep
.
TrainingOutput
is used with PipelineData
when constructing a Pipeline to enable other steps to consume the metrics or models generated by an AutoMLStep
or HyperDriveStep
.
Create an AutoMLStep
Define the pipeline
Run the pipeline
Examine Results
Retrieve the metrics of all child runs
Outputs of above run can be used as inputs of other steps in pipeline. In this tutorial, we will examine the outputs by retrieve output data and running some tests.
Retrieve the Best Model
Test the Model
Load Test Data
For the test data, it should have the same preparation step as the train data. Otherwise it might get failed at the preprocessing step.
Testing Our Best Fitted Model
We will use confusion matrix to see how our model works.
Publish and run from REST endpoint
Run the following code to publish the pipeline to your workspace. In your workspace in the portal, you can see metadata for the pipeline including run history and durations. You can also run the pipeline manually from the portal.
Additionally, publishing the pipeline enables a REST endpoint to rerun the pipeline from any HTTP library on any platform.
Authenticate once again, to retrieve the auth_header
so that the endpoint can be used.
Get the REST url from the endpoint property of the published pipeline object. You can also find the REST url in your workspace in the portal. Build an HTTP POST request to the endpoint, specifying your authentication header. Additionally, add a JSON payload object with the experiment name and the batch size parameter. As a reminder, the process_count_per_node is passed through to ParallelRunStep because you defined it is defined as a PipelineParameter object in the step configuration.
Make the request to trigger the run. Access the Id key from the response dict to get the value of the run id.
Use the run id to monitor the status of the new run. This will take another 10-15 min to run and will look similar to the previous pipeline run, so if you don't need to see another pipeline run, you can skip watching the full output.
Consume Pipeline Endpoint (API)
Summary
Pipeline endpoints can be consumed via HTTP, but it is also possible to do so via the Python SDK. Since there are different ways to interact with published Pipelines, this makes the whole pipeline environment very flexible.
It is key to find and use the correct HTTP endpoint to interact with a published pipeline. Sending a request over HTTP to a pipeline endpoint will require authentication in the request headers. We will talk more about it later.
Pipelines can perform several other tasks aside from training a model. Some of these tasks, or steps are:
- Data Preparation
- Validation
- Deployment
- Combined tasks
Définition
- Pipeline endpoint: The URL of the published Pipeline
- HTTP Headers: Part of the HTTP specification, where a request can attach extra information, like authentication
- Automation: A core pillar of DevOps which is applicable to Machine Learning
- Batch inference: The process of doing predictions using parallelism. In a pipeline, it will usually be on a recurring schedule
- HTTP trigger: With configuration, a service can create an HTTP request based on certain conditions
- Pipeline parameters: Like variables in a Python script, these can be passed into a script argument
- Publishing a Pipeline: Allowing external access to a Pipeline over an HTTP endpoint
- Recurring schedule: A way to schedule pipelines to run at a given interval
Documentation
- PipelineEndpoint Class.
- Create and run machine learning pipelines with Azure Machine Learning SDK
-
Tutorial: Create Training and Inferencing Pipelines with Azure ML Designer
- Build Repeatable ML Workflows with Azure Machine Learning Pipelines
- Tutorial: Build an End-to-End Azure ML Pipeline with the Python SDK
- Tutorial: Train Machine Learning Models with Automated ML Feature of Azure ML
Best Practices for Azure Machine Learning Pipelines
Attention
Rephrasing of the StackOverflow answer, go check it for complete anwsers.
Most of the time, a pipeline has at least 4 steps.
- Input data
- Data transformation step
- Model Training step
- Model scoring step
There are a bunch of things that are completely unclear from the documentation and the examples and I'm struggling to fully grasp the concept.
- When I look at
batch scoring
examples, it is implemented as a Pipeline Step. This raises the question:
Question
Does this mean that the predicting part
is part of the same pipeline as the training part
, or should there be separate 2 separate pipelines for this ?
Making 1 pipeline that combines both steps seems odd to me, because you don't want to run your predicting part every time you change something to the training part (and vice versa).
A pipeline architecture depends on if:
- you need to predict live (else batch prediction is sufficient), and
- your data is already transformed and ready for scoring.
If you need live scoring, you should deploy your model. If batch scoring, is fine. You could either have:
- a training pipeline at the end of which you register a model that is then used in a scoring pipeline, or
- have one pipeline that can be configured to do either using script arguments.
Attention
From March 2021. PipelineData is no longer a preferred way: "PipelineData use DataReference underlying which is no longer the recommended approach for data access and delivery, please use OutputFileDatasetConfig instead".
In the batch scoring examples, the assumption is that there is already a trained model, which could be coming from another pipeline, or in the case of the notebook, it's a pre-trained model not built in a pipeline at all.
However, running both training and prediction in the same pipeline is a valid use-case. Use the allow_reuse
param and set to True
, which will cache the step output in the pipeline to prevent unnecessary reruns.
Take a model training step for example, and consider the following input to that step:
- training script
- input data
- additional step params
If you set allow_reuse=True
, and your training script, input data, and other step params are the same as the last time the pipeline ran, it will not rerun that step, it will use the cached output from the last time the pipeline ran. But let's say your data input changed, then the step would rerun.
In general, pipelines are pretty modular and you can build them how you see fit. You could maintain separate pipelines for training and scoring, or bundle everything in one pipeline but leverage the automatic caching.
- What parts should be implemented as a Pipeline Step and what parts shouldn't? Should the creation of the Datastore and Dataset be implemented as a step? Should registering a model be implemented as a step?
All transformations you do to your data (munging, featurization, training, scoring) should take place inside of PipelineStep
's. The inputs and outputs of which should be PipelineData
's.
Azure ML artifacts should be:
- created in the pipeline control plane using
PipelineData
, and - registered either ad-hoc, as opposed to with every run, or
- when you need to pass artifacts between pipelines.
In this way PipelineData
is the glue that connects pipeline steps directly rather than being indirectly connected with .register()
and .download()
PipelineData
's are ultimately just ephemeral directories that can also be used as placeholders before steps are run to create and register artifacts.
Dataset
's are abstractions of PipelineData
s in that they make things easier to pass to AutoMLStep
and HyperDriveStep
, and DataDrift
.
- What isn't shown anywhere is how to deal with model registry. I create the model in the training step and then write it to the output folder as a pickle file. Then what? How do I get the model in the next step? Should I pass it on as a
PipelineData
object? Shouldtrain.py
itself be responsible for registering the trained model?
During development, I recommend that you don't register your model and that the scoring step receives your model via a PipelineData as a pickled file.
In production, the scoring step should use a previously registered model. The registration of the new model should be done via comparison of the metrics and then trigger the registration if needed.