Azure MLOps Template Part 3: Data

Introduction

This is the third part of a 10-part blog article series about my Azure MLOps Template repository. For an overview of the whole series, read this article. I will present each of the 10 parts as a separate tutorial. This will allow an interested reader to easily replicate the setup of the template.

In the first part, we have seen how to provision our Azure infrastructure with the Infrastructure as Code (IaC) open source tool called Terraform. In the second part, we have covered how to set up reliable, reproducible and easily maintainable Python environments using the popular open source package management system Conda as well as the Azure Machine Learning (AML) Environments feature. In this third part, we will see how to store and access data using AML. Specifically, we will download the Stanford Dogs dataset from the Internet, upload it to the default Azure Blob Storage of our AML workspace and then register it as an AML dataset.

Prerequisites to follow along:

You have an Azure Account
You either have a Pay-As-You-Go subscription (in which case you will incur cost) or you have some free Azure credits
You have followed along in the first part and the second part of the blog article series

AML Datastores and Datasets Overview

A key element of successful MLOps is to have a good data strategy in place. Data strategy is an aspect with numerous facets in the context of MLOps. One of these facets that we will focus on in this article is data access. Reliable, secure and continuous access to curated and refreshed data is one of the keys when it comes to successful data strategies.

While it might be tempting to just upload some training data (e.g. a csv file) locally and then use it for model training, this is not a good idea from an MLOps perspective. The reason is that this practice is very short-sighted and leads to data replication and data swamps in the mid and long term. Instead, it is more recommendable to integrate the model training process with the enterprise data lake, where data is curated and refreshed on a regular basis by data engineering teams. With this approach, automated model training workflows can be established. Reusability and reproducibility is improved and data quality can be enforced.

In this blog article, we will see how the integration between AML and Azure storage works. There are two AML concepts/features that are important in this context: the AML datastore and the AML dataset.

With the AML datastore feature, we can seamlessly and securely connect different types of Azure storage services. For each Azure storage that we want to connect, we simply create an AML datastore in the AML workspace. You can create an AML datastore via the Azure Portal, the AML Python SDK, Azure Resource Manager (ARM) templates or the AML VS Code Extension. If you want to learn more about AML datastores, read this documentation.

The AML dataset feature allows us to version control and efficiently track and reuse the data that flows into our model training. This can be useful for a variety of reasons such as to understand how changes in data affect model performance but also from a compliance standpoint to understand which data a model was trained on. AML datasets are basically a reference to a data source location along with a copy of its metadata.

In this article, we will see how to create an AML dataset that points to a location in an AML datastore. However, there are also other scenarios for AML datasets: for example, you can point to an URI instead of to an AML datastore. If you want to learn more about AML datasets, read this documentation.

Next to the secure and reliable access and the reusability, versioning and tracking of data that comes with connecting ML systems to enterprise data lake storage, an additional concept has emerged over the last few years: the feature store. The feature store takes the concept of reusability one step further as it aims to reuse precomputed features instead of just raw data from the data lake. In this article, the feature store concept will not be mentioned any further. However, I will write a separate article about how to build a feature store in Azure Databricks in the future. For now, if you are interested, you can have a look at this article by Databricks.

In the following, I will give a step-by-step walkthrough of the dataset setup process using AML datastores and datasets. We will leverage the Azure infrastructure that has been provisioned in part 1 of this blog article series as well as the cloned Git repository that has been created inside your Azure DevOps Project. We will also leverage our work from part 2 of this blog article series. In part 2 we have cloned your Azure MLOps Template Repository to your AML Compute Instance. We then went trough the 00_environment_setup.ipynb notebook to set up reliable, reproducible and easily maintainable Python environments. We will make use of the Jupyter kernel that we have created as part of this setup process.

2. Run the Dataset Setup Notebook

I have created the 01_dataset_setup.ipynb notebook, which is located in the notebooks directory of the template repository. Open this notebook on your AML Compute Instance (e.g. with Jupyter Lab). We will use the Jupyter kernel dogs_clf_dev_env that we have created in the 00_environment_setup.ipynb notebook to run this notebook. You can select the Jupyter kernel as follows once you have opened the notebook:

Now follow the instructions from the notebook’s Markdown cells step by step. In the rest of this article, I will be giving supporting explanations and information in addition to the instructions of the Markdown cells. Feel free to skip reading through it if you prefer to go through the notebook yourself.

2.1 Dataset Setup Notebook Overview

In the 01_dataset_setup.ipynb notebook, I will first show how to download the Stanford Dogs dataset to your AML Compute Instance. We will use this dataset to train a multiclass dog classification model with PyTorch (more on the actual model training will be covered in the next part of the blog article series). We will have an initial look at the data and upload it to Azure storage for easy data access during model training. More specifically, we will upload the data to the default Azure Blob Storage that is provisioned together with the AML workspace.

Note: While we will use the default Azure Blob Storage of the AML workspace here, you could create a new AML datastore for your existing data lake storage account in an enterprise setting.

To conclude with this part of the blog article series, we will create an AML dataset referencing the location of the Azure Blob Storage where we have uploaded the data. As mentioned above, this will allow us to version control and easily track and reuse the data during multiple iterations of model training.

2.2 Download Data to the AML File Share

The Stanford Dogs dataset, which we use for model training, is a dataset consisting of 20,580 images of 120 different dog breeds/classes. We will first make this data available for local access from our AML Compute Instance.

Each AML workspace comes not only with a default Azure Blob Storage but also with a default Azure File Share. This File Share is mounted to all AML Compute Instances that exist in the workspace, which allows for easy sharing of data and improved collaboration by Data Scientists. We will therefore download the data to this AML File Share instead of to the local disk of the AML Compute Instance.

Note: In an enterprise setting, be careful when storing sensitive information on the AML File Share as all Data Scientists that have access to the AML workspace will have access to it.

I have written two helper functions called download_stanford_dogs_archives()and extract_stanford_dogs_archives() that will download the archive files from the Standford Dogs website and extract them into three folders called test, train and val. These folders reflect the data split for training, testing and validation respectively. If you are interested in the implementation details of these functions, have a look at the data_utils.py file in the src/utils directory.

Note: It can take quite some time to download and extract the data as it is a big dataset.

Once the functions are called from the notebook, you can find the data in the data directory of your cloned template repository:

You might wonder why I was talking about downloading the data to the AML File Share but now it is located in the data directory of your template repository. The reason is that when we cloned the template repository earlier, we did not really clone it to the AML Compute Instance’s disk but to the mounted AML File Share. The AML File Share is mounted to the AML Compute Instance under the “code” directory. This directory opens automatically when you open Jupyter Lab or a Terminal:

As a result, you can also find the data and the cloned template repository by navigating to the automatically provisioned Azure storage account in the Azure Portal and opening the File Share that starts with “code” using the Storage Explorer:

2.3 Upload Data to the AML Blob Store

Next, we will upload the data to the AML Blob Storage from where we can easily access it during model training on an AML Compute Cluster. We can do this using the upload method of the datastore class of the AML Python SDK as shown in the notebook. After the cell is executed, we can check the blob container with “azureml-blobstore” in the name to see the result. This is the default Azure Blob Storage that is mounted to the AML workspace:

Note: Your number suffixes for the storage account would match here. In my case I have provisioned a new AML workspace, which is why in the screenshots the random number suffix does not match.

The AML workspace uses the storage account key for authentication to the workspaceblobstore. You can see this by clicking on the workspaceblobstore in the AML Datastores menu and then checking the “Authentication type” on the bottom of the page. This means that every Data Scientists that has access to the AML workspace will have full access to this Azure Blob Storage. It is therefore not recommended to use the storage account key for authentication when you connect your enterprise data lake to the AML workspace.

Instead you have two better options. You can leverage a Service Principal (SP) for authentication when creating a new AML datastore. This allows you to create access-control lists (ACLs) to control the access of this SP. For more information on ACLs, read this documentation. While access can be controlled in a fine-grained way with this method, all Data Scientists will still have the same permissions (the permissions of the SP) when accessing data through the AML workspace. This means that the permissions of their own Azure Active Directory account are not honored by the AML datastore. This is why only recently identity-based access has been introduced and is now in experimental preview. Identity-based access essentially means credential passthrough. It enables the scenario that each Data Scientist’s permissions are used when accessing data through the AML workspace instead of having a central SP giving the same permissions to all Data Scientists in the workspace.

2.4 Explore the Data

One important analysis that we will do before model training is to load the raw images and calculate their mean and standard deviation. We will later use the calculated values to normalize the images during model training. In order to calculate the mean and standard deviation of the images, I have created two utility functions: load_unnormalized_train_data() and get_mean_std(). The first function returns PyTorch dataloaders without normalization and the second function takes as input unnormalized PyTorch dataloaders to load all data and return the channel-wise mean and standard deviation of the image pixels. For implementation details, check the data_utils.py file in the src/utils directory.

After having calculated the channel-wise mean and standard deviation of the images, I have inserted the resulting values into another utility function that will return the normalized PyTorch dataloaders and that we will leverage for model training: load_data().

In addition, I have created two utility functions to display a single image and a batch of images respectively: show_image() and show_batch_of_images(). While the first function takes as input the path to an example image, the latter one takes as input a torch tensor containing a grid of images. Implementation details can again be found in the data_utils.py file in the src/utils directory.

2.5 Register an AML Dataset

The last step of the dataset setup is to register our data as an AML dataset. For this we need to first create a dataset object. We can do this by passing a datastore and a source path. There are two different types of AML datasets: File and Tabular.

In our case, since we are using unstructured data (.png files), we need to leverage the File dataset. More information about how to create an AML dataset and the different types of AML Datasets can be found here.

After a dataset object has been created, we can register it to the AML workspace using the register method of the Dataset class. The registered dataset is then visible under the AML datasets assets in the AML workspace:

Outlook

In the next parts of my Azure MLOps Template blog article series, we will have a look at how to use our registered AML environments and AML dataset to develop and train a model. In part 4 specifically, we will cover how to use PyTorch and transfer learning to train our multiclass dog classification model.