MLOps Components & Platform¶

The aim for this section is to deliver the following:

An overview of the various MLOps components that we will be interacting with for the rest of the guide, as well as an overview for each of the components.
A summary of the service(s) and tool(s) of choice for some components, and an access quickstart for each of them.

{% if cookiecutter.platform == 'gcp' -%}

Google Cloud Platform (GCP) Projects¶

A GCP project is required to access GCP resources for this project. Such projects are accessible through the GCP console.

Authorisation¶

You can use GCP's Cloud SDK to interact with the varying GCP services. When you're using the SDK for the first time, you are to provide authorisation using a user or service account.
See here for more information on authorising your SDK.

A simple command to authorise access:

# For authorisation with user account
gcloud auth login
# For authorisation with service account
gcloud auth login --cred-file=/path/to/service-account-key.json

With your user account, you should have access to the following GCP products/services:

{%- set kubeplat = 'GKE' %}

{% endif -%}

Kubernetes¶

Before we dive into the different MLOps components that you will be interacting with in the context of this guide, we have to first introduce Kubernetes as the underlying orchestration tool to execute pipelines and manage containerised applications and environments.

From the Kubernetes site:

Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and management of containerized applications. It groups containers that make up an application into logical units for easy management and discovery.

A number of services and applications that you will be interacting with (or deploying) are deployed (to be deployed) within a Kubernetes cluster. Some of the MLOps components which the Kubernetes cluster(s) will be relevant for are:

Developer Workspace
Data Experimentation
Model Training & Evaluation
Experiment & Pipeline Tracking
Model Serving

These components will be further elaborated in the upcoming sections.

Reference Link(s)

IBM - What is Kubernetes?

{% if cookiecutter.platform == 'onprem' -%}

Rancher¶

Rancher is a Kubernetes management platform that provides cluster administrators or users to manage Kubernetes clusters or facilitate Kubernetes workflows.

{%- set kubeplat = 'Rancher' %}

{% endif -%}

{% if cookiecutter.orchestrator == 'runai' -%} {%- set orch = 'Run:AI' -%} {%- set vs_orch = " VS " + orch -%} {%- set and_orch = " and " + orch -%} {% elif cookiecutter.orchestrator == 'polyaxon' -%} {%- set orch = 'Polyaxon' -%} {%- set vs_orch = " VS " + orch -%} {%- set and_orch = " and " + orch -%} {% elif cookiecutter.orchestrator == "noorch" -%} {%- set vs_orch = " " -%} {%- set and_orch = " " -%}

Kubernetes VS {{kubeplat}}{{vs_orch}}¶

One might be confused as to how each of the aforementioned tools and platforms differ from each other. To put it simply, Kubernetes lies underneath the {{kubeplat}}{{and_orch}} platform/interface. {{kubeplat}}{{and_orch}} are abstraction layers on top of Kubernetes; they both essentially communicate with the Kubernetes API server to carry out actions or orchestrate workloads through each of their own interface.

{% if cookiecutter.orchestrator == "runai" -%} Developers can use {{kubeplat ~ "\'s"}} interface or Run:AI\'s interface/CLI to spin up workspaces, jobs or deploy applications. However, the latter can better serve machine learning engineers in carrying out their machine learning workflows as that was the intended usage of the platform. Moreover, Run:AI\'s unique selling point is its better utilisation of GPU resources (through fractionalisation and other features) so when it comes to workloads that require GPU, like model training and evaluation, the usage of Run:AI is recommended. Also, on the surface, it is easier for one to spin up developer workspaces on Run:AI.

Reference Link(s)

{% if cookiecutter.platform == "onprem" %} - Rancher Docs - Rancher Server and Components {% elif cookiecutter.platform == "gcp" %} - GKE Overview {%- endif -%} {%- if cookiecutter.orchestrator == "runai" %} - Run:ai Docs - System Components

MLOps Components¶

The diagram below showcases the some of the components that this guide will cover as well as how each of them relate to each other.

{% if cookiecutter.platform == 'gcp' -%}

{% elif cookiecutter.platform == 'onprem' -%}

{% endif -%}

Note

Click on the image above for an interactive view of the diagram. You may interact with the layers to view the components in a sequential manner.

Developer Workspace¶

Developers begin by having the client (laptop/VM) to be authenticated by whichever platform they have been provided access to.

Following authentication, developers can make use of templates provided by the MLOps team to spin up developer workspaces (VSCode server, JupyterLab, etc.) on the respective platforms. Within these developer workspaces, developers can work on their codebase, execute light workloads, and carry out other steps of the end-to-end machine learning workflow.

A typical machine learning or AI project would require the team to carry out exploratory data analysis (EDA) on whatever domain-specific data is in question. This work is expected to be carried out within the development workspace with the assistance of virtual environment managers.

Reference Link(s)

Coder Docs - Workspaces

Version Control¶

Within a developer workspace and environment, developers can interact (pull, push, etc.) with a Git registry, whether it would be GitHub, GitLab or other Git registries.
This guide will reference GitLab as the preferred Git registry.

Reference Link(s)

Atlassian Tutorials - What is Git?

Continuous X¶

GitLab also serves as a DevOps platform where the Continuous X of things (Continuous Integration, Continuous Delivery, etc.) can be implemented and automated. This is done through GitLab CI/CD. Interactions made with repositories on GitLab can be made to trigger CI/CD workflows. The purpose of such workflows are to facilitate the development lifecycle and streamline the process of delivering quality codebase.

The workflows at the very least should include unit and integration testing where the codebase is subjected to tests and linting tools to ensure that best practices and conventions are adhered to by contributors from the project team. This is known as Continuous Integration (CI).
Another important aspect is Static Application Security Testing (SAST) where application security tools are utilised to identify any vulnerabilities that exist within the codebase.
GitLab CI/CD can also invoke interactions with other MLOps components such as submitting jobs (model training, data processing, etc.) to the aforementioned orchestration platforms or even deploy applications. This fulfils the Continuous Delivery (CD) and Continuous Training (CT) portion.

Reference Link(s)

Container Image Registry¶

Images built through CI/CD workflows or manual builds can be pushed to container image registries.

{% if cookiecutter.platform == 'onprem' -%} Harbor Registry - Sample Screenshot

Harbor Registry

{% elif cookiecutter.platform == 'gcp' -%} GCP Artifact Registry - Sample Screenshot

GCP Artifact Registry

{% endif %}

Reference Link(s)

Red Hat - What is a container registry?

Data Preparation¶

Following the EDA phase, the project team would map out and work on data processing and preparation pipelines. These pipelines would first be developed with manual invocation in mind but a team can strive towards automating the processes where the pipelines can be triggered by the CI/CD workflows that they would have defined.

As the quality of data to be used for training the models is important, components like data preparation can be prefaced with data validation, where checks are done to examine the data’s adherence to conventions and standards set by the stakeholders of the project.

Model Training & Evaluation¶

Once the project team is more familiar with the domain-specific data and data preparation pipelines have been laid, they can look into model training and evaluation.

When working towards a base model or a model that can be settled as the Minimum Viable Model (MVM), a lot of experimentations would have to be done as part of the model training process. Part of such experiments includes parameter tuning where a search space is iterated through to find the best set of configurations that optimises the model’s performance or objectives. Tools like Optuna can greatly assist in facilitating such workflows.

Experiment & Pipeline Tracking¶

As there would be a myriad of experiments to be carried out, there is a need for the configurations, results, artefacts, and any other relevant metadata of every experiment to be logged and persisted. Purpose of tracking such information would allow for easy comparison of models’ performances and if there is a need to reproduce experiments, relevant information can be referred back. With the right information, metadata and utilisation of containers for reproducible workflows, pipelines can be tracked as well. Carrying these out would provide a team with a model registry of sorts where experiments with tagged models can be referred to when they are to be deployed and served.

A tool with relevant features would be MLflow.

Reference Link(s)

Databricks Blog - Introducing MLflow: an Open Source Machine Learning Platform

Model Serving¶

With the models that have been trained, applications that allow for end-users to interact with the model can be deployed on test environments. Deployment of models can be and are conventionally done by using API frameworks. However, not all problem statements require such frameworks and scripts for executing batch inference might suffice in some cases.

One of the popular Python frameworks for building APIs is FastAPI. It is easy to pick up and has many useful out-of-the-box features.

Reference Link(s)

Ubuntu Blog - A guide to ML model serving

Push & Pull with HTTPS VS SSH¶

The usage of either the HTTPS or SSH protocol for communicating with a Git-based server depends on the environment in question. If an environment is made accessible by multiple developers, then HTTPS-based access where passwords are prompted for would be better fitting. SSH-based access would be more fitting for clients that are more isolated like a single Linux user or local machines made accessible by a single owner.

Reference Link(s)

GitLab Docs - Use SSH keys to communicate with GitLab

{% if cookiecutter.orchestrator == 'runai' -%}

Run:AI¶

Run:AI is an enterprise orchestration and cluster management platform that works as an abstraction layer on top of the infrastructure to maximise the usage of such resources. The platform utilises Kubernetes in the backend. Orchestration platforms such as Run:AI allows end-users to easily spin up workloads, execute jobs, set up services or carry out any interaction with relevant resources.

The video below provides a quick and high-level overview of that the platform's unique selling point.

The entry point for accessing the platform's front-end UI is through the login page given to you by your organisation.

Authentication¶

While one can make use of the platform's front-end UI to interact with the Kubernetes cluster in the backend, one might be inclined towards the programmatic approach where a CLI is to be relied on. Run:AI provides a CLI that can be used to interact with the platform's API.

To use the CLI, you need to be authenticated. For that, you need the following:

A Kubernetes configuration file a.k.a kubeconfig. This is provided by the MLOps team.
Run:AI CLI to be installed on your local machine (or any client).

`kubeconfig`¶

A client that intends to communicate with a Kubernetes cluster would have to rely on a configuration file called kubeconfig. The YAML-formatted kubeconfig would contain information such as cluster endpoints, authentication details, as well as any other access parameters. kubeconfig files are relied on by the kubectl CLI tool for information and credentials to access Kubernetes clusters.

In the context of being authenticated with the Run:ai cluster, end-users would be provided with a kubeconfig entailed with the default set of configuration. While you may place this kubeconfig in any (safe) location within your local machine, a reasonable place to place it would be the $HOME/.kube directory.

Here is a sample of how the default kubeconfig can look like:

apiVersion: v1
clusters:
- cluster:
    insecure-skip-tls-verify: true
    server: https://runai-cluster.yourcompany.tld:6443
  name: runai-cluster-fqdn
contexts:
- context:
    cluster: runai-cluster-fqdn
    user: runai-authenticated-user
  name: runai-cluster-fqdn
current-context: runai-cluster-fqdn
kind: Config
preferences: {}
users:
- name: runai-authenticated-user
  user:
    auth-provider:
      config:
        airgapped: "true"
        auth-flow: remote-browser
        realm: yourcompany
        client-id: runai-cli
        idp-issuer-url: https://app.run.ai/auth/realms/yourcompany
        redirect-uri: https://yourcompany.run.ai/oauth-code
      name: oidc

To understand more on managing configuration for Kubernetes, do refer to the reference document below.

Reference Link(s)

Kubernetes Docs - Organizing Cluster Access Using kubeconfig Files

Run:AI CLI¶

With the aforementioned kubeconfig file, we can now use the Run:AI CLI for authentication. We first have to download the CLI.

WindowsmacOSLinux

Head over to the Run:ai dashboard.
On the top right-hand corner, click on the Help icon.
Click on Researcher Command Line Interface.
Select Windows.
Click on DOWNLOAD, rename the file as runai.exe and save the file to a location that is included in your PATH system variable.

Head over to the Run:ai dashboard.
On the top right-hand corner, click on the Help icon.
Click on Researcher Command Line Interface.
Select Mac.
Click on DOWNLOAD and save the file.

Run the following commands:

$ chmod +x runai
$ sudo mv runai /usr/local/bin/runai

Head over to the Run:ai dashboard.
On the top right-hand corner, click on the Help icon.
Click on Researcher Command Line Interface.
Select Linux.
Click on DOWNLOAD and save the file.

Run the following commands:

$ chmod +x runai
$ sudo mv runai /usr/local/bin/runai

To verify your installation, you may run the following command:

runai version

You should see an output similar to this:

Version: 2.XX.XX
BuildDate: YYYY-MM-DDThh:mm:ssZ
GitCommit: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
GoVersion: goX.XX.X
Compiler: gc

Now that the CLI has been successfully installed, you can use it to authenticate with the Run:ai cluster.

Linux/macOSWindows PowerShell

export KUBECONFIG=/path/to/provided/kubeconfig
runai login

You should see an interactive prompt similar to this:

Go to the following link in your browser:
        https://app.run.ai/auth/realms/yourcompany/protocol/openid-connect/auth?access_type=offline&client_id=runai-cli&redirect_uri=https%3A%2F%2Fyourcompany.run.ai%2Foauth-code&response_type=code&scope=email+openid+offline_access&state=xxxxxxx
Enter verification code:
INFO[0068] Logged in successfully

$Env:KUBECONFIG='/path/to/provided/kubeconfig'
runai login

You should see an interactive prompt similar to this:

Go to the following link in your browser:
        https://app.run.ai/auth/realms/yourcompany/protocol/openid-connect/auth?access_type=offline&client_id=runai-cli&redirect_uri=https%3A%2F%2Fyourcompany.run.ai%2Foauth-code&response_type=code&scope=email+openid+offline_access&state=xxxxxxx
Enter verification code:
INFO[0068] Logged in successfully

As you can see from above, you would be required to use a browser to access the link provided by the CLI. Upon accessing the link, you would be prompted to login with your Azure account. Once you have successfully logged in, you would be provided with a verification code. Copy the verification code and paste it into the terminal.

Info

What happens in the background when the runai login command is successfully executed is that the kubeconfig file is updated with the necessary authentication details, specifically the id-token and refresh-token fields, which are then used by the kubectl CLI tool to communicate with the Run:ai cluster.

{%- elif cookiecutter.orchestrator == 'polyaxon' -%} {%- elif cookiecutter.orchestrator == 'none' -%} {% endif %}

Docker CLI Authentication¶

While Harbor has its own front-end interface, one may use the Docker CLI to interact with the registry.

docker login registry.yourgitregistry.tld

You should have an interactive prompt similar to this:

Username: <YOUR_USERNAME_HERE>
Password:
Login Succeeded!

Upon a successful login through the Docker CLI, you can push or pull images to/from the Docker registry you've logged into.

AWS CLI for S3 Protocol¶

The S3 protocol may be used in the project to access S3-compatible buckets such as MinIO. We can make use of the AWS CLI's S3 commands to interact with the storage system.Instructions for installing the AWS CLI (v2) can be found here.

Following installation of the CLI, you would need to configure the settings to be used. The settings can be populated within separate files: config and credentials, usually located under $HOME/.aws. However, we can make do with just populating the credentials file. An example of a credentials file containing credentials for multiple profiles would look like the following:

[profile-1]
aws_access_key_id = project-1-user
aws_secret_access_key = XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

[profile-2]
aws_access_key_id = project-2-user
aws_secret_access_key = XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

The profile-1 and profile-2 are just arbitrary profile names that you can set for your own reference.

To list the buckets that a profile as access to, you may run a command similar to the following:

aws --profile profile-1 --endpoint-url="https://minio.yourcompany.tld" s3 ls

With a similar output to this:

YYYY-MM-DD hh:mm:ss bucket-1
YYYY-MM-DD hh:mm:ss bucket-2

The --endpoint-url flag is required for the AWS CLI to know where to send the requests to. In this case, we are sending requests to AI Singapore's ECS server.

Note

Some buckets may be hidden when listing buckets. This is due various access permissions that might have been set by administrators. For some buckets, while you may not be able to list them, you may still view the objects that are contained within them.

Reference Link(s)

{%- elif cookiecutter.platform == 'gcp' %}

Google Artifact Registry¶

AI Singapore's emphasis on reproducibility and portability of workflows and accompanying environments translates to heavy usage of containerisation. Throughout this guide, we will be building Docker images necessary for setting up development environments, jobs for the various pipelines and deployment of the predictive model.

Within the context of GCP, the GCP Artifact Registry will be used to store and version our Docker images. Following authorisation to gcloud, you can view the image repositories of your project's registry like so:

gcloud container images list --repository={{cookiecutter.registry_project_path}}

To push or pull images to/from Artifact Registry, you would need to authenticate with the Google Cloud project that the registry is associated with. You can do so by running the following command:

gcloud auth configure-docker asia-southeast1-docker.pkg.dev

The command above will populate your Docker configuration file with the intended Artifact Registry Docker host. Host names Google Artifact Registry ends with -docker.pkg.dev.

Reference Link(s)

Google Cloud Storage (GCS)¶

In the context of a Google Cloud infrastructure environment, there are two main storage mediums:

Google Cloud Filestore for managed network file storage (NFS)
Google Cloud Storage (GCS) for object storage

The usage of NFS storage is mainly observable through Persistent Volumes (PVs) or virtual machine disks.
As for GCS, one would be provided with access to one or more GCS buckets through the provided user or service account. Upon authorisation, one may list the contents of a bucket like so:

gsutil ls -p <GCP_PROJECT_ID> gs://<GCS_BUCKET_NAME>

Reference Link(s)

IBM Blog - Object vs. File vs. Block Storage: What’s the Difference?

{%- endif %}

MLflow¶

For model experimentation and tracking needs, AI Singapore mainly relies on MLflow. MLflow is an open-source platform for the machine learning lifecycle. It has several components but we will mainly be using the Tracking server component.

Accessing Tracking Server Dashboard¶

Every project has a dedicated MLflow Tracking server, deployed in each project's Kubernetes namespace (or Run:ai project). Also, to access these servers, end-users would need their own credentials, which are provided by the MLOps team. In essence, you would need the following to make use of the MLflow Tracking server:

MLflow Tracking server URL(s)
Your own username and password for the same server(s)
(Optional) ECS credentials for artifact storage
(Optional) GCS credentials for artifact storage

One would be prompted for a username and password when accessing an MLflow Tracking server for the first time:

MLflow Tracking Server - Login Page

Following a successful login, most end-users would be brought to the Experiments page. Depending on whether one is an admin or a common user, the page would look different. Admin users would be able to view all experiments while common users would only be able to view experiments that they have been provided access to.

MLflow Tracking Server - First Login View

Reference Link(s)

Logging to Tracking Server¶

Now, to test out your environment's ability to log to MLflow Tracking server, you can run the sample script that has been provided in this repository. The script can be found at src/mlflow-test.py. The script simply logs a few dummy metrics, parameters, and an artifact to an MLflow Tracking server.

Linux/macOSWindows PowerShell

conda create -n mlflow-test python=3.12.4
conda activate mlflow-test
pip install mlflow==2.15.1
# Install boto3 or google-cloud-storage packages if 
# custom object storage is used
export MLFLOW_TRACKING_USERNAME=<MLFLOW_TRACKING_USERNAME>
export MLFLOW_TRACKING_PASSWORD=<MLFLOW_TRACKING_PASSWORD>
python src/mlflow_test.py <MLFLOW_TRACKING_URI> <NAME_OF_DEFAULT_MLFLOW_EXPERIMENT>

conda create -n mlflow-test python=3.12.4
conda activate mlflow-test
pip install mlflow==2.15.1
# Install boto3 or google-cloud-storage packages if 
# custom object storage is used
$MLFLOW_TRACKING_USERNAME=<MLFLOW_TRACKING_USERNAME>
$MLFLOW_TRACKING_PASSWORD=<MLFLOW_TRACKING_PASSWORD>
python src/mlflow_test.py <MLFLOW_TRACKING_URI> <NAME_OF_DEFAULT_MLFLOW_EXPERIMENT>

A successful run of the script would present you with an experiment run that looks similar to the following:

MLflow Tracking Server - Post Test Script

Reference Link(s)

MLflow Docs - MLflow Tracking

MLOps Components & Platform¶

Google Cloud Platform (GCP) Projects¶

Authorisation¶

Kubernetes¶

Rancher¶

Kubernetes VS {{kubeplat}}{{vs_orch}}¶

MLOps Components¶

Developer Workspace¶

Version Control¶

Continuous X¶

Container Image Registry¶

Data Preparation¶

Model Training & Evaluation¶

Experiment & Pipeline Tracking¶

Model Serving¶

Push & Pull with HTTPS VS SSH¶

Run:AI¶

Authentication¶

kubeconfig¶

Run:AI CLI¶

Docker CLI Authentication¶

AWS CLI for S3 Protocol¶

Google Artifact Registry¶

Google Cloud Storage (GCS)¶

MLflow¶

Accessing Tracking Server Dashboard¶

Logging to Tracking Server¶

`kubeconfig`¶