GitHub Actions For Machine Learning Beginners

GitHub Actions For Machine Learning Beginners
Image by Author

GitHub Actions is a powerful feature of the GitHub platform that allows for automating software development workflows, such as testing, building, and deploying code. This not only streamlines the development process but also makes it more reliable and efficient.

In this tutorial, we will explore how to use GitHub Actions for a beginner Machine Learning (ML) project. From setting up our ML project on GitHub to creating a GitHub Actions workflow that automates your ML tasks, we will cover everything you need to know.

What is GitHub Actions?

GitHub Actions is a powerful tool that provides a continuous integration and continuous delivery (CI/CD) pipeline for all GitHub repositories for free. It automates the entire software development workflow, from creating and testing to deploying code, all within the GitHub platform. You can use it to improve your development and deployment efficiency.

GitHub Actions key features

We will now learn about key components of workflow.

Workflows

Workflows are automated processes that you define in your GitHub repository. They are composed of one or more jobs and can be triggered by GitHub events such as a push, pull request, issue creation, or by workflows. Workflows are defined in a YML file within the .github/workflows directory of your repository. You can edit it and rerun the workflow right from the GitHub repository.

Jobs and Steps

Within a workflow, jobs define a set of steps that execute on the same runner. Each step in a job can run commands or actions, which are reusable pieces of code that can perform a specific task, such as formatting the code or training the model.

Events

Workflows can be triggered by various GitHub events, such as push, pull requests, forks, stars, releases, and more. You can also schedule workflows to run at specific times using cron syntax.

Runners

Runners are the virtual environments/machines where workflows are executed. GitHub provides hosted runners with Linux, Windows, and macOS environments, or you can host your own runner for more control over the environment.

Actions

Actions are reusable units of code that you can use as steps within your jobs. You can create your own actions or use actions shared by the GitHub community in the GitHub Marketplace.

GitHub Actions makes it straightforward for developers to automate their build, test, and deployment workflows directly within GitHub, helping to improve productivity and streamline the development process.

In this project, we will use two Actions:

actions/checkout@v3: for checking out your repository so that workflow can access the file and data.
iterative/setup-cml@v2: for displaying the model metrics and confusion matrix under the commit as a message.

Automating Machine Learning Workflow with GitHub Actions

We will work on a simple machine learning project using the Bank Churn dataset from Kaggle to train and evaluate a Random Forest Classifier.

Setting Up

We will create the GitHub repository by providing the name, and description, checking the readme file, and license.
Go to the project director and clone the repository.
Change the directory to the repository folder.
Launch the code editor. In our case, it is VSCode.

$ git clone https://github.com/kingabzpro/GitHub-Actions-For-Machine-Learning-Beginners.git    $ cd .GitHub-Actions-For-Machine-Learning-Beginners    $ code .

Please create a `requirements.txt` file and add all the necessary packages that are required to run the workflow successfully.

pandas  scikit-learn  numpy  matplotlib  skops  black

Download the data from Kaggle using the link and extract it in the main folder.
The dataset is big, so we have to install GitLFS into our repository and track the train CSV file.

$ git lfs install  $ git lfs track train.csv

Training and Evaluating Code

In this section, we will write the code that will train, evaluate, and save the model pipelines. The code is from my previous tutorial, Streamline Your Machine Learning Workflow with Scikit-learn Pipelines. If you want to know how the scikit-learn pipeline works, then you should read it.

Create a `train.py` file and copy and paste the following code.
The code uses ColumnTransformer and Pipeline for preprocessing the data and the Pipeline for feature selection and model training.
After evaluating the model performance, both metrics and the confusion matrix are saved in the main folder. These metrics will be used later by the CML action.
In the end, the scikit-learn final pipeline is saved for model inference.

import pandas as pd    from sklearn.compose import ColumnTransformer  from sklearn.ensemble import RandomForestClassifier  from sklearn.model_selection import train_test_split    from sklearn.feature_selection import SelectKBest, chi2  from sklearn.impute import SimpleImputer  from sklearn.pipeline import Pipeline  from sklearn.preprocessing import MinMaxScaler, OrdinalEncoder    from sklearn.metrics import accuracy_score, f1_score  import matplotlib.pyplot as plt  from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix    import skops.io as sio    # loading the data  bank_df = pd.read_csv("train.csv", index_col="id", nrows=1000)  bank_df = bank_df.drop(["CustomerId", "Surname"], axis=1)  bank_df = bank_df.sample(frac=1)      # Splitting data into training and testing sets  X = bank_df.drop(["Exited"], axis=1)  y = bank_df.Exited    X_train, X_test, y_train, y_test = train_test_split(      X, y, test_size=0.3, random_state=125  )    # Identify numerical and categorical columns  cat_col = [1, 2]  num_col = [0, 3, 4, 5, 6, 7, 8, 9]    # Transformers for numerical data  numerical_transformer = Pipeline(      steps=[("imputer", SimpleImputer(strategy="mean")), ("scaler", MinMaxScaler())]  )    # Transformers for categorical data  categorical_transformer = Pipeline(      steps=[          ("imputer", SimpleImputer(strategy="most_frequent")),          ("encoder", OrdinalEncoder()),      ]  )    # Combine pipelines using ColumnTransformer  preproc_pipe = ColumnTransformer(      transformers=[          ("num", numerical_transformer, num_col),          ("cat", categorical_transformer, cat_col),      ],      remainder="passthrough",  )    # Selecting the best features  KBest = SelectKBest(chi2, k="all")    # Random Forest Classifier  model = RandomForestClassifier(n_estimators=100, random_state=125)    # KBest and model pipeline  train_pipe = Pipeline(      steps=[          ("KBest", KBest),          ("RFmodel", model),      ]  )    # Combining the preprocessing and training pipelines  complete_pipe = Pipeline(      steps=[          ("preprocessor", preproc_pipe),          ("train", train_pipe),      ]  )    # running the complete pipeline  complete_pipe.fit(X_train, y_train)      ## Model Evaluation  predictions = complete_pipe.predict(X_test)  accuracy = accuracy_score(y_test, predictions)  f1 = f1_score(y_test, predictions, average="macro")    print("Accuracy:", str(round(accuracy, 2) * 100) + "%", "F1:", round(f1, 2))      ## Confusion Matrix Plot  predictions = complete_pipe.predict(X_test)  cm = confusion_matrix(y_test, predictions, labels=complete_pipe.classes_)  disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=complete_pipe.classes_)  disp.plot()  plt.savefig("model_results.png", dpi=120)    ## Write metrics to file  with open("metrics.txt", "w") as outfile:      outfile.write(f"nAccuracy = {round(accuracy, 2)}, F1 Score = {round(f1, 2)}nn")         # saving the pipeline  sio.dump(complete_pipe, "bank_pipeline.skops")

We got a good result.

$ python train.py  Accuracy: 88.0% F1: 0.77

GitHub Actions For Machine Learning Beginners

You can learn more about the inner workings of the code mentioned above by reading "Streamline Your Machine Learning Workflow with Scikit-learn Pipelines"

We don't want Git to push output files as they are always generated at the end of the code so we will be adding the to .gitignore file.

Just type `.gitignore` in the terminal to launch the file.

$ .gitignore

Add the following file names.

metrics.txt  model_results.png  bank_pipeline.skops

This is how it should look like on your VSCode.

GitHub Actions For Machine Learning Beginners

We will now stage the changes, create a commit, and push the changes to the GitHub main branch.

git add .  git commit -m "new changes"  git push origin main

This is how your GitHub repository should look like.

GitHub Actions For Machine Learning Beginners

CML

Before we begin working on the workflow, it's important to understand the purpose of Continuous Machine Learning (CML) actions. CML functions are used in the workflow to automate the process of generating a model evaluation report. What does this mean? Well, when we push changes to GitHub, a report will be automatically generated under the commit. This report will include performance metrics and a confusion matrix, and we will also receive an email with all this information.

GitHub Actions For Machine Learning Beginners

GitHub Actions

It's time for the main part. We will develop a machine learning workflow for training and evaluating our model. This workflow will be activated whenever we push our code to the main branch or when someone submits a pull request to the main branch.

To create our first workflow, navigate to the “Actions” tab on the repository and click on the blue text “set up a workflow yourself.” It will create a YML file in the .github/workflows directory and provide us with the interactive code editor for adding the code.

GitHub Actions For Machine Learning Beginners

Add the following code to the workflow file. In this code, we are:

Naming our workflow.
Setting the triggers on push and pull request using `on` keyworks.
Providing the actions with written permission so that the CML action can create the message under the commit.
Use Ubuntu Linux runner.
Use `actions/checkout@v3` action to access all the repository files, including the dataset.
Using `iterative/setup-cml@v2` action to install the CML package.
Create the run for installing all of the Python packages.
Create the run for formatting the Python files.
Create the run for training and evaluating the model.
Create the run with GITHUB_TOKEN for moving the model metrics and confusion matrix plot to report.md file. Then, use the CML command to create the report under the commit comment.

name: ML Workflow  on:    push:      branches: [ "main" ]    pull_request:      branches: [ "main" ]    workflow_dispatch:      permissions: write-all    jobs:    build:      runs-on: ubuntu-latest      steps:        - uses: actions/checkout@v3          with:            lfs: true        - uses: iterative/setup-cml@v2        - name: Install Packages          run: pip install --upgrade pip && pip install -r requirements.txt        - name: Format          run: black *.py        - name: Train          run: python train.py        - name: Evaluation          env:            REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}          run: |             echo "## Model Metrics" > report.md            cat metrics.txt >> report.md                          echo '## Confusion Matrix Plot' >> report.md            echo '![Confusion Matrix](model_results.png)' >> report.md              cml comment create report.md

This is how it should look on your GitHub workflow.

GitHub Actions For Machine Learning Beginners

After committing the changes. The workflow will start executing the command sequentially.

GitHub Actions For Machine Learning Beginners

After completing the workflow, we can view the logs by clicking on the recent workflow in the "Actions" tab, opening the build, and reviewing each task's logs.

GitHub Actions For Machine Learning Beginners

We can now view the model evaluation under the commit messages section. We can access it by clicking on the commit link: fixed location in workflow · kingabzpro/GitHub-Actions-For-Machine-Learning-Beginners@44c74fa

GitHub Actions For Machine Learning Beginners

You will also receive an email from GitHub

GitHub Actions For Machine Learning Beginners

The code source is available on my GitHub repository: kingabzpro/GitHub-Actions-For-Machine-Learning-Beginners. You can clone it and try it yourself.

Final Thoughts

Machine learning operation (MLOps) is a vast field that requires knowledge of various tools and platforms to successfully build and deploy models in production. To get started with MLOps, it is recommended to follow a comprehensive tutorial, "A Beginner's Guide to CI/CD for Machine Learning". It will provide you with a solid foundation to effectively implement MLOps techniques.

In this tutorial, we covered what GitHub Actions are and how they can be used to automate your machine learning workflow. We also learned about CML Actions and how to write scripts in YML format to run jobs successfully. If you're still confused about where to start, I suggest taking a look at The Only Free Course You Need To Become a MLOps Engineer.

Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in technology management and a bachelor's degree in telecommunication engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.

GitHub Actions For Machine Learning Beginners

GitHub Actions key features

Workflows

Jobs and Steps

Events

Runners

Actions

Setting Up

Training and Evaluating Code

CML

GitHub Actions

More On This Topic

Trump Administration Orders US Chip Design Software program Companies to Cease Gross sales to China

Latest stories

Trump Administration Orders US Chip Design Software program Companies to...

Google DeepMind broadcasts SignGemma: AI for Signal Language

IIT Kharagpur, Singapore’s IME Signal MoU for Semiconductor Analysis

We Smoked NVIDIA’s Blackwell, Says Cerebras

Gemini will now routinely summarize your lengthy emails until you...

You might also like...

Trump Administration Orders US Chip Design Software program Companies to Cease Gross sales to China

Google DeepMind broadcasts SignGemma: AI for Signal Language

IIT Kharagpur, Singapore’s IME Signal MoU for Semiconductor Analysis