CategoriesData & Visualisation, Machine Learning
Date Published
March 10, 2022

Reduce the Cost and Complexity of Machine Learning Preprocessing

A week ago, Nate Bachmeier, AWS Senior Solutions Architect, and Marvin Fernandes, Solutions Architect, wrote an insightful article over on the AWS Machine learning Blog.

Within it, they detailed a solution overview that enabled both reduced complexity and costs, during Machine Learning preprocessing! That statement there really caught our attention. Let’s dig deeper!

The solution

The solution revolved around the common use of Amazon Simple Storage Service(Amazon S3) and adjustments to the pipeline that train on unstructured data (such as video, audio and free-form text). The AWS refined solution demonstrated a clear pattern for a significant reduction in complexity, cost and centrally managing the second step (when it comes to Data Scientist/Engineer inputs).

The basic infrastructure, highlighting the normalising of datasets used to train machine learning models.

How the AWS ‘elegant’ solution works

“When ML algorithms process unstructured data like images and video, it requires various normalisation tasks (such as grey-scaling and resizing). This step exists to accelerate model convergence, avoid overfitting and improve prediction accuracy. You often perform these preprocessing steps on instances that later run the AI training. That approach creates inefficiencies, because those resources typically have more expensive processors (for example, GPUs) than these tasks require. Instead, our solution externalizes those operations across economic, horizontally scalable Amazon S3 Object Lambda functions.

This design pattern has three critical benefits. To begin, it centralises the shared data transformation steps, such as image normalisation and removing ML pipeline code duplication. Secondly, S3 Object Lambda functions avoid data consistency issues in derived data through JIT conversions. Finally, the serverless infrastructure reduces operational overhead, increases access time and limits costs to the per-millisecond time when running your code.

An elegant solution exists in which you can centralise these data preprocessing and data conversion operations with S3 Object Lambda. S3 Object Lambda enables you to add code that modifies data from Amazon S3 before returning it to an application. The code runs within an AWS Lambda function, a serverless compute service. Lambda can instantly scale to tens of thousands of parallel runs while supporting dozens of programming languages and even custom containers.” – AWS Solution Architecture team

“As an ML & AI specialist, we’re excited to be using Amazon SageMaker in both new and exciting ways. Cost reduction is always an important part of a customer project, and using the latest solutions to make that possible is what helps set us apart.”

Ajish Palakadan

Chief Technology Officer - Firemind

The finer details

In the solution shown in the infrastructure diagram above, the S3 bucket contains the raw images to be processed. You’ll then need to create an S3 Access Point for the images. By building multiple levels of Machine learning models, you should create separate S3 Access Points for each model.

Alternatively, AWS Identity and Access Management (IAM) policies for access points support sharing reusable functions across ML pipelines. Then you attach a Lambda function that has your preprocessing business logic to the S3 Access Point. After you retrieve the data, you call the S3 Access Point to perform JIT data transformations. Finally, you update your ML model to use the new S3 Object Lambda Access Point to retrieve data from Amazon S3.

The original article then walks through the motions of creating the S3 Object Lambda access point as well as reviewing the typical cost savings analysis that take into account the standard model training inefficiencies. We recommend checking it out and getting access to the scrollable Python language (helping you create a Lambda function that performs the image resizing and conversion).

Mark Freeman - Digital Marketing Generalist

Mark Freeman - Digital Marketing Generalist