A Closer Look at Amazon SageMaker Data Wrangler

As we push further into the field of Machine Learning (ML) and explore the many tools provided by AWS, our data team have set their sights firmly on Amazon SageMaker Data Wrangler.

As a Data Scientist myself, I’m always looking for new tools that simplify the process of data collation, cleaning, transforming and so on. The act of data handling can often be an arduous battle of time management, filled with laborious and manual tasking that’s necessary (but often soul destroying).

Amazon SageMaker Data Wrangler has proven to be an impressive and necessary extension to Amazon SageMaker Studio, speeding up the time it takes to aggregate and prepare data for ML considerably.

In this post I’ll be taking a deep dive into Data Wrangler (checkout the video further below) as well as sharing some of my thoughts on the new tool.

Data Scientist, Metin Alisho, ready to deep dive Data Wrangler

Video Exploration of Data Wrangler

I was keen to get to grips with Data Wrangler and thought it best to record my findings and experience whilst using a sample dataset.

In the video below you can see how to get started with the tool, setting up data flows, crafting engineering steps in a pipeline, transforming data and more. I hope this video gives other Data Scientists and good overview of the capabilities Data Wrangler has in regards to speeding up data preparation times.

If you’d prefer to watch the video on YouTube, click here.

Ease of Use

With its integration to Redshift, S3 and Athena, Data Wrangler makes it easy to connect to a data source and dive straight into the analysis and transformation. In addition, any data scientist can build and view a complete data pipeline carrying out all the steps necessary to get their data in the perfect format for their intended model.

Main Advantages

The main advantage was the ability to cut down the time spent on data cleansing and transformation by about 80-90% with the click of a button. This allows complex functions to be carried out without the need to write code and more importantly, researching how to write the necessary code in the first place!

Speed Up

Suppose a client wants a classification model built. Before this model is carried out, a Data Scientist will have to carry out a few initial functions. These are analysing the data (such as studying the distributions and correlations), transforming the data, carrying out some post transform analysis and perhaps a bias report or target leakage assessment.

To carry out all of these steps in a SageMaker, although doable, is an extremely time consuming process. But by using Data Wrangler, project delivery speeds up, allowing clients to see faster results from a consultancy perspective as the Data Scientist has less chance of getting bogged down with additional coding and debugging.

The Potential

I know Data Wrangler has a vast amount of potential here at Firemind. We can help our clients (current and future) get a head start in their data and machine learning journey by diving in and ‘getting our hands dirty’ with their data. One of the hardest stages of ML has always been the time needed for preparation of client data. With this new tool, that issue is a thing of the past.

It also allows us to quickly offer the client opportunities with AWS infrastructure tools such as Amazon Redshift and S3 and finally, we can work with the client to actually progress with the transformed data and start making predictions with ML.

These predictions and effective future analysis with predictive models are vital for many companies. The companies that embrace and understand that will be able to evolve as grow through intelligent dashboards, visualisation tools and more.

Final Thoughts

In addition to Data Wrangler being a transformation tool and data pipeline, it also opens a Data Scientist up to more advanced knowledge, only found after hours of research. For example, a Data Scientist new to ML may never have had exposure to target leakage, f1 scores or scaling features.

With Data Wrangler having the ability to provide this knowledge at the click of a button, it raises further awareness that these topics do exist. In the ML world, where there is so much theoretical knowledge, a Data Scientist is prompted to research the theoretical background (knowing its real world applications).