Amazon SageMaker Data Wrangler for Data Scientists

AWS is making every Data Scientist’s wish come true with Amazon SageMaker Data Wrangler!

When AWS introduced Amazon SageMaker back in November 2017, it was a game changer for AI/ML in the cloud space. It used to take us days before we could get Data Scientists to see the data in a secure and friendly cloud environment, in some cases almost 90% of the initial stage of each data project was spent on creating a stable and secure environment for the Data Scientists to get working on AI/ML problems.

With the introduction of AWS SageMaker, this was drastically cut from days to minutes. AWS didn’t call it a day after releasing SageMaker, instead they refined it, adding new tools and features the AI/ML crowd were asking for (Rekognition and Textract being just some of those examples). All of those tools were an extra and unique ammunition for the Data Scientist’s arsenal to solve the never the same problems in AI/ML space.

Reducing the time needed to prepare data

Amazon SageMaker Data Wrangler is certainly the new kid on the block with the same level of magic Sagemaker once presented to AI/ML world. This single tool part of the SageMaker suite will unlock a plethora of AI/ML opportunities that could have otherwise been left behind due to the amount of preparation needed.

Anyone in the data science world would likely have the same answer if you asked them the same question. What is the one thing you wish is not part of being a Data Scientist? Most would agree, “less time to prepare data” would be among the top answers. Data Wrangler is certainly the answer for this frustration. In some cases, up to 95% of the total effort of a Data Scientist is inefficiently spent on cleaning and preparing datasets. With the arrival of Data Wrangler, this inefficient use of time can become a thing of the past. Data scientists can now focus on what they were destined to do, solving complex problems with AI/ML, not spending the majority of their time cleaning and preparing.

Amazon SageMaker Studio and the importing of fresh data.

Advancing data accessibility

One of the more time-consuming parts of data preparation is actually finding the data and making it accessible to the AI/ML teams. Data Wrangler allows AI/ML teams to access many of AWS’s most common data-driven services as well as some third party services with little or no manual configuration. For example, S3, Redshift, Lake Formation and Snowflake. Data Wrangler also includes more than 300 built-in data transformations, which will help data scientists cut down on time spent on data preparation without the burden of additional code to be written.

Node exporting within a simplified flow structure.

Visualisation from sorted data sets

An important factor (and often overlooked) of building your processing flow is understanding your data and displaying it in a way that can be beneficial for viewers of multiple departments within a business. The best way to help you understand your data is to visualise and display it. Data Wrangler boasts a selection of visualisation templates, including whisker plots, bar charts, histograms and scatter plots.

Examples of visualisation templates within Amazon SageMaker.

Firemind and Data Wrangler

Every business that utilises vast amounts of data knows and understands the importance of effectively managing, understanding and predicting trends and patterns within said data. However, without tools such as Data Wrangler within Amazon SageMaker, this task is a highly manual and laborious endeavour, often leading to tedious and error-prone tasks that consume a data teams time, week after week.

Our team of Data Scientists have been working with Data Wrangler, comparing it’s workflow and benefits to a range of data sets and we couldn’t be more impressed with it’s accuracy. After a thorough evaluation, we know that Data wrangler will be an exceptional tool that helps us deliver on data projects across our varied client base.

