Data Lake Operations Boosted Using AWS DMS

In this post, Behram Irani, Senior Analytics Solutions Architect, shows how you can use AWS DMS to replicate headers from source tables to Amazon S3 as the target and how to use these columns to consolidate the data in the data lake for CRUD operations.

Migrating the raw data from source systems into a central repository is usually the first step in establishing a data lake. Many systems store source data in relational database tables, therefore a mechanism is required to ingest this data in the data lake and also get some relevant metadata about these tables so that data in the lake can be consolidated for create, read, update and delete (CRUD) operations.

Amazon Simple Storage Service (Amazon S3) is the storage of choice for setting up data lakes on AWS. Source data is first ingested into Amazon S3, transformed and then consumed from Amazon S3 using purpose-built AWS services. The first step in this process is to ingest data from a variety of source systems, including relational databases such as Oracle database and Microsoft SQL Server database.

AWS Database Migration Service (AWS DMS) is a web service that you can use to migrate data from these source databases to your S3 data lake. AWS DMS can migrate the initial full load from the source database tables into Amazon S3 as well as perform ongoing change data capture (CDC). The replication process is performed by creating and starting an AWS DMS task. The S3 endpoint you define as the target for this task accumulates data from the source tables in CSV or Parquet files. You can then transform and curate this raw data so different tools can consume it.

AWS DMS also provides certain transformation rules that you can apply to any selected schema, table or view. For more information on specific transformation expressions, see Transformation rules and actions.

Infrastructure diagram highlighting On-prem sources to end user.

Be sure to visit the full article where Behram highlights useful header transformations in AWS DMS to migrate your source database table data into Amazon S3 (and use them to perform consolidation operations in the data lake).

New to data lakes? Need cloud storage that can contain your data types?

If you’re looking to craft a centralised repository that allows you to store all your structured and unstructured data, at any scale, get in touch.