Build a Genomics Data Lake on AWS Using Amazon EMR

Amazon EMR is Combatting the Struggle to Perform Large Scale Population Genomic Data Analysis

As an AWS Advanced Partner, we typically align ourselves to clients whose data needs provide a monetary kickback or overall insight that yields some value. Whether it’s analysis of sporting strengths, market movements, financial inflations or media statistics and growth. But within the realm of all things ‘data’ lies some of the most important data sets – the human genome and genetic advancement.

New to Genomics? We’ve got you covered! Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping and editing of genomes. A genome itself is an organism’s complete set of DNA, including all of its genes as well as its hierarchical, three-dimensional structural configuration.

Understanding the human genome to advance medicine, therapies and fill in the blanks to our own evolution is a massive undertaking that requires some of the most complex tools and computational power that exist. AWS is fortunately at hand to help map, align and process this data by using data lakes within Amazon EMR.

The data used with genomic mapping is tricky to say the least! In order to be able to perform more sophisticated tertiary analysis on genomics and clinical data, researchers need to be able to access, aggregate and query the data from a centralised data store in a secure and compliant manner. The Variant Call Format (VCF) is not an ideal way to store this data when it needs to be queried in a performant manner, so data scientists need to convert it to an open format with efficient storage and performance.

High level architecture diagram to simplify VCFs within AWS.

Simplifying the Genomic Data

Open, columnar formats like Parquet and ORC are more efficient at compression and perform better in the types of queries typically performed on genomic data. In the AWS blog post, they build a post-secondary analysis pipeline that converts Variant Call Format files (VCFs) into these formats to populate a genomics data lake built on Amazon Simple Storage Service (Amazon S3). They use Hail, a genomics software tool from the Broad Institute, on Amazon EMR to perform the data transformation.

Part 1 of the article delves into the process of building the AWS solution, including transforming a single VCF into Parquet, performing bulk transformations and accessing the data in a structured and reliable manner using clusters.

Transforming your data, one column at a time!

Whilst you might not be mapping the human genome, it’s likely you have a wealth of data that needs structuring and ordered using the best tools available. Get in touch with us today to look at how we work with your data.