Rapid Python used on Big Data to Discover Human Genetic Variation

Description

Rapid Python used on Big Data to Discover Human Genetic Variation

Presented by Deniz Kural

Advances in genome sequencing has enabled large-scale projects such as the 1000 Genomes Project to sequence genomes across diverse populations around the world, resulting in very large data sets. I use Python for rapid development of algorithms for processing & analyzing genomes and discovering thousands of new variants, including "Mobile Elements" that copy[HTML REMOVED] themselves across the genome.

Abstract

Recent advances in high-throughput sequencing now enables accurate sequencing human genomes at a low cost & high speed. This technology is now used to initiate projects involving large-scale sequencing of many genomes. The 1000 Genomes project aims to sequence 2500 genomes across 27 world populations, and has initially completed its Pilot phase. The aim of the project is to discover & characterize novel variants. These variants enable association studies that investigate the link between genomic variation & phenotypes, including disease.

A class of variants, known as "Structural Variants" represent a heterogenous class of larger variants, such as inversions, duplications, deletions, and various kinds of insertions.

I use Python to for rapid development of algorithms to process, analyze, and annotate very large data sets. In particular, I focus on Mobile Elements, pieces of DNA that copy[HTML REMOVED] across the genome. These elements constitute roughly half of the genome, whereas protein-coding genes account for roughly 1.5 % of the genome.

I will discuss distributed computing, genomics, and big data within the context of Python.