Graph-theoretic models for de novo genome assembly

Abstract

The latest breakthroughs in DNA sequencing technologies have made it possible to compute high-quality human genome assemblies at scale, thus providing a more complete picture of human genetic diversity. To move towards a fully automated and robust computational pipeline for deployment in healthcare, it becomes important to develop practically efficient genome assembly algorithms that are also provably-good. Graph-theoretic models play a central role in computing genome assembly. Graph sparsification is commonly used during genome assembly to simplify the graph by removing redundant or spurious edges. However, a graph model must be 'coverage-preserving', i.e., it must ensure that the target genome can be spelled as a walk in the graph, given sufficient sequencing depth. Our work highlights that the commonly used string graph model violates this property, both in theory and practice. We next introduce a novel sparse read-overlap-based graph model that is motivated by theory. Finally, we demonstrate the empirical advantage of this model using human sequencing data.
This talk will be based on the following publication:
https://doi.org/10.1093/bioinformatics/btad124

Bio

Chirag Jain is an Assistant Professor and Pratiksha Trust Young Investigator in the department of Computational and Data Sciences (CDS) at IISc. He leads the ATCG group (https://at-cg.github.io) which develops scalable algorithms and software tools for data-intensive problems in molecular biology. Prior to his appointment at IISc, he was working as a post-doctoral fellow at the US National Institutes of Health. He had completed his PhD dissertation in 2019 at Georgia Tech, for which he was awarded the College of Computing Dissertation Award.

Abstract

Bio

Participating Departments

Important Links