NGS Workshop [Chairs: Kun Huang and Dongxiao Zhu]
August 11, 2013
Kui Zhang, Degui Zhi
Hidden Markov model (HMM) based on Li and Stephens  model that takes into account chromosome sharing of multiple individuals results in mainstream haplotype phasing algorithms for genotyping arrays and next-generation sequencing (NGS) data. However, existing methods based on this model do not consider haplotype informative reads, i.e., reads that cover multiple heterozygous sites, which carry useful haplotype information. In this work, we extend our previous work , we developed a new HMM to incorporate a two-site joint emission term that captures the haplotype information across two adjacent sites. While our model improves the accuracy of genotype calling and haplotype phasing, haplotype information in reads covering non-adjacent sites and/or more than two adjacent sites is not used due to the severe computational burden. We develop a new probabilistic model for genotype calling and haplotype phasing from NGS data that incorporates haplotype information of multiple adjacent and/or non-adjacent sites covered by a read over an arbitrary distance. We develop a new hybrid MCMC algorithm that combines the Gibbs sampling algorithm of HapSeq and Metropolis-Hastings algorithm and is computationally feasible. We show by simulation and real data from the 1000 Genomes Project that our model offers superior performance for haplotype phasing as well as genotype calling for population NGS data over existing methods.
1. Li, N. and M. Stephens, Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics, 2003. 165(4): p. 2213-33.
2. Zhi, D., et al., Genotype calling from next-generation sequencing data using haplotype information of reads. Bioinformatics, 2012. 28(7): p. 938-46.
Profiling of different types of de novo and rare inherited mutations in autism spectrum disorders revealed by whole genome sequencing
Autism Spectrum Disorder (ASD) demonstrates high heritability, familial clustering and ~4:1 male to female bias, yet the genetic causes are only partially understood, due to extensive clinical and genetic heterogeneity. Whole genome sequencing (WGS) promises added value to identify novel ASD risk genes as well as new mutations in known loci, but an assessment of its full utility in an ASD group has not been performed. BGI and collaborators have initiated an international endeavor called the Autism Genome 10K-Project, which aims to sequence the genomes of 10,000 individuals from ASD families. In a pilot study, we used WGS to examine 76 families (33 from China, 33 from the US, 10 from middle east) with ASD to detect different types de novo or rare inherited genetic variants, including SNV, small Indel, CNV, SV and retrotransposon insertion. Among ASD probands, we identified number of deleterious de novo mutations and X-linked or autosomal inherited alterations (some had combinations of mutations). These deleterious mutations variants were found in number of novel, known, and candidate ASD risk genes. Taken together, these results suggest that WGS and thorough informatic analyses for de novo and rare-inherited mutations will improve the detection of genetic variants likely associated with ASD or its accompanying clinical symptoms.
Spontaneous mutations in bacteria revealed by whole-genome re-sequencing
The advance of next-generation sequencing techniques has transformed the research paradigms in many fields in life sciences, including microbiology. The low cost of NGS enables large-scale investigation of genomic variations in bacteria under distinct or no selection pressure. I will report here the findings on the spontaneous mutations in the bacterium E. coli, revealed by whole-genome re-sequencing of mutation-accumulation (MA) E. coli strains, in which selection pressure is minimized. For the first time from genome-wide data, we derived the rate and spectrum of spontaneous mutations in both wide-type and mismatch repair (MMR) deficiency strains in E. coli. We observed the distribution of spontaneous mutations fell into a wave-like spatial pattern that is repeated in the two separately replicated halves of the E. coli chromosome. Finally, I will also report the large-scale genome rearrangement occurring in neutrally evolved E. coli strains.
This work is collaborated with the labs of Pat Foster and Michael Lynch at Indiana University.
Computational approaches for metagenomic mining
Fueled by advances in next generation sequencing, metagenomics is revolutionizing the field of microbiology, and has excited researchers in many disciplines that could benefit from the study of environmental microbes, by enabling analyses of microbial communities. Annotating large metagenomic datasets, however, can be challenging, and directed computational approaches are often needed to make the best use of metagenomic datasets. In this talk, I will first introduce FragGeneScan and RAPSearch, tools that we have developed for predicting genes (often fragmented) from metagenomic sequences and fast similarity searches at protein level. I will also talk about targeted computational approaches that we have developed for studying CRISPR–Cas immune systems, demonstrating the importance of targeted analyses of general metagenomic datasets. The CRISPR–Cas adaptive immune system is a primary defense system in prokaryotic organisms, providing targeted defense against invading DNAs (including viruses and plasmids): bacteria memorize invaders by incorporating pieces of the invaders’ sequences, called spacers, into CRISPR (clustered regularly interspaced short palindromic repeats) loci between repeats, forming arrays of repeat-spacer units. Our targeted computational approaches include a targeted assembly approach that greatly improves the identification of CRISPR arrays from metagenomic sequences, and an approach for fishing out the invaders of bacterial communities using spacers extracted from assembled CRISPR arrays. Application of our tools to human microbiomes has revealed a variety of CRISPR–Cas systems and a diverse collection of invasive mobile genetic elements in human microbiomes, which hopefully will be an important resource for studying the interactions between bacteria and invaders.
PePr: a Peak-calling and Prioritization Pipeline for Identifying DNA-binding Sites in Replicated ChIP-seq Experiments
ChIP-seq is now the standard method to identify genome-wide DNA-binding sites for transcription factors (TFs) and histone modifications. As use of this technique grows, there is a growing demand to analyze experiments with biological replicates, especially for epigenomic experiments where variation among biological samples is most significant. However, even with TF binding, the observed variation is higher among replicates than most current peak finders assume. I will present a novel Peak-calling and Prioritization pipeline (PePr) for replicated ChIP-seq experiments. PePr uses a local negative binomial distribution, ranking consistent binding sites more favorably than sites with greater variability. Comparing PePr to commonly-used approaches on several transcription factor ChIP-seq datasets, we show PePr uniquely identifies regions with enriched tag counts, high motif occurrence rate and known characteristics of TF binding based on visual inspection. For histone modification data displaying substantial variation among samples, PePr achieved better specificity than alternative approaches by identifying regions that are more consistently different between the two sample groups. I will illustrate how PePr estimates the shift size to align alternate strand reads, the optimal window size across the genome, the significance and FDR levels, and post-processing steps to remove artifacts and improve peak resolution. PePr is made available as a Google Code project at https://code.google.com/p/pepr-chip-seq.
Personalized mutation network analysis of putative cancer genes from next-generation sequencing data
A major challenge in interpreting the large volume of mutation data identified by next-generation sequencing (NGS) is to distinguish driver mutations from neutral passenger mutations to facilitate the identification of targetable genes and new drugs. Current approaches are primarily based on frequencies, which lack the power to detect infrequently mutated driver genes and ignore functional interconnection and regulation among cancer genes. We propose a novel, personalized mutation network method, VarWalker, which well adjusts personal mutation profile and builds on joint frequency of interacting mutation genes in protein-protein interaction networks. VarWalker fits a sample-specific generalized additive model to estimate the probabilities of mutation events in each patient’s genome. With a weighted resampling procedure, passenger mutations which are largely raised by random events can be removed. VarWalker then examines mutation genes as well as their close interactors by applying the algorithm Random Walk with Restart. The method is further tailored with comprehensive resample- and randomization-based tests for evaluation. In our applications to two large-scale NGS benchmark datasets (183 lung adenocarcinoma samples and 121 melanoma samples), we constructed a functionally connected, cancer-gene driven subnetwork for each disease. Importantly, VarWalker enables identification of not only highly recurrently mutated genes, but also well-studied, yet infrequently mutated genes. Utilizing VarWalker, we demonstrated that network-assisted approaches can be effectively adapted to facilitate detecting driver genes in cancer from NGS data. VarWalker prioritizes cancer genes, either frequently or infrequently mutated, and illustrates their interactions in PPI networks.
In this talk, I will discuss our strategy of studying transcriptional regulation using allele specific expression levels derived from RNA-seq experiments. We will demonstrate this strategy using tissue specific transcriptome data derived from F1 crosses between C57BL/6 and DBA/2J.
In this talk I will discuss our ongoing work on generating and processing large whole transcriptome data from RNA-seq technology for identifying regulatory elements in pharmacogenomics applications. Using whole transcriptome sequencing with high depth (100-180 million reads per sample), we are able to discover functional SNPs by identifying allelic specific expression sites. In addition, we can identify lncRNAs which show strong correlative relationships with mRNAs and compare them among multiple brain regions between smokers and non-smokers.
Computational analysis of pathology and radiology imaging is playing an increasing role in studies of cancer. Combining quantitative measurements of the macro and microscopic properties of tumors with rich genomic and clinical descriptions can provide insights into mechanisms of tumor progression. This talk will present several in silico studies of glioma brain tumors where we have developed computational pipelines to integrate imaging and genomics using data from The National Cancer Institute's Cancer Genome Atlas (TCGA). The TCGA contains a wide array of genomic characterizations of hundreds of patients from more than 20 tumor types, and presents a unique opportunity to link imaging observations to genetics and patient outcome. We show how these approaches can leverage multimodal datasets to link imaging phenotypes to molecular drivers, define image-based subtypes of tumors, and to understand the role of tissue microenvironment in establishing patterns of gene expression.