Program for NGS Workshop in ICIBM
April 22, 2012
Alt Event Finder: A Tool for Extracting Alternative Splicing Events from RNA-seq Data
Ao Zhou1,2, Marcus R. Breese2,3,4, Yangyang Hao2,3,4, Howard J. Edenberg2,3,4,5, Lang Li2,4,6, Todd C. Skaar6, Yunlong Liu2,3,4,*1Bioinformatics Program, School of Informatics, Indiana University Purdue University Indianapolis, Indianapolis, IN 46202
2Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN 46202
3Center for Medical Genomics, Indiana University School of Medicine, Indianapolis, IN 46202
4Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN 46202
5Department of Biochemistry and Molecular Biology, Indiana University School of Medicine, Indianapolis, IN 46202
6Division of Clinical Pharmacology, Department of Medicine, Indiana University School of Medicine, Indianapolis, IN 46202
* Corresponding Author
Alternative splicing increases proteome diversity by expressing multiple gene isoforms that often differ in function. Identifying alternative splicing events directly from RNA-seq experiment is an important step for detecting differential regulated exons, and further investigating splicing regulation. Methods: We develop Alt Event Finder, a tool for identifying novel splicing events by using transcript annotation derived from genome-guided construction tools, such as Cufflinks and Scripture. With proper combination of upstream alignment and transcript reconstruction tools, Alt Event Finder is capable of identifying many novel alterative splicing events in human genome. We have evaluated the effect of upstream tool combinations on the performance our strategy, and recommend different settings given the depth of sequencing coverage. We further applied Alt Event Finder on a set of RNA-seq data in rat liver cells, and identified dozens of events that are alternatively spliced after extensive alcohol exposure. Conclusion: Alt Event Finder is capable of identifying de novo splicing events from data-driven transcript annotation, and is a useful tool for studying splicing regulation.
SASeq: A Selective and Adaptive Shrinkage Approach to Detect and Quantify Active Transcripts using RNA-Seq
Tin Nguyen, Nan Deng, and Dongxiao Zhu
Detection and quantification of active transcripts using RNA-Seq is a central task to transcriptomics research. Initial efforts on mathematical or statistical modeling of read counts or per-base exonic expression signal are successful but facing an increasing risk of model misspecification and the resulted overfitting. This is because the number of reference transcripts in the database is much larger than that of the active transcripts expressed under a single biological condition, and the difference is getting bigger with the accelerated augmentation of transcripts database. The blind shrinkage of all the transcript parameters towards 0 does not necessarily lead to a set of active transcripts. The informed shrinkage approaches, motivated by the real data, are thus desirable. We present a novel selective and adaptive shrinkage approach to detect and quantify the active transcripts using RNA-Seq data. We propose a new mathematical model of the observed exonic expression signal and the underlying transcript structure. We introduce a tuning parameter to penalize the selected regions in the selected transcripts that were not supported by the observed exonic expression signal, and we develop a constrained least square algorithm to adaptively adjust the shrinkage level based on the exonic expression signal. We also implement a fast yet accurate GUI tool to automate the detection and quantification of the active transcripts. Our tool takes a variety of RNA-Seq data formats, such as fasta, fastq, SAM or BAM, as input and output transcript abundance through a few mouse clicks. Using simulation studies, our methods compare favorably with selected competing methods in terms of both time complexity and accuracy. We also demonstrate the potential applications by analyzing a real-world RNA-Seq data set.
Availability:Both simulation data used for method comparisons as well as the GUI tool are freely available at http://asammate.sourceforge.net/.
DFI: Gene Feature Discovery in RNA-seq Experiments from Multiple Sources
Hatice Gulcin Ozer1,*, Jeffrey D. Parvin1, Kun Huang1
Differential expression detection for RNA-seq experiments is often biased by normalization algorithms due to their sensitivity to parametric assumptions on the gene count distributions, extreme values of gene expression, gene length and total number of sequence reads. To overcome these limitations, we developed Differential Feature Index (DFI), a non-parametric method for characterizing distinctive gene features across any number of diverse RNA-seq experiments without inter-sample normalization. Validated with qRT-PCR datasets, DFI accurately detected differentially expressed genes regardless of expression levels and consistent with tissue selective expression. Accuracy of DFI was very similar to the currently accepted methods: EdgeR, DESeq and Cuffdiff. In this study, we demonstrated that DFI can efficiency handle multiple groups of data simultaneously, and identify differential gene features for RNA-Seq experiments from different laboratories, tissue types, and cell origins, and is robust to extreme values of gene expression, size of the datasets and gene length.
Data Sharing: Between Promises and Practices
Christopher D. Coldren1
In this workshop session I'll present my perspective on Genomic assay data sharing and the importance of experimental metadata. While many groups have focused on the logistical difficulty of sharing NGS data, the sharing of experimental metadata is also a complex and critical task. The successful and appropriate practice of sharing genomic assay data must include the communication of experimental records that are NGS-specific, and including technical and experimental details of molecular biology, instrumentation, and computation. Groups sharing micorarray data confronted similar issues, and I will present two lessons learned that were learned by that community. In the first case the cryptic sharing of experimental metadata allowed flawed research to remain undetected for several years. In a second case the ease of use of experimental metadata enabled research of otherwise unimaginable breadth. I will describe the approach to NGS data sharing that we have initiated in the Vanderbilt Genome Sciences Resource, and will invite discussion of the experiences of workshop participants.
Bayesian Inference and Modularity Analysis of the Hierarchical Structure for Dynamic ERα Regulatory Networks
Binhua Tang1, Hang-Kai Hsu2, Pei-Yin Hsu2, Russell Bonneville1, Su-Shing Chen3, Tim H-M Huang2, Victor X. Jin1*
1Department of Biomedical Informatics, The Ohio State University, Columbus, OH 43210, USA
In this study, we integrate both estrogen (E2)-stimulated time-series ChIP-seq and gene expression data to reverse engineer estrogen receptor a (ERa) transcriptional regulatory network. We identify the ERa-centered transcription factor hubs and their target genes from the ChIP-seq data at the four time points after estrogen stimulation, and infer the time-variant hierarchical network structures with a Bayesian multivariate statistical approach. Furthermore we statistically analyze the properties of the inferred network structures including network connectivity distribution, the correlation between regulatory coefficients and components' signal-to-noise ratios with respect to absolute rank value distribution of regulatory strength. Finally, we used inherent recurrent motif patterns to discover three self-bedded regulatory modules within the ERa-centered hierarchical networks. The Gene Ontology (GO) analysis shows that each of the three modules regulates distinct functions of ERa target genes at different time points, demonstrating that our modularity analysis is indeed capable of discovering the functional association of the self-embedded network modules with their target genes. In summary, our Bayesian inference and modularity analysis not only reveals the dynamics of ERa-centered regulatory networks and underlying biological mechanisms, it also provide a novel approach to the underlying biological motif design principles and module function analysis for time-series high-throughput genomic (binding and expression) data.
Next-generation Sequencing-based Analysis of DNA Methylation Using MethylCap-seq: Sample Exclusion, Validation, and the Contribution of Replicate Lanes
Michael P Trimarchi, Mark Murphy, David Frankhouser, Benjamin Rodriguez, John Curfman, Guido Marcucci, Pearlly Yan, Ralf Bundschuh
DNA methylation is an important epigenetic mark and dysregulation of DNA methylation is associated with many diseases including cancer. Thus, it is of great interest to determine the genome-wide methylation status of entire patient cohorts with the goals of identifying novel classifications of patient subgroups and of understanding the biological mechanisms by which changes in methylation status contribute to disease. MethylCap-seq is a cost effective solution for such genome-wide determination of methylation status but since the raw sequencing data is a somewhat indirect read-out of methylation status, the reliability of methylation reconstruction from raw sequencing data is not very well understood. We analyze several MethylCap-seq data sets and perform two different studies to assess data quality. First, we investigate how data quality is affected by excluding samples that do not meet quality control cutoff requirements determined from our experience of working with hundreds of MethylCap-seq samples. Second, we compare the ability to call feature by feature methylation from technical replicates where the same MethylCap-seq library was sequenced on separate sequencing lanes. Lastly, we verify a method for the determination of the global amount of methylation from MethylCap-seq data by comparing to a spiked-in control DNA of known methylation status. We show that rejection of samples based on our quality control parameters leads to a significant improvement of methylation calling. We also find that for data that passes the quality filter, correlation between technical replicates is very high. Lastly, we find that a global methylation index calculated from MethylCap-seq data correlates well with the global methylation level of a sample as obtained from a spike-in DNA of known methylation. We show that with appropriate quality control MethylCap-seq is a reliable tool that provides reproducible relative methylation information on a feature by feature basis as well as information about the global level of methylation and which can be applied to entire patient cohorts of hundreds of patients.