»CleanUpRNAseq is developed to check if RNA-seq data is suffered from gDNA contamination. If so, it can perform correction for gDNA contamination and reduce false discovery rate of differentially expressed genes.
RNA sequencing (RNA-seq) has become a standard method for profiling gene expression, yet genomic DNA (gDNA) contamination carried over to the sequencing library poses a significant challenge to data integrity. Detecting and correcting this contamination is vital for accurate downstream analyses. Particularly, when RNA samples are scarce and invaluable, it becomes essential not only to identify but also to correct gDNA contamination to maximize the data's utility. However, existing tools capable of correcting gDNA contamination are limited and lack thorough evaluation. To fill the gap, we developed CleanUpRNAseq, which offers a comprehensive set of functionalities for identifying and correcting gDNA-contaminated RNA-seq data
» scATACpipe is a bioinformatic pipeline (powered by Nextflow) for single-cell ATAC-seq (scATAC-seq) data analysis.
scATACpipe enables users to perform the end-to-end analysis of scATAC-seq data with three sub-workflow options for preprocessing that leverage 10x Genomics Cell Ranger ATAC software, the ultra-fast Chromap procedures, and a set of custom scripts implementing current best practices for scATAC-seq data preprocessing. The pipeline extends the R package ArchR for downstream analysis with added support to any eukaryotic species with an annotated reference genome.
» GS-Preprocess is a simple, 5-argument pipeline that generates input data for the GUIDEseq Bioconductor package(https://doi.org/doi:10.18129/B9.bioc.GUIDEseq) from raw Illumina sequencer output.
For off-target profiling, Bioconductor GUIDEseq only requires a 2-line guideRNA fasta, demultiplexed BAM files of "plus"- and "minus"-strands, and Unique Molecular Index (UMI) references for each read. The latter two are produced by GS-Preprocess.
» A Bioconductor package to find and visualize signficantly enriched or depleted amino acid motif or amino acid group patterns in proteom dataset
(A collaboration with Dr. Acharya)
In addition to implement iceLogo in R to visualize differential amino acid sequence pattern, dagLogo can also test and visualize significant amino acid group patterns by classifying the amino acids into groups according to charge, chemistry and hydrophobicity and etc.
» A web application for comprehensive and efficient analyses of RNA-seq data
OneStopRNAseq has user-friendly interfaces and offers workflows for common types of RNA-seq data analyses, such as comprehensive data-quality control, differential analysis of gene expression, exon usage, alternative splicing, transposable element expression, allele-specific gene expression quantification, and gene set enrichment analysis.
» A Bioconductor package for the bioinformatic analysis of the NAD-seq data
(A collaboration with Dr. Kaufman)
Nucleolus is an important structure inside the nucleus in eukaryotic cells. It is the site for transcribing rDNA into rRNA and for assembling ribosomes, aka ribosome biogenesis. In addition, nucleoli are dynamic hubs through which numerous proteins shuttle and contact specific non-rDNA genomic loci. Deep sequencing analyses of DNA associated with isolated nucleoli (NAD- seq) have shown that specific loci, termed nucleolus- associated domains (NADs) form frequent three- dimensional associations with nucleoli. NAD-seq has been used to study the biological functions of NAD and the dynamics of NAD distribution during embryonic stem cell (ESC) differentiation. NADfinder is the first software designed specifically for the bioinformatic analysis of the NAD-seq data, including baseline correction, smoothing, normalization, peak calling, and annotation.
» A Bioconductor package with minimalist design for plotting elegant track layers
This package is for the visualization of multi-omics data that can be integrated into any analysis pipeline in R. trackViewer can be used not only to visualize coverage and annotation tracks, but also to generate lollipop and dandelion plots that depict sparse and dense methylation/mutation/variant data to facilitate an integrative analysis of diverse datasets. In addition, the updated trackViewer (versions 1.19.27 and higher) has a web interface in addition to the R programming interface. Furthermore, with the ‘browseTracks’ function, users can generate interactive figures—that is, figures one can easily customize the features of by clicking, dragging, and typing.
» A Bioconductor package for quality assessment of ATAC-seq data
ATAC-seq (Assays for Transposase-Accessible Chromatin using sequencing) is a recently developed technique for genome-wide analysis of chromatin accessibility. Compared to earlier methods for assaying chromatin accessibility, ATAC-seq is faster and easier to perform, does not require cross-linking, has higher signal to noise ratio, and can be performed on small cell numbers. However, to ensure a successful ATAC-seq experiment, step-by-step quality assurance processes, including both wet lab quality control and in silico quality assessment, are essential. ATACseqQC package is for easily generating various diagnostic plots to help researchers quickly assess the quality of their ATAC-seq data. In addition, this package contains functions to preprocess aligned ATAC-seq data for subsequent peak calling.
» A Bioconductor package for the visualization of motif alignment and the analysis of transcription factor binding site evolution
(A collaboration with Dr. Brodsky)
This package is for the visualization of the alignment of motifs as a phylogenetic tree in different layout types. This tool facilitates the analysis of binding site diversity and conservation within families of TFs and the evolution of TFs among different species. motifStack can align DNA motifs; generate motif signatures for closely related motifs; and plot aligned motifs as a stack, a linear or a radial tree, or a word cloud of sequence logos. Different parameter settings can be used to generate diverse types of plots with color schema highlighting important data features.
This package is involved in the pipeline of finding candidate binding sites for known transcription factors via sequence matching.
» A Bioconductor package for identifying off-targets with GUIDE-seq data
(A collaboration with Dr. Wolfe)
The package implements GUIDE-seq analysis workflow in a flexible platform with more than 60 adjustable parameters for the analysis of datasets associated with custom nuclease applications. These parameters allow data analysis to be tailored to different nuclease platforms with different length and complexity in their guide and PAM recognition sequences or their DNA cleavage position. They also enable users to customize sequence aggregation criteria, and vary peak calling thresholds that can influence the number of potential off-target sites recovered. GUIDEseq also annotates potential off-target sites that overlap with genes based on genome annotation information, as these may be the most important off-target sites for further characterization. In addition, GUIDEseq enables the comparison and visualization of off-target site overlap between different datasets for a rapid comparison of different nuclease configurations or experimental conditions.
Zhu LJ, Lawrence M, Gupta A, Pages H, Kucukural A, Garber M, Wolfe SA (2017). “GUIDEseq: A Bioconductor package to analyze GUIDE-Seq datasets for CRISPR-Cas nucleases.” BMC Genomics, 18(1). http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-3746-y.
» A Bioconductor package for analysis of high-throughput sequencing data processed by restriction enzyme digestion.
(A collaboration with Dr. Fazio)
The package includes functions to build restriction enzyme cut site (RECS) map, distribute mapped sequences on the map with five different approaches, find enriched/depleted RECSs for a sample, and identify differentially enriched/depleted RECSs between samples.
» A Bioconductor package for design of target-specific guide RNAs in CRISPR-Cas9, genome-editing systems.
(A collaboration with Dr. Brodsky)
The package includes functions to find potential guide RNAs for input target sequences, optionally filter guide RNAs without restriction enzyme cut site, or without paired guide RNAs, genome-wide search for off-targets, score, rank, fetch flank sequence and indicate whether the target and off-targets are located in exon region or not. Potential guide RNAs are annotated with total score of the top5 and topN off-targets, detailed topN mismatch sites, restriction enzyme cut sites, and paired guide RNAs. If GeneRfold is installed, then the minimum free energy and bracket notation of secondary structure of gRNA and gRNA backbone constant region will be included in the summary file. This package leverages Biostrings and BSgenome packages.
» A Bioconductor package for classifiying putative polyadenylation sites as true or false/internally oligodT primed
(A collaboration with Dr. Lawson)
This package uses the Naive Bayes classifier (from e1071) to assign probability values to putative polyadenylation sites (pA sites) based on training data from zebrafish. This will allow the user to separate true, biologically relevant pA sites from false, oligodT primed pA sites.
» Database of Drosophila TF DNA-binding Specificities
(A collaboration with Dr. Brodsky and Dr. Wolfe)
The FlyFactorSurvey database summarizes a project using the bacterial one-hybrid method to systematically describe the binding site preferences of transcription factors in Drosophila melanogaster.
» A Bioconductor package for annotating peaks identified from ChIP-seq, Chip-chip or any high-throuput experiments
(A collaboration with Dr. Lawson and Dr. Green)
Batch annotation of the peaks identified from either ChIP-seq or ChIP-chip experiments. The package includes functions to retrieve the sequences around the peak, obtain enriched Gene Ontology (GO) terms, find the nearest gene, exon, miRNA or custom features such as most conserved elements and other transcription factor binding sites supplied by users. This package leverages the biomaRt, IRanges, Biostrings, BSgenome, GO.db, multtest and stat packages
» A Bioconductor package for the identification of novel alternative PolyAdenylation Sites (PAS)
(A collaboration with Dr. Green)
Alternative polyadenylation (APA) is one of the important post-transcriptional regulation mechanisms which occurs in most human genes. InPAS facilitates the discovery of novel APA sites from RNAseq data. It leverages cleanUpdTSeq to fine tune identified APA sites.
» Search tool for RNAiCore
» To create motif logo of transcript factor for preview.