CeNGEN splicing data portal


Both the website and dataset are under active construction, and may undergo maintenance without notice. Feel free to send me feedback.

Differential splicing (local quantification)

Transcript quantification (less reliable)

Raw data in genome browser

Documentation

This module allows direct visualization of the sequencing data for any genomic locus. For each sample, a density plot indicates the number of reads aligned at a particular genomic position (normalized by the total number of reads in that sample and multiplied by one million, unit: CPM). In addition, a splice junction track indicates the number of junction-spanning reads supporting that junction. For each neuron type, the average of all the samples is performed for each genomic position, to give a mean coverage for that neuron. In addition, the junction-spanning reads are summed for each junction, to give a total junction usage track for that neuron. Finally, to allow rapid examination of a genomic locus across neurons, additional "global" tracks are available:

Genes that are expressed in a given sample typically have 0.2-1.5 CPM along the gene body. Expressions higher than 1 CPM suggest highly expressed genes. A useful approach to examine a given genomic locus is to first visualize the "global" statistics, and look for variable regions. Then, examine these regions using the means per neuron. Finally, if a region of interest appears differentially used between neuron types, displaying the individual samples gives an idea of the robustness. Junctions and coverage often provide complementary information, it is recommended to look at both.

There are two annotation tracks available: a copy of the Wormbase annotation this analysis is based on ("canonical_geneset"), and a "novel" extended annotation. The color indicates the strand a gene is annotated on. The "novel" annotation contains potential novel transcripts annotated by StringTie, which will appear in grey. Note that, while the Wormbase-annotated transcripts have exons and CDS (where the CDS contain only coding sequence, no UTR), the novel transcripts only have exons annotated, as their actual coding sequence is unknown.

The "sj" tracks are displayed by default as arcs, whose thickness is proportional to the log-number of reads. Clicking on the "..." next to the track name, you can select a LinearBasicDisplay where each splice junction is represented by a rectangle, with the number of reads printed below.

Applications and limitations

This visualization makes no assumption on the genomic structure, and allows direct observation of constitutive and alternative splicing (whether differentially used or not), non-coding RNAs, or unannotated genes or exons. The coverage values are only normalized by sequencing depth, the junction values not normalized at all; this can be considered raw data.

Neither the coverage nor the junction values are normalized by gene; thus, when averaging between neurons with different expression levels, the highest expressing neuron can dominate the result. Further, the criteria of what represents "high" and "low" expression can differ between genes. Finally, this is not a proper quantification: the other tools on this website will usually give more accurate answers to splicing questions. Visualizing the read coverage is a good way to verify the predictions of the splicing quantification.

Technical considerations

The tracks were processed using R/Bioconductor, and loaded in JBrowse2. The individual tracks can be downloaded from here and loaded into a different genome browser (e.g. Wormbase's JBrowse, or UCSC).

StringTie was used in quantification mode (Pertea et al., 2016) to estimate the expression level of each transcript. This view is often easier to interpret (as transcripts directly reflect the biology we're interested in), but is typically less reliable than local quantifications. The transcriptome used was Wormbase WS281, augmented by ~8,000 novel transcripts in existing genes using StringTie in transcriptome reconstruction mode with a mix of short reads and long reads (Kovaka et al., 2019; Shumate et al., 2021), and subsequent filtering (these transcripts will have names starting with STRG, and can be visualized in the genome browser with the "novel" track).

The quantification is accessible by two modules.

Single-gene mode

In single gene mode, you can input a single gene name and a combination of neurons. Use ALL for all neurons, individual neuron names (e.g. "AWA", "ASEL", or "OLL"), or keywords such as "ACh", "motor", "sensory", ... You can combine keywords and neuron names as needed. The list of genes can be longer than one, only the first one will be taken into consideration.

Three plots will be produced. The first one displays the average TPM value for that gene's transcripts, across all sequenced neurons. This is a convenient way to get a glimpse of the general usage of this gene in the nervous system. Then, for that gene, the proportion of its expression that can be attributed to each of its transcripts. This makes identifying an isoform switch easy. However, when a gene is lowly or not expressed, this visualization can be misleading. The third plot allows direct visualization of the transcript expression levels, and enables distinguishing between neurons where no transcript of that gene is expressed, and neurons where some transcripts are highly expressed.

Selecting the checkbox "Plot individual samples" will represent each sample rather than neuron-level aggregates. The choice of color scale is customizable:

Heatmap

The choice of gene and neuron is as described above, except that several genes can be input at once. In addition, there are several options for normalization:

The choice of color scale is customizable.

Finally, you have the possibility to download the underlying data as a table (tab-separated values, can be opened with Excel). The downloaded dataset will contain the mean TPM value for selected neurons and transcripts (unnormalized). The heatmap itself can be downloaded as an SVG (compatible with Inkscape, Adobe Illustrator, Affinity Designer, ...) for further editing. The SVG reflects the displayed heatmap, using the same neurons and transcripts, normalizations, and color scale. The downloaded file name has the date and time of download (time in UTC, may differ from your local time), the name of the selected genes (truncated to 20 characters), and the name of the selected neurons (truncated to 20 characters).

Applications and limitations

Because the data was produced with short reads, this quantification is less accurate than a local quantification.

The different visualizations available can make some aspect more clear or obscure them. It is crucial to combine several visualizations before drawing conclusions.

As the transcript level typically reflects the biology, this quantification can be easier to interpret. It can be a great way to explore one or several genes before using more detailed tools.

Technical considerations

The transcriptome reconstruction was performed with StringTie v2.2.1 using a random subsample of the reads used here, along with (unpublished) PacBio long reads. A custom script was used to select potential novel transcripts for existing genes, discarding transcripts spanning several genes or in intergenic regions. This new transcripts have identifiers of the form "STRG.1.1" and can be visualized in the genome browser on this website. WS281 was used for reference-guided reconstruction.

The quantification was performed with StringTie using the "eB" option. TPM values were extracted for each gene in each sample.

References

Kovaka S, Zimin AV, Pertea GM, Razaghi R, Salzberg SL, Pertea M Transcriptome assembly from long-read RNA-seq alignments with StringTie2, Genome Biology 20, 278 (2019), DOI:10.1186/s13059-019-1910-1

Pertea M, Kim D, Pertea GM, Leek JT, Salzberg SL Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown, Nature Protocols 11, 1650-1667 (2016), DOI:10.1038/nprot.2016.095

Shumate A, Wong B, Pertea G, Pertea M, Improved Transcriptome Assembly Using a Hybrid of Long and Short Reads with StringTie, bioRxiv (2021) 2021.12.08.471868

Splicing was quantified using MAJIQ (see Vaquero-Garcia et al., 2016 for detailed explanations). Briefly, splice junctions (SJ) starting from the same exon (or splice junctions arriving into the same exon) are grouped in a Local Splicing Variation (LSV). For each LSV, the relative usage of each possible SJ is quantified, and a Percent Selected Index (PSI) is estimated with a Bayesian model. This way, a canonical exon skipping is represented as two LSVs, one LSV upstream, containing two SJ (one that links the upstream exon to the cassette exon, and one that skips the cassette to link to the downstream exon), and a second LSV with two SJ (one from the upstream exon into the downstream exon, the other from the cassette exon into the downstream exon).

The portal generates one page for each gene. On top, a representation of the splice graph displaying all exons and SJs. One can display the splice graph for individual neuron types or samples to see the number or reads spanning each SJ. Below, each LSV is represented schematically, and violin plots display the PSI of each SJ in each neuron type. In many cases, there were insufficient reads to quantify an SJ in a given neuron (e.g. when the gene is not expressed in that neuron), and the violin plot is left blank.

Applications and limitations

This visualization is most appropriate to establish differential splicing across neuron types.

This quantification only considers an LSV when there are differences in SJ. Thus, two isoforms distinguished only by the number of exons but with no branch in the splice graph will not appear differentially spliced.

Technical considerations

The MAJIQ quantifications and the VOILA server used for displaying are described in Vaquero-Garcia et al. (2016, 2020).

References

Jorge Vaquero-Garcia, Alejandro Barrera, Matthew R Gazzara, Juan González-Vallinas, Nicholas F Lahens, John B Hogenesch, Kristen W Lynch, Yoseph Barash, A new view of transcriptome complexity and regulation through the lens of local splicing variations, eLife 2016;5:e11752 DOI:10.7554/eLife.11752

Jorge Vaquero-Garcia, Joseph K. Aicher, Paul Jewell, Matthew R. Gazzara, Caleb M. Radens, Anupama Jha, Christopher J. Green, Scott S. Norton, Nicholas F. Lahens, Gregory R. Grant, Yoseph Barash, RNA splicing analysis using heterogeneous and large RNA-seq datasets, bioRxiv 2021.11.03.467086 DOI: 10.1101/2021.11.03.467086

Note: this website uses Google Analytics to monitor total numbers of connections. The exact pages visited (gene/neurons etc) are not recorded.

News

2022-12-01: update JBrowse to 2.2.1, add strand color in gff track, change sj tracks default display to Arcs, use "novel" gff3 version 221130 with exons.

2022-10-20: Google Analytics.

2022-07-21: transcript-level: multiple improvements (faster, buttons, download heatmap, heatmap color scales, Wormbase browser link, average across neurons, ...).

2022-07-19: transcript-level quantification backend improved (starts faster).

2022-04-08: transcript-level quantification app updated with a new interactive heatmap. Documentation updated.

2022-03-23: Data processing update; any multimapping read was removed, new transcripts added by comparing with (unpublished) long reads data (see e.g. Y37E3.30).

2022-02-09: Adding minimal app for isoform quantification.

2022-01-31: button order changed.

2022-01-25: Updated help. Aesthetic changes in landing page.

2022-01-18: splice junctions available in the browser.

2021-12-06: VOILA server available. The individual samples are again available in the browser.

2021-11-30: removing two samples that failed QC. The individual samples are no longer available.

2021-11-23: "lower" and "higher" now correspong to the 3rd lowest/highest sample.

2021-11-18: "lower" and "higher" now correspond to the 3nd lowest and highest neurons.

2021-11-12: "lower" and "higher" now correspond to the 3rd lowest and 4th highest neurons.

2021-11-12: annotation updated to WS281 to match bw tracks.

2021-11-10: alignments on WS281 (vs 277 previously). In addition, min and max tracks replaced by median, 10th percentile, and 90th percentile.