West Virginia University Genomics Core Facility

Bioinformatics
You've got data. We turn it into information


Here is an explaination of of the files and figures I return to you for some standard analyses.

RNA-Seq Output

Files

results.txt

This file contains information on all the genes or transcripts included in the analysis.

The columns are:
GeneID Gene baseMean log2FoldChange pvalue padj

GeneID

The gene ID. Usually this will be an Ensembl gene ID, but for odd species it will be whatever annotation I can find.

Gene

The gene symbol, a more human friendly indication of what the gene is. If no gene symbol is available, the gene ID will be repeated.

baseMean

This is a value of the expression of the gene, averaged across all samples.

log2FoldChange

The log2 fold change of the sample. If the file is labeled AvB_results.txt, then this should be the fold change of A versus B. So a value of 1 means there is twice as much of A as be, a value of -2 means there is four times as much B as A. If the direction of change is of importance to you, you should spot-check a couple of genes to make sure. Use the transformed counts file for this.

pvalue

The p value for how statistically probable that the treatment and control are the same. Don't use this value, as it is not corrected.

padj

The Benjaminni-Hockburg adjusted p value. This is the value to use.

Significant.txt

This is the most important file I give you. It is simply a subset of the results file, but all the genes that have an adjusted p value beter than 0.05.

Kinda significant.txt

This is also a subset of the results file, but here genes were selected as having an ajusted p value better than 0.1, and a fold change (not log2 fold change) of 1.5 or larger. Only use this file if you are desparate.

Transformed counts

For each gene, the transfromed counts for each sample are given. These are counts which have been normalized across samples, and rlog transformed. Useful to check the direction of the fold change, or for plotting in heatmaps.

Figures

I create several figures with the analysis. They are mostly for quality control purposes, but can give some insight into how the data is behaving.

Cluster Dendrogram

Distances between samples are calculated using the rlog transformed counts, and then hierarchical clustering is done. The height of the bars is proportional to the distance between the samples.


Heatmap 1

This presents the same information as the cluster dendrogram, that is, the distance between samples. Dark colors mean the samples are similar.


Heatmap 2

To create this heatmap, I pick the 30 most differentially expressed genes, and display the rlog transformed counts for each sample.


MA Plot

Each gene is plotted here, with log2 fold change versus mean expression. Genes with significant (adjusted) p-values are in red. Truthfully, this one doesn't tell me a lot, but some like it. If something is horribly wrong in the data it can sometimes be seen here.


Volcano Plot

This one also plots every gene, but here -log10 p-value versus log2 Fold Change. Various levels of significance are colored.


PCA

Principal Component Analysis. Hopefully your samples cluster by treatment.





For questions, help, or to offer a beer, get in touch with the bioinformatician, Niel Infante