West Virginia University Genomics Core Facility

Bioinformatics
You've got data. We turn it into information


This is how I do some of the standard analyses. I am constantly evaluating and changing my procedures, so this may be out of date. You will want to change file names, directory structure, and such. If running on the HPC, you'll need to wrap the commands in a script to submit. You may (will) have to load modules, and some of the commands may be (are) slightly different on different machines.

RNA-Seq Workflow

For this analysis, I set up my working directory with the following sub-directories: reads which contains the raw fastq files, map which contains the output of the mapping program, count which will contain raw count files, and deseq which has the output of the deseq workflow. I generally do not do any quality trimming unless the quality is very bad. Be sure to look at your data first with fastQC to get a sense of what you have.

STAR Aligner

This produces an unsorted sam file. This has to be done for every pair of fastq files, so its ususaly easiest to put this in a loop.
STAR --genomeDir ~/genome/sheep/star/ --genomeLoad NoSharedMemory --runThreadN 10 --outFileNamePrefix map/HL3_ --readFilesIn reads/9_2131ln_R1.fastq  reads/9_2131ln_R2.fastq 

featureCounts

Before running I usually rename the sam files. STAR names them prefixAligned.sam, and I dont like that, so I use a quick: for f in *.sam; do echo $f; mv $f ${f%Aligned*}.sam; done . This will give you just prefix.sam. Not necessary, just personal preference.
featureCounts *.sam -a ~/genome/sheep/annotation/oar3.1.gtf -B -o ../count/all.count -p -T 12 --ignoreDup
-p Use this for paired end reads. It will count the fragment once, not both ends.
-B Only count fragments that have both ends mapping.
--ignoreDup Only count uniquely mapping reads.
-T is number of threads to use.

I like to break the output file into a separate count file for each sample, labeled by sample name. This makes it easier to use subsets of the data if you have multiple comparissons. I do this with the unix command: for i in {7..13}; do echo $i; sed '1d' all.count | cut -f 1,$i > tmp; nm=`head -n 1 tmp | cut -f 2`; nm=${nm%.sam}; sed -i '1d' tmp; mv tmp $nm; done . Adjust the second number depending on hw many samples you have.

DESeq2

Do this in R.

I first write a parameter file as an R script:
# Organism
library('org.Hs.eg.db')
orgDB <- org.Hs.eg.db

# Input and Output

directory<-"count"
outDir <- "deseq"
outPrefix <- "Male_LVHvsNF"

# The Stats

PCA_Group <- 'condition'
design =~ condition + batch + bmi
contrast <- c('condition','LVH','NF')


# File Names and Conditions

sampleFiles <- c('106','44', '59', '63', '9', '20','74','78','79','81','84','90', '91', '34', '40', '77', '163','17', '99', '123')
sampleCondition <- c('LVH','LVH','LVH','LVH','NF','NF','NF','NF','NF','NF','NF','NF', 'LVH','LVH','LVH','LVH','LVH','LVH','LVH','LVH')
sampleBatch <- c('B1', 'B1', 'B1', 'B1', 'B1','B1','B1','B1','B1','B1','B1','B1', 'B2', 'B2', 'B2', 'B2', 'B2', 'B2', 'B2', 'B2')
sampleBMI <- c('L', 'O', 'O', 'L', 'L', 'O', 'L', 'O', 'O', 'O', 'L', 'L',  'L',  'L',  'O', 'O',  'L', 'O', 'O', 'O')
sampleSex <-  c('M', 'F', 'M', 'F', 'F', 'F', 'F', 'F', 'M', 'M', 'M', 'M',  'F',  'F',  'F', 'F',  'M', 'M', 'M', M')

sampleTable <- data.frame(sampleName=sampleFiles, fileName=sampleFiles, condition=sampleCondition, batch=sampleBatch, bmi=sampleBMI, sex=sampleSex)

sampleTable <- sampleTable[sampleTable$sex=='M',]
Then I go into RStudio and load up the following. You can just run the whole thing, but I ususally step through it and tweak things as I go.This is by no means perfect, and will have to change from project to project, but should give a good sense of what I do.
# A template to run DESeq2
#
# Add in the right info, then copy and paste into R

library('DESeq2')
library('dplyr')
library('AnnotationDbi')
library('gplots')
library('calibrate')
library("RColorBrewer")

# Put all the experiment specific values in here
source('MvFparams.R')

ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable=sampleTable, directory=directory, design=design)

dds<-DESeq(ddsHTSeq)
res<-results(dds, contrast=contrast)
res<-res[order(res$padj),]
head(res)

# Get gene names
res$Gene <- mapIds(orgDB, keys=row.names(res), column='SYMBOL', keytype='ENSEMBL', multivals='first')
res$ID <- row.names(res)

# Use Gene ID in place where there is no gene name
idx <- is.na(res$Gene)
res$Gene[idx] <- res$ID[idx]

outResults <- data.frame(GeneID=res$ID, Gene=res$Gene, baseMean=res$baseMean, log2FoldChange=res$log2FoldChange, pvalue=res$pvalue, padj=res$padj)

name <- paste(outDir, '/', outPrefix, '_results.txt', sep="") 
write.table(outResults, file=name, sep="\t", quote=F, row.names=F)

# Significant genes

r2 <- res[!(is.na(res$padj)),]
resSig <- r2[ r2$padj < 0.05, ]
resTable <- data.frame(GeneID=row.names(resSig), Gene=resSig$Gene, baseMean=resSig$baseMean, log2FoldChange=resSig$log2FoldChange, pvalue=resSig$pvalue, padj=resSig$padj)

write.table(resTable,file=paste(outDir, "/", outPrefix, "_significant.txt", sep=""), sep="\t", quote=F, row.names=F)


# Almost significant genes
# Close on padj, and have a fold change of 1.5

almostSig <- r2[ r2$padj < 0.1 & abs(r2$log2FoldChange) > 0.58, ]
almostSigTable <- data.frame(geneID=row.names(almostSig), Gene=almostSig$Gene,  baseMean=almostSig$baseMean, log2FoldChange=almostSig$log2FoldChange, pvalue=almostSig$pvalue, padj=almostSig$padj)

write.table(almostSigTable,file=paste(outDir, "/", outPrefix, "_kinda_significant.txt", sep=""), sep="\t", quote=F, row.names=F)



######## Transformations

rld <- rlogTransformation(dds, blind=TRUE)
vsd <- varianceStabilizingTransformation(dds, blind=TRUE)

d <- as.data.frame(assay(rld))
d$Gene <- mapIds(orgDB, keys=row.names(d), column='SYMBOL', keytype='ENSEMBL', multivals='first')
d$ID <- row.names(res)


write.table(d, file=paste(outDir, "/", outPrefix, "_TransformedCounts.txt", sep=""), sep="\t", quote=F, row.names=F)


#########  MA Plot   #########

name <- paste(outDir, '/', outPrefix, '_MAplot.png', sep="") 
png(name)
plotMA(dds)
dev.off()



#########  Heatmap   #########

select <- order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)[1:30]
hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100)

distsRL <- dist(t(assay(rld)))
mat <- as.matrix(distsRL)
rownames(mat) <- colnames(mat) <- with(colData(dds), paste(condition,sampleTable$sampleName , sep=" : "))

name <- paste(outDir, '/', outPrefix, '_heatmap.png', sep="") 
png(name)
heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(16, 16))
dev.off()


###########  Cluster   ###########

name <- paste(outDir, '/', outPrefix, '_cluster.png', sep="") 
png(name)
plot(hclust(dist(t(assay(rld)))), main='Dendrogram', xlab='', sub='')
dev.off()



############  PCA    ###########

name <- paste(outDir, '/', outPrefix, '_PCA.png', sep="") 
png(name)
print(plotPCA(rld, intgroup=c(PCA_Group)))
dev.off()


########### Heatmap

# This picks the top 30 genes, ranked by absolute fold change, and does a heatmap of the normalized expression

d <- assay(rld)

best <- res[order(abs(res$log2FoldChange),decreasing = T)[1:30],]

m <- d[row.names(d) %in% row.names(best),]

hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100)

name <- paste(outDir, '/', outPrefix, '_heat.png', sep="") 
png(name)
heatmap.2(m, col=hmcol, dendrogram='none', trace='none', margin=c(10,6), density.info='none')
dev.off()



####  Volcano

name <- paste(outDir, '/', outPrefix, '_volcano.png', sep="") 
png(name)

par(pch = 16)
with(res, plot(log2FoldChange, -log10(pvalue), main = "Volcano plot"))
with(subset(res, padj < 0.05), points(log2FoldChange, -log10(pvalue), col = "red"))
with(subset(res, abs(log2FoldChange) > 2), points(log2FoldChange, -log10(pvalue),  col = "orange"))

with(subset(res, padj < 0.05 & abs(log2FoldChange) > 2), points(log2FoldChange,  -log10(pvalue), col = "green"))

# Add legend
legend("bottomleft", legend = c("FDR<0.05", "|LFC|>2", "both"), pch = 16, col = c("red", "orange", "green"))

# Label Extra significant points
with(subset(res, padj < 0.05 & abs(log2FoldChange) > 2), textxy(log2FoldChange, -log10(pvalue), labs = Gene, cex = 1))

# Label all significant
with(subset(res, padj < 0.05), textxy(log2FoldChange, -log10(pvalue), labs = Gene, cex = 1))

dev.off()





For questions, help, or to offer a beer, get in touch with the bioinformatician, Niel Infante