This is an old revision of the document!

Next Generation Sequencing (NGS) technologies are increasingly being used for gene expression profiling as a replacement for microarrays. The expression level given by these technologies is the number of reads in the library mapping to a given feature (gene, exon, transcript, etc.), i.e., the read counts. Most of the statistical methods for assessment of differential expression using count data rely on parametric assumptions about the distribution of the counts (Poisson, Negative Binomial, …). Moreover, many of them need replicates to work and tend to have problems to evaluate differential expression in features with low counts.

NOISeq is a non-parametric approach for the identification of differentially expressed genes from count data. NOISeq empirically models the noise distribution of count changes by contrasting fold-change differences (M) and absolute expression differences (D) for all the features in samples within the same condition. This reference distribution is then used to assess whether the M-D values computed between two conditions for a given gene is likely to be part of the noise or represent a true differential expression.

The are two variants of the method: NOISeq-real uses replicates, when available, to compute the noise distribution and, NOISeq-sim simulates them in absence of replication. It should be noted that the NOISeq-sim simulation procedure assimilates to technical replication and does not reproduce biological variability, which is necessary for population inferential analysis.

Please, find here an outline of the NOISeq method.

NOISeq method has been implemented in R language.

Tutorial

This tutorial is intended to guide you through the use of NOISeq R functions to analyze count data coming from next generation sequencing technologies. First, we describe the input data format of the method. Then, we suggest you to explore your data using NOISeq functionalities in order to learn more about saturation, contamination or other biases in your data. Finally, we show how to compute differential expression between two experimental conditions with NOISeq.

To start, just download the R script with NOISeq functions here and save it into the directory where your data are. Launch R in that directory and load NOISeq functions with the following command:

> source("NOISeq.r")

Input data

NOISeq accepts basically two types of data:

expression data, i.e. (normalized) counts
annotation data, i.e. information about features (for example, length and biotype)

All data must be provided in a tab-delimited txt file.

Expression data

First column: Features names or IDs.
Rest of columns: Expression values for each sample.
Example file where biological features are genes and there are 2 replicates per condition:

GeneID	condA_rep1	condA_rep2	condB_rep1	condB_rep2
ENSG00000230254	25	131	556	632
ENSG00000231674	3008	2966	55	64
ENSG00000227983	944	1024	997	854
ENSG00000185272	0	2	10	22
ENSG00000252478	5	9	4	3
…	…	…	…	…

The function to read txt files and convert them into an R object to be used by NOISeq functions is readData:

> mydata <- readData(file = "DataFile.txt", cond1 = c(2:5), cond2 = c(6:9), header = TRUE)

If your txt file does not contain a header, you should change the header parameter to header = FALSE.

Length of the features

One of the biases of RNA-Seq technology is the dependence of the number of counts on the length of the biological feature. If you wish to normalize expression data by the length of the feature, you must provide NOISeq with this information in a txt file like this:

GeneID	Length
ENSG00000230254	2557
ENSG00000231674	1210
ENSG00000227983	457
ENSG00000185272	8590
ENSG00000252478	52
…	…

The function readInfo can be used to read the txt file containing the names of the features and their length:

> mylength <- readInfo(file = "LengthFile.txt", header = TRUE)

Again, change the header parameter to header = FALSE if your txt file does not contain a header.

Other biological information

The biological information will be used in the exploratory analysis and it can be any kind of classification of the features. For instance, if the features are genes, the biological information could be the biotypes of those genes, according to the Ensembl database. Then, the txt file NOISeq need would be like this:

GeneID	Biotype
ENSG00000230254	protein_coding
ENSG00000231674	processed_transcript
ENSG00000227983	pseudogene
ENSG00000185272	protein_coding
ENSG00000252478	snRNA
…	…

The function readInfo can also be used to read the biological information file:

> mybioinfo <- readInfo(file = "BioinfoFile.txt", header = TRUE)

Exploratory analysis

Coming soon!!

Differential expression

The NOISeq method computes differential expression between two experimental conditions given the expression level of the considered features. By default, the algorithm transforms the counts into counts per million reads. Other normalization techniques are implemented such as the Upper Quartile (Bullard et al. 2010), the Trimmed Mean of M values (Robinson and Oshlack 2010) or RPKM (Mortazavi et al. 2008) if the length of the features is provided. NOISeq also accepts normalized expression values instead of counts in case the user would prefer to apply another normalization procedure.

NOISeq takes the normalized counts to obtain the statistics needed to derive differential expression, which are the log-ratio (M) and the absolute value of the difference (D). Expression levels equal to 0 are replaced with a certain constant k>0, in order to avoid infinite or undetermined M-values.

A feature is considered to be differentially expressed if its corresponding M and D values are likely to be higher than noise values. Hence, to compute this probability we need to estimate M and D distribution in noise data, where “noise” means the variability observed when comparing replicates within the same experimental condition. This variability is due to technical or biological causes (depending on the nature of the compared replicates). Changes in expression between conditions with the same magnitude than changes in expression between replicates within the same condition should not be considered as differential expression. NOISeq estimates empirically the probability distribution of M and D statistics in the noise by comparing expression levels between each pair of replicates within the same experimental condition and pooling together the corresponding M and D values. To build this distribution, the absolute value of M is used, since the sign of changes is an arbitrary result and only the magnitude of the change is biologically meaningful.

Once the probability of differential expression has been computed for each feature, the odds Pr(differential expression)/Pr(non-differential expression) is used to decide whether a feature is differentially expressed between conditions or not. For instance, an odds value of 4:1 is equivalent to Pr(differential expression) = 0.8 and it means that the feature is 4 times more likely to be differentially expressed than non-differentially expressed. This probability is the input parameter q that the algorithm takes as the cutoff for declaring differential expression.

The NOISeq algorithm compares replicates within the same condition to estimate noise distribution. Two versions of NOISeq method have been developed: NOISeq-real, that uses available reaplicates, and NOISeq-sim, that simulates technical replicates from the data.

Input parameters

Depending on the characteristics of the available samples, we have to choose the appropriate values for the parameters in the corresponding NOISeq option. This is a short description for the input parameters in NOISeq:

repl: The type of replicates: “tech” for technical replicates and “bio” for biological ones.
k: When counts = 0, 0 will be changed to k. By default, k = 0.5.
norm: Normalization method. It can be one of “rpkm” (default), “uqua” (Upper Quartile), “tmm” (Trimmed Mean of M) or “n” (no normalization).
long: Vector containing the features length as given by readInfo function. If long = 1000, no correction by length is applied.
lc: Correction factor for length normalization. This correction is done by dividing the counts vector by mylength^lc. By default, lc = 1 (which implies a linear length normalization as in RPKM).
nss: Number of simulated samples. Only active when the input data does not contain replicates and NOIseq-sim will be used. By default, nss = 5.
pnr: Percentage of the total reads for each simulated sample. Only active when the input data does not contain replicates and NOIseq-sim will be used. By default, pnr = 0.2.
v: Variability in the size of the simulated samples (only for NOISeq-sim). By default, v = 0.02.
q: Threshold used to determine differentially expressed genes. By default, q = 0.9.

The following table summarizes all the possibilities and includes some recommendations for the values of the parameters:

Method	Replicates	Counts	norm	k	nss	pnr	v
NOISeq-real	Technical / Biological	Raw	rpkm,uqua,tmm	0.5	0	–	–
NOISeq-real	Technical / Biological	Normalized	n	NULL	0	–	–
NOISeq-sim	None	Raw	rpkm,uqua,tmm	0.5	5 (at least)	0.2	0.02
NOISeq-sim	None	Normalized	n	NULL	5 (at least)	0.2	0.02

NOISeq-real: using available replicates

The algorithm estimates the probability distribution for M and D in an empirical way, computing M and D values for every pair of replicates within the same experimental condition and for every feature. Then, all these values are pooled together to generate the noise distribution. Two replicates in one of the experimental conditions is sufficient to run the algorithm. If the number of possible comparisons within a certain condition is higher than 30, in order to reduce computation time, 30 pairwise comparisons are randomly chosen when estimating noise distribution.

It should be noted that biological replicates are necessary if the goal is to make any inferences about the population. Deriving differential expression from technical replicates is useful to draw conclusions about the specific samples being compared in the study but not to extend these conclusions to the whole population.

In RNA-seq or similar sequencing technologies, counts from technical replicates (e.g. lanes) can be summed up. Thus, when repl parameter in NOISeq is set to “tech”, this is the way the algorithm summarizes the information provided by the replicates to compute M and D signal values (between different conditions). However, for biological replicates, other summary statistic such us the mean or the median may be more meaningful. TO BE COMPLETED…

Here you have an example where there are technical replicates and counts data are to be normalized by RPKM:

> myresults <- noiseq(mydata[[1]], mydata[[2]], repl = "tech", k = 0.5, norm = "rpkm", long = mylength, q = 0.8, nss = 0, lc = 1)

Warning: NOISeq for biological replicates has not been tested yet. In case you want to use it, we will be very grateful to receive any feedback from you!

NOISeq-sim: no replicates

When there are no replicates for any of the experimental conditions, the algorithm can simulate them. The simulation relies on the assumption that read counts follow a multinomial distribution, where probabilities for each class (feature) in the multinomial distribution are the probability of a read to map to that feature. These mapping probabilities are approximated using counts in the only sample of the corresponding experimental condition. Counts equal to zero are replaced with k>0, to give all features some chance to appear.

Given the sequencing depth of the unique available sample, the size of the simulated samples is a percentage (parameter pnr) of this total amount of reads, allowing a small variability (parameter v). The number of replicates to be simulated is provided by nss parameter.

Example of how to apply NOISeq-sim to data that is already normalized:

> myresults <- noiseq(mydata[[1]], mydata[[2]], k = NULL, norm = "n", long = 1000, q = 0.9, pnr = 0.2, nss = 5, v = 0.02, lc = 1)

Output data

NOISeq methods returns you a list containing the following objects:

probab: Probability of differential expression for each feature. A NA value means that the probability was not computed because the feature had expression 0 in all the samples for both experimental conditions.
deg: List of differentially expressed features according to the chosen threshold q.
Ms: M-value in signal (between conditions) for each feature.
Ds: D-value in signal (between conditions) for each feature.
Mn: M-values in noise (within the same condition) for all features, conditions and comparisons.
Dn: D-values in noise (within the same condition) for all features, conditions and comparisons.

If you are interested in knowing which of the differentially expressed features are up or down regulated for the first condition (for example), you can combine deg and Ms to get this information:

> Mdeg <- myresults$Ms[myresults$deg,]

A positive value for Mdeg means that the feature is up-regulated in the first condition. If Mdeg is negative, the feature is down-regulated for the first condition.

More about normalization

To be completed…

RPKM UQUA TMM

Length

References

Bullard J.H., Purdom E., Hansen K.D. and Dudoit S. (2010) Evaluation of statistical methods for normalization and differential expression in mRNA-seq experiments. BMC Bioinformatics 11(1):94+.
Mortazavi A., Williams B.A., McCue K., Schaeer L. and Wold B. (2008) Mapping and quantifying mammalian transcriptomes by RNA-seq. Nature Methods 5(7):621-628.
Robinson M.D. and Oshlack A. (2010) A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology 11(3):R25+.

Contact

NOISeq has been developed at the Bioinformatics and Genomics Department of the Centro de Investigación Príncipe Felipe, in collaboration with the Department of Applied Statistics, Operations Research and Quality of the Universidad Politécnica de Valencia, Spain.

Please, contact us at:

Ana Conesa, aconesa@cipf.es

Sonia Tarazona, starazona@cipf.es

How to cite us

Tarazona S., García-Alcalde F., Ferrer A., Dopazo J., and Conesa A. Differential expression in RNA-seq: a matter of depth. (submitted)

NOISeq