This is an old revision of the document!
Next Generation Sequencing (NGS) technologies are increasingly being used for gene expression profiling as a replacement for microarrays. The expression level given by these technologies is the number of reads in the library mapping to a given feature (gene, exon, transcript, etc.), i.e., the read counts. Most of the statistical methods for assessment of differential expression using count data rely on parametric assumptions about the distribution of the counts (Poisson, Negative Binomial, …). Moreover, many of them need replicates to work and tend to have problems to evaluate differential expression in features with low counts.
NOISeq is a non-parametric approach for the identification of differentially expressed genes from count data. NOISeq empirically models the noise distribution of count changes by contrasting fold-change differences (M) and absolute expression differences (D) for all the features in samples within the same condition. This reference distribution is then used to assess whether the M-D values computed between two conditions for a given gene is likely to be part of the noise or represent a true differential expression.
The are two variants of the method: NOISeq-real uses replicates, when available, to compute the noise distribution and, NOISeq-sim simulates them in absence of replication. It should be noted that the NOISeq-sim simulation procedure assimilates to technical replication and does not reproduce biological variability, which is necessary for population inferential analysis.
Please, find here an outline of the NOISeq method.
NOISeq method has been implemented in R language.
This tutorial is intended to guide you through the use of NOISeq R functions to analyze count data coming from next generation sequencing technologies. First, we describe the input data format of the method. Then, we suggest you to explore your data using NOISeq functionalities in order to learn more about saturation, contamination or other biases in your data. Finally, we show how to compute differential expression between two experimental conditions with NOISeq.
To start, just download the R script with NOISeq functions here and save it into the directory where your data are. Launch R in that directory and load NOISeq functions with the following command:
> source("NOISeq.r")
NOISeq accepts basically two types of data:
All data must be provided in a tab-delimited txt file.
First column: Features names or IDs.
Rest of columns: Expression values for each sample.
Example file where biological features are genes and there are 2 replicates per condition:
GeneID | condA_rep1 | condA_rep2 | condB_rep1 | condB_rep2 |
ENSG00000230254 | 25 | 131 | 556 | 632 |
ENSG00000231674 | 3008 | 2966 | 55 | 64 |
ENSG00000227983 | 944 | 1024 | 997 | 854 |
ENSG00000185272 | 0 | 2 | 10 | 22 |
ENSG00000252478 | 5 | 9 | 4 | 3 |
… | … | … | … | … |
The function to read txt files and convert them into an R object to be used by NOISeq functions is readData:
> mydata <- readData(file = "DataFile.txt", cond1 = c(2:5), cond2 = c(6:9), header = TRUE)
If your txt file does not contain a header, you should change the header parameter to header = FALSE
.
One of the biases of RNA-Seq technology is the dependence of the number of counts on the length of the biological feature. If you wish to normalize expression data by the length of the feature, you must provide NOISeq with this information in a txt file like this:
GeneID | Length |
ENSG00000230254 | 2557 |
ENSG00000231674 | 1210 |
ENSG00000227983 | 457 |
ENSG00000185272 | 8590 |
ENSG00000252478 | 52 |
… | … |
The function readInfo can be used to read the txt file containing the names of the features and their length:
> mylength <- readInfo(file = "LengthFile.txt", header = TRUE)
Again, change the header parameter to header = FALSE
if your txt file does not contain a header.
The biological information will be used in the exploratory analysis and it can be any kind of classification of the features. For instance, if the features are genes, the biological information could be the biotypes of those genes, according to the Ensembl database. Then, the txt file NOISeq need would be like this:
GeneID | Biotype |
ENSG00000230254 | protein_coding |
ENSG00000231674 | processed_transcript |
ENSG00000227983 | pseudogene |
ENSG00000185272 | protein_coding |
ENSG00000252478 | snRNA |
… | … |
The function readInfo can also be used to read the biological information file:
> mybioinfo <- readInfo(file = "BioinfoFile.txt", header = TRUE)
Coming soon!!
The NOISeq method computes differential expression between two experimental conditions given the expression level of the considered features. By default, the algorithm transforms the counts into counts per million reads. Other normalization techniques are implemented such as the Upper Quartile (Bullard et al. 2010), the Trimmed Mean of M values (Robinson and Oshlack 2010) or RPKM (Mortazavi et al. 2008) if the length of the features is provided. NOISeq also accepts normalized expression values instead of counts in case the user would prefer to apply another normalization procedure.
NOISeq takes the normalized counts to obtain the statistics needed to derive differential expression, which are the log-ratio (M) and the absolute value of the difference (D). Expression levels equal to 0 are replaced with a certain constant k>0, in order to avoid infinite or undetermined M-values.
A feature is considered to be differentially expressed if its corresponding M and D values are likely to be higher than noise values. Hence, to compute this probability we need to estimate M and D distribution in noise data, where “noise” means the variability observed when comparing replicates within the same experimental condition. This variability is due to technical or biological causes (depending on the nature of the compared replicates). Changes in expression between conditions with the same magnitude than changes in expression between replicates within the same condition should not be considered as differential expression. NOISeq estimates empirically the probability distribution of M and D statistics in the noise by comparing expression levels between each pair of replicates within the same experimental condition and pooling together the corresponding M and D values. To build this distribution, the absolute value of M is used, since the sign of changes is an arbitrary result and only the magnitude of the change is biologically meaningful.
Once the probability of differential expression has been computed for each feature, the odds Pr(differential expression)/Pr(non-differential expression) is used to decide whether a feature is differentially expressed between conditions or not. For instance, an odds value of 4:1 is equivalent to Pr(differential expression) = 0.8 and it means that the feature is 4 times more likely to be differentially expressed than non-differentially expressed. This probability is the input parameter q that the algorithm takes as the cutoff for declaring differential expression.
The NOISeq algorithm compares replicates within the same condition to estimate noise distribution. Two versions of NOISeq method have been developed: NOISeq-real, that uses available reaplicates, and NOISeq-sim, that simulates technical replicates from the data.
Depending on the characteristics of the available samples, we have to choose the appropriate values for the parameters in the corresponding NOISeq option. This is a short description for the input parameters in NOISeq:
The following table summarizes all the possibilities and includes some recommendations for the values of the parameters:
Method | Replicates | Counts | norm | k | nss | pnr | v |
---|---|---|---|---|---|---|---|
NOISeq-real | Technical / Biological | Raw | rpkm,uqua,tmm | 0.5 | 0 | – | – |
Normalized | n | NULL | |||||
NOISeq-sim | None | Raw | rpkm,uqua,tmm | 0.5 | 5 (at least) | 0.2 | 0.02 |
Normalized | n | NULL |
The algorithm estimates the probability distribution for M and D in an empirical way, computing M and D values for every pair of replicates within the same experimental condition and for every feature. Then, all these values are pooled together to generate the noise distribution. Two replicates in one of the experimental conditions is sufficient to run the algorithm. If the number of possible comparisons within a certain condition is higher than 30, in order to reduce computation time, 30 pairwise comparisons are randomly chosen when estimating noise distribution.
It should be noted that biological replicates are necessary if the goal is to make any inferences about the population. Deriving differential expression from technical replicates is useful to draw conclusions about the specific samples being compared in the study but not to extend these conclusions to the whole population.
In RNA-seq or similar sequencing technologies, counts from technical replicates (e.g. lanes) can be summed up. Thus, when repl parameter in NOISeq is set to “tech”, this is the way the algorithm summarizes the information provided by the replicates to compute M and D signal values (between different conditions). However, for biological replicates, other summary statistic such us the mean or the median may be more meaningful. TO BE COMPLETED…
Here you have an example where there are technical replicates and counts data are to be normalized by RPKM:
> myresults <- noiseq(mydata[[1]], mydata[[2]], repl = "tech", k = 0.5, norm = "rpkm", long = mylength, q = 0.8, nss = 0, lc = 1)
Warning: NOISeq for biological replicates has not been tested yet. In case you want to use it, we will be very grateful to receive any feedback from you!
When there are no replicates for any of the experimental conditions, the algorithm can simulate them. The simulation relies on the assumption that read counts follow a multinomial distribution, where probabilities for each class (feature) in the multinomial distribution are the probability of a read to map to that feature. These mapping probabilities are approximated using counts in the only sample of the corresponding experimental condition. Counts equal to zero are replaced with k>0, to give all features some chance to appear.
Given the sequencing depth of the unique available sample, the size of the simulated samples is a percentage (parameter pnr) of this total amount of reads, allowing a small variability (parameter v). The number of replicates to be simulated is provided by nss parameter.
Example of how to apply NOISeq-sim to data that is already normalized:
> myresults <- noiseq(mydata[[1]], mydata[[2]], k = NULL, norm = "n", long = 1000, q = 0.9, pnr = 0.2, nss = 5, v = 0.02, lc = 1)
NOISeq methods returns you a list containing the following objects:
If you are interested in knowing which of the differentially expressed features are up or down regulated for the first condition (for example), you can combine deg and Ms to get this information:
> Mdeg <- myresults$Ms[myresults$deg,]
A positive value for Mdeg means that the feature is up-regulated in the first condition. If Mdeg is negative, the feature is down-regulated for the first condition.
To be completed…
RPKM UQUA TMM
Length
NOISeq has been developed at the Bioinformatics and Genomics Department of the Centro de Investigación Príncipe Felipe, in collaboration with the Department of Applied Statistics, Operations Research and Quality of the Universidad Politécnica de Valencia, Spain.
Please, contact us at:
Ana Conesa, aconesa@cipf.es
Sonia Tarazona, starazona@cipf.es
Tarazona S., García-Alcalde F., Ferrer A., Dopazo J., and Conesa A. Differential expression in RNA-seq: a matter of depth. (submitted)