NOISeq

User Tools


Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
start [2011/04/15 17:57]
sotacam [Input parameters]
start [2015/05/28 14:51]
sotacam
Line 2: Line 2:
  
 Next Generation Sequencing (NGS) technologies are increasingly being used for gene expression pro filing as a Next Generation Sequencing (NGS) technologies are increasingly being used for gene expression pro filing as a
-replacement for microarrays. The expression level given by these technologies is the number of reads in the library mapping to a given feature (gene, exon, transcript, etc.), i.e., the read counts. Most of the statistical methods for assessment of differential expression using count data rely on parametric assumptions about the distribution of the counts (Poisson, Negative Binomial, ...). Moreover, many of them need replicates to work and tend to have problems to evaluate differential expression in features with low counts.+replacement for microarrays. The expression level given by these technologies is the number of reads in the library mapping to a given feature (gene, exon, transcript, etc.), i.e., the read counts. Most of the statistical methods for assessment of differential expression using count data rely on parametric assumptions about the distribution of the counts (Poisson, Negative Binomial, ...). 
  
-NOISeq is a non-parametric approach for the identification of differentially expressed genes from count data. NOISeq empirically models the noise distribution of count changes by contrasting fold-change differences (M) and absolute expression differences (D) for all the features in samples within the same condition. ​+ 
 +**NOISeq** is a non-parametric approach for the identification of differentially expressed genes from count data or previously normalized ​count data. NOISeq empirically models the noise distribution of count changes by contrasting fold-change differences (M) and absolute expression differences (D) for all the features in samples within the same condition. ​
 This reference distribution is then used to assess whether the M-D values computed between two conditions for a given gene is likely to be part of the noise or represent a true differential expression. This reference distribution is then used to assess whether the M-D values computed between two conditions for a given gene is likely to be part of the noise or represent a true differential expression.
  
-The are two variants of the method: NOISeq-real uses replicateswhen availableto compute + 
-the noise distribution andNOISeq-sim simulates them in absence of replication. It should be noted that+NOISeq was tested on data sets with technical replicates. ​The are two variants of this method: NOISeq-real uses replicates when available to compute the noise distribution and NOISeq-sim simulates them in absence of replication. It should be noted that
 the NOISeq-sim simulation procedure assimilates to technical replication and does not reproduce biological the NOISeq-sim simulation procedure assimilates to technical replication and does not reproduce biological
 variability,​ which is necessary for population inferential analysis. ​ variability,​ which is necessary for population inferential analysis. ​
  
-Please, find {{:​posternoiseq_2010.pdf|here}} an outline of the NOISeq method. 
- 
-NOISeq method has been implemented in R language. 
- 
- 
- 
-====== Tutorial ====== 
- 
-This tutorial is intended to guide you through the use of NOISeq R functions to analyze count data coming from next generation sequencing technologies. First, we describe the **input data** format of the method. Then, we suggest you to **explore your data** using NOISeq functionalities in order to learn more about saturation, contamination or other biases in your data. Finally, we show how to compute **differential expression** between two experimental conditions with NOISeq. 
- 
-To start, just download the R script with NOISeq functions [[here]] and save it into the directory where your data are. Launch R in that directory and load NOISeq functions with the following command:\\ 
-  > source("​NOISeq.r"​) 
- 
- 
-===== Input data ===== 
-NOISeq accepts basically two types of data: 
-  * expression data, i.e. (normalized) counts 
-  * annotation data, i.e. information about features (for example, length and biotype) 
-\\ 
-All data must be provided in a tab-delimited txt file.  
- 
-=== Expression data === 
-First column: Features names or IDs.\\ 
-Rest of columns: Expression values for each sample.\\ 
-Example file where biological features are genes and there are 2 replicates per condition: ​ 
-| GeneID ​          ​| ​ condA_rep1| ​ condA_rep2| ​ condB_rep1| ​ condB_rep2| 
-| ENSG00000230254 ​ |  25|  131|  556|  632| 
-| ENSG00000231674 ​ |  3008|  2966|  55|  64| 
-| ENSG00000227983 ​ |  944|  1024|  997|  854| 
-| ENSG00000185272 ​ |  0|  2|  10|  22| 
-| ENSG00000252478 ​ |  5|  9|  4|  3| 
-|  ...  |  ...            |  ...           ​| ​ ...          |  ...           | 
- 
- 
-The function to read txt files and convert them into an R object to be used by NOISeq functions is //​readData//:​\\ 
-  > mydata <- readData(file = "​DataFile.txt",​ cond1 = c(2:5), cond2 = c(6:9), header = TRUE) 
-If your txt file does not contain a header, you should change the //header// parameter to ''​header = FALSE''​. 
- 
- 
-=== Length of the features === 
-One of the biases of RNA-Seq technology is the dependence of the number of counts on the length of the biological feature. If you wish to normalize expression data by the length of the feature, you must provide NOISeq with this information in a txt file like this:  
-| GeneID ​          ​| ​ Length|  ​ 
-| ENSG00000230254 ​ |  2557|  ​ 
-| ENSG00000231674 ​ |  1210|  ​ 
-| ENSG00000227983 ​ |  457|  ​ 
-| ENSG00000185272 ​ |  8590| 
-| ENSG00000252478 ​ |  52|  
-|  ...  |  ...    |        
- 
-The function //​readInfo//​ can be used to read the txt file containing the names of the features and their length:\\ 
-  > mylength <- readInfo(file = "​LengthFile.txt",​ header = TRUE) 
-Again, change the //header// parameter to ''​header = FALSE''​ if your txt file does not contain a header. 
- 
- 
- 
-=== Other biological information === 
-The biological information will be used in the exploratory analysis and it can be any kind of classification of the features. For instance, if the features are genes, the biological information could be the biotypes of those genes, according to the Ensembl database. Then, the txt file NOISeq need would be like this: 
-| GeneID ​          ​| ​ Biotype ​ |  ​ 
-| ENSG00000230254 ​ |  protein_coding ​ |  ​ 
-| ENSG00000231674 ​ |  processed_transcript ​ |  ​ 
-| ENSG00000227983 ​ |  pseudogene ​ |  ​ 
-| ENSG00000185272 ​ |  protein_coding ​ | 
-| ENSG00000252478 ​ |  snRNA  |  
-|  ...  |  ...    |  ​ 
- 
-The function //​readInfo//​ can also be used to read the biological information file:\\ 
-  > mybioinfo <- readInfo(file = "​BioinfoFile.txt",​ header = TRUE) 
- 
- 
- 
-===== Exploratory analysis ===== 
-Coming soon!! 
- 
- 
- 
-===== Differential expression ===== 
-The NOISeq method computes differential expression between two experimental conditions given the expression level of the considered features. By default, the algorithm transforms the counts into counts per million reads. Other normalization techniques are implemented such as the Upper Quartile (Bullard et al. 2010), the Trimmed Mean of M values (Robinson and Oshlack 2010) or RPKM (Mortazavi et al. 2008) if the length of the features is provided. NOISeq also accepts normalized expression values instead of counts in case the user would prefer 
-to apply another normalization procedure. 
- 
-NOISeq takes the normalized counts to obtain the statistics needed to derive differential expression, which are the log-ratio (M) and the absolute value of the difference (D). Expression levels equal to 0 are replaced with a certain constant k>0, in order to avoid infinite or undetermined M-values. 
- 
-A feature is considered to be differentially expressed if its corresponding M and D values are likely to be higher than noise values. Hence, to compute this probability we need to estimate M and D distribution in noise data, where "​noise"​ means the variability observed when comparing replicates within the same experimental condition. This variability is due to technical or biological causes (depending on the nature of the compared replicates). Changes in expression between conditions with the same magnitude than changes in expression between replicates within the same condition should not be considered as differential expression. NOISeq estimates empirically the probability distribution of M and D statistics in the noise by comparing expression levels between each pair of replicates within the same experimental condition and pooling together the corresponding M and D values. To build this distribution,​ the absolute value of M is used, since the sign of changes is an arbitrary result and only the magnitude of the change is biologically meaningful. ​ 
- 
-Once the probability of differential expression has been computed for each feature, the odds Pr(differential expression)/​Pr(non-differential expression) is used to decide whether a feature is differentially expressed between conditions or not. For instance, an odds value of 4:1 is equivalent to Pr(differential expression) 
-= 0.8 and it means that the feature is 4 times more likely to be differentially expressed than non-differentially 
-expressed. This probability is the input parameter //q// that the algorithm takes as the cutoff for declaring differential expression. 
- 
-The NOISeq algorithm compares replicates within the same condition to 
-estimate noise distribution. Two versions of NOISeq method have been developed: NOISeq-real,​ that uses available reaplicates,​ and NOISeq-sim, that simulates technical replicates from the data. 
- 
- 
- 
-==== Input parameters ==== 
- 
-Depending on the characteristics of the available samples, we have to choose the appropriate values for the parameters in the corresponding NOISeq option. This is a short description for the input parameters in NOISeq: ​ 
-  * **repl**: ​ The type of replicates: "​tech"​ for technical replicates and "​bio"​ for biological ones. 
-  *  **k**: When counts = 0, 0 will be changed to //k//. By default, //k// = 0.5. 
-  * **norm**: Normalization method. It can be one of "​rpkm"​ (default), "​uqua"​ (Upper Quartile), "​tmm"​ (Trimmed Mean of M) or "​n"​ (no normalization). 
-  * **long**: Vector containing the features length as given by //​readInfo//​ function. If //long// = 1000, no correction by length is applied. 
-  * **lc**: Correction factor for length normalization. This correction is done by dividing the counts vector by //​mylength//​^//​lc//​. By default, //lc// = 1 (which implies a linear length normalization as in RPKM). 
-  * **nss**: Number of simulated samples. Only active when the input data does not contain replicates and NOIseq-sim will be used.  By default, //nss// = 5. 
-  * **pnr**: Percentage of the total reads for each simulated sample. ​ Only active when the input data does not contain replicates and NOIseq-sim will be used. By default, //pnr// = 0.2. 
-  * **v**: Variability in the size of the simulated samples (only for NOISeq-sim). By default, //v// = 0.02.  
-  * **q**: Threshold used to determine differentially expressed genes. By default, //q// = 0.9. 
- 
- 
-The following table summarizes all the possibilities and includes some recommendations for the values of the parameters: 
- 
- 
-^Method ​      ​^Replicates ​             ^Counts ​     ^norm           ​^k ​    ​^nss ​          ^pnr ^v  ^  ​ 
-|NOISeq-real ​ |Technical / Biological ​ |Raw         ​|rpkm,​uqua,​tmm ​ |0.5   ​|0 ​            ​|-- ​  ​|-- ​ |  
-|:::          |:::                     ​|Normalized ​ |n              |NULL  |:::           ​|::: ​ |:::  | 
-|NOISeq-sim ​  ​|None ​                   |Raw         ​|rpkm,​uqua,​tmm ​ |0.5   |5 (at least) ​ |0.2  |0.02  |  
-|:::          |:::                     ​|Normalized ​ |n              |NULL  |:::           ​|::: ​ |:::  | 
- 
- 
- 
- 
-==== NOISeq-real:​ using available replicates ==== 
-The algorithm estimates the probability distribution for M and D in an empirical way, computing M and 
-D values for every pair of replicates within the same experimental condition and for every feature. Then, all 
-these values are pooled together to generate the noise distribution. Two replicates in one of the experimental 
-conditions is sufficient to run the algorithm. If the number of possible comparisons within a certain condition is higher than 30, in order to reduce computation time, 30 pairwise comparisons are randomly chosen when estimating noise distribution. 
- 
-It should be noted that biological replicates are necessary if the goal is to make any inferences about the 
-population. Deriving differential expression from technical replicates is useful to draw conclusions about the 
-specific samples being compared in the study but not to extend these conclusions to the whole population. 
- 
-In RNA-seq or similar sequencing technologies,​ counts from technical replicates (e.g. lanes) can be summed up. Thus, when //repl// parameter in NOISeq is set to "​tech",​ this is the way the algorithm summarizes the information provided by the replicates to compute M and D signal values (between different conditions). However, for biological replicates, other summary statistic such us the mean or the median may be more meaningful. **TO BE COMPLETED...** 
- 
-Here you have an example where there are technical replicates and counts data are to be normalized by RPKM: 
-  > myresults <- noiseq(mydata[[1]],​ mydata[[2]],​ repl = "​tech",​ k = 0.5, norm = "​rpkm",​ long = mylength, q = 0.8, nss = 0, lc = 1) 
- 
- 
-**Warning**:​ NOISeq for biological replicates has not been tested yet. In case you want to use it, we will be very grateful to receive any feedback from you! 
-==== NOISeq-sim: no replicates ==== 
-When there are no replicates for any of the experimental conditions, the algorithm can simulate them. The 
-simulation relies on the assumption that read counts follow a multinomial distribution,​ where probabilities 
-for each class (feature) in the multinomial distribution are the probability of a read to map to that feature. These 
-mapping probabilities are approximated using counts in the only sample of the corresponding experimental 
-condition. Counts equal to zero are replaced with k>0, to give all features some chance to appear. ​ 
- 
-Given the sequencing depth of the unique available sample, the size of the simulated samples is a percentage (parameter //pnr//) of this total amount of reads, allowing a small variability (parameter //v//). The number of replicates to be simulated is provided by //nss// parameter. 
- 
-Example of how to apply NOISeq-sim to data that is already normalized: 
- 
-  > myresults <- noiseq(mydata[[1]],​ mydata[[2]],​ k = NULL, norm = "​n",​ long = 1000, q = 0.9, pnr = 0.2, nss = 5, v = 0.02, lc = 1) 
- 
- 
- 
-==== Output data ==== 
-NOISeq methods returns you a list containing the following objects: 
-  * **probab**: Probability of differential expression for each feature. A NA value means that the probability was not computed because the feature had expression 0 in all the samples for both experimental conditions. 
-  * **deg**: List of differentially expressed features according to the chosen threshold //q//. 
-  * **Ms**: M-value in signal (between conditions) for each feature. 
-  * **Ds**: D-value in signal (between conditions) for each feature. 
-  * **Mn**: M-values in noise (within the same condition) for all features, conditions and comparisons. 
-  * **Dn**: D-values in noise (within the same condition) for all features, conditions and comparisons. 
- 
-If you are interested in knowing which of the differentially expressed features are up or down regulated for the first condition (for example), you can combine //deg// and //Ms// to get this information:​ 
-  > Mdeg <- myresults$Ms[myresults$deg,​] 
  
-A positive value for //Mdeg// means that the feature is up-regulated in the first conditionIf //Mdeg// is negative, the feature is down-regulated for the first condition.+Please, find {{:​posternoiseq_2012.pdf|here}} an outline of the NOISeq method.
  
-==== More about normalization ==== +**NOISeqBIO** is the adaptation of NOISeq to handle biological variabilityYou can find a description of the NOISeqBIO method in this {{:​noiseqbio_techreport.pdf|Technical Report}} and also a summary in this {{:​posternoiseqbio.pdf|poster}}.
-**To be completed...**+
  
-RPKM 
-UQUA 
-TMM 
  
-Length+Both NOISeq and NOISeqBIO are included in R/Bioc NOISeq package.
  
-====== References ====== 
-  * Bullard J.H., Purdom E., Hansen K.D. and Dudoit S. (2010) Evaluation of statistical methods for normalization and differential expression in mRNA-seq experiments. //BMC Bioinformatics//​ 11(1):94+. 
-  * Mortazavi A., Williams B.A., McCue K., Schae er L. and Wold B. (2008) Mapping and quantifying mammalian transcriptomes by RNA-seq. //Nature Methods// 5(7):​621-628. 
-  * Robinson M.D. and Oshlack A. (2010) A scaling normalization method for differential expression analysis of RNA-seq data. //Genome Biology// 11(3):R25+. 
  
  
  
-====== Contact ====== 
-NOISeq has been developed at the [[http://​bioinfo.cipf.es|Bioinformatics and Genomics Department]] of the [[http://​www.cipf.es|Centro de Investigación Príncipe Felipe]], in collaboration with the Department of Applied Statistics, Operations Research and Quality of the [[http://​www.upv.es|Universidad Politécnica de Valencia]], Spain. 
  
-Please, contact us at: 
  
-**Ana Conesa**, aconesa@cipf.es 
  
-**Sonia Tarazona**, starazona@cipf.es 
  
  
-====== How to cite us ====== 
  
-Tarazona S., García-Alcalde F., Ferrer A., Dopazo J., and Conesa A. **Differential expression in RNA-seq: a matter of depth**. (submitted)