Table of Contents


Class prediction in acute leukemia


In this example we are going to analyse a dataset from Golub et al. (1999). In that paper they were studying two different types of leukemia (acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) in order to detect differences between them. This dataset have 3051 genes and 38 arrays, 27 of them labeled as ALL and 11 of them as AML.

Using Class prediction we are going to build a predictor to try to distinguish between both classes. In the train file we can see 30 arrays, 21 ALL and 9 AML. The rest, 6 ALL and 2 AML, are in the test file for predicting.

You can find the dataset for this exercise in the following files:

A. Training

  1. Train with KNN algorithm. Upload the datafile and select the variable TUMOR. In order to get the exercises fast select 5 repeats of 5-fold cross validation. In this exercise do not select any feature selection method.
  2. Repeat the exercise but select CFS feature selection method, which one works better? why? how many genes were selected
  3. Now try with SVM algorithm with no feature selection method, which one performs better? What do you prefer: SVM or KNN?
  4. To finish you can try SVM with CFS feature selection method, how many features were selected? why it matches KNN with CFS?
  5. Finally, which is the best combination? why is SVM doing better along than with CFS?

B. Test

 ALL	ALL	ALL	ALL	ALL	ALL	AML	AML