Class prediction in acute leukemia


In this example we are going to analyse a dataset from Golub et al. (1999). In that paper they were studying two different types of leukemia (acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) in order to detect differences between them. This dataset have 3051 genes and 38 arrays, 27 of them labeled as ALL and 11 of them as AML.

Using Class prediction we are going to build a predictor to try to distinguish between both classes. In the train file we can see 30 arrays, 21 ALL and 9 AML. The rest, 6 ALL and 2 AML, are in the test file for predicting.

You can find the dataset for this exercise in the following files:

A. Training

  1. Train with KNN algorithm. Upload the datafile and select the variable TUMOR. In order to get the exercises fast select 5 repeats of 5-fold cross validation. In this exercise do not select any feature selection method.
  2. Repeat the exercise but select CFS feature selection method, which one works better? why? how many genes were selected
  3. Now try with SVM algorithm with no feature selection method, which one performs better? What do you prefer: SVM or KNN?
  4. To finish you can try SVM with CFS feature selection method, how many features were selected? why it matches KNN with CFS?
  5. Finally, which is the best combination? why is SVM doing better along than with CFS?

B. Test

  • Now we select the option Train and test and select datatraingolub and datatestgolub.
  • We can select KNN without feaure method to speed up the exercise.
  • In order to check the accuracy of prediction you can see the correct labels for the test file:
 ALL	ALL	ALL	ALL	ALL	ALL	AML	AML 
  • Are the predictions right? Do you get the same results with SVM?