GSF Logo GSF mips Logo mips
mips

services

mdcs

help

 

 

HELP PAGE

 

mdcs

online software
basic options

advanced options

download software

example

help

methods

developers

mips home      mail to webmaster    print view

 

Chapter 1: The Maximal Margin (MAMA) Linear Programming Classification Procedure
     The web version of the MAMA (Antonov et al., 2004) operates in 3 basic modes: test set classification  (normal mode), leave one out (LOO) cross validation and n-fold cross validation mode. In test set classification mode, it is assumed that user will supply both training and test data. Training data will be used for classifier (Ideal Feature) construction. Afterward the test set samples will be classified. LOO requires only training data. In this case iteratively each sample is removed from the data and classifier constructed using remaining samples. Afterwards the sample being removed will be classified. The n-fold cross validation mode requires only the training data. However, if the test data is also supplied it will be joined with the training data. The join data splits randomly into new test and training sets. The split rate is specified as a parameter on the Advanced Options page. It represents the relative number (0;1) of samples that will be partitioned to the training set. The newly created test and training sets will be analyzed in the normal mode.

Chapter 2: Entering the Data
     There are 2 different forms for entering the data. The Basic Options form allows the user to specify a minimum amount of parameters to get started. The Advanced Options contains more options to allow flexible analysis of the data.

Basic Options
     On this page the user provides the data files in Input format containing the expression and classification data. Actually, the training data information is only required. It is optional to supply the test data information. Two additional fields that should be specified by the user are the data delimiter and the number of genes for classification. Data delimiter is the symbol used to separate data in the input files. Default value is tab symbol. The number of genes specifies the number of genes used in classification procedure. These genes will be selected according to the standard deviation filter (top genes). The default value is 1000. If the data consisted of less then 1000 genes all of them will be selected. Basic Options page will operate in LOO mode if test set is not supplied. Otherwise it will execute normal mode.

Advanced Options
     At the Advanced Options page allows to specify more options. They are divided into data and analysis options. In the data part the user should specify the following fields:

  • data delimiter – the symbol used to separate data in input files. Default value is tab.
  • missing value – the symbol used to indicate missing values in the input files. Default value is Na.
  • data shift – the value added to all gene expression values. Default value is 0.
  • negative value parameter – the parameter which specify how the negative values in the input data set should be treated. If the user selects “yes” then negative expression values will be replaced with 0. This procedure is applied after the data shift (see previous parameter). Otherwise no preprocessing of negative values will take place. Default value is no.

Analysis section specifies parameters for gene selection utilities and classification mode. Classification mode specifies possible classification procedures (see Chapter 1). If n-fold cross validation mode is selected then the user should also indicate the split rate (see Chapter 1). The user may also specify the following fields for gene selection:

  • Gene filter type specifies three different statistical criteria that could be applied to select genes. All of them are computed based on the gene expression values across samples from the training set. They are standard deviation (sd) of gene expression values, average (avg) and the ratio of the standard deviation to the average (sd/avg).
  • Gene filter threshold indicates a threshold for the gene filter. The gene is filtered out if Gene filter value for this gene is less than the threshold.
  • Gene number limits maximum number of genes for analysis. The genes are ordered according to the corresponding Gene filter value and genes with smallest values are filtered out.
  • Transform indicates wether or not log transformation of the data should be applied.

The last parameter is redundant compared to the Gene filter threshold but we found it useful for fast evaluation of different Gene filter types. If the user supplies the Gene number parameter it will be used for gene selection.

Input format
     The training set must include two data files, the expression and classification data files. The first file contains data from microarray experiments. The second file specifies classification information about the expression data. The test set (if available) should also include expression and classification data files. If classification labels for test set samples is not known the third row of classification data file must be empty.

Expression data file format.
     Data is organized into matrix – rows correspond to genes, columns correspond to samples. First n rows specify n different names of the samples. And first m columns specify m different possible names of genes. (Attention: only first name of the sample and the two first names of the genes will be used to generate classification reports). Expression values represent a double (or integer) values. It is possible that each expression value followed by some text comment like symbols A or P specifying, e.g., the confidence of the measurement. These text comments will be ignored. All data should be separated by the same delimiter, which is specified in the input parameter form.

Classification data file format.
     There are three rows in the file. The first row specifies 4 integer numbers. First one is the number of samples. Second one is the number of classes. The last two values specify the first expression value in the matrix. If the first n rows in expression data file specify n different names of the samples and the first m columns specify m different possible names of genes then the third integer will be n+1 and the last integer m+1.

The second row specifies the names of classes. It must start with the symbol “#”. The order of names is important. The third row specifies the labels of the samples. Order of labels corresponds to the order of samples in expression data file. Labels are integer values. The minimum label correspond to the first class name, …, and the maximum label correspond to the last class. In the case of unknown labels (relates only to the test data) the third row of classification file must be empty.

This data format type is most commonly used in the studies of classification of expression data (Golub et al., 1999; Ramaswamy et al., 2001). Many examples of the data sets in this format can be downloaded from http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi. (Attention. One difference in the data format exists: in the classification file (both for training and test sets) in the first row you should additionally specify the first expression value in the matrix (see description of Classification data file))

Missing Value
     If the data has missing values there should be a symbol specifying it. The missing values will be treated as an average of gene expression over all samples.
     Examples of both files are indicated below.
  

Expression data file:


Description

Accession

ALL_1

ALL_2

AMM_1

AMM_2

endogenous control_1

AFFX-BioB-5_at 

13.13

135.65

200.20

300.56

MSR1 Macrophage scavenger receptor 1

D13264_at

2900.13

1350.65

2000.2

3678.56


 

Classification data file:


4   2   2   3

# ALL AML

0   0   1   1


The expression data file contains 4 samples, one description line and two gene names for each sample. The expression data are available for 2 genes only.

The classification data file, first line, contains following numbers:

4 – number of samples
2 – number of classes
2 – first row with data
3 – first column with data

The second line of this file contains names of classes. There are two classes {ALL,AML} corresponding to two different types of leukemia. The third line provides information on experimental classes of samples. Two first samples belong to ALL class (0), and the last samples are from AML class (1).

The data delimiter is tab. It is default data delimiter.

Possible Errors
     Errors will be generated and execution terminated if any of the input parameters have values that are not within their range or if some required input files are missed. Notice, that by our experience the most common error are because of problems with input file data format!

Interpretation of Results
     The first portion of the results is a list of the options used for the current run. Next follows a report on the uploaded data. The information is presented in the tables corresponding to training and test set. There are three columns. First represents class label, second represents class name and the third gives the number of samples in the corresponding set. The gene information includes total number of correctly read genes, total number of missing values in the data. Bad gene report specifies the genes with the number of expression values different from the one that was specified in the classification data file (not equal to the number of specified samples). The last section is classification results. It starts with the Sample report. For each sample being classified the predicted and the actual label (if any) is reported. The overall classification statistics is presented in the next table. The last part of classification result represents gene models report. For each class the corresponding gene model is reported. For each gene involved in the model the following information is reported: the name, the coefficient reflecting gene contribution to the model (linear combination coefficient) and gene profile pairwise correlation to the ideal marker gene profile.1

Example of Calculations
     An example of results calculated for the ALL/AML dataset of Golub et al, 1999 can be found here. The links on this page will help you to interpret the results.

Antonov, A. V., Tetko, I. V., Mader, M. T., Budczies, J. & Mewes, H. W. (2004). Optimization models for cancer classification: extracting gene interaction information from microarray expression data. Bioinformatics, 20, 644-52.

Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. & Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531-7.

Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C. H., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J. P., Poggio, T., Gerald, W., Loda, M., Lander, E. S. & Golub, T. R. (2001). Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci U S A, 98, 15149-54.

  1Gene with expresion values 0 in all classes except the target where it expression values equals to 1

 

  (c) 2002-2004 GSF - Forschungszentrum für Umwelt und Gesundheit, GmbH Ingolstädter Landstraße 1, D-85764 Neuherberg