GSF Logo GSF mips Logo mips
mips

services

mdcs

methods

 

 

Methods

 

mdcs

online software
basic options

advanced options

download software

example

help

methods

developers

mips home      mail to webmaster    print view

 

We introduce a new method for the construction of gene features for the classification of multiple tumor types using microarray data. The section is organized as follows: first we formulate in mathematical terms our concept of data transformation and ideal feature construction. Since for the gene expression data the number of response variables (i.e. samples/sample classes) is usually much smaller than the number of predictor variables (i.e. genes) it is possible to build ideal features in a number of alternative ways. Different criteria can be used depending on the procedures applied. In the second part of this section, we describe a new procedure for the feature selection that maximizes the margin of an ideal feature, i.e. the value that represents a distance of one particular tumor type from the others. Finally, we describe a classification procedure based on the ideal feature concept.

Definition of the ideal feature

Here, we introduce some conventions of notation .

counters and number of samples in the dataset, respectively

counters and number of genes in the dataset, respectively

counter and number of classes, respectively

input dataset in matrix form

signature of training sample k (the  k th row of matrix  X)

expression profile of gene  i (the  i th column of matrix  X)

some real constant or variable depending on context (index represents a gene or some related vector)

normalized feature space and the mapping from the original to the feature space

class of the k th sample

a function

a set of indices for a group of genes and counter of such sets

unity vector (all components equal to 1)

Norm

The basic idea is the following: via a nonlinear mapping   the initial input data   are mapped into feature space  F. The purpose of such transformation is to achieve that for every two samples of the same class the scalar product in the feature space is equal to one and for every two objects from the different classes equal to zero. This can be described by the following equations

     (A1)

Let us propose one of the possible approaches to construct a mapping  F(x), which satisfies equation (A1). We define the ideal feature vectors as binary vectors   with components equal to some positive scalar value zl  if the sample belongs to class  l and 0 otherwise. The number of such vectors is equal to the number of tumor classes, L. Our target is to construct the ideal feature vectors in the form

     (A2)    

where Jl  is a set of genes that form feature  l. The unity vector   is added to the model to include in the analysis features that can be transformed to the ideal by simple shift of every component. Thus, the vector   has values equal to   if the sample belongs to class  l and   otherwise.

From a biological point of view the ideal feature model assumes that expression levels of genes from the set Jl  are subjected to multiple positive and negative correlations with each other. The degree of the correlation remains constant in all classes except the target class  l, where it is changed to a different value. This can be considered as a change in the functional relation among the genes that build the feature of class  l. In other words, these genes show an interaction pattern typical for the tumor or state type.

 

Fig.1. The principle of ideal feature construction. a), b), c), d) and f) are the log transformed expression profiles of some hypothetical genes. No single gene profile has an apparent correlation with samples 6 – 10, which are the members of the distinct class. However a linear function   of these profiles provides the ideal feature as shown in g).

 

This approach leaves freedom in the selection of functions f(.) and procedures to identify the corresponding gene sets Jl . In this study the ideal features are constructed in the form (Fig. 1)

     (A3)    

Such choice of f(.) implies that multiple ratios of expression levels in the corresponding group of genes remain constant. For example, if only two classes, A and B (e.g. tumor and normal tissue), are to be discriminated (in case of multiple class classification, class B includes all classes except A) one has

 , for each sample k from class A, 
                                                                                       (A4)
 ,    for each sample n from class B.

Swapping of classes A and B corresponds to a renormalization of constants in (A3) and thus leads to the same classification result.

 

Optimization procedure for the ideal feature construction

For simplicity in this section we will consider only positive linear combinations in (A2,A3). But this case could be easily extended to generality by adding to the input dataset a negative copy of each gene in the form  .

Microarray expression data tend to have a large discrepancy between the number of predictors (i.e. genes) and responses (i.e. samples). Therefore, it is possible to select classifying gene sets Jl  in many different ways. Each such procedure requires some externally formulated criterion for selection among probable features. Since the number of species is much less then the number of genes it is possible to construct a large number of ideal features.

    A multiplication of coefficients   in (A4) on some constant   provides

 , for each sample k from class A, 
                                                                                       (A5)
 ,    for each sample n from class B.

and  . For this reason one can consider the ratio   as margin between class A and B and prefer features with minimal unity () component relative to z. If   is fixed, e.g. to 1, then this ratio is maximized by maximization of zl . This task can be implemented using linear programming (LP). LP practically has no restrictions on the size of the problem that can be solved. Consider the following optimization problem


 subject to:

                                                                                       

(A6)

The small constant  bounds deviations of solution of (A6) from the ideal feature. This constant was fixed to be 5% of 1/K, where K is the number of training set samples.

Fig.2. Geometric interpretation of the classification procedure. The procedure is looking for a subspace of the gene space (here genes 1,2,3) where samples from class A and class B are lying in different parallel hyperplanes with maximum distance, z, for fixed value of  . Points represent samples

    An example for the geometric interpretation of the ideal feature generation for the two-class separation is shown in Fig. 2. The constraints in (A6) ensure that class A and class B samples are lying in different parallel hyperplanes (and the distance from each sample to the corresponding hyperplane is within the constant ) in the gene subspace formed by the gene group Jl . The value of constant   fixes the distance between hyperplane B and the origin. The objective of (A6) is to find such a gene subspace Jl  where the margin between hyperplanes A and B is maximal. The proposed procedure is referred to as MAximal MArgin linear programming (MAMA) method.

 

Multiple tumor classification procedure

In case of multiple tumor classification the problem (A6) is solved L times and L feature vectors   are formed. The nonzero    elements in each solution correspond to genes that constitute group  Jl , l=1,…L and are used in formula (A3). Thus, each constructed feature corresponds to one particular class (in our case tumor type). For each sample k  to be classified one calculates L feature values  for each tumor class l=1,…L and votes  . The index of the maximum vote l indicates the predicted class for the sample k.

 

  (c) 2002-2004 GSF - Forschungszentrum für Umwelt und Gesundheit, GmbH Ingolstädter Landstraße 1, D-85764 Neuherberg