![]() |
![]() |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||
|
We introduce a new method for the construction of gene features for the classification of multiple tumor types using microarray data. The section is organized as follows: first we formulate in mathematical terms our concept of data transformation and ideal feature construction. Since for the gene expression data the number of response variables (i.e. samples/sample classes) is usually much smaller than the number of predictor variables (i.e. genes) it is possible to build ideal features in a number of alternative ways. Different criteria can be used depending on the procedures applied. In the second part of this section, we describe a new procedure for the feature selection that maximizes the margin of an ideal feature, i.e. the value that represents a distance of one particular tumor type from the others. Finally, we describe a classification procedure based on the ideal feature concept. Definition of the ideal feature Here, we introduce some conventions of notation .
The basic idea is the following: via a nonlinear mapping
Let us propose one of the possible approaches to construct a mapping F(x), which satisfies equation (A1). We define the ideal feature vectors as binary vectors
where Jl is a set of genes that form feature l. The unity vector From a biological point of view the ideal feature model assumes that expression levels of genes from the set Jl are subjected to multiple positive and negative correlations with each other. The degree of the correlation remains constant in all classes except the target class l, where it is changed to a different value. This can be considered as a change in the functional relation among the genes that build the feature of class l. In other words, these genes show an interaction pattern typical for the tumor or state type.
This approach leaves freedom in the selection of functions f(.) and procedures to identify the corresponding gene sets Jl . In this study the ideal features are constructed in the form (Fig. 1)
Such choice of f(.) implies that multiple ratios of expression levels in the corresponding group of genes remain constant. For example, if only two classes, A and B (e.g. tumor and normal tissue), are to be discriminated (in case of multiple class classification, class B includes all classes except A) one has
Swapping of classes A and B corresponds to a renormalization of constants in (A3) and thus leads to the same classification result.
Optimization procedure for the ideal feature construction For simplicity in this section we will consider only positive linear combinations in (A2,A3). But this case could be easily extended to generality by adding to the input dataset a negative copy of each gene in the form Microarray expression data tend to have a large discrepancy between the number of predictors (i.e. genes) and responses (i.e. samples). Therefore, it is possible to select classifying gene sets Jl in many different ways. Each such procedure requires some externally formulated criterion for selection among probable features. Since the number of species is much less then the number of genes it is possible to construct a large number of ideal features. A multiplication of coefficients
and
The small constant
An example for the geometric interpretation of the ideal feature generation for the two-class separation is shown in Fig. 2. The constraints in (A6) ensure that class A and class B samples are lying in different parallel hyperplanes (and the distance from each sample to the corresponding hyperplane is within the constant
Multiple tumor classification procedure In case of multiple tumor classification the problem (A6) is solved L times and L feature vectors |
|||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||