GSF Logo GSF mips Logo mips
mips

services

EST3

 

 

METHOD

 

EST3

METHOD
HELP

on-line software
   Classification

    Training

developers

mips home      mail to webmaster    print view

 

 

Why EST3?

We recently analysed the classification power of SVM (Support Vector Machine) to classify EST data from libraries containing sequences from two organisms. The SVM classifier were trained using codon usage and n-mers frequencies. The best results were calculated using frequencies of trinucleotides (3-mer).

Frequency of triplets

Each EST sequence is chunked on triplets and their frequencies are calculated. E.g. sequence AATCAAT is chuncked at "AAT", "ATC", "TCA", "CAA", "AAT". The frequencies are:
triplet  count  frequency
AAT20.4
ATC10.2
TCA10.2
CAA10.2
sum51

If the analysed EST sequence contains unallowed symbols, e.g. AATXCAAT, all triplets containing it, i.e. in this example ATX, TXC, XCA, will be ignored.

The triplet frequencies are used to train the Support Vector Machine (SVM) and to estimate performance of the method using double fold cross-validation procedure. The SVM models and their statistical results were stored in a MySQL database a nd can be used to classify new EST data at our site. An article with detailed description of our results is submitted for publication.

Training mode

To develop classifiers we use the open source LibSVM package.(Chang and Lin, 2005) In support vector machines, the input variables are first mapped into a higher dimensional feature space by the use of a kernel function, and then a linear model is constructed in this feature space. For the purposes of the current study we restricted our analysis to the RBF kernel, which was also used in our previous studies and demonstrated higher predic-tion ability to a number of other studied approaches.(Friedel, et al., 2005) The parameters of the SVM were optimized for each dataset using internal cross-validation procedure following a grid search. The grid search included SVM parameter C=2-5,2-3,...,215 and width of the RBF kernel γ=2-15,2-13,... ,23 as recommended in the LibSVM manual. The input data for the SVM algorithm were normalized to a (0,1) interval before the data analysis. The same approach is used to develop new classifier in the Training mode. N.B.! The optimisation of large dataset can take hours.

 

 

  (c) 2006 GSF - Forschungszentrum für Umwelt und Gesundheit, GmbH Ingolstädter Landstraße 1, D-85764 Neuherberg

















































eXTReMe Tracker