|
|
 |
|
Why EST3?
We recently analysed the classification power of SVM (Support Vector Machine) to classify EST data from
libraries containing sequences from two organisms. The SVM classifier were trained using codon usage and n-mers frequencies.
The best results were calculated using frequencies of trinucleotides (3-mer).
Frequency of triplets
Each EST sequence is chunked on triplets and their frequencies are calculated. E.g. sequence
AATCAAT is chuncked at "AAT", "ATC", "TCA", "CAA", "AAT". The frequencies are:
| triplet | count | frequency |
| AAT | 2 | 0.4 |
| ATC | 1 | 0.2 |
| TCA | 1 | 0.2 |
| CAA | 1 | 0.2 |
| sum | 5 | 1 |
If the analysed EST sequence contains unallowed symbols, e.g. AATXCAAT, all triplets containing it,
i.e. in this example ATX, TXC, XCA, will be ignored.
The triplet frequencies are used to train the Support Vector Machine (SVM) and to estimate performance of the method
using double fold cross-validation procedure. The SVM models and their statistical results were stored in a MySQL database a
nd can be used to classify new EST data at our site. An article with detailed description of our results is submitted
for publication.
Training mode
To develop classifiers we use the open source LibSVM package.(Chang and Lin, 2005)
In support vector machines, the input variables are first mapped into a higher dimensional feature space by
the use of a kernel function, and then a linear model is constructed in this feature space. For the purposes of the current
study we restricted our analysis to the RBF kernel, which was also used in our previous studies and demonstrated higher
predic-tion ability to a number of other studied approaches.(Friedel, et al., 2005)
The parameters of the SVM were optimized for each dataset using internal cross-validation procedure following a grid search.
The grid search included SVM parameter C=2-5,2-3,...,215 and width of the RBF kernel γ=2-15,2-13,... ,23 as recommended in the LibSVM manual.
The input data for the SVM algorithm were normalized to a (0,1) interval before the data analysis.
The same approach is used to develop new classifier in the Training mode. N.B.! The optimisation of large dataset can take hours.
|
| |
 |