The knowledge networks can be extracted
from principally different sources of biological knowledge. As a core of our
system we employ the MIPS functional catalogue. Gene sequence similarity and InterPro
domains data were employed as additional independent data sources. Other
utilized sources are manually annotated PPI databases. Our system is very
flexible in use. Each knowledge module can be switched on/off depending
on the purpose of the study. There is an option, which allows the user to
upload his own knowledge modules in the specified format.
Functional Catalogue Module (FunCat module)
The FunCat is an
annotation scheme for the functional description of proteins. Taking into
account the broad and highly diverse spectrum of known protein functions, the FunCat
consists of 28 main functional categories (or branches) that cover general
fields like cellular transport, metabolism and cellular communication/signal
transduction. The main branches exhibit a hierarchical, tree like structure
with up to six levels of increasing specificity. In total, the FunCat includes
1307 functional categories.
Each of the
functional categories is assigned to a unique two-digit number. The upward
context of the hierarchical tree consists of the prefix of the preceding
nodes, located in the upper levels in the hierarchy. The levels of
categories are separated by dots, e.g. 01 metabolism is a representative
of the highest level, and 01.01.03.02.01 biosynthesis of
glutamate belongs to the most specific level of FunCat.
According to the total number of different
functional categories (1307) one can extract the same number of different
networks. Each network corresponds to one category. The extraction procedure is
very simple. If two genes have the same category then they are connected in the
corresponding network. The hierarchical tree like structure of FunCat presumes an
hierarchical organization of the extracted networks. The networks generated by
very specific categories (e.g. 01.01.03.02.01 biosynthesis of
glutamate) are a subnetworks of the networks generated by corresponding
unspecific ones (e.g. 01 metabolism).
Sequence similarity (SS) module
The base
information used by the module is a pairwise similarity score between the amino
acid sequences of two genes. The FASTA pair-wise scores for were retrieved from
the SIMAP database. The input values were calculated as -log10 (E-value).
Pairwise scores with E-value > 0.1 were excluded from the analysis.
The edge weight between two genes is proportional to the similarity score.
There are
several reasons to include sequence similarity (SS) module to the BIOREL
system. First of all it reflects any bias in the network that can be attributed
to the genes sequence similarity. This module for example may be very helpful
for analyses of gene expression data to estimate cross hybridization effects.
Any systematic bias towards similarity in expression profiles of genes with
similar sequences will be detected. However the estimation of this effect is
not as simple as it seems. Genes with similar sequences are functionally
related and thus one needs to separate two effects: sequence similarity and
similar function. By applying the BIOREL system twice for the network extracted
from expression data one can estimate these effects. First time analysis is
performed using only FunCat module and second time using only SS module. If the
functional bias of the network and a set of genes classified as relevant will
be similar in both cases then most edges in the network connects only genes that
share strong sequence similarity (and thus functionally related) and there are
no edges which connects functionally similar genes without sequence similarity.
Such result would indicate strong cross hybridization signal in the analyzed
expression data.� ��
InterPro Domain (IPD) module
The base information used is protein domain
composition provided by the InterPro database.� The number of different networks extracted by this module corresponds to the number of domains. Each domain generates a network. The extraction procedure creates an edge between two genes if their proteins both have the corresponding domain. Any systematic bias in the network due to similar domain composition of interacting genes will be estimated by this module.
Gene Neighborhood module
The base information used is physical distance
between two genes on the chromosome. The weight of the edge between two genes
is inversely proportional to the distance separating them physically on the
chromosome. Two options are implemented. The distance is measured in a number
of genes or in a number of nucleotides. Any systematic bias in the gene
interactions reflected in the network due to gene neighborhood on the
chromosome will be estimated by this module.
Protein Protein interaction (PPI) module
There are several databases on protein-protein
interaction in yeast. Among them one should mention manually curated catalogues
of known protein complexes, data from high-throughput experiments, such as two
hybrid experiments, genetic interactions, etc. Having been assembled
differently they are similar in storage format. Therefore in all cases the same
network extraction procedure can be used. An edge of the binary network is
constructed if two proteins are involved in an interaction according to the
database record. The BIOREL system in the web configuration employs only manually
curated catalogues of known protein complexes.
User defined knowledge modules
This option can be used for many purposes. First
the biological information is very dynamic. New sources of information
considering genes from different biological perspectives can arise. Thus we
allow the user to add data to our knowledge base. Second this option allows to
infer the relevance of the target network based on the associations from the
set of user supplied networks. These networks represent not the biological
knowledge but other networks extracted by different methods or from different
kind of high-throughput data. This kind of analysis within our system allows
getting interesting insights into the differences and similarities of the
networks extracted by different statistical methodologies or from kinds of
data. It can be very useful for benchmarking (network inference procedure)
purposes.