HitPick is a web server that facilitates the analysis of chemical screenings by identifying hits and predicting their molecular targets. The target prediction functionality can also be used in a stand-alone fashion.
For hit identification, the widely used B-score method (1) is applied.
For target prediction, HitPick applies a new approach that combines two 2D molecular similarity based methods, namely, simple 1-Nearest-Neighbour (1NN) similarity searching (2) and a machine learning method based on Laplacian-modified naive Bayesian models (3).
- Training and validation sets for the target prediction part
- Input description
- Output examples
- Output description
- Interaction data from STITCH and stereochemistry
Training and validation sets for the target prediction part:
For the protein target prediction method we use 145,549 direct human chemical-protein interactions collected from STITCH 3.1, consisting of protein targets for 99,572 compounds. For each of the 1,375 targets, we randomly separated 85% and 15% of their ligands for the training and validation set, respectively. In total, the validation set contains 22,868 positive and 20,779,507 negative compound-target relationships, respectively. When evaluating the highest scoring target prediction for each compound, HitPick achieves a sensitivity of 60.94% (with 66.16% being maximum possible sensitivity), a specificity of 99.99% and a precision of 92.11%, an improvement over naive Bayesian models (sensitivity of 52.95%, specificity of 99.98%, precision of 80.03%) and 1NN similarity searching (precision of 84.72%).
We also evaluated the performance of the HitPick target prediction method at different ranges of chemical similarity (measured by Tanimoto coefficient, Tc (4)) of the query compound to the closest training compound and for up to its five top scoring known targets independently. In order to obtain robust precision estimates we require a minimum of 30 compound-target predictions for each target rank in a given Tc interval (Table.1). We observed that the precision increases with increasing Tc. For compounds with a Tc of 0.7 or higher to the training set, the first predicted target was nearly always correct. Furthermore, the precision reached at least 53% for a Tc in the range of 0.4~0.5 (Table. 1). Thus, we chose 50% as default precision threshold for the predicted targets on the web server.
Table1. Precision (%) for the first five predicted targets in relation to the Tc similarity of a validation compound to the most similar molecule in the training set
The precision for cells marked as "NA" could not be determined due to the low number of compound-target predictions (less than 30). Due to the design of the widely and successfully used fingerprint scheme, a Tc of 1 does not mean that two molecules are necessarily identical. For the compoounds which are in the STITCH database, we assign their known targets with 100% target prediction precision.
- Bioassay data:
- Target prediction data
The format of the screening data in HitPick is the same used by ChemBank. As an example, we provide the following chemical screening from ChemBank. The uploaded screening data from user should be tab delimited ".txt" file.
Note: 1111.0016 is name of the assay. "2012 and 2021" are the plate numbers of the assay. A and B are replicates. NA, "Not Available". If some well is a control in the plate, the id and structures of the well do not exist.
The stand-alone version accepts up to 100 molecules as input, given as list containing per line either only one SMILES string or first a molecule identifier along with the SMILES string separated by whitespace. If you would like to predict targets for more compounds you are welcome to contact us.
- Hit identification:
- Target prediction:
To see an interactive hit identification output example, please click here .
To see an interactive target prediction output example, please click here .
- Precision cut-off
- Predicted targets:
- Diverse subset selection
Hits are determined by a p-value cut-off of 0.05. If the assay contains replicates of compounds, we require all replicates to be identified as hits to consider them as hits.
The precision of the target prediction is calculated within intervals of chemical similarity (Tc) between query and STITCH compounds. The Tc compares the similarity of FCFP-like (5) circular Morgan fingerprints using feature-invariants as implemented in RDKit. Different ranges of Tc are associated to different precision values according to our validation procedure (Table. 1).
The default precision value is set to 50%. Users can set different precision cut-offs for the target prediction results. Under a lower threshold, more chemicals will have predictions. However, this always results in less reliable predictions.
HitPick reports only those targets per compound for which we can reliably estimate the precision. The precision depends on the similarity to the most similar compound in the set of known interactions as well as on the rank of the target's score. To ensure reliability of reported precision values we require a minimum of 30 compound-target predictions. The results are displayed sorted by precision with a threshold of 50% by default, which can be adapted by the user to meet individual requirements. The targets are reported as gene symbols and more information can be found at STITCH or GeneCards.
Fig 1. An overview of the predicted targets. The ten most frequently predicted targets are highlighted for the target prediction outcome under certain precision cut-off.
Whenever the hit identification routine returns more than 100 compounds, target prediction is carried out for a structurally diverse (meaning as dissimilar as possible) subset consisting of 100 compounds. This procedure is intended to facilitate the analysis of molecular targets putatively involved in the measured biological processes by focusing on a representative subset of hits.
For this, we employ the MaxMin-Algorithm (6) as provided by RDKit, which follows a simple yet efficient approach. It is initiliazed with a random seed compound and subsequently adds compounds iteratively from outside the subset that are maximally dissimilar to the current subset until the desired number of compounds is selected.
Interaction data from STITCH and stereochemistry
As we are using a 2D fingerprint, stereochemistry is not taken into account during the generation of target models as well as similarity calculations. Therefore, interaction data from STITCH for sets of steroisomers are merged into new records, identified by the concatenation of all individual identifiers of the contributing compounds.
To preserve the privacy of the user data, only users are able to access their uploaded data and results (from the same IP address as on query submission).
In addition, all data will be deleted automatically after seven days.
The processing time for hit identification depends on the size of the assay data. For bioassays size lower than 5000, 10,000 and 100,000 compounds, the web server returns the results in less than one, two and 30 minutes, respectively. The target prediction takes around 5-7 minutes for 10-100 compounds.
However, as calculations are carried out on a shared cluster environment, actual processing time depends on the cluster workload.
1. Malo,N., Hanley,J.A., Cerquozzi,S., Pelletier,J. and Nadon,R. (2006) Statistical practice in high-throughput screening data analysis. Nature Biotechnology, 24, pp. 167-175.
2. Schuffenhauer,A., Floersheim,P., Acklin,P. and Jacoby,E. (2003) Similarity metrics for ligands reflecting the similarity of the target proteins. Journal of Chemical Information and Computer Sciences, 43, pp. 391-405.
3. Nidhi, Glick,M., Davies,J.W. and Jenkins,J.L. (2006) Prediction of biological targets for compounds using multiple-category Bayesian models trained on chemogenomics databases. Journal of Chemical Information and Modeling, 46, pp. 1124-1133.
4. Willett,P., Barnard,J., Downs,G. (1998) Chemical similarity searching. Journal of Chemical Information and Computer Sciences, 38, pp. 983-996.
5. Rogers,D., Hahn,M. (2010) Extended-Connectivity Fingerprints. Journal of Chemical Information and Modeling, 50, pp. 742-54.
6. Ashton,M., Barnard,J., Casset,F., Charlton,M., Downs,G., Gorse,D., Holliday,J.D., Lahana,R. and Willett,P. (2003) Identification of diverse database subsets using property-based and fragment-based molecular descriptions. Quantitative Structure-Activity Relationships, 21 (6). pp. 598-604.