HitPick is a web server that facilitates the analysis of chemical screenings by identifying hits and predicting their molecular targets. The target prediction functionality can also be used in a stand-alone fashion.

For hit identification, the widely used B-score method (1) is applied.

For target prediction, HitPick applies a new approach that combines two 2D molecular similarity based methods, namely, simple 1-Nearest-Neighbour (1NN) similarity searching (2) and a machine learning method based on Laplacian-modified naive Bayesian models (3).





Training and validation sets for the target prediction part:

For the protein target prediction method we use 145,549 direct human chemical-protein interactions collected from STITCH 3.1, consisting of protein targets for 99,572 compounds. For each of the 1,375 targets, we randomly separated 85% and 15% of their ligands for the training and validation set, respectively. In total, the validation set contains 22,868 positive and 20,779,507 negative compound-target relationships, respectively. When evaluating the highest scoring target prediction for each compound, HitPick achieves a sensitivity of 60.94% (with 66.16% being maximum possible sensitivity), a specificity of 99.99% and a precision of 92.11%, an improvement over naive Bayesian models (sensitivity of 52.95%, specificity of 99.98%, precision of 80.03%) and 1NN similarity searching (precision of 84.72%).

We also evaluated the performance of the HitPick target prediction method at different ranges of chemical similarity (measured by Tanimoto coefficient, Tc (4)) of the query compound to the closest training compound and for up to its five top scoring known targets independently. In order to obtain robust precision estimates we require a minimum of 30 compound-target predictions for each target rank in a given Tc interval (Table.1). We observed that the precision increases with increasing Tc. For compounds with a Tc of 0.7 or higher to the training set, the first predicted target was nearly always correct. Furthermore, the precision reached at least 53% for a Tc in the range of 0.4~0.5 (Table. 1). Thus, we chose 50% as default precision threshold for the predicted targets on the web server.

Table1. Precision (%) for the first five predicted targets in relation to the Tc similarity of a validation compound to the most similar molecule in the training set

The precision for cells marked as "NA" could not be determined due to the low number of compound-target predictions (less than 30). Due to the design of the widely and successfully used fingerprint scheme, a Tc of 1 does not mean that two molecules are necessarily identical. For the compoounds which are in the STITCH database, we assign their known targets with 100% target prediction precision.



Input description:



Output examples:



Output description:



Interaction data from STITCH and stereochemistry

As we are using a 2D fingerprint, stereochemistry is not taken into account during the generation of target models as well as similarity calculations. Therefore, interaction data from STITCH for sets of steroisomers are merged into new records, identified by the concatenation of all individual identifiers of the contributing compounds.



Privacy

To preserve the privacy of the user data, only users are able to access their uploaded data and results (from the same IP address as on query submission).

In addition, all data will be deleted automatically after seven days.



Processing time:

The processing time for hit identification depends on the size of the assay data. For bioassays size lower than 5000, 10,000 and 100,000 compounds, the web server returns the results in less than one, two and 30 minutes, respectively. The target prediction takes around 5-7 minutes for 10-100 compounds.

However, as calculations are carried out on a shared cluster environment, actual processing time depends on the cluster workload.



References:

1. Malo,N., Hanley,J.A., Cerquozzi,S., Pelletier,J. and Nadon,R. (2006) Statistical practice in high-throughput screening data analysis. Nature Biotechnology, 24, pp. 167-175.

2. Schuffenhauer,A., Floersheim,P., Acklin,P. and Jacoby,E. (2003) Similarity metrics for ligands reflecting the similarity of the target proteins. Journal of Chemical Information and Computer Sciences, 43, pp. 391-405.

3. Nidhi, Glick,M., Davies,J.W. and Jenkins,J.L. (2006) Prediction of biological targets for compounds using multiple-category Bayesian models trained on chemogenomics databases. Journal of Chemical Information and Modeling, 46, pp. 1124-1133.

4. Willett,P., Barnard,J., Downs,G. (1998) Chemical similarity searching. Journal of Chemical Information and Computer Sciences, 38, pp. 983-996.

5. Rogers,D., Hahn,M. (2010) Extended-Connectivity Fingerprints. Journal of Chemical Information and Modeling, 50, pp. 742-54.

6. Ashton,M., Barnard,J., Casset,F., Charlton,M., Downs,G., Gorse,D., Holliday,J.D., Lahana,R. and Willett,P. (2003) Identification of diverse database subsets using property-based and fragment-based molecular descriptions. Quantitative Structure-Activity Relationships, 21 (6). pp. 598-604.