The NIH DTP program provides a set of 60 cancer cell lines, against which approximately 42,000 compounds have been screened. The screening data is available here.
This page is a client of a web service (WSDL , Usage) that returns activity predictions for all 60 cell lines for a user specified set of compounds. Note that the data avilable from the NIH is real-valued and is available for three concentration parameters. The models were developed using log GI50, with a cutoff of 5.0
Predictions are obtained using a set of random forest models, one for each cell line using 166 bit MACCS keys as the features. However due to the nature of the problem, the active to inactive class ratio for a given cell line is imbalanced. As a result the models were developed by biasing classification accuracy towards the actives (i.e., false actives are preferred to false inactives).
NOTE: Due to current hardware limitations, predictions are made for the first 40 cell lines only.
Paste SMILES, one to a line

