CICC Quarterly Report: Rajarshi Guha

Database: The local PubChem mirror is up and running and represents a subset of the information in PubChem. 2D structure coordinates and some calculated properties are excluded. The DB also includes synonyms and Compound-Substance associations. Currently, bio-assay data is not included in the database. gNova fingerprints are included for the compounds and so it is now searchable with SMARTS. Related to this is the inclusion (in a separate table) of 'derived' properties which is essentially contributed data. Currently this includes FlogP and MR values contributed by Adam Lee from UMich. This table will later include properties from Kevin's calculations and possibly 3D structure coordinates (but this is linked to the work on the QM DB, so it's not finalized yet)

Web Services: The R WS infrastructure has been developed. Currently, we are in the process of adding specific model types as WS's. Right now we have: OLS, CNN, RF regression, k-means clustering and XY & histogram plots. It is up and running (WSDL from my office machine and Rserve on gf8) and probably will need to be migrated to a more permanent home. Also I have been collecting WSDL links from other WS providers: currently we have links from U Cologne (Christoph Steinbeck) and VCC Lab (Igor Tetko). Waiting for Mark Nicklaus to send us his links. CDK web services being reworked to be in sync with the latest CDK, also updating the current offerings to include the extra ones that I had made, but weren't put up on gf8.

Tool/SW development: Implementing more ADAPT descriptors into the CDK; also merging some descriptors provided by Todd Martin from the EPA. Updating the rcdk package for R allowing use of the CDK in R. Implemented the rpubchem package for R that allows us to directly access compound information (SMILES and some calculated properties) and bio-assay datasets (by assay number or key word search) from within R. An article on these two packages has been submitted to J. Stat. Soft.

Cheminformatics/Modeling: The work on ensemble descriptor selection was presented at the ACS as a talk, and is being written up for submission to JCIM. The RNN-cluster work was presented as a poster at the ACS and is in the last stages (running it on some more datasets) and will be written up (end Oct, early Nov). I've also started working on the tox data from Scripps. Initial work has been the development of a RF model to predict species-specific toxicity - very good performance within species, not very impressive across species. Further work will include analysing structural fragments, building more reliable models, ensemble models etc.

<<Back