Progress Report to the NIH on
Grant P20 HG003894-01
Chemical Informatics Cyberinfrastructure
Geoffrey C. Fox, Principal Investigator
June 1, 2006
Progress Report Summary (Expanded)

The Chemical Informatics and Cyberinfrastructure Collaboratory

MISSION: The mission of the Chemical Informatics and Cyberinfrastructure Collaboratory is to expand the knowledge base of cheminformatics and to champion the use of cheminformatics tools and techniques in life sciences and chemistry research and teaching. We shall accomplish this mission through a comprehensive public web service infrastructure, innovative applications that use these services, full integration with NIH Molecular Libraries Initiative data and PubChem, and an effective cheminformatics education program. Our long-term goal is to build on the major recent cheminformatics and grid computing advances and to introduce more broadly cheminformatics and grid computing techniques to the academic world, the pharmaceutical industry, and other segments of the chemical industry beyond the pharmaceutical sector.

A. SPECIFIC AIMS

The aims of the Chemical Informatics Cyberinfrastructure project are essentially unchanged from those submitted in March 2005. Those are listed below in the "Studies and Results" section with modifications that occurred in September 2005 as a result of negotiations with the NIH. Those negotiations led us to defer some of the educational activities to the potential full center phase. Graduate student support for the education projects was redirected to Dr. Mu-Hyun Baik to be applied to the chemistry subprojects under his direction. Likewise, the percentage effort of Dr. David Wild was reduced from 30% to 22%, and the funds used to increase the percentage of Dr. Baik's time on the project from 10% to 20%. These changes were in response to reviewers' concerns that Dr. Baik did not have enough support for all of the chemistry subprojects.

B. STUDIES AND RESULTS

1. Adapt and further develop Web Services and other grid technology to chemistry.

  1. Design a Grid-based distributed data architecture compatible with PubChem and other databases and capable of good performance with high volume data. (Geoffrey Fox and Marlon Pierce)

We are building Web/Grid services for connecting chemical data sources, applications (simulation, data mining, data assimilation, imaging, etc), computing resources, and information services. Web services have been developed for Digital Chemistry's (formerly, BCI's) clustering service methods and for OpenEye's OMEGA, FRED, and FILTER programs, as well as for several of the tools from Peter Murray-Rust's World-Wide Molecular Matrix. Examples of the last are:

The OSCAR3 chemical text mining tool, under development by Peter Murray-Rust, is being fitted with a SOAP input/output engine for a web service. Relevant tools from the open-source Chemistry Development Kit (CDK), a Java library for structural chemo- and bioinformatics, are also being adapted to Web services (e.g., Draw2D), as is VOTables, a general-purpose service for manipulating tabular data. VOTables will be used as an intermediary for data exchange between databases. Taverna, an open-source product widely used in the life sciences, is being evaluated for workflow. Taverna comes with third party tools for parsing, manipulating, and displaying data, and it includes import tools. Grid tools (Globus and Condor) for interacting with TeraGrid are also being investigated for their potential use in chemistry. Community Grids Laboratory personnel are building standards-based Web portal environments, an activity that will begin in earnest over the summer. Another project involves the ToxTree Service, an open Java source application that estimates toxic hazards by applying a decision tree approach. This is being converted from a GUI application to a text-based web service.

  1. Develop tools for HTS data analysis and virtual screening, integration of data with both parallel and distributed simulation engines, metadata and annotation generation, navigation, and visualization. (David Wild and others)

1.) Smart mining of drug discovery information

Our aim is to employ workflows of web services exposing chemoinformatics tools and databases to achieve tasks which are highly relevant to drug discovery scientists, but which are too complex to achieve easily using current tools. Based on the web service infrastructure we have developed, we are creating Taverna workflows directed to this purpose. The workflow highlighted below finds structures in a local PostgreSQL NIH DTP Tumor Cell Line database which are structurally similar to the ligand of a PDB protein known to be involved in cancer (in this case, HSP90). The similar structures are filtered for drugability, converted to 3D conformers and then are docked into the protein. The docking results are visualized using JMOL, and the docking scores are correlated with the tumor inhibition results in the DTP database. Correlations found may lead to a hypothesis of mechanism of action for the NIH compounds.


We are developing this workflow further (including automated mining of the PDB for proteins docked to similar compounds) as well as developing several other workflows. Our next step is to embed these workflows in execution environments for activation by smart clients (including therapeutic area portals and straightforward email interaction schemes for scientists).

2.) Data mining of the DTP tumor cell line dataset

We are collaborating with Melanie Wu, Database & Data Mining expert at the School of Informatics, on advanced data mining of the DTP dataset using our web service and workflow technologies, building on existing published data mining research on this dataset. We believe these methods will be transferable to the analysis of screening center results in PubChem. Current projects include:

Five of the key personnel on the projects paid a visit to the NIH Developmental Therapeutics Program on April 26, 2006. The meeting was very helpful in developing mutual understandings of the interests and capabilities of the two organizations.

3.) Fast clustering of Pubchem using Divisive K-means & Linux clusters

Divisive K-means is a very recent hierarchical divisive clustering method that has been shown to be as accurate as Ward's (the leading algorithm in chemoinformatics), but much faster. We have applied an MPI parallel-enabled version on our AVIDD Linux clusters, and are able to cluster the entirety of PubChem Compound (5,273,852 structures at the time of the test) in around 6 hours using 40 processing units. We are investigating ways of effectively utilizing this organization of PubChem for navigating and analyzing the database. The algorithm can also cluster on the basis of numeric data (e.g. screening results).

4.) Distributed Drug Discovery for neglected diseases

Distributed Drug Discovery is a project run by Dr. Bill Scott at IUPUI aimed at tackling neglected diseases using distributed chemistry, while educating undergraduates about combinatorial chemistry. Each student makes 4 compounds on cheap equipment. Each class will typically make around 60 compounds. Several universities around the world are participating. We have created a web service-enabled PostgreSQL database that maintains the reaction transformations, virtual and made compounds. This information can then be drawn into our workflows. For example, searches for similar compounds can be done on Pubchem, Tumor Cell Line database, etc., both to suggest compounds for follow-up in this project and also to suggest reaction mechanisms for compounds in other datasets.

5.) Visualization and end-user layer tools

We are pursuing a number of interface-level tools for analyzing HTS data and for integration with our web service environment, including a similarity-matrix approach for visualizing very large volumes of chemical information, .NET tools for querying PubChem data, tools for automatically visualizing QSAR relationships, and portlet environments tailored to particular therapeutic areas. Below is illustrated one such tool, PubChemSR, a .NET application for querying PubChem and exporting results to Excel, etc. It is available at: http://darwin.informatics.indiana.edu/juhur/Tools/PubChemSR


6.) Collaboration with Peter Murray-Rust group

We have initiated a collaboration with Peter Murray-Rust's group in Cambridge, UK to integrate their OSCAR tool into our workflows. OSCAR allows automatic extraction of chemical names from paper bodies and abstracts, and the conversion of these to machine-readable 2D chemical structures. We are building a workflow that will allow us to apply a measure of similarity between papers based on the structural similarity of compounds referenced in the papers, rather than the text of the papers. We will then investigate the effectiveness of this new method.

7.) Collaboration with University of Michigan MACE center

We are in the early stages of collaborating with the University of Michigan MACE ECCR center. Areas of collaboration will likely include education (including the sharing of chemoinformatics courses), and integration of MACE tools with our grid environment.

  1. Design a novel quantum mechanical simulation database that will complement the experimental libraries.

A new database infrastructure for Quantum Mechanical reaction simulations based on the MySQL platform was developed. Data from our legacy prototype database is currently being incorporated; filters, queries and a Web interface are being developed.

2. Apply the computer techniques to real chemical research problems

  1. Novel routes to the discovery of enzymatic reaction mechanisms
  2. Mechanism-based drug design
  3. Data-inquiry-based development of new methods in natural product synthesis.

To expedite the research progress on application projects, we continue to utilize the legacy prototype version of our database Varuna, which is not scalable but is rich in functionality. During the funding period of this progress report we have published 4 papers, 2 are submitted and 3 are in final stages of preparation. Substantial progress was made in utilizing the small molecule reaction database and data mining to recognize chemical reactivity patterns, which were reported.

3. Enhance chemoinformatics education

  1. a. Establish a seminar in chemical informatics (Gary Wiggins and others)

Built on ideas generated at the October 28, 2005 meeting of our external advisory board, a seminar was created on the topic of "Molecular Informatics, the Data Grid, and an Introduction to eScience." The seminar was taught to six students this past semester, and the syllabus can be found at: http://www.indiana.edu/~cheminfo/I533/533home.html

  1. Develop training modules for chemoinformatics instruction on the Web in partnership with others (Gary Wiggins and David Wild)

We are working with Mesa Analytics to help develop the Cheminformatics Virtual Classroom. See at: http://www.chemvc.com:8100/ Drs. Wild and Wiggins will join Dr. Norah MacCuish in delivering a cheminformatics workshop at the Biennial Conference on Chemical Education on July 31, 2006.

  1. Develop a web guide for essential chemoinformatics resources (Gary Wiggins)

The web guide includes many links to pages on the web for things such as academic institutions that offer cheminformatics programs or courses, chemistry databases on the web, links to professional societies, and many other items relevant to cheminformatics. See at: http://www.chembiogrid.org/resources/resources.html

  1. Explore mechanisms for introducing graduate students to chemoinformatics research. (All participants)

A number of MS and PhD graduate students, both at IUB and at IUPUI in Indianapolis, are now involved in our research efforts. With the incoming graduate students next fall, a total of 4 PhD students and 2 MS students will have chemoinformatics as their major area of study. We have recently developed a graduate chemical informatics certificate program, with four courses to be taught by distance education (http://www.informatics.indiana.edu/academics/chem_certificate.asp).

C. SIGNIFICANCE

The techniques being developed to allow web service access to chemistry databases will have far-reaching significance in their applications. For example, the correlations that are found through data mining of the NIH Developmental Therapeutics Database and PubChem should help in formulating a hypothesis of the mechanism of action for the NIH compounds. To our knowledge, there is no public quantum simulation database that would allow the reuse of QM calculations. Ultimately, our efforts will allow individual scientists to federate locally produced databases with other databases throughout the world, integrating data at the local site that is appropriate to their own research endeavors. In the area of education, we continue to offer the nation's only graduate degree program in chemoinformaticsl. The new PhD in Informatics program admitted the first student on the chemoinformatics track last fall, and 3 others are scheduled to begin the program in August 2006.

D. PLANS

Our next step is to embed the workflows in execution environments for activation by smart clients (including therapeutic area portals and straightforward email interaction schemes for scientists).

Future projects and plans include:

E. PUBLICATIONS

The following relevant publications by key personnel have appeared from late 2005 to date:

1. Bailey, Brad C.; Fan, Hongjun; Baum, Erich W.; Huffman, John C.; Baik, Mu-Hyun; Mindiola, Daniel J. "Intermolecular C-H Bond Activation Promoted by a Titanium Alkylidyne." Journal of the American Chemical Society 2005, 127, 16016-16017. http://dx.doi.org/10.1021/ja0556934

2. Bailey, Brad C.; Fan, Hongjun; Huffman, John C.; Baik, Mu-Hyun; Mindiola, Daniel J. "Room Temperature Ring-Opening Metathesis of Pyridines by a Transient Ti.trplbnd.C Linkage." Journal of the American Chemical Society 2006, 128(21), 6798-6799. http://dx.doi.org/10.1021/ja061590p

3. Fout, Alison R.; Basuli, Falguni; Fan, Hongjun; Tomaszewski, John; Huffman, John C.; Baik, Mu-Hyun; Mindiola, Daniel J. "A Co2N2 Diamond-Core Resting State of Cobalt(I): A Three-Coordinate CoI Synthon Invoking an Unusual Pincer-Type Rearrangement." Angewandte Chemie International Edition 2006, 45(20), 3291-3295. http://dx.doi.org/10.1002/anie.200504343

4. Wild, David J.; Wiggins, Gary D. "Videoconferencing and Other Distance Education Techniques in Chemoinformatics Teaching and Research at Indiana University." Journal of Chemical Information and Modeling 2006, 46(2), 495-502. http://dx.doi.org/10.1021/ci050297q

5. Wild, David J.; Wiggins, Gary D. "Challenges for Chemoinformatics Education in Drug Discovery." Drug Discovery Today May 2006, 11(9-10), 436-439. http://dx.doi.org/10.1016/j.drudis.2006.03.010

6. Yang, Xiaofan; Baik; Mu-Hyun. "cis,cis-[(bpy)2RuVO]2O4+ Catalyzes Water Oxidation Formally via in Situ Generation of Radicaloid RuIV-O" Journal of the American Chemical Society J. Am. Chem. Soc.; 2006; ASAP Web Release Date: 23-May-2006.

http://dx.doi.org/10.1021/ja053710j

F. PROJECT-GENERATED RESOURCES

Web Sites:

ChemBioGrid; Indiana University Chemical Informatics and Cyberinfrastructure Collaboratory

http://www.chembiogrid.org/

ChemBioGrid Wiki

http://www.chembiogrid.org/wiki/index.php/Main_Page

I533 Seminar in Chemical Informatics: Molecular Informatics, the Data Grid, and an Introduction to eScience

http://www.indiana.edu/~cheminfo/I533/533home.html

CICC Grid at SourceForge.net

http://sourceforge.net/projects/cicc-grid

Progress Report Summary (Expanded)

http://www.chembiogrid.org/news/events/Progress_Report_Summary_6_1_2006_full.htm