David Wild Update Oct 2006
From Chemical Informatics and Cyberinfrastructure Collaboratory
This is a report on developments from the workflow level through to the 'customer'
Contents |
[edit]
Collaborations
We should prioritize the collaborations for the next grant, and also look at other people we could collaborate with. Here are updates on current collaborations, excluding the Cambridge collaboration and collaboration with Kevin Gilbert which are focused on infrastructure. Most of those below can be considered 'customers' of our center.
- Scripps Florida. Rajarshi visited and demonstrated our services can be used in Pipeline Pilot. He is now working on methods to provide toxicity flags for the HTS set
- We are working with Faming Zhang to apply our OSCAR3 & Pubchem searching & docking workflow (on Big Red) to search for potential Kinase inhibitors based on query Kinase protein structures provided by Faming. The scientific quality of the results will be assessed with Faming
- Michigan. Successful IU/MACE workshop. Rajarshi has implemented a local Pubchem that Gordon Crippen can use for his experiments. Various property values generated at Michigan are also stored. I have 9 Michigan students enrolled in my I571 introductory chemoinformatics course.
- Lilly. On visit to Lilly, discussed a number of collaboration areas including extending MOBIUS with our workflows (Mic Lajiness) and integrating workflows with an automated laboratory (Horst Hemmerle)
- Jack Bikker at Pfizer has been recruited as an (informal) advisor on HTS follow-up chemistry
- I met with Erik Stolterman to discuss HCI possibilities. It seems like the kinds of issues we're dealing with (ill defined customer groups, quickly changing needs, tension between real science and computational science, and a field in the midst of a revolution) match very well with expertese at IU. We propose funding an IU HCI Ph.D. student to focus on the understanding of customer groups for us.
- Jim Caruthers at Purdue. Jim has is developing a system for harvesting compound information from chemists' desktop machines employing eLab notebook software and a data warehouse. We have applied for IU funding to have a student work on integrating this with our cyberinfrastructure (basically making a web service out of their database).
- Other possible collaborations?: John Cleary (Waikato University, NZ); Sheffield; Mic Lajiness (Lilly), Jake Chen and Samy at IUPUI
[edit]
Customer Group Definitions
- Have begun to define customer groups. Will solicit feedback from review meeting on these, and which ones we should be tackling (and with what priority)
[edit]
Research into interaction tools and interfaces
- PubChemSR has been upgraded with a number of new features
- The feasibility of a searching workflow to permit the scientific literature to be searched for compounds which dock to a protein of interest was demonstrated as a first job on Big Red, IU's new supercomputer.
- Mapping of information needs to workflows. We have established a feasible plan for mapping scientists' expressions of queries to workflows. The steps in this plan are:
- Generate a network map of all the possible interactions of web services, to define the problem space
- Generate the subgraph of the network that is realistic / sensible
- Create RDF sentences for the relation of each service to others in the node
- Parse queries (natural language, etc) into RDF sentences
- Map RDF sentences in queries to RDF sentences developed from the map
- Compare example vs deduction based approaches (i.e. are workflows predefined or generated on the fly?)
[edit]
Data mining research
- There are now 2 Ph.D. students (Huijun Wang and Jon Klinginsmith) working on data mining of HTS / screening / genomic information
- We now have all the compound, screening result and gene expression data for the DTP Tumor Cell Line set in our local PostgreSQL database (along with some other information such as MACCS keys for the compounds) that is structure and similarity searchable using the gNova cartridge. So far we've been focusing on incorporating the database into our workflows, and also on characterizing the compounds and data as a precursor to some data mining experiments and also to get a feel for whether it would be a good public test-set to use for evaluating data mining methods.
- We've set up a workflow that performs a similarity search on the database, then converts the most similar compounds to 3D and docks them into a protein of interest. We're using this in the context of providing a known ligand to a tumor-related protein (from the PDB) as the 2D similarity search target, then looking at the best docked structures from the similarity search to see what cell lines they were active in. If we find well docked structures that are active in cell lines for which the protein is known or suspected to have a part in cell growth, then we might consider that binding to the protein could be responsible for the activity. We're currently trying this with a number of kinase proteins, with the help of a chemist here who is doing kinase research (see Faming Zhang collaboration above)
- We did some diversity analysis on the Tumor Cell Line set compared to PubChem subsets and the FDA MRTD set of 1,500 prescription drugs, based on property profiles (logP, PSA, HBA, etc). We found that the ~45000 compounds in the TCL set that have activity data seem extremely drug-like, moreso than the Pubchem subsets. The distribution of properties almost exactly match the MRTD profiles. We also did some structural similarity calculations and found that there was a high degree of MACCS key similarity between compounds in the MRTD and those in the TCL database. I can forward details if you are interested
- We're currently doing some statistical analysis on the screening and gene expression data. In particular we want to categorize the screening data (maybe into Active/Moderate/Inactive) so we can apply some Association Rule Mining techniques to try to find relationships between chemical structure, activity and gene expression.
[edit]
Publications and Presentations
- American Chemical Society Meeting in Chicago
- David Wild, Advanced HTS data mining using web service workflows (Oral)
- Rajarshi Guha, Local Lazy Regression: Making Use of the Neighborhood to Improve QSAR Predictions (Oral), R-NN Curves: A Method for Diversity Analysis and Cluster Identification (Poster)
- A publication on the web service and workflow infrastructure will be submitted in the next month
- A publication on the data mining will be submitted by the year end
