Indiana Cambridge Collaboration

From Chemical Informatics and Cyberinfrastructure Collaboratory

Contents

People

  • Indiana: Xiao Dong, Geoffrey Fox, Kevin Gilbert, Rajarshi Guha, Marlon Pierce, David Wild, Gary Wiggins
  • IUPUI: Kelsey Forsythe, Malika Mahoui
  • Cambridge: Sam Adams, Peter Corbett, Nick Day, Peter Murray-Rust, Joe Townsend, Mark Hayes

Access Grid Meetings

Meeting Apr 5, 2006

Present: David, Gary, Peter MR, Sam, Joe, Mark, Nick

Points discussed:

  • As part of our commitment to each other, we sould have regular AG meetings, be committed to sharing code, and exchanging people
  • Cambridge is investigating lighter weight protocols than SOAP/WSDL including using HTML/CGI interfaces. These can always be exposed as web services
  • We discussed some of the limitations of Taverna (iteration, forking, looping). Cambridge is committed to using Taverna but is also interested in other systems
  • We discussed ways of packaging workflows to be exposed as web services themselves. On the Taverna website there is apparently a tool to allow a SCUFL file to be executed as a Java function. There doesn't seem to be a SCUFL->BEPL converter presently
  • We decided that once Indiana has some reasonable Oncology-related tools, we should initiate contact with Ernest for him to scientifically alpha-test them
  • Regarding OSCAR, it was noted that OSCAR2 and OSCAR1 are pretty similar, and we really should be working with OSCAR3, which is available from the WWMM CVS site. Sam and Joe can help Indiana to get OSCAR3 working and usable here.
  • We mused about the fact that despite us not getting any credit for doing infrastructure, some level of work is needed to ensure software and services do not "decay" over time. We discussed the possibilities of unit testing services and workflows, including using the Java execution of SCUFL noted above to enable JUnit testing
  • We discussed ways of documenting and describing algorithms and workflows, as per the Blue Obelisk model. The idea of "workflow pseudocode" came up, as well as persistent storage in workflows.
  • David brought up the issue of efficiency of CML for very large files. It was noted that a CML minimally could just tag the SMILES and ID of a structure. This would be more robust than just using the SMILES file.

Follow-up:

  • We will have another meeting next Monday 10th April at 9.30am EDT (2.30pm BST)
  • David will look into BEPL and SCUFL packaging
  • Maybe Marlon could look into JUnit testing of workflows and services
  • David will ensure Cambridge gets access to our database services (initially NIH database)
  • Somebody at Indiana should exchange mails with Sam to ensure that we don't duplicate effort with regard to OpenBabel
  • David will create a Wiki page for the interaction between Cambridge and Indiana. Here it is!

Meeting Apr 14, 2006

Present: David, Gary, Geoffrey, Kevin Gilbert, Bob Clark (Tripos), John Huffman, Peter Corbett

  • Versions of Oscar. Oscar 1 is the original data checker. Oscar 2 was an intermediate. Oscar-3 adds the ability to associate chemical names with chemical structures (using a name-structure converter, Optsim), plus a database of chemical names (Handbag). This associates structures with chemical names. There are currently a few thousand entries in this. We established at Indiana we currently just have Oscar 1 (from the RSC website).
  • Oscar 3 will take an XML or plain text file (from a paper), and will output an XML file which includes InChIs, SMILES, CML, and URL-encoded CML. There is a Java applet available (that uses CDK) in which structures can be visualized. Much of the output is related to natural language processing and can be ignored.
  • More information about the natural language project that this is part of is on the Cambridge SciBorg page. This is not just chemical - Peter Corbett is the chemistry domain expert in this project.
  • David is keen to be able to take chemical structures out of papers using an Oscar web service so these can be used in a variety of workflows.
  • David & Peter identified interesting research project to compare similarity of journal articles based on chemical name similarity and/or chemical structure similarity of the compounds identified in the papers. This might be something we can work on at IU or between IU and Cambridge.

Follow-up:

  • Peter Corbett will email Geoffrey mid next week with information on how to access Oscar-3
  • We will use email to set the date of the next meeting

Meeting April 24, 2006

Agenda:

  • Updates on research at both sites
  • Visit to NIH on 26th
  • Abstracts for the fall ACS meeting
  • Abstract for CompLife 06

Present: David Wild, Kevin Gilbert, Gary Wiggins, Peter Murray Rust, Peter Corbett, Nick Day, Jim Downing

  • Geoffrey has been sent OSCAR-3
  • David discussed the developments with the local IU DTP database and data mining it. The web service now allows similarity searching and will shortly allow generic SQL queries. A PDB ligand database is also available. This data is being drawn into a variety of workflows and is being used in data mining experiments.
  • Peter Murray-Rust's group is committed to continue working in a Taverna environment, but they are also taking a wider look at available frameworks and systems. However they are not planning major infrastructure input into Taverna in the near future. Taverna-2 is under development in the Taverna community, possibly with a new workflow language. This is unlikely to be available in the near future.
  • PMR group is looking to create and maintain a smaller number of high quality web services (including OSCAR, CML conversion and validation), as opposed to large numbers as a proof of concept. This is an area that Cambridge has a speciality in, and there is no corresponding US center working in this area.
  • Potential collaboration area: extract compounds from web services (OSCAR). Take compounds back and search pubchem for chemical structure information. Use this to calculate similarity between papers. Then publish annotation of abstract as a web service. Include set of generic multidimensional services that will do simple cluster analysis based and ranking. For instance, cluster papers based on similarity of 2D structures. This would be a good application for the R-serv web service, and maybe a project for Rajarshi. This could be a 3-way collaboration between Dan (NIH), Cambridge and IU.
  • Charles H. Davis may be a collaborator here - he has links to IUPAC
  • Kevin - we might want to include other chemical information as well as structure extracted from the abstracts.
  • San Francisco ACS in fall: maybe we can submit an abstract together about the above work. May 5 deadline for abstract submission

Follow-up:

  • David will send an email to Bobby Glen about possibly working together regarding IU's web services, data mining and docking
  • PMR will come up with a slide describing collaboration area for discussion at NIH on wednesday
  • CompLife 06 - we will look to present joint papers at this.
  • Check Dan's ability to join AG meetings. These might also be expanded to other groups.
  • Assign some names, goals, etc for the above collaborative project
  • Set next meeting for Thursday May 4th 9.30am

Meeting May 4, 2006

Agenda:

Present: Peter Murray-Rust, Peter Corbett, Gary Wiggins, Kevin Gilbert, Kelsey Forsythe, (Part of the meeting: Marlon Pierce, Geoffrey Fox)

Gary reviewed the notes that he and Marlon had taken at the April 26 meeting with Dan Zaharevitz and others at DTP in Frederick. Main points:

  • Dan emphasized that it is important to get something concrete done. His vision is to be able to sit at his laptop and plug and play data from various sources. Showing utility is more important than a top-down comlete data model design. No other groups have an architecture to make tools work together. We need to be sure that both sides understand the source codes necessary to implement in the respective environments. There is a big difference between their internal dtabase keys and the external identifiers, e.g., NSC#, CAS#, etc. Make sure that their notion of grid services is the same as ours. Dan is OK with using CML for structure and VOTables for the data.
  • DTP is experienced with CGI, Java, etc., but not Web services. DTP's current web site is servlet based (about 30 servlets), so it is suitable for human-driven point-and-click interaction. Not suited for workflows where the user wants to pick and choose the data to analyze.
  • DTP needs a service for substructure searching.
  • Mark Kunkel of DTP is working with androMDA generated web services, which may be incompatible with Axis. This is being used to implement a new COMPARE web site. He is starting to use androMDA as a programing environment tool.
  • Dan would very much like to link compounds through growth inhibition data to genes.
  • Linking to a Screening Center: Dan said we could get a list of the Informatics Working Group members. One of their biggest problems is deciding how to pick compounds that interesting. They don't have enough resources to closely examine all assays, so they need some guidance in filtering data. Interesting chemistry may include doing automated literature searches (via Google, OSCAR3) as well as visualization, datamining, and clustering. For the July meeting, we need to focues on connecting with the screening centers.

Notes from the 5/4/06 discussion:

  • For substructure searching, CDK has a whole bundle of things, but Peter Corbett has encountered some bugs in doing SSS with it. The problem with the CDK is with compatibility with other libraries - having the main CDK jar on your classpath can cause other things to fail. I've noticed this with JSPs, with the standard XSLT libraries (both unreported) and with the Java Preferences API (reported, and fixed in the CVS version). Also, when applets are run in web browsers (but NOT in the "applet viewer" that comes with Java) some CDK routines, like hydrogen adders and some of the descriptor-calculating routines also fail, often by failing to have an effect or returning zero rather than by throwing exceptions. This doesn't necessarily mean you'll encounter problems, but be aware. Peter M-R suggested that the CDK developers would work more on this if they knew we are serious about trying to apply it to DTP.
  • Peter M-R asked what can we do together that can be done reasonably quickly. He suggested that we make our contributions with what we've already got. Need something that can handle medium-sized data sets.
  • Peter M-R feels that web services are not very well defined. Therefore, there preference is to take a light-weight approach, since the vision tends to be bigger than the available manpower.
  • Looking at the DTP project, Peter M-R suggested that we try to slice the table into segments that are likely to be scientifically valuable, for example, by reducing the number of targets and the number of compounds, say 20 different subsets of molecules and a subset of the targets. The compounds could be steroids and other classes for about 1,000 compounds total. Cambridge can ID through QSAR whether something si a steroid. Cambridge could pre-run this.
  • Peter M-R suggested that on a regular basis we need to come up with architecture diagrams to help divide the work. It is easy to underestimate the work, but we can farm out parts of it, e.g., the substructure portion to CDK.
  • We need to firm up about a half dozen key web services that we need, such as:
    • Babel
    • InChI
    • DTP requests (may be multiple Web services)
    • Substructure searching
    • (PubChem interactions

Our goal should be to form a communal trans-Atlantic, multi-site group that exposes some of these as web services.

  • Next meeting: Monday, May 22, 9:30 AM EDT. Rajarshi Guha will be in town at that time.

Meeting May 22, 2006

Present: Cambridge: Peter MR, Peter Corbett, Bloomington: David, Rajarshi Guha, Gary, Kevin Gilbert, Geoffrey, Marlon IUPUI: Malika Mahoui, Kelsey Forsythe

  • OSCAR update - Cambridge will be concentrating on this over the next few months. Server software has been updated, including extended search of OSCAR results and listing co-occurrence of structures in papers. OSCAR can work with HTML and XML formats for paper input
  • Geoffrey asked where the publications come from for input. Peter MR says have to be careful using non-open access journal articles, David suggested the PLoS journals. Nick's robot can download all articles in a given journal. Peter Corbett suggested using PubMed abstracts. They can let you download the whole database of PubMed abstracts, with some conditions. Also a University of Pennsylvania project called BioIE which includes a corpus. Set of selected PubMed abstracts - e.g. a set on Oncology, P450 oncology, etc. This has a high concentration of generic drug names, active compounds, etc - things which are referenced in PubChem.
  • Geoffrey mentioned Microsoft Academic Live. Seems like a lot of open access material there. Maybe we can work with Microsoft on this.
  • Seems like having a web service that searches PubMed abstracts for compounds and generates SMILES would be very feasible.
  • Updates from IU: Rajarshi will be starting July 1st, and will be working on workflow generation, implementation of R and QSAR web services, and clients (including .NET). Our web services will need to use CML to allow information to be added along the workflow. Some issues if workflows split (and CMLs need to be re-merged). Requirement for stripping CML/SMILES for programs such as BCI clustering. PMR says most important thing is having a unique identifier for each compound for recovery. Marlon: Toxtree service has been implemented. Working with CONDOR. Meeting with MACE later today

Follow-up:

  • Jake Kim to connect with Peter Corbett about implementation of OSCAR at IU as a web service using PubMed Abstracts
  • PMR will investigate Microsoft Academic Live
  • David and Rajarshi will look at taking output of OSCAR and doing chemical similarity
  • David, Marlon, Rajarshi and Peter Corbett will come up with a timeline for completing this
  • Peter Corbett will send Sourceforge ID to Marlon
  • Put Peter on the CICC-DEV-L list (Marlon)
  • Next meeting June 5th