QSAR Project with automated builds

From Bioclipse
Jump to: navigation, search
Bioclipse Project
Plugins path:net.bioclipse.qsar
Wiki page last updated:2010-08-11
Repo URL:http://github.com/olas/bioclipse.qsar


The Overview page of the QSAR Editor in Bioclipse.

This project aims at building a flexible framework for working with QSAR in Bioclipse. It consits of a file format for molecule and descriptor selections, and a builder that automatically builds the QSAR dataset from these selections in the background.


  • Ola Spjuth (Design and Bioclipse code)
  • Rajarshi Guha (Design and CDK code)
  • Egon Willighagen (Design and Bioclipse/CDK code)
  • Martin Eklund (Design)


See git repo Bioclipse.qsar


  • XSD constructed and EMF model generated and tested. [Done]
  • Project wizard [Done]
  • QSAR multi page editor [Done]
  • CDK descriptor calculations [Done]
  • Automatic build [Done]
  • Schema updates for units, structures, linking etc [Done]
  • REST service descriptor calculations [Done]
  • XMPP service descriptor calculations [Done]
  • GUI update for new schema [Done]
  • React on deleted/renamed resources and projects in Navigator Not started
  • Batch import of response values from separate file Not started
  • External descriptor calculations Not started
  • Multiple qsar.xml in a project Not started

QSAR file format

The idea is to have a file (qsar.xml) that defines molecules and descriptors and might look something like below (note: demonstrational example, info and URL's are made up).

The purpose is to define the resources has been used and what software used for calculation of descriptors. How this is done (via a Web service, XMPP service, in a Java library, or a shell script) is not important for the QSAR-ML but implementation details.

A page for the current QSAR-ML is available.


<?xml version="1.0" encoding="UTF-8"?>
<qsar:DocumentRoot xmi:version="2.0"
   xmlns:xmi="http://www.omg.org/XMI" xmlns:qsar="http://www.bioclipse.net/qsar">
       authors="Ola Spjuth"
       datasetname="olas dataset"
       description="A dataset describing a lot of things. Pretty much everything."
       license="Creative Blah License"

   <qsar:reference xsi:type="bibtexml:BibTeXML.entryType" id="article1">
       <bibtexml:author>Spjuth, Ola and Helmus, Tobias and Willighagen, Egon and Kuhn, Stefan and Eklund, Martin and Wagener, Johannes and Murray-Rust, Peter and Steinbeck, Christoph and Wikberg, Jarl</bibtexml:author>
       <bibtexml:title>Bioclipse: an open source workbench for chemo- and bioinformatics</bibtexml:title>
       <bibtexml:journal>BMC Bioinformatics</bibtexml:journal>




     <resource file="/QSAR test/molecules/polycarpol.mol" id="polycarpol.mol" name="polycarpol.mol" 
         type="text" checksum="">
     	  <structure id="polycarpol.mol" resourceindex="0" inchi="TBC"/>

     <resource URL="http://pele.farmbio.uu.se/molecules/cml/0037.cml" id="0037.cml" name="0037.cml" 
         type="xml" checksum="">
     	  <structure id="polycarpol.mol" resourceid="mol1" inchi="TBC"/>

     <resource file="/QSAR test/molecules/Fragments2.sdf" id="Fragments2.sdf" name="Fragments2.sdf" 
         type="text" checksum="">
     	<structure id="fragments2.sdf_0" resourceindex="0" inchi="TBC"/>
     	<structure id="fragments2.sdf_1" resourceindex="1" inchi="TBC"/>
     	<structure id="fragments2.sdf_2" resourceindex="2" inchi="TBC"/>
     	<structure id="fragments2.sdf_3" resourceindex="3" inchi="TBC"/>

      <resource URL="http://pele.farmbio.uu.se/molecules/smiles/smiles30.smi" id="smiles30.smi" 
          name="smiles30.smi" type="text" checksum="">
       <structure id="smiles30.smi_0" resourceindex="0" inchi="TBC"/>
	<structure id="smiles30.smi_1" resourceindex="1" inchi="TBC"/>
      	<structure id="smiles30.smi_2" resourceindex="2" inchi="TBC"/>
      	<structure id="smiles30.smi_3" resourceindex="3" inchi="TBC"/>

     <resource file="/QSAR test/molecules/mols5.cml" id="mols4.cml" name="mols5.cml" type="xml" numStructures="4" checksum="">
     	<structure id="mols5.cml_0" resourceid="wee" inchi="TBC" />
     	<structure id="mols5.cml_1" resourceid="hoow" inchi="TBC" />
     	<structure id="mols5.cml_2" resourceid="mama" inchi="TBC" />
     	<structure id="mols5.cml_3" resourceid="mia" inchi="TBC" />


Descriptors and DescriptorProviders

A descriptorptovider is a piece of software that accepts one or more structures, one or more descriptorIDs (from BODO), and returns back one or more descriptor values.

   <descriptorprovider id="cdk" name="Chemistry Development Kit" URL="http://cdk.sourceforge.net" 
       vendor="CDK Project" version=""/>

   <descriptorprovider id="dragon" name="Dragon" URL="http://talete.mi.it"
       vendor="Talete" version="5.5"/>

   <qsar:descriptors id="descriptor1" ontologyid="http://www.blueobelisk.org/ontologies/chemoinformatics-algorithms/#xlogP" provider="cdk">
     <qsar:parameter key="checkAromaticity" value="true"/>
     <qsar:parameter key="salicylFlag" value="false"/>
   <qsar:descriptors id="descriptor3" ontologyid="http://www.blueobelisk.org/ontologies/chemoinformatics-algorithms/#atomCount" provider="dragon"/>


This is currently in early planning stage. No impl exists.

     <preprocessingStep id="Smi23d" name="Generate 3D coordinates with smi23d" namespace="http://www.chembiogrid.org/cheminfo/smi23d/"
     <preprocessingStep id="org.openscience.cdk.atomtype.sybyl" name="Sybyl Atom Types"
         namespace="http://cdk.sf.net" order="2"/>

Responses and Units

There are really too many units for measuring responses, dependence on environment/assay type, and further no controlled vocabulary. QSAR-ML supports definition of a response unit, and leaves translation/joining/conversion between data sets to the user.

        <responseunit id="ic50" shortname="IC50" name="half maximal inhibitory concentration (IC50)"
              description="Measure of the effectiveness of a compound in inhibiting biological or 
              biochemical function." URL="http://en.wikipedia.org/wiki/IC50">

Responses can be single valued or arrays. Example below contains a mix of single and array; this is purely for demonstrational issues. Most likely the responses will be of same length. Unit links to the id of a responseunit (defined above).

     <response value="11.45" structureid="fragments2.sdf_0" unit="ic50"/>
     <response value="15.45" structureid="mols5.cml_2" unit="ic50"/>
     <response arrayValues="12.56,23.45,34.56" structureid="mols5.cml_0 unit="ic50"">


I have decided to include results from descriptorcalculations. These are really derived numbers, but without them it would not be an exchange format.

   <qsar:descriptorresult descriptorid="http://www.blueobelisk.org/ontologies/chemoinformatics-algorithms/#xlogP" 
     <qsar:descriptorvalue index="0" label="desc1_col1" value="19.564"/>
     <qsar:descriptorvalue index="1" label="desc1_col2" value="76.2"/>
   <qsar:descriptorresult descriptorid="http://www.blueobelisk.org/ontologies/chemoinformatics-algorithms/#atomCount" 
     <qsar:descriptorvalue index="0" label="desc2label" value="19.564"/>

Note that a descriptor can have multiple results (array) which is why the index is needed for ordering.

Bioclipse plugins

Contributing descriptors via extensions

Bioclipse has an extension point, net.bioclipse.qsar.descriptorProvider in plugin net.bioclipse.qsar. The easiest way to understand it is to look at the CDK implementation in net.bioclipse.cdk.qsar.

           name="Chemistry Development Kit"
           vendor="Chemistry Development Kit"


Here the implementation net.bioclipse.cdk.qsar.impl.CDKDescriptorCalculator implements IDescriptorCalculator, which has the method:

	 * Calculates descriptors for a List of molecules.
	 * @param molecules the Imolecules as input
	 * @param descriptorsForProvider descriptors with parameters and impl
	 * @param monitor Progressmonitor. Set worked++ for each molecule/descriptor
	 * done (max size must be set to mols x descs before this is called)
	 * @return Map<IMolecule, IDescriptorResult> results for each molecule
	public Map<? extends IMolecule, List<IDescriptorResult>> calculateDescriptor(
			             List<? extends IMolecule> molecules, 
			             List<DescriptorType> descriptorsForProvider, 
			             IProgressMonitor monitor);

Here is an individual descriptor implementation that accepts a parameter and has listed values to be able to calculate multiple descriptor+parameter combinations out of the box. Still from the plugin net.bioclipse.cdk.qsar:

                 description="Number of atoms of a certain element type."
                       description="Element name. Wild card * means all atoms"
                       <listedvalue value="C" />
                       <listedvalue value="N" />
                       <listedvalue value="O" />
                       <listedvalue value="H" />
                       <listedvalue value="*" />

This is simply a mapping of an ID that the calculator can make use of (in this case the CDK class) to an entry in the ontology. Added are parameters and a list of parameters that will be calculated if user selects "all parameters". It does not limit to this, if user wants to count number of Si, this is possible too.

Project Implementation

1) Create an XSD for QSAR analysis. The reason for XSD and not RelaxNG is because step 2) below requires it as input.

2) Generate model code with EMF. This gives a model with full XML binding/validation according to the schema and an undo/redo framework.

3) Write GUI components that makes use of the generated model. These will include:

  • A wizard to create New QSAR projects
    • Adds a folder 'molecules'
    • Creates the file qsar.xml (sort of a QSAR manifest)
  • QSAREditor that edits qsar.xml with graphical pages
    • Check molecules from the 'molecules' folder
    • Molecular preprocessing steps (e.g. generate3D)
    • Descriptor selections
    • Input and match observation data for molecules
  • A QSAR Project with automated builds
    • For all molecules checked in qsar.xml, calculate descriptors, add observation values, and create/update the derived file matrix.csv

This means I can have a chart open in Bioclipse, and when I change qsar.xml (for example add/remove descriptor) I immediately see the updated chart. If you add a descriptor, this is calculated in the background for all selected molecules. The build should only work on deltas (i.e. partial build, and not build already built molecules/descriptor pairs).

The idea is naturally to extend this later with support for integrated data analysis.

Comments on the project's design and implementation are highly appreciated.

--Ola 17:05, 11 March 2009 (CET)