Friday, November 15, 2013

Substructure fingerprints and the cartridge

Substructure fingerprints and the PostgreSQL cartridge

The previous post about using fingerprints for substructure screening covered the basics and looked at screenout rates. A more important point is how well the fingerprints perform in a real-world application. The use I will pick here is integration with the RDKit cartridge for PostgreSQL.

The RDKit cartridge includes integration with PostgreSQL's Generalized Search Tree (GiST) index engine and allows indices to be built for molecule columns that help with substructure search.

To demonstrate this, I'll start by collecting the 50K compounds that I've been using into a database and building an index:

~/RDKit_blog/data > psql -c 'create table raw_data2 (chemblid integer, smiles text)' fptest
CREATE TABLE
~/RDKit_blog/data > zcat chembl16_50K_sss.txt.gz | sed 's/\\/\\\\/g' | psql -c "copy raw_data2 (chemblid,smiles) from stdin with delimiter ' '" fptest
~/RDKit_blog/data > psql fptest
psql (9.2.4)
Type "help" for help.

fptest=# \timing
Timing is on.
fptest=# select * into mols from (select chemblid,mol_from_smiles(smiles::cstring) m from raw_data2) tmp where m is not null;
SELECT 50000
Time: 36408.113 ms
fptest=# create index molidx on mols using gist(m);
CREATE INDEX
Time: 56578.796 ms

Add an additional query set, the Murcko scaffolds for the ZINC_leads set:

fptest=# select id,mol_murckoscaffold(m) m into zinc_leads_scaffolds from zinc_leads;
SELECT 500
Time: 108.871 ms

Now try a couple substructure queries using some random ZINC_leads scaffolds:

fptest=# select * from zinc_leads_scaffolds order by random() limit 5;
      id      |               m               
--------------+-------------------------------
 ZINC35324029 | c1ccc([C@@H]2CO2)cc1
 ZINC05920112 | S=c1ssc2c1SCS2
 ZINC16848052 | c1nnc[nH]1
 ZINC00102676 | O=c1[nH]c(=O)n(-c2ccccc2)s1
 ZINC00265495 | O=C1O[C@H](c2ccoc2)C2CCCC=C12
(5 rows)

Time: 3.029 ms
fptest=# select * from mols where m@>'c1ccc([C@@H]2CO2)cc1';
 chemblid |                   m                   
----------+---------------------------------------
  1355230 | C=CCOC1=CC(=O)c2ccccc2C12CO2
  1122083 | OC1c2ccc3c4ccccc4c4ccccc4c3c2C2OC2C1O
   238553 | NC(=O)N1c2ccccc2C2OC2c2ccccc21
   624377 | Cc1ccc(C2OC2C(=O)c2ccc3c(c2)CCCC3)cc1
(4 rows)

Time: 13.225 ms
fptest=# select * from mols where m@>'O=c1[nH]c(=O)n(-c2ccccc2)s1';
 chemblid |              m               
----------+------------------------------
   495448 | O=c1sn(-c2ccccc2)c(=O)n1CCCl
(1 row)

Time: 3.698 ms
fptest=# select count(*) from mols where m@>'c1nnc[nH]1';
 count 
-------
   986
(1 row)

Time: 80.336 ms

The individual queries using the index are nice and fast. How about if I turn the index off?

fptest=# SET enable_indexscan=off;
SET
Time: 0.424 ms
fptest=# SET enable_bitmapscan=off;
SET
Time: 0.206 ms
fptest=# SET enable_seqscan=on;
SET
Time: 0.195 ms
fptest=# select * from mols where m@>'c1ccc([C@@H]2CO2)cc1';
 chemblid |                   m                   
----------+---------------------------------------
  1355230 | C=CCOC1=CC(=O)c2ccccc2C12CO2
  1122083 | OC1c2ccc3c4ccccc4c4ccccc4c3c2C2OC2C1O
   238553 | NC(=O)N1c2ccccc2C2OC2c2ccccc21
   624377 | Cc1ccc(C2OC2C(=O)c2ccc3c(c2)CCCC3)cc1
(4 rows)

Time: 4191.875 ms

That certainly makes a difference. We can see why by turning the index back on and looking at the query analysis:

fptest=# SET enable_indexscan=on;
SET
Time: 0.447 ms
fptest=# SET enable_bitmapscan=on;
SET
Time: 0.158 ms
fptest=# explain analyze select * from mols where m@>'c1ccc([C@@H]2CO2)cc1';
                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on mols  (cost=12.74..192.77 rows=50 width=389) (actual time=10.848..11.482 rows=4 loops=1)
   Recheck Cond: (m @> 'c1ccc([C@@H]2CO2)cc1'::mol)
   Rows Removed by Index Recheck: 3
   ->  Bitmap Index Scan on molidx  (cost=0.00..12.72 rows=50 width=0) (actual time=10.262..10.262 rows=7 loops=1)
         Index Cond: (m @> 'c1ccc([C@@H]2CO2)cc1'::mol)
 Total runtime: 11.702 ms
(6 rows)

Time: 13.743 ms

There's lots of information there, most of which I'm going to ignore. For the purposes of this post, the important bits are the row counts from the index scan -- rows=7 -- and the final bitmap heap scan -- rows=4. For this query, seven database molecules passed the fingerprint scan and needed to have an actual substructure search done. Four of those actually contained the substructure. That's about the same screenout rate observed in the previous post.

If you're interested, you can get more information about explain output here or here.

The value of the fingerprints and the index really becomes clear when we do bulk searches using the cross join operation in SQL:

fptest=# select count(*) from mols cross join zinc_leads qt where mols.m@>qt.m;
 count 
-------
  1274
(1 row)

Time: 715.373 ms

This query does a substructure search across the molecules for each row in the ZINC_leads set. Without the index, the database would have to load all 50K molecules from the mols table and do 25000000 (50000*500) substructure comparisons. This takes a while:

fptest=# SET enable_indexscan=off;
SET
Time: 0.306 ms
fptest=# SET enable_bitmapscan=off;
SET
Time: 0.222 ms
fptest=# SET enable_seqscan=on;
SET
Time: 0.200 ms
fptest=# select count(*) from mols cross join zinc_leads qt where mols.m@>qt.m;
 count 
-------
  1274
(1 row)

Time: 1234519.855 ms

The index enables a speedup of >1700 here. Awesome. :-)

If we make the queries more generic by using their Murcko scaffolds, we get more results and the queries take longer:

fptest=# select count(*) from mols cross join zinc_leads_scaffolds qt where mols.m@>qt.m;
 count  
--------
 310719
(1 row)

Time: 19654.489 ms

Finding the 310000 matches takes just under 20 seconds.

For reference, here are the results for the other two query sets:

fptest=# select count(*) from mols cross join pubchem_pieces qt where mols.m@>qt.m;
  count  
---------
 1935315
(1 row)

Time: 214718.647 ms
fptest=# select count(*) from mols cross join zinc_frags qt where mols.m@>qt.m;
 count 
-------
  4659
(1 row)

Switching to a fingerprint that was less effective in the screenout benchhmarking, the Layered fingerprint, we see a substantial increase in search times:

fptest=# select count(*) from mols cross join pubchem_pieces qt where mols.m@>qt.m;
  count  
---------
 1935315
(1 row)

Time: 359185.914 ms
fptest=# select count(*) from mols cross join zinc_frags qt where mols.m@>qt.m;
 count 
-------
  4659
(1 row)

Time: 5478.102 ms
fptest=# select count(*) from mols cross join zinc_leads qt where mols.m@>qt.m;
 count 
-------
  1274
(1 row)

Time: 2861.711 ms
fptest=# select count(*) from mols cross join zinc_leads_scaffolds qt where mols.m@>qt.m;
 count  
--------
 310719
(1 row)

Time: 34194.532 ms

No comments: