CORD-19 Dataset antiviral compounds #26

motey · 2020-04-03T13:41:21Z

~~With version 5 of the CORD19 dataset a list of Anti-Viral Candidate Compounds was included.~~

Actually, i just realized, this data was attached from the team, working on a python tool around the CORD-19 dataset. https://github.com/josephsdavid/cord-19-tools
At the moment there is no documentation from where they have this data.

The attached readme says following:

CAS COVID-19 Anti-Viral Candidate Compounds Readme

* The dataset includes 49,437 anti-viral candidate compounds (known and similar) created from the  CAS REGISTRY of chemical substances;
* The dataset is being provided in SDfile format, which includes the complete Molfile representation and other information such as cas.rn, cas.index.name, molecular.formula, molecular.weight, melting.point.experimental, and other property data. 
* Dataset entities represent known anti-viral drugs and related chemical substances that are structurally similar to a known anti-viral.

We have one huge file containing a big list of compounds described in MOL format.
One entry looks like this

Cobicistat
C40H53N7O5S2
1004316-88-4 Copyright (C) 2020 ACS
 54 58  0  0  1  0  0  0  0  0999 V2000
48827.327914512.9573    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
48827.3279 9675.3049    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
44637.796116931.7835    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
53016.859716931.7835    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
40448.266814512.9573    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
57206.386614512.9573    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
36258.737416931.7835    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
40448.2668 9675.3049    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
61395.913516931.7835    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
36258.737421769.4359    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
32069.208114512.9573    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
61395.913521769.4359    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
65585.445314512.9573    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
32069.208124188.2622    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
27879.676316931.7835    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
69774.977116931.7835    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
23690.146914512.9598    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
27879.676321769.4359    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
73964.508914512.9573    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
69774.977121769.4359    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
19500.617616931.7860    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
23690.1469 9675.3073    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
78154.035816931.7835    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
 2843.498713391.2040    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  875.850117810.6207    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000 9477.4613    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
53016.8597 7256.4786    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
53016.8597 2418.8262    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
57206.3866 9675.3049    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
57206.3866    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
61395.9135 7256.4811    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
61395.9135 2418.8262    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
57206.386624188.2622    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
57206.386629025.9146    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
53016.859721769.4359    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
53016.859731444.7408    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
48827.327924188.2622    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
48827.327929025.9146    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
32069.208129025.9170    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
27879.676331444.7433    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
36258.737431444.7433    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
27879.676336282.3957    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
36258.737436282.3932    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
32069.208138701.2219    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
15311.088314512.9598    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
10891.671516480.6083    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
14805.4142 9701.8075    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
 7654.650912885.5324    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
10073.4772 8696.0031    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
82343.562714512.9573    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
86762.979416480.6083    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
82849.2367 9701.8075    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
90000.000012885.5324    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
87581.1738 8696.0031    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  6  0  0  0
  1  3  1  0  0  0  0
  1  4  1  0  0  0  0
  2 27  1  0  0  0  0
  3  5  1  0  0  0  0
  4  6  1  0  0  0  0
  5  7  1  0  0  0  0
  5  8  2  3  0  0  0
  6  9  1  0  0  0  0
  7 10  1  1  0  0  0
  7 11  1  0  0  0  0
  9 12  1  6  0  0  0
  9 13  1  0  0  0  0
 10 14  1  0  0  0  0
 11 15  1  0  0  0  0
 12 33  1  0  0  0  0
 13 16  1  0  0  0  0
 14 39  1  0  0  0  0
 15 17  1  0  0  0  0
 15 18  2  3  0  0  0
 16 19  1  0  0  0  0
 16 20  2  3  0  0  0
 17 21  1  0  0  0  0
 17 22  1  0  0  0  0
 19 23  1  0  0  0  0
 21 45  1  0  0  0  0
 23 50  1  0  0  0  0
 24 25  1  0  0  0  0
 24 26  1  0  0  0  0
 24 48  1  0  0  0  0
 27 28  2  0  0  0  0
 27 29  1  0  0  0  0
 28 30  1  0  0  0  0
 29 31  2  0  0  0  0
 30 32  2  0  0  0  0
 31 32  1  0  0  0  0
 33 34  2  0  0  0  0
 33 35  1  0  0  0  0
 34 36  1  0  0  0  0
 35 37  2  0  0  0  0
 36 38  2  0  0  0  0
 37 38  1  0  0  0  0
 39 40  1  0  0  0  0
 39 41  1  0  0  0  0
 40 42  1  0  0  0  0
 41 43  1  0  0  0  0
 42 44  1  0  0  0  0
 43 44  1  0  0  0  0
 45 46  1  0  0  0  0
 45 47  2  0  0  0  0
 46 48  2  0  0  0  0
 47 49  1  0  0  0  0
 48 49  1  0  0  0  0
 50 51  2  0  0  0  0
 50 52  1  0  0  0  0
 51 53  1  0  0  0  0
 52 54  1  0  0  0  0
 53 54  2  0  0  0  0
M  END
> <cas.rn>
1004316-88-4

> <cas.index.name>
2,7,10,12-Tetraazatridecanoic acid, 12-methyl-13-[2-(1-methylethyl)-4-thiazolyl]-9-[2-(4-morpholinyl)ethyl]-8,11-dioxo-3,6-bis(phenylmethyl)-, 5-thiazolylmethyl ester, (3R,6R,9S)-

> <molecular.formula>
C40H53N7O5S2

> <molecular.weight>
776.02

> <boiling.point.predicted>
974.5±65.0 °C    Press: 760 Torr

> <density.predicted>
1.228±0.06 g/cm3    Temp: 20 °C; Press: 760 Torr

> <pka.predicted>
11.86±0.46    Most Acidic Temp: 25 °C

Is this data interesting for our scope?
@mpreusse : "YES"

How can we connect this data to our graph?
As discussed with @mpreusse we could search for the molecule name ("C40H53N7O5S2" in the example above) in the :PatentAbstract{text} and connect them.
Other ideas are welcome.

The text was updated successfully, but these errors were encountered:

dkrizic · 2020-04-03T14:45:19Z

I worked with compounds with Neo4j. I one project we created a molecular substructure search as plugin for Neo4j, that supported MOL V2000 and SMILES/SMARTS. Is this what you are asking about? I already mentioned this to @mpreusse.

motey · 2020-04-03T15:05:25Z

Sounds good. Is this plugin open source?

dkrizic · 2020-04-03T15:15:24Z

No, the plugin is not open source. There are multiple versions and iterations. The first one is based on the EPAM Indigo library (https://lifescience.opensource.epam.com/indigo/) which worked fine, but we always had clashes with the dependent libraries and class loader issues in Neo4j. We had to switch to anther software vendor for a chemical library.

But... I suggest that we implement the following:

We store the compounds in Neo4j
-- One node per compounds
-- The whole MOL or SMILES string in a property of that node
I will implement a Microservice that uses the Indigo library and offers a REST interface to do substructure search and similarity
I will write a plugin for Neo4j that integrates that service

It would help me if I understand what we need "in a chemical way"

motey · 2020-04-03T16:04:55Z

We do need/want a search function for substructures for the application/endusers, correct?
Or do we just want to fingerprint the molecues once and connect them?
If latter, what about using the indigo library, in a pre-process (in python for example with https://pypi.org/project/epam.indigo/)
From my (naive) view that looks a lot less complex, less error-prone and maybe faster as it would process data locally.

seangrant82 · 2020-04-06T02:50:07Z

would this be helpful in linking between what is in CORD-19 and dataset: https://www.cas.org/covid-19-antiviral-compounds-dataset?utm_source=hootsuite&utm_medium=linkedin&utm_term=&utm_content=9a9f1234-6bd2-4673-9436-bb49800209ca&utm_campaign=COVID-19

bramble50 · 2020-04-08T16:48:26Z

For processing of molecules I would recommend rdkit: https://github.com/rdkit/rdkit rather than using indigo it's much better supported and has a wider variety of functionality.

I suspect it will be enough just to connect the molecules to the existing data as compound nodes. Ideally by compound->publication_id. You can always add additional features like similarity/substructure searching in afterwards.

In terms of adding/searching for molecules I would stay away from using non-unique keys for these such as the chemical formula (unless just as metadata on the node). Even the CAS numbers used in the above example are not unique or persistent. If there is not a unique identifier from the database they come from/you want to add multiple chemical sources in future then the best way to do this is to calculate inchi's or inchi_key's from the mol_files/SDF's above using RDKit (ideally with a molecular standardisation process as well).

Other datasets I would look at for mapping include ChEMBL, pubchem (although it's a bit noisy) and SureCHEMBL (chembl for patents). I can create extra issues for these if needed.

motey · 2020-04-08T18:13:01Z

Maybe interesting as well: https://github.com/rdkit/neo4j-rdkit
(did not read just stumpled and skimmed. Reminder: Have a closer look.)
@sarmbruster was involed in that too

sarmbruster · 2020-04-08T18:22:38Z

Maybe interesting as well: https://github.com/rdkit/neo4j-rdkit
(did not read just stumpled and skimmed. Reminder: Have a closer look.)
@sarmbruster was involed in that too

There's a lightning talk recording by the author if the plugin, see https://neo4j.com/online-summit/session/rdkit-neo4j-integration. I've just acted as a mentor.

motey assigned motey and mpreusse Apr 3, 2020

Jiros changed the title ~~If/How to integrate CORD-19 Dataset antiviral compounts~~ CORD-19 Dataset antiviral compounts Jul 10, 2020

Jiros changed the title ~~CORD-19 Dataset antiviral compounts~~ CORD-19 Dataset antiviral compounds Jul 10, 2020

Jiros transferred this issue from covidgraph/documentation Dec 7, 2020

Jiros added Type: Data Source To identify an issue as a data source Status: Suggested This issue is a suggestion for doing something new or different in CovidGraph Priority: Low This issue can be dealt with in the long term or is on hold Tag: Help Wanted Extra attention is needed labels Dec 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CORD-19 Dataset antiviral compounds #26

CORD-19 Dataset antiviral compounds #26

motey commented Apr 3, 2020

dkrizic commented Apr 3, 2020

motey commented Apr 3, 2020

dkrizic commented Apr 3, 2020

motey commented Apr 3, 2020

seangrant82 commented Apr 6, 2020

bramble50 commented Apr 8, 2020

motey commented Apr 8, 2020 •

edited

Loading

sarmbruster commented Apr 8, 2020 •

edited

Loading

CORD-19 Dataset antiviral compounds #26

CORD-19 Dataset antiviral compounds #26

Comments

motey commented Apr 3, 2020

dkrizic commented Apr 3, 2020

motey commented Apr 3, 2020

dkrizic commented Apr 3, 2020

motey commented Apr 3, 2020

seangrant82 commented Apr 6, 2020

bramble50 commented Apr 8, 2020

motey commented Apr 8, 2020 • edited Loading

sarmbruster commented Apr 8, 2020 • edited Loading

motey commented Apr 8, 2020 •

edited

Loading

sarmbruster commented Apr 8, 2020 •

edited

Loading