-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CORD-19 Dataset antiviral compounds #26
Comments
I worked with compounds with Neo4j. I one project we created a molecular substructure search as plugin for Neo4j, that supported MOL V2000 and SMILES/SMARTS. Is this what you are asking about? I already mentioned this to @mpreusse. |
Sounds good. Is this plugin open source? |
No, the plugin is not open source. There are multiple versions and iterations. The first one is based on the EPAM Indigo library (https://lifescience.opensource.epam.com/indigo/) which worked fine, but we always had clashes with the dependent libraries and class loader issues in Neo4j. We had to switch to anther software vendor for a chemical library. But... I suggest that we implement the following:
It would help me if I understand what we need "in a chemical way" |
We do need/want a search function for substructures for the application/endusers, correct? |
would this be helpful in linking between what is in CORD-19 and dataset: https://www.cas.org/covid-19-antiviral-compounds-dataset?utm_source=hootsuite&utm_medium=linkedin&utm_term=&utm_content=9a9f1234-6bd2-4673-9436-bb49800209ca&utm_campaign=COVID-19 |
For processing of molecules I would recommend rdkit: https://github.com/rdkit/rdkit rather than using indigo it's much better supported and has a wider variety of functionality. I suspect it will be enough just to connect the molecules to the existing data as compound nodes. Ideally by compound->publication_id. You can always add additional features like similarity/substructure searching in afterwards. In terms of adding/searching for molecules I would stay away from using non-unique keys for these such as the chemical formula (unless just as metadata on the node). Even the CAS numbers used in the above example are not unique or persistent. If there is not a unique identifier from the database they come from/you want to add multiple chemical sources in future then the best way to do this is to calculate inchi's or inchi_key's from the mol_files/SDF's above using RDKit (ideally with a molecular standardisation process as well). Other datasets I would look at for mapping include ChEMBL, pubchem (although it's a bit noisy) and SureCHEMBL (chembl for patents). I can create extra issues for these if needed. |
Maybe interesting as well: https://github.com/rdkit/neo4j-rdkit |
There's a lightning talk recording by the author if the plugin, see https://neo4j.com/online-summit/session/rdkit-neo4j-integration. I've just acted as a mentor. |
With version 5 of the CORD19 dataset a list of Anti-Viral Candidate Compounds was included.Actually, i just realized, this data was attached from the team, working on a python tool around the CORD-19 dataset. https://github.com/josephsdavid/cord-19-tools
At the moment there is no documentation from where they have this data.
The attached readme says following:
We have one huge file containing a big list of compounds described in MOL format.
One entry looks like this
Is this data interesting for our scope?
@mpreusse : "YES"
How can we connect this data to our graph?
As discussed with @mpreusse we could search for the molecule name ("C40H53N7O5S2" in the example above) in the :PatentAbstract{text} and connect them.
Other ideas are welcome.
The text was updated successfully, but these errors were encountered: