Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disagreement due to stereochemical SMILES #45

Closed
dswigh opened this issue Apr 26, 2023 · 4 comments
Closed

Disagreement due to stereochemical SMILES #45

dswigh opened this issue Apr 26, 2023 · 4 comments

Comments

@dswigh
Copy link
Contributor

dswigh commented Apr 26, 2023

Given the molecule: (e)-2-butenenitrile
PubChem will resolve to: ['C/C=C/C#N']
CIR will resolve to: ['CC=CC#N']

These two are (almost) the same SMILES strings, but Pura says they don't agree because one specifies the stereochemistry, while the other doesn't.

Perhaps a 'drop stereochemical information' arg would be a solution?

@marcosfelt
Copy link
Collaborator

I think this would make sense! So just to confirm, you'd want an option in resolve_identifiers that ignores stereochemistry differences?

@dswigh
Copy link
Contributor Author

dswigh commented Apr 27, 2023

Yea! I used the following in my own code:

# Canonicalise and remove stoichiometry
def clean_smiles(smiles):
    if pd.isna(smiles):
        return smiles
    else:
        mol = Chem.MolFromSmiles(smiles)
        return Chem.MolToSmiles(mol, isomericSmiles=False) # isomericSmiles=False is what strips away the stereo info

# Apply the function to all columns in the DataFrame
df = pura_solvents.applymap(clean_smiles)

I haven't investigated fully what services/conditions cause a SMILES string to either contain, not contain, or 'explicitly be ambiguous' (ie having the crossed bond) in relation to stereochemistry.

@dswigh
Copy link
Contributor Author

dswigh commented Apr 27, 2023

Didn't realise the indentation would be removed by markdown... hopefully it's self-evident how the indentation should be!

@marcosfelt
Copy link
Collaborator

marcosfelt commented Apr 28, 2023

Following up on the discussion we had in person. The behavior in the original post is actually expected since (e)-2-butenenitrile should resolve to C/C=C/C#N (i.e., CIR was wrong). Therefore, we would want the consensus algorithm to say these two SMILES are different and therefore there is not sufficient agreement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants