Tutorial

The Basics

We begin by importing selfies.

[1]:

import selfies as sf

First, let’s try translating between SMILES and SELFIES - as an example, we will use benzaldehyde. To translate from SMILES to SELFIES, use the selfies.encoder function, and to translate from SMILES back to SELFIES, use the selfies.decoder function.

[2]:

original_smiles = "O=Cc1ccccc1"  # benzaldehyde

try:

    encoded_selfies = sf.encoder(original_smiles)  # SMILES  -> SELFIES
    decoded_smiles = sf.decoder(encoded_selfies)   # SELFIES -> SMILES

except sf.EncoderError as err:
    pass  # sf.encoder error...
except sf.DecoderError as err:
    pass  # sf.decoder error...

[3]:

encoded_selfies

[3]:

'[O][=C][C][=C][C][=C][C][=C][Ring1][=Branch1]'

[4]:

decoded_smiles

[4]:

'O=CC1=CC=CC=C1'

Note that original_smiles and decoded_smiles are different strings, but they both represent benzaldehyde. Thus, when comparing the two SMILES strings, string equality should not be used. Insead, use RDKit to check whether the SMILES strings represent the same molecule.

[5]:

from rdkit import Chem

Chem.CanonSmiles(original_smiles) == Chem.CanonSmiles(decoded_smiles)

[5]:

True

Customizing SELFIES

The SELFIES grammar is derived dynamically from a set of semantic constraints, which assign bonding capacities to various atoms. Let’s customize the semantic constraints that selfies operates on. By default, the following constraints are used:

[6]:

sf.get_preset_constraints("default")

[6]:

{'H': 1,
 'F': 1,
 'Cl': 1,
 'Br': 1,
 'I': 1,
 'O': 2,
 'O+1': 3,
 'O-1': 1,
 'N': 3,
 'N+1': 4,
 'N-1': 2,
 'C': 4,
 'C+1': 5,
 'C-1': 3,
 'P': 5,
 'P+1': 6,
 'P-1': 4,
 'S': 6,
 'S+1': 7,
 'S-1': 5,
 '?': 8}

These constraints map atoms (they keys) to their bonding capacities (the values). The special ? key maps to the bonding capacity for all atoms that are not explicitly listed in the constraints. For example, S and Li are constrained to a maximum of 6 and 8 bonds, respectively. Every SELFIES string can be decoded into a molecule that obeys the current constraints.

[7]:

sf.decoder("[Li][=C][C][S][=C][C][#S]")

[7]:

'[Li]=CCS=CC#S'

But suppose that we instead wanted to constrain S and Li to a maximum of 2 and 1 bond(s), respectively. To do so, we create a new set of constraints, and tell selfies to operate on them using selfies.set_semantic_constraints.

[8]:

new_constraints = sf.get_preset_constraints("default")
new_constraints['Li'] = 1
new_constraints['S'] = 2

sf.set_semantic_constraints(new_constraints)

To check that the update was succesful, we can use selfies.get_semantic_constraints, which returns the semantic constraints that selfies is currently operating on.

[9]:

sf.get_semantic_constraints()

[9]:

{'H': 1,
 'F': 1,
 'Cl': 1,
 'Br': 1,
 'I': 1,
 'O': 2,
 'O+1': 3,
 'O-1': 1,
 'N': 3,
 'N+1': 4,
 'N-1': 2,
 'C': 4,
 'C+1': 5,
 'C-1': 3,
 'P': 5,
 'P+1': 6,
 'P-1': 4,
 'S': 2,
 'S+1': 7,
 'S-1': 5,
 '?': 8,
 'Li': 1}

Our previous SELFIES string is now decoded like so. Notice that the specified bonding capacities are met, with every S and Li making only 2 and 1 bonds, respectively.

[10]:

sf.decoder("[Li][=C][C][S][=C][C][#S]")

[10]:

'[Li]CCSCC=S'

Finally, to revert back to the default constraints, simply call:

[11]:

sf.set_semantic_constraints()

Please refer to the API reference for more details and more preset constraints.

SELFIES in Practice

Let’s use a simple example to show how selfies can be used in practice, as well as highlight some convenient utility functions from the library. We start with a toy dataset of SMILES strings. As before, we can use selfies.encoder to convert the dataset into SELFIES form.

[12]:

smiles_dataset = ["COC", "FCF", "O=O", "O=Cc1ccccc1"]
selfies_dataset = list(map(sf.encoder, smiles_dataset))

selfies_dataset

[12]:

['[C][O][C]',
 '[F][C][F]',
 '[O][=O]',
 '[O][=C][C][=C][C][=C][C][=C][Ring1][=Branch1]']

The function selfies.len_selfies computes the symbol length of a SELFIES string. We can use it to find the maximum symbol length of the SELFIES strings in the dataset.

[13]:

max_len = max(sf.len_selfies(s) for s in selfies_dataset)
max_len

[13]:

To extract the SELFIES symbols that form the dataset, use selfies.get_alphabet_from_selfies. Here, we add [nop] to the alphabet, which is a special padding character that selfies recognizes.

[14]:

alphabet = sf.get_alphabet_from_selfies(selfies_dataset)
alphabet.add("[nop]")

alphabet = list(sorted(alphabet))
alphabet

[14]:

['[=Branch1]', '[=C]', '[=O]', '[C]', '[F]', '[O]', '[Ring1]', '[nop]']

Then, create a mapping between the alphabet SELFIES symbols and indices.

[15]:

vocab_stoi = {symbol: idx for idx, symbol in enumerate(alphabet)}
vocab_itos = {idx: symbol for symbol, idx in vocab_stoi.items()}

vocab_stoi

[15]:

{'[=Branch1]': 0,
 '[=C]': 1,
 '[=O]': 2,
 '[C]': 3,
 '[F]': 4,
 '[O]': 5,
 '[Ring1]': 6,
 '[nop]': 7}

SELFIES provides some convenience methods to convert between SELFIES strings and label (integer) and one-hot encodings. Using the first entry of the dataset (dimethyl ether) as an example:

[16]:

dimethyl_ether = selfies_dataset[0]
label, one_hot = sf.selfies_to_encoding(dimethyl_ether, vocab_stoi, pad_to_len=max_len)

[17]:

label

[17]:

[3, 5, 3, 7, 7, 7, 7, 7, 7, 7]

[18]:

one_hot

[18]:

[[0, 0, 0, 1, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 1, 0, 0],
 [0, 0, 0, 1, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 1],
 [0, 0, 0, 0, 0, 0, 0, 1],
 [0, 0, 0, 0, 0, 0, 0, 1],
 [0, 0, 0, 0, 0, 0, 0, 1],
 [0, 0, 0, 0, 0, 0, 0, 1],
 [0, 0, 0, 0, 0, 0, 0, 1],
 [0, 0, 0, 0, 0, 0, 0, 1]]

[21]:

dimethyl_ether = sf.encoding_to_selfies(one_hot, vocab_itos, enc_type="one_hot")
dimethyl_ether

[21]:

'[C][O][C][nop][nop][nop][nop][nop][nop][nop]'

[22]:

sf.decoder(dimethyl_ether)  # sf.decoder ignores [nop]

[22]:

'COC'

If different encoding strategies are desired, selfies.split_selfies can be used to tokenize a SELFIES string into its individual symbols.

[24]:

list(sf.split_selfies("[C][O][C]"))

[24]:

['[C]', '[O]', '[C]']

Please refer to the API reference for more details and utility functions.