Tutorial
The Basics
We begin by importing selfies
.
[1]:
import selfies as sf
First, let’s try translating between SMILES and SELFIES - as an example, we will use benzaldehyde. To translate from SMILES to SELFIES, use the selfies.encoder
function, and to translate from SMILES back to SELFIES, use the selfies.decoder
function.
[2]:
original_smiles = "O=Cc1ccccc1" # benzaldehyde
try:
encoded_selfies = sf.encoder(original_smiles) # SMILES -> SELFIES
decoded_smiles = sf.decoder(encoded_selfies) # SELFIES -> SMILES
except sf.EncoderError as err:
pass # sf.encoder error...
except sf.DecoderError as err:
pass # sf.decoder error...
[3]:
encoded_selfies
[3]:
'[O][=C][C][=C][C][=C][C][=C][Ring1][=Branch1]'
[4]:
decoded_smiles
[4]:
'O=CC1=CC=CC=C1'
Note that original_smiles
and decoded_smiles
are different strings, but they both represent benzaldehyde. Thus, when comparing the two SMILES strings, string equality should not be used. Insead, use RDKit to check whether the SMILES strings represent the same molecule.
[5]:
from rdkit import Chem
Chem.CanonSmiles(original_smiles) == Chem.CanonSmiles(decoded_smiles)
[5]:
True
Customizing SELFIES
The SELFIES grammar is derived dynamically from a set of semantic constraints, which assign bonding capacities to various atoms. Let’s customize the semantic constraints that selfies
operates on. By default, the following constraints are used:
[6]:
sf.get_preset_constraints("default")
[6]:
{'H': 1,
'F': 1,
'Cl': 1,
'Br': 1,
'I': 1,
'O': 2,
'O+1': 3,
'O-1': 1,
'N': 3,
'N+1': 4,
'N-1': 2,
'C': 4,
'C+1': 5,
'C-1': 3,
'P': 5,
'P+1': 6,
'P-1': 4,
'S': 6,
'S+1': 7,
'S-1': 5,
'?': 8}
These constraints map atoms (they keys) to their bonding capacities (the values). The special ?
key maps to the bonding capacity for all atoms that are not explicitly listed in the constraints. For example, S and Li are constrained to a maximum of 6 and 8 bonds, respectively. Every SELFIES string can be decoded into a molecule that obeys the current constraints.
[7]:
sf.decoder("[Li][=C][C][S][=C][C][#S]")
[7]:
'[Li]=CCS=CC#S'
But suppose that we instead wanted to constrain S and Li to a maximum of 2 and 1 bond(s), respectively. To do so, we create a new set of constraints, and tell selfies
to operate on them using selfies.set_semantic_constraints
.
[8]:
new_constraints = sf.get_preset_constraints("default")
new_constraints['Li'] = 1
new_constraints['S'] = 2
sf.set_semantic_constraints(new_constraints)
To check that the update was succesful, we can use selfies.get_semantic_constraints
, which returns the semantic constraints that selfies
is currently operating on.
[9]:
sf.get_semantic_constraints()
[9]:
{'H': 1,
'F': 1,
'Cl': 1,
'Br': 1,
'I': 1,
'O': 2,
'O+1': 3,
'O-1': 1,
'N': 3,
'N+1': 4,
'N-1': 2,
'C': 4,
'C+1': 5,
'C-1': 3,
'P': 5,
'P+1': 6,
'P-1': 4,
'S': 2,
'S+1': 7,
'S-1': 5,
'?': 8,
'Li': 1}
Our previous SELFIES string is now decoded like so. Notice that the specified bonding capacities are met, with every S and Li making only 2 and 1 bonds, respectively.
[10]:
sf.decoder("[Li][=C][C][S][=C][C][#S]")
[10]:
'[Li]CCSCC=S'
Finally, to revert back to the default constraints, simply call:
[11]:
sf.set_semantic_constraints()
Please refer to the API reference for more details and more preset constraints.
SELFIES in Practice
Let’s use a simple example to show how selfies
can be used in practice, as well as highlight some convenient utility functions from the library. We start with a toy dataset of SMILES strings. As before, we can use selfies.encoder
to convert the dataset into SELFIES form.
[12]:
smiles_dataset = ["COC", "FCF", "O=O", "O=Cc1ccccc1"]
selfies_dataset = list(map(sf.encoder, smiles_dataset))
selfies_dataset
[12]:
['[C][O][C]',
'[F][C][F]',
'[O][=O]',
'[O][=C][C][=C][C][=C][C][=C][Ring1][=Branch1]']
The function selfies.len_selfies
computes the symbol length of a SELFIES string. We can use it to find the maximum symbol length of the SELFIES strings in the dataset.
[13]:
max_len = max(sf.len_selfies(s) for s in selfies_dataset)
max_len
[13]:
10
To extract the SELFIES symbols that form the dataset, use selfies.get_alphabet_from_selfies
. Here, we add [nop]
to the alphabet, which is a special padding character that selfies
recognizes.
[14]:
alphabet = sf.get_alphabet_from_selfies(selfies_dataset)
alphabet.add("[nop]")
alphabet = list(sorted(alphabet))
alphabet
[14]:
['[=Branch1]', '[=C]', '[=O]', '[C]', '[F]', '[O]', '[Ring1]', '[nop]']
Then, create a mapping between the alphabet SELFIES symbols and indices.
[15]:
vocab_stoi = {symbol: idx for idx, symbol in enumerate(alphabet)}
vocab_itos = {idx: symbol for symbol, idx in vocab_stoi.items()}
vocab_stoi
[15]:
{'[=Branch1]': 0,
'[=C]': 1,
'[=O]': 2,
'[C]': 3,
'[F]': 4,
'[O]': 5,
'[Ring1]': 6,
'[nop]': 7}
SELFIES provides some convenience methods to convert between SELFIES strings and label (integer) and one-hot encodings. Using the first entry of the dataset (dimethyl ether) as an example:
[16]:
dimethyl_ether = selfies_dataset[0]
label, one_hot = sf.selfies_to_encoding(dimethyl_ether, vocab_stoi, pad_to_len=max_len)
[17]:
label
[17]:
[3, 5, 3, 7, 7, 7, 7, 7, 7, 7]
[18]:
one_hot
[18]:
[[0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 1]]
[21]:
dimethyl_ether = sf.encoding_to_selfies(one_hot, vocab_itos, enc_type="one_hot")
dimethyl_ether
[21]:
'[C][O][C][nop][nop][nop][nop][nop][nop][nop]'
[22]:
sf.decoder(dimethyl_ether) # sf.decoder ignores [nop]
[22]:
'COC'
If different encoding strategies are desired, selfies.split_selfies
can be used to tokenize a SELFIES string into its individual symbols.
[24]:
list(sf.split_selfies("[C][O][C]"))
[24]:
['[C]', '[O]', '[C]']
Please refer to the API reference for more details and utility functions.