API Reference
Core Functions
- selfies.encoder(smiles, strict=True)[source]
Translates a SMILES string into its corresponding SELFIES string.
This translation is deterministic and does not depend on the current semantic constraints. Additionally, it preserves the atom order of the input SMILES string; thus, one could generate randomized SELFIES strings by generating randomized SMILES strings, and then translating them.
By nature of SELFIES, it is impossible to represent molecules that violate the current semantic constraints as SELFIES strings. Thus, we provide the
strict
flag to guard against such cases. Ifstrict=True
, then this function will raise aselfies.EncoderError
if the input SMILES string represents a molecule that violates the semantic constraints. Ifstrict=False
, then this function will not raise any error; however, callingselfies.decoder()
on a SELFIES string generated this way will not be guaranteed to recover a SMILES string representing the original molecule.- Parameters
smiles (
str
) – the SMILES string to be translated. It is recommended to use RDKit to check that the strings passed into this function are valid SMILES strings.strict (
bool
) – ifTrue
, this function will check that the input SMILES string obeys the semantic constraints. Defaults toTrue
.
- Return type
str
- Returns
a SELFIES string translated from the input SMILES string.
- Raises
EncoderError – if the input SMILES string is invalid, cannot be kekulized, or violates the semantic constraints with
strict=True
.- Example
>>> import selfies as sf >>> sf.encoder("C=CF") '[C][=C][F]'
Note
This function does not currently support SMILES with:
The wildcard symbol
*
.The quadruple bond symbol
$
.Chirality specifications other than
@
and@@
.Ring bonds across a dot symbol (e.g.
c1cc([O-].[Na+])ccc1
) or ring bonds between atoms that are over 4000 atoms apart.
Although SELFIES does not have aromatic symbols, this function does support aromatic SMILES strings by internally kekulizing them before translation.
- selfies.decoder(selfies, compatible=False)[source]
Translates a SELFIES string into its corresponding SMILES string.
This translation is deterministic but depends on the current semantic constraints. The output SMILES string is guaranteed to be syntatically correct and guaranteed to represent a molecule that obeys the semantic constraints.
- Parameters
selfies (
str
) – the SELFIES string to be translated.compatible (
bool
) – ifTrue
, this function will accept SELFIES strings containing depreciated symbols from previous releases. However, this function may behave differently than in previous major relases, and should not be treated as backard compatible. Defaults toFalse
.
- Return type
str
- Returns
a SMILES string derived from the input SELFIES string.
- Raises
DecoderError – if the input SELFIES string is malformed.
- Example
>>> import selfies as sf >>> sf.decoder('[C][=C][F]') 'C=CF'
Customization Functions
The SELFIES grammar is derived dynamically from a set of semantic constraints,
which assign bonding capacities to various atoms.
By default, selfies
operates under the following constraints:
Max Bonds |
Atom(s) |
---|---|
1 |
F, Cl, Br, I |
2 |
O |
3 |
B, N |
4 |
C |
5 |
P |
6 |
S |
8 |
All other atoms |
The +1 and -1 charged versions of O, N, C, S, and P are also constrained, where a +1 increases the bonding capacity of the neutral atom by 1, and a -1 decreases the bonding capacity of the neutral atom by 1. For example, N+1 has a bonding capacity of \(3 + 1 = 4\), and N-1 has a bonding capacity of \(3 - 1 = 2\). The charged versions B+1 and B-1 are constrained to a capacity of 2 and 4 bonds, respectively.
However, the default constraints are inadequate for SMILES strings that violate them. For
example, nitrobenzene O=N(=O)C1=CC=CC=C1
has a nitrogen with 6 bonds and
the chlorate anion O=Cl(=O)[O-]
has a chlorine with 5 bonds - these
SMILES strings cannot be represented by SELFIES strings under the default constraints.
Additionally, users may want to specify their own custom constraints. Thus, we
provide the following methods for configuring the semantic constraints
of selfies
.
Warning
SELFIES strings may be translated differently under different semantic constraints. Therefore, if custom semantic constraints are used, it is recommended to report them for reproducibility reasons.
- selfies.get_preset_constraints(name)[source]
Returns the preset semantic constraints with the given name.
Besides the aforementioned default constraints,
selfies
offers other preset constraints for convenience; namely, constraints that enforce the octet rule and constraints that accommodate hypervalent molecules.The differences between these constraints can be summarized as follows:
Cl, Br, I
N
P
P+1
P-1
S
S+1
S-1
default
1
3
5
6
4
6
7
5
octet_rule
1
3
3
4
2
2
3
1
hypervalent
7
5
5
6
4
6
7
5
- Parameters
name (
str
) – the preset name:default
oroctet_rule
orhypervalent
.- Return type
Dict
[str
,int
]- Returns
the preset constraints with the specified name, represented as a dictionary which maps atoms (the keys) to their bonding capacities (the values).
- selfies.get_semantic_constraints()[source]
Returns the semantic constraints that
selfies
is currently operating on.- Return type
Dict
[str
,int
]- Returns
the current semantic constraints, represented as a dictionary which maps atoms (the keys) to their bonding capacities (the values).
- selfies.set_semantic_constraints(bond_constraints='default')[source]
Updates the semantic constraints that
selfies
operates on.If the input is a string, the new constraints are taken to be the preset named
bond_constraints
(seeselfies.get_preset_constraints()
).Otherwise, the input is a dictionary representing the new constraints. This dictionary maps atoms (the keys) to non-negative bonding capacities (the values); the atoms are specified by strings of the form
E
orE+C
orE-C
, whereE
is an element symbol andC
is a positive integer. For example, one may have:bond_constraints["I-1"] = 0
bond_constraints["C"] = 4
This dictionary must also contain the special
?
key, which indicates the bond capacities of all atoms that are not explicitly listed in the dictionary.- Parameters
bond_constraints (
Union
[str
,Dict
[str
,int
]]) – the name of a preset, or a dictionary representing the new semantic constraints.- Return type
None
- Returns
None
.
Utility Functions
- selfies.len_selfies(selfies)[source]
Returns the number of symbols in a given SELFIES string.
- Parameters
selfies (
str
) – a SELFIES string.- Return type
int
- Returns
the symbol length of the SELFIES string.
- Example
>>> import selfies as sf >>> sf.len_selfies("[C][=C][F].[C]") 5
- selfies.split_selfies(selfies)[source]
Tokenizes a SELFIES string into its individual symbols.
- Parameters
selfies (
str
) – a SELFIES string.- Return type
Iterator
[str
]- Returns
the symbols of the SELFIES string one-by-one with order preserved.
- Example
>>> import selfies as sf >>> list(sf.split_selfies("[C][=C][F].[C]")) ['[C]', '[=C]', '[F]', '.', '[C]']
- selfies.get_alphabet_from_selfies(selfies_iter)[source]
Constructs an alphabet from an iterable of SELFIES strings.
The returned alphabet is the set of all symbols that appear in the SELFIES strings from the input iterable, minus the dot
.
symbol.- Parameters
selfies_iter (
Iterable
[str
]) – an iterable of SELFIES strings.- Return type
Set
[str
]- Returns
an alphabet of SELFIES symbols, built from the input iterable.
- Example
>>> import selfies as sf >>> selfies_list = ["[C][F][O]", "[C].[O]", "[F][F]"] >>> alphabet = sf.get_alphabet_from_selfies(selfies_list) >>> sorted(list(alphabet)) ['[C]', '[F]', '[O]']
- selfies.get_semantic_robust_alphabet()[source]
Returns a subset of all SELFIES symbols that are constrained by
selfies
under the current semantic constraints.- Return type
Set
[str
]- Returns
a subset of all SELFIES symbols that are semantically constrained.
- selfies.selfies_to_encoding(selfies, vocab_stoi, pad_to_len=- 1, enc_type='both')[source]
Converts a SELFIES string into its label (integer) and/or one-hot encoding.
A label encoded output will be a list of shape
(L,)
and a one-hot encoded output will be a 2D list of shape(L, len(vocab_stoi))
, whereL
is the symbol length of the SELFIES string. Optionally, the SELFIES string can be padded before it is encoded.- Parameters
selfies (
str
) – the SELFIES string to be encoded.vocab_stoi (
Dict
[str
,int
]) – a dictionary that maps SELFIES symbols to indices, which must be non-negative and contiguous, starting from 0. If the SELFIES string is to be padded, then the special padding symbol[nop]
must also be a key in this dictionary.pad_to_len (
int
) – the length that the SELFIES string string is padded to. If this value is less than or equal to the symbol length of the SELFIES string, then no padding is added. Defaults to-1
.enc_type (
str
) – the type of encoding of the output:label
orone_hot
orboth
. If this value isboth
, then a tuple of the label and one-hot encodings is returned. Defaults toboth
.
- Return type
Union
[List
[int
],List
[List
[int
]],Tuple
[List
[int
],List
[List
[int
]]]]- Returns
the label encoded and/or one-hot encoded SELFIES string.
- Example
>>> import selfies as sf >>> sf.selfies_to_encoding("[C][F]", {"[C]": 0, "[F]": 1}) ([0, 1], [[1, 0], [0, 1]])
- selfies.encoding_to_selfies(encoding, vocab_itos, enc_type)[source]
Converts a label (integer) or one-hot encoding into a SELFIES string.
If the input is label encoded, then a list of shape
(L,)
is expected; and if the input is one-hot encoded, then a 2D list of shape(L, len(vocab_itos))
is expected.- Parameters
encoding (
Union
[List
[int
],List
[List
[int
]]]) – a label or one-hot encoding.vocab_itos (
Dict
[int
,str
]) – a dictionary that maps indices to SELFIES symbols. The indices of this dictionary must be non-negative and contiguous, starting from 0.enc_type (
str
) – the type of encoding of the input:label
orone_hot
.
- Return type
str
- Returns
the SELFIES string represented by the input encoding.
- Example
>>> import selfies as sf >>> one_hot = [[0, 1, 0], [0, 0, 1], [1, 0, 0]] >>> vocab_itos = {0: "[nop]", 1: "[C]", 2: "[F]"} >>> sf.encoding_to_selfies(one_hot, vocab_itos, enc_type="one_hot") '[C][F][nop]'
- selfies.batch_selfies_to_flat_hot(selfies_batch, vocab_stoi, pad_to_len=- 1)[source]
Converts a list of SELFIES strings into its list of flattened one-hot encodings.
Each SELFIES string in the input list is one-hot encoded (and then flattened) using
selfies.selfies_to_encoding()
, withvocab_stoi
andpad_to_len
being passed in as arguments.- Parameters
selfies_batch (
List
[str
]) – the list of SELFIES strings to be encoded.vocab_stoi (
Dict
[str
,int
]) – a dictionary that maps SELFIES symbols to indices.pad_to_len (
int
) – the length that each SELFIES string in the input list is padded to. Defaults to-1
.
- Return type
List
[List
[int
]]- Returns
the flattened one-hot encodings of the input list.
- Example
>>> import selfies as sf >>> batch = ["[C]", "[C][C]"] >>> vocab_stoi = {"[nop]": 0, "[C]": 1} >>> sf.batch_selfies_to_flat_hot(batch, vocab_stoi, 2) [[0, 1, 1, 0], [0, 1, 0, 1]]
- selfies.batch_flat_hot_to_selfies(one_hot_batch, vocab_itos)[source]
Converts a list of flattened one-hot encodings into a list of SELFIES strings.
Each encoding in the input list is unflattened and then decoded using
selfies.encoding_to_selfies()
, withvocab_itos
being passed in as an argument.- Parameters
one_hot_batch (
List
[List
[int
]]) – a list of flattened one-hot encodings. Each encoding must be a list of length divisible bylen(vocab_itos)
.vocab_itos (
Dict
[int
,str
]) – a dictionary that maps indices to SELFIES symbols.
- Return type
List
[str
]- Returns
the list of SELFIES strings represented by the input encodings.
- Example
>>> import selfies as sf >>> batch = [[0, 1, 1, 0], [0, 1, 0, 1]] >>> vocab_itos = {0: "[nop]", 1: "[C]"} >>> sf.batch_flat_hot_to_selfies(batch, vocab_itos) ['[C][nop]', '[C][C]']
Exceptions
- exception selfies.EncoderError[source]
Exception raised by
selfies.encoder()
.
- exception selfies.DecoderError[source]
Exception raised by
selfies.decoder()
.