Code Documentation

Core Functions

selfies.encoder(smiles, print_error=False)[source]

Translates a SMILES into a SELFIES.

The SMILES to SELFIES translation occurs independently of the SELFIES alphabet and grammar. Thus, selfies.encoder() will work regardless of the alphabet and grammar rules that selfies is operating on, assuming the input is a valid SMILES. Additionally, selfies.encoder() preserves the atom and branch order of the input SMILES; thus, one could generate random SELFIES corresponding to the same molecule by generating random SMILES, and then translating them.

However, encoding and then decoding a SMILES may not necessarily yield the original SMILES. Reasons include:

  1. SMILES with aromatic symbols are automatically Kekulized before being translated.

  2. SMILES that violate the bond constraints specified by selfies will be successfully encoded by selfies.encoder(), but then decoded into a new molecule that satisfies the constraints.

  3. The exact ring numbering order is lost in selfies.encoder(), and cannot be reconstructed by selfies.decoder().

Finally, note that selfies.encoder() does not check if the input SMILES is valid, and should not be expected to reject invalid inputs. It is recommended to use RDKit to first verify that the SMILES are valid.

  • smiles (str) – the SMILES to be translated.

  • print_error (bool) – if True, error messages will be printed to console. Defaults to False.

Return type



the SELFIES translation of smiles. If an error occurs, and smiles cannot be translated, None is returned instead.


>>> import selfies
>>> selfies.encoder('C=CF')


Currently, selfies.encoder() does not support the following types of SMILES:

  • SMILES using ring numbering across a dot-bond symbol to specify bonds, e.g. C1.C2.C12 (propane) or c1cc([O-].[Na+])ccc1 (sodium phenoxide).

  • SMILES with ring numbering between atoms that are over 16 ** 3 = 4096 atoms apart.

  • SMILES using the wildcard symbol *.

  • SMILES using chiral specifications other than @ and @@.

selfies.decoder(selfies, print_error=False, constraints=None)[source]

Translates a SELFIES into a SMILES.

The SELFIES to SMILES translation operates based on the selfies grammar rules, which can be configured using selfies.set_semantic_constraints(). Given the appropriate settings, the decoded SMILES will always be syntactically and semantically correct. That is, the output SMILES will satisfy the specified bond constraints. Additionally, selfies.decoder() will attempt to preserve the atom and branch order of the input SELFIES.

  • selfies (str) – the SELFIES to be translated.

  • print_error (bool) – if True, error messages will be printed to console. Defaults to False.

  • constraints (Optional[str]) – if 'octet_rule' or 'hypervalent', the corresponding preset bond constraints will be used instead. If None, selfies.decoder() will use the currently configured bond constraints. Defaults to None.

Return type



the SMILES translation of selfies. If an error occurs, and selfies cannot be translated, None is returned instead.


>>> import selfies
>>> selfies.decoder('[C][=C][F]')

See also

The “octet_rule” and “hypervalent” preset bond constraints can be viewed with selfies.get_octet_rule_constraints() and selfies.get_hypervalent_constraints(), respectively. These presets are variants of the “default” bond constraints, which can be viewed with selfies.get_default_constraints(). Their differences can be summarized as follows:

  • def. : Cl, Br, I: 1, N: 3, P: 5, P+1: 6, P-1: 4, S: 6, S+1: 7, S-1: 5

  • oct. : Cl, Br, I: 1, N: 3, P: 3, P+1: 4, P-1: 2, S: 2, S+1: 3, S-1: 1

  • hyp. : Cl, Br, I: 7, N: 5, P: 5, P+1: 6, P-1: 4, S: 6, S+1: 7, S-1: 5

Utility Functions


Computes the symbol length of a SELFIES.

The symbol length is the number of symbols that make up the SELFIES, and not the length of the string itself (i.e. len(selfies)).


selfies (str) – a SELFIES.

Return type



the symbol length of selfies.


>>> import selfies
>>> selfies.len_selfies('[C][O][C]')
>>> selfies.len_selfies('[C][=C][F].[C]')

Splits a SELFIES into its symbols.

Returns an iterable that yields the symbols of a SELFIES one-by-one in the order they appear in the string. SELFIES symbols are always either indicated by an open and closed square bracket, or are the '.' dot-bond symbol.


selfies (str) – the SELFIES to be read.

Return type



an iterable of the symbols of selfies in the same order they appear in the string.


>>> import selfies
>>> list(selfies.split_selfies('[C][O][C]'))
['[C]', '[O]', '[C]']
>>> list(selfies.split_selfies('[C][=C][F].[C]'))
['[C]', '[=C]', '[F]', '.', '[C]']

Constructs an alphabet from an iterable of SELFIES.

From an iterable of SELFIES, constructs the minimum-sized set of SELFIES symbols such that every SELFIES in the iterable can be constructed from symbols from that set. Then, the set is returned. Note that the symbol '.' will not be added as a member of the returned set, even if it appears in the input.


selfies_iter (Iterable[str]) – an iterable of SELFIES.

Return type



the SElFIES alphabet built from the SELFIES in selfies_iter.


>>> import selfies
>>> selfies_list = ['[C][F][O]', '[C].[O]', '[F][F]']
>>> alphabet = selfies.get_alphabet_from_selfies(selfies_list)
>>> sorted(list(alphabet))
['[C]', '[F]', '[O]']

Returns a subset of all symbols that are semantically constrained by selfies.

These semantic constraints can be configured with selfies.set_semantic_constraints().

Return type



a subset of all symbols that are semantically constrained.

selfies.selfies_to_encoding(selfies, vocab_stoi, pad_to_len=- 1, enc_type='both')[source]

Converts a SELFIES into its label (integer) and/or one-hot encoding.

A label encoded output will be a list of size (N,) and a one-hot encoded output will be a list of size (N, len(vocab_stoi)); where N is the symbol length of the (potentially padded) SELFIES. Note that SELFIES uses the special padding symbol [nop].

  • selfies (str) – the SELFIES to be encoded.

  • vocab_stoi (Dict[str, int]) – a dictionary that maps SELFIES symbols (the keys) to a non-negative index. The indices of the dictionary must contiguous, starting from 0.

  • pad_to_len (int) – the length the SELFIES is be padded to. If pad_to_len is less than or equal to the symbol length of the SELFIES, then no padding is added. Defaults to -1.

  • enc_type (str) – the type of encoding of the output: label or one_hot or both. If the value is both, then a tuple of the label and one-hot encoding are returned (in that order). Defaults to both.

Return type

Union[List[int], List[List[int]], Tuple[List[int], List[List[int]]]]


the label encoded and/or one-hot encoded SELFIES.


>>> import selfies as sf
>>> sf.selfies_to_encoding('[C][F]', {'[C]': 0, '[F]': 1})
([0, 1], [[1, 0], [0, 1]])
selfies.encoding_to_selfies(encoded, vocab_itos, enc_type)[source]

Converts a label (integer) or one-hot encoded list into a SELFIES string.

If the input is label encoded, then a list of size (N,) is expected; and if the input is one-hot encoded, then a 2D list of size (N, len(vocab_itos)) is expected.

  • encoded (Union[List[int], List[List[int]]]) – a label or one-hot encoded list.

  • vocab_itos (Dict[int, str]) – a dictionary that maps non-negative indices (the keys) to SELFIES symbols. The indices of the dictionary must be contiguous, starting from 0.

  • enc_type (str) – the type of encoding of the output: label or one_hot.

Return type



the SELFIES string represented by the encoded input.


>>> import selfies as sf
>>> one_hot = [[0, 1, 0], [0, 0, 1], [1, 0, 0]]
>>> vocab_itos = {0: '[nop]', 1: '[C]', 2: '[F]'}
>>> sf.encoding_to_selfies(one_hot, vocab_itos, enc_type='one_hot')
selfies.batch_selfies_to_flat_hot(selfies_batch, vocab_stoi, pad_to_len=- 1)[source]

Converts a list of SELFIES into a list of flattened one-hot encodings.

Returned is a list of size (batch_size, N * len(vocab_stoi)); where N is the symbol length of the (potentially padded) SELFIES. Note that SELFIES uses the special padding symbol [nop].

  • selfies_batch (List[str]) – a list of SELFIES to be converted.

  • vocab_stoi (Dict[str, int]) – a dictionary that maps SELFIES symbols (the keys) to a non-negative index. The indices of the dictionary must contiguous, starting from 0.

  • pad_to_len (int) – the length that each SELFIES is be padded to. If pad_to_len is less than or equal to the symbol length of the SELFIES, then no padding is added. Defaults to -1.

Return type



the flattened one-hot encoded representations of the SELFIES from the batch. This is a 2D list of size (batch_size, N * len(vocab_stoi)).


>>> import selfies as sf
>>> batch = ["[C]", "[C][C]"]
>>> vocab_stoi = {'[nop]': 0, '[C]': 1}
>>> sf.batch_selfies_to_flat_hot(batch, vocab_stoi, 2)
[[0, 1, 1, 0], [0, 1, 0, 1]]
selfies.batch_flat_hot_to_selfies(one_hot_batch, vocab_itos)[source]

Convert a batch of flattened one-hot encodings into a list of SELFIES.

We expect one_hot_batch to be a list of size (batch_size, S), where S is divisible by the length of the vocabulary.

  • one_hot_batch (List[List[int]]) – a list of flattened one-hot encoded representations.

  • vocab_itos (Dict[int, str]) – a dictionary that maps non-negative indices (the keys) to SELFIES symbols. We expect the indices of the dictionary to be contiguous and starting from 0.

Return type



a list of SELFIES strings.


>>> import selfies as sf
>>> batch = [[0, 1, 1, 0], [0, 1, 0, 1]]
>>> vocab_itos = {0: '[nop]', 1: '[C]'}
>>> sf.batch_flat_hot_to_selfies(batch, vocab_itos)
['[C][nop]', '[C][C]']

Advanced Functions

By default, selfies operates under the following semantic constraints

Max Bonds



F, Cl, Br, I












All other atoms

The +1 and -1 charged versions of O, N, C, S, P are also constrained, where a +1 increases the maximum bond capacity of the neutral atom by 1, and a -1 decreases the maximum bond capacity of the neutral atom by 1. For example, N+1 has a bond capacity of \(3 + 1 = 4\), and N-1 has a bond capacity of \(3 - 1 = 2\).

However, the default constraints are inadequate for SMILES that violate them. For example, nitrobenzene O=N(=O)C1=CC=CC=C1 has a nitrogen with 6 bonds and the chlorate anion O=Cl(=O)[O-] has a chlorine with 5 bonds - these SMILES cannot be represented as SELFIES under the default constraints. Additionally, users may want to specify their own custom constraints. Thus, we provide the following methods for configuring the semantic constraints of selfies.


SELFIES may be translated differently under different semantic constraints. Therefore, if custom semantic constraints are used, it is recommended to report them for reproducibility reasons.


Returns the preset “default” bond constraint settings.

Return type

Dict[str, int]


the default constraint settings.


Returns the preset “octet rule” bond constraint settings. These constraints are a harsher version of the default constraints, so that the octet rule is obeyed. In particular, S and P are restricted to a 2 and 3 bond capacity, respectively (and similarly with S+, S-, P+, P-).

Return type

Dict[str, int]


the octet rule constraint settings.


Returns the preset “hypervalent” bond constraint settings. These constraints are a relaxed version of the default constraints, to allow for hypervalent molecules. In particular, Cl, Br, and I are relaxed to a 7 bond capacity, and N is relaxed to a 5 bond capacity.

Return type

Dict[str, int]


the hypervalent constraint settings.


Returns the semantic bond constraints that selfies is currently operating on.

Returned is the argument of the most recent call of selfies.set_semantic_constraints(), or the default bond constraints if the function has not been called yet. Once retrieved, it is copied and then returned. See selfies.set_semantic_constraints() for further explanation.

Return type

Dict[str, int]


the bond constraints selfies is currently operating on.


Configures the semantic constraints of selfies.

The SELFIES grammar is enforced dynamically from a dictionary bond_constraints. The keys of the dictionary are atoms and/or ions (e.g. I, Fe+2). To denote an ion, use the format E+C or E-C, where E is an element and C is a positive integer. The corresponding value is the maximum number of bonds that atom or ion can make, between 1 and 8 inclusive. For example, one may have:

  • bond_constraints['I'] = 1

  • bond_constraints['C'] = 4

selfies.decoder() will only generate SMILES that respect the bond constraints specified by the dictionary. In the example above, both '[C][=I]' and '[I][=C]' will be translated to 'CI' and 'IC' respectively, because I has been configured to make one bond maximally.

If an atom or ion is not specified in bond_constraints, it will by default be constrained to 8 bonds. To change the default setting for unrecognized atoms or ions, set bond_constraints['?'] to the desired integer (between 1 and 8 inclusive).


bond_constraints (Optional[Dict[str, int]]) – a dictionary representing the semantic constraints the updated SELFIES will operate upon. Defaults to None; in this case, a default dictionary will be used.

Return type


