bblean.bitbirch#

BitBirch ‘Lean’ class for fast, memory-efficient O(N) clustering

Classes

BitBirch

Implements the BitBIRCH clustering algorithm, 'Lean' version

class bblean.bitbirch.BitBirch(*, threshold=0.65, branching_factor=50, merge_criterion=None, tolerance=None)[source]#

Implements the BitBIRCH clustering algorithm, ‘Lean’ version

Memory and time efficient, online-learning algorithm. It constructs a tree data structure with the cluster centroids being read off the leaf.

If you find this software useful please cite the following articles:

Parameters:
  • threshold (float = 0.65) – The similarity radius of the subcluster obtained by merging a new sample and the closest subcluster should be greater than the threshold. Otherwise a new subcluster is started. Setting this value to be very low promotes splitting and vice-versa.

  • branching_factor (int = 50) – Maximum number of ‘BitFeatures’ subclusters in each node. If a new sample enters such that the number of subclusters exceed the branching_factor then that node is split into two nodes with the subclusters redistributed in each. The parent subcluster of that node is removed and two new subclusters are added as parents of the 2 split nodes.

  • merge_criterion (str) – radius, diameter or tolerance. radius: merge subcluster based on comparison to centroid of the cluster. diameter: merge subcluster based on instant Tanimoto similarity of cluster. tolerance: applies tolerance threshold to diameter merge criteria, which will merge subcluster with stricter threshold for newly added molecules.

  • tolerance (float) – Penalty value for similarity threshold of the ‘tolerance’ merge criteria.

Notes

The tree data structure consists of nodes with each node holdint a number of subclusters (BitFeatures). The maximum number of subclusters in a node is determined by the branching factor. Each subcluster maintains a linear sum, mol_indices and the number of samples in that subcluster. In addition, each subcluster can also have a node as its child, if the subcluster is not a member of a leaf node.

Each time a new fingerprint is fitted, it is merged with the subcluster closest to it and the linear sum, mol_indices and the number of samples int the corresponding subcluster are updated. This is done recursively untils the properties of a leaf node are updated.

property is_init: bool#

Whether the tree has been initialized (True after first call to fit())

property num_fitted_fps: int#

Total number of fitted fingerprints

set_merge(criterion=None, *, tolerance=None, threshold=None, branching_factor=None)[source]#

Changes the criteria for merging subclusters in this BitBirch tree

For an explanation of the parameters see the BitBirch class docstring.

fit(X, /, reinsert_indices=None, input_is_packed=True, n_features=None, max_fps=None)[source]#

Build a BF Tree for the input data.

Parameters:
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Input data.

  • reinsert_indices (Iterable[int]) – if reinsert_indices is passed, X corresponds only to the molecules that will be reinserted into the tree, and reinsert_indices are the indices associated with these molecules.

  • input_is_packed (bool) – Whether the input fingerprints are packed

  • n_features (int) – Number of featurs of input fingerprints. Only required for packed inputs if it is not a multiple of 8, otherwise it is redundant.

Returns:

Fitted estimator.

Return type:

self

get_centroids_mol_ids(sort=True, packed=True)[source]#

Get a dict with centroids and mol indices of the leaves

get_centroids(sort=True, packed=True)[source]#

Get a list of arrays with the centroids’ fingerprints

get_cluster_mol_ids(sort=True)[source]#

Get the indices of the molecules in each cluster

get_assignments(n_mols=None, sort=True, check_valid=True)[source]#

Get an array with the cluster labels associated with each fingerprint idx

dump_assignments(path, smiles=(), sort=True, check_valid=True)[source]#

Dump the cluster assignments to a *.csv file

reset()[source]#

Reset the tree state

Delete all internal nodes and leafs, does not reset the merge criterion or other merge parameters.

delete_internal_nodes()[source]#

Delete all nodes in the tree that are not leaves

This function is for advanced usage only. It should be called if there is need to use the BitBirch leaf clusters, but you need to release the memory held by the internal nodes. After calling this function, no more fingerprints can be fit into the tree, unless a call to BitBirch.reset afterwards releases the whole tree, including the leaf clusters.

recluster_inplace(iterations=1, extra_threshold=0.0, shuffle=False, seed=None, verbose=False, stop_early=False)[source]#

Refine singleton clusters by re-inserting them into the tree

Parameters:
  • extra_threshold (float, default=0.0) – The amount to increase the current threshold in each iteration.

  • iterations (int, default=1) – The maximum number of refinement iterations to perform.

Returns:

self – The fitted estimator with refined clusters.

Return type:

BitBirch

Raises:

ValueError – If the model has not been fitted.

refine_inplace(X, initial_mol=0, input_is_packed=True, n_largest=1)[source]#

Refine the tree: break the largest clusters in singletons and re-fit