bblean.bitbirch#
BitBirch ‘Lean’ class for fast, memory-efficient O(N) clustering
Classes
Implements the BitBIRCH clustering algorithm, 'Lean' version |
- class bblean.bitbirch.BitBirch(*, threshold=0.65, branching_factor=50, merge_criterion=None, tolerance=None)[source]#
Implements the BitBIRCH clustering algorithm, ‘Lean’ version
Memory and time efficient, online-learning algorithm. It constructs a tree data structure with the cluster centroids being read off the leaf.
If you find this software useful please cite the following articles:
BitBIRCH: efficient clustering of large molecular libraries: https://doi.org/10.1039/D5DD00030K
BitBIRCH Clustering Refinement Strategies: https://doi.org/10.1021/acs.jcim.5c00627
BitBIRCH-Lean: TO-BE-ADDED
- Parameters:
threshold (float = 0.65) – The similarity radius of the subcluster obtained by merging a new sample and the closest subcluster should be greater than the threshold. Otherwise a new subcluster is started. Setting this value to be very low promotes splitting and vice-versa.
branching_factor (int = 50) – Maximum number of ‘BitFeatures’ subclusters in each node. If a new sample enters such that the number of subclusters exceed the branching_factor then that node is split into two nodes with the subclusters redistributed in each. The parent subcluster of that node is removed and two new subclusters are added as parents of the 2 split nodes.
merge_criterion (str) – radius, diameter or tolerance. radius: merge subcluster based on comparison to centroid of the cluster. diameter: merge subcluster based on instant Tanimoto similarity of cluster. tolerance: applies tolerance threshold to diameter merge criteria, which will merge subcluster with stricter threshold for newly added molecules.
tolerance (float) – Penalty value for similarity threshold of the ‘tolerance’ merge criteria.
Notes
The tree data structure consists of nodes with each node holdint a number of subclusters (
BitFeatures
). The maximum number of subclusters in a node is determined by the branching factor. Each subcluster maintains a linear sum, mol_indices and the number of samples in that subcluster. In addition, each subcluster can also have a node as its child, if the subcluster is not a member of a leaf node.Each time a new fingerprint is fitted, it is merged with the subcluster closest to it and the linear sum, mol_indices and the number of samples int the corresponding subcluster are updated. This is done recursively untils the properties of a leaf node are updated.
- set_merge(criterion=None, *, tolerance=None, threshold=None, branching_factor=None)[source]#
Changes the criteria for merging subclusters in this BitBirch tree
For an explanation of the parameters see the
BitBirch
class docstring.
- fit(X, /, reinsert_indices=None, input_is_packed=True, n_features=None, max_fps=None)[source]#
Build a BF Tree for the input data.
- Parameters:
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Input data.
reinsert_indices (Iterable[int]) – if
reinsert_indices
is passed,X
corresponds only to the molecules that will be reinserted into the tree, andreinsert_indices
are the indices associated with these molecules.input_is_packed (bool) – Whether the input fingerprints are packed
n_features (int) – Number of featurs of input fingerprints. Only required for packed inputs if it is not a multiple of 8, otherwise it is redundant.
- Returns:
Fitted estimator.
- Return type:
self
- get_centroids_mol_ids(sort=True, packed=True)[source]#
Get a dict with centroids and mol indices of the leaves
- get_centroids(sort=True, packed=True)[source]#
Get a list of arrays with the centroids’ fingerprints
- get_assignments(n_mols=None, sort=True, check_valid=True)[source]#
Get an array with the cluster labels associated with each fingerprint idx
- dump_assignments(path, smiles=(), sort=True, check_valid=True)[source]#
Dump the cluster assignments to a
*.csv
file
- reset()[source]#
Reset the tree state
Delete all internal nodes and leafs, does not reset the merge criterion or other merge parameters.
- delete_internal_nodes()[source]#
Delete all nodes in the tree that are not leaves
This function is for advanced usage only. It should be called if there is need to use the BitBirch leaf clusters, but you need to release the memory held by the internal nodes. After calling this function, no more fingerprints can be fit into the tree, unless a call to
BitBirch.reset
afterwards releases the whole tree, including the leaf clusters.
- recluster_inplace(iterations=1, extra_threshold=0.0, shuffle=False, seed=None, verbose=False, stop_early=False)[source]#
Refine singleton clusters by re-inserting them into the tree
- Parameters:
- Returns:
self – The fitted estimator with refined clusters.
- Return type:
- Raises:
ValueError – If the model has not been fitted.