Tuning BitBIRCH parameters#

BitBIRCH has a few parameters parameters that can be adjusted to modify the quality of the resulting clustering.

Merge criterion and tolerance#

The merge_criterion is used to determine whether two clusters can be merged inside a node in the BitBIRCH tree. The criteria may be asymetric (they consider differently clusters already in the tree, old clusters, and clusters that are being inserted, or nominee clusters). In the following, diameter-complement refers to \(1 - D_{JT}\) and radius-complement to \(1 - R_{JT}\). There are three main merge criteria implemented:

radius (symmetric):
The radius-complement of the resulting cluster must be less or equal than the threshold value.
diameter (symmetric):
The diameter-complement (equivalently, the average similarity) of the resulting cluster must be less or equal than the threshold value.
tolerance-diameter (asymmetric):
The diameter criteria must be satisfied and the diameter-commplement of the resulting cluster must be greater or equal to that of the old cluster (unless the old cluster has a single fingerprint). Some slack can be provided with a value of tolerance.
tolerance-radius (asymmetric):
The radius criterion must be satisfied and the radius-complement of the resulting cluster must be greater or equal to that of the old cluster (unless the old cluster has a single fingerprint). Some slack can be provided with a value of tolerance.
tolerance-legacy (asymmetric):
This is providded for compatibility with old BitBIRCH versions only, in general it should be avoided. The diameter criterion must be satisfied and The value of \((isim(X \cup new_fp)(N + 1) - isim(X)(N - 1)) / 2\) must be greater or equal to that of the old cluster (unless the old cluster has a single fingerprint, or the new cluster has more than one fingerprint). Some slack can be provided with a value of tolerance.

Both tolerance-diameter and tolerance-radius reduce the tolerance slack exponentially as the cluster gets larger. This behaviour is usually desirable, but can be turned off with adaptive=False.

Currently we recommend the diameter criteria for the initial build of the tree, and the corresponding tolerance-diameter criteria for refinement and tree-combining. The default slack value for tolerance (0.05) is good for most purposes, although you may want no slack (tolerance=0) if it is important to maintain the average Tanimoto values after refinement. Using a very large value for tolerance will flatten the isim distribution.

Threshold#

The threshold determines the minimum metric acceptable within a given cluster. If adding a new molecule to a cluster would result in a lower average similarity than threshold, BitBIRCH will instead create a new cluster. High threshold values may result in many small, compact clusters. Low threshold may result in few large, diffuse clusters.

The clustering results for a given threshold value will depend on the kind of fingerprint used. Sparse fingerprints (e. g. ECFPs) typically have lower pairwise Jaccard-Tanimoto similarities, which means you will want a low threshold to recover meaningful structure. Denser fingerprints (e. g. the default rdkit fingerprints) require larger threshold.

A typical recommendation is to use a threshold in the range of 0.2-0.35 for ECFP4 or ECFP6, and a threshold in the range of 0.5-0.65 for rdkit fingerprints. Within these ranges the method is not very sensitive to the threshold value chosen, but choosing the wrong range for a given fingerprint kind may be very disadvantageous.

Branching factor#

The branching_factor determines how many clusters each node of the BitBIRCH tree can hold before splitting into new nodes. A high branching factor will result in fewer nodes, which means tree insertions will better approximate a thorough search over the full fingerprint set, and memory usage will be lower. However, a very high branching factor, may also incurr in a higher computational cost.

A recommended branching factor that performs well in terms of memory use and compute cost is 254. Higher branching factors may be useful to reduce memory usage when clustering hundreds of millions of molecules, at the cost of some speed (for example you may want 1000 for 100M-200M molecules).

The clustering results depend on the branching_factor, but only very weakly. Most of the effect is limited to performance and memory usage.