bblean.similarity#

Optimized molecular similarity calculators

Functions

jt_isim_from_sum

iSIM Tanimoto, from sum of rows of a fingerprint array and number of rows

jt_isim

Average Tanimoto, using iSIM

jt_sim_packed

Tanimoto similarity between a matrix of packed fps and a single packed fp

jt_most_dissimilar_packed

Finds two fps in a packed fp array that are the most Tanimoto-dissimilar

jt_isim_radius_from_sum

Calculate the Tanimoto radius of a set of fingerprints

jt_isim_radius_compl_from_sum

Calculate the complement of the Tanimoto radius of a set of fingerprints

jt_isim_diameter_from_sum

Calculate the Tanimoto diameter of a set of fingerprints.

jt_isim_radius

Calculate the Tanimoto radius of a set of fingerprints

jt_isim_radius_compl

Calculate the complement of the Tanimoto radius of a set of fingerprints

jt_isim_diameter

Calculate the Tanimoto diameter of a set of fingerprints

centroid_from_sum

Calculates the majority vote centroid from a sum of fingerprint values

centroid

Calculates the majority vote centroid from a set of fingerprints

jt_isim_medoid

Calculate the (Tanimoto) medoid of a set of fingerprints, using iSIM

jt_compl_isim

Get all complementary (Tanimoto) similarities of a set of fps, using iSIM

jt_stratified_sampling

jt_sim_matrix_packed

Tanimoto similarity matrix between all pairs of packed fps in arr

bblean.similarity.jt_isim_from_sum(linear_sum, n_objects)[source]#

iSIM Tanimoto, from sum of rows of a fingerprint array and number of rows

iSIM Tanimoto was first propsed in: https://pubs.rsc.org/en/content/articlelanding/2024/dd/d4dd00041b

\(iSIM_{JT}(X)\) is an excellent \(O(N)\) approximation of the average Tanimoto similarity of a set of fingerprints.

Also equivalent to the complement of the Tanimoto diameter \(iSIM_{JT}(X) = 1 - D_{JT}(X)\).

Parameters:
  • c_total (np.ndarray) – Sum of the elements from an array of fingerprints X, column-wise c_total = np.sum(X, axis=0)

  • n_objects (int) – Number of elements n_objects = X.shape[0]

Returns:

isim – iSIM Jaccard-Tanimoto value

Return type:

float

bblean.similarity.jt_isim(fps, input_is_packed=True, n_features=None)[source]#

Average Tanimoto, using iSIM

iSIM Tanimoto was first propsed in: https://pubs.rsc.org/en/content/articlelanding/2024/dd/d4dd00041b

\(iSIM_{JT}(X)\) is an excellent \(O(N)\) approximation of the average Tanimoto similarity of a set of fingerprints.

Also equivalent to the complement of the Tanimoto diameter \(iSIM_{JT}(X) = 1 - D_{JT}(X)\).

Parameters:
  • arr (np.ndarray) – 2D fingerprint array

  • input_is_packed (bool) – Whether the input array has packed fingerprints

  • n_features (int | None) – Number of features when unpacking fingerprints. Only required if not a multiple of 8

Returns:

isim – iSIM Jaccard-Tanimoto value

Return type:

float

bblean.similarity.jt_sim_packed(arr, vec)[source]#

Tanimoto similarity between a matrix of packed fps and a single packed fp

bblean.similarity.jt_most_dissimilar_packed(Y, n_features=None)[source]#

Finds two fps in a packed fp array that are the most Tanimoto-dissimilar

This is not guaranteed to find the most dissimilar fps, it is a robust O(N) approximation that doesn’t affect final cluster quality. First find centroid of Y, then find fp_1, the most dissimilar molecule to the centroid. Finally find fp_2, the most dissimilar molecule to fp_1

Returns:

  • fp_1 (int) – index of the first fingerprint

  • fp_2 (int) – index of the second fingerprint

  • sims_fp_1 (np.ndarray) – Tanimoto similarities of Y to fp_1

  • sims_fp_2 (np.ndarray) – Tanimoto similarities of Y to fp_2

Return type:

tuple[integer, integer, ndarray[tuple[Any, …], dtype[float64]], ndarray[tuple[Any, …], dtype[float64]]]

bblean.similarity.jt_isim_radius_from_sum(ls, n)[source]#

Calculate the Tanimoto radius of a set of fingerprints

bblean.similarity.jt_isim_radius_compl_from_sum(ls, n)[source]#

Calculate the complement of the Tanimoto radius of a set of fingerprints

bblean.similarity.jt_isim_diameter_from_sum(ls, n)[source]#

Calculate the Tanimoto diameter of a set of fingerprints.

Equivalent to 1 - jt_isim_from_sum(ls, n)

bblean.similarity.jt_isim_radius(arr, input_is_packed=True, n_features=None)[source]#

Calculate the Tanimoto radius of a set of fingerprints

bblean.similarity.jt_isim_radius_compl(arr, input_is_packed=True, n_features=None)[source]#

Calculate the complement of the Tanimoto radius of a set of fingerprints

bblean.similarity.jt_isim_diameter(arr, input_is_packed=True, n_features=None)[source]#

Calculate the Tanimoto diameter of a set of fingerprints

bblean.similarity.centroid_from_sum(linear_sum, n_samples, *, pack=True)[source]#

Calculates the majority vote centroid from a sum of fingerprint values

The majority vote centroid is an good approximation of the Tanimoto centroid.

Parameters:
  • linear_sum (np.ndarray) – Sum of the elements column-wise

  • n_samples (int) – Number of samples

  • pack (bool) – Whether to pack the resulting fingerprints

Returns:

centroid – Centroid fingerprints of the given set

Return type:

np.ndarray[np.uint8]

bblean.similarity.centroid(fps, input_is_packed=True, n_features=None, *, pack=True)[source]#

Calculates the majority vote centroid from a set of fingerprints

The majority vote centroid is an good approximation of the Tanimoto centroid.

bblean.similarity.jt_isim_medoid(fps, input_is_packed=True, n_features=None, pack=True)[source]#

Calculate the (Tanimoto) medoid of a set of fingerprints, using iSIM

Returns both the index of the medoid in the input array and the medoid itself

Note

Returns the first (or only) fingerprint for array of size 2 and 1 respectively. Raises ValueError for arrays of size 0

bblean.similarity.jt_compl_isim(fps, input_is_packed=True, n_features=None)[source]#

Get all complementary (Tanimoto) similarities of a set of fps, using iSIM

bblean.similarity.jt_sim_matrix_packed(arr)[source]#

Tanimoto similarity matrix between all pairs of packed fps in arr