Finding Pairs
Matching profiles based on metadata columns¶
This example demostrates how to use copairs to group profiles based on their metadata properties.
Specifically, this is used in calculation of mAP for profile strength and similarity assesement.
Citation:
Kalinin, A.A., Arevalo, J., Serrano, E., Vulliard, L., Tsang, H., Bornholdt, M., Muñoz, A.F., Sivagurunathan, S., Rajwa, B., Carpenter, A.E., Way, G.P. and Singh, S., 2025. A versatile information retrieval framework for evaluating profile strength and similarity. Nature Communications 16, 5181. doi:10.1038/s41467-025-60306-2
import random
import pandas as pd
from copairs import Matcher, MatcherMultilabel
Data¶
Let's assume you have a dataset with 20 samples taken in 3 plates p1, p2, p3,
each plate is composed of 5 wells w1, w2, w3, w4, w5, and each well
has one or more labels (t1, t2, t3, t4) assigned.
random.seed(0)
n_samples = 20
dframe = pd.DataFrame(
{
"plate": [random.choice(["p1", "p2", "p3"]) for _ in range(n_samples)],
"well": [
random.choice(["w1", "w2", "w3", "w4", "w5"]) for _ in range(n_samples)
],
"label": [random.choice(["t1", "t2", "t3", "t4"]) for _ in range(n_samples)],
}
)
dframe = dframe.drop_duplicates()
dframe = dframe.sort_values(by=["plate", "well", "label"])
dframe = dframe.reset_index(drop=True)
Getting valid pairs¶
To get pairs of samples that share the same label but comes from different
plates at different well positions:
matcher = Matcher(dframe, ["plate", "well", "label"], seed=0)
pairs_dict = matcher.get_all_pairs(sameby=["label"], diffby=["plate", "well"])
pairs_dict
{'t1': [(3, 11), (3, 5), (3, 6), (3, 7)],
't2': [(1, 16), (1, 10), (1, 15), (8, 16), (8, 15), (10, 16)],
't3': [(9, 4), (9, 13), (13, 4), (13, 12), (4, 12)],
't4': [(0, 17), (0, 14), (17, 2), (2, 14)]}
Getting valid pairs from a multilabel column¶
For eficiency reasons, you may not want to have duplicated rows. You can
group all the labels in a single row and use MatcherMultilabel to find the
corresponding pairs:
dframe_multi = dframe.groupby(["plate", "well"])["label"].unique().reset_index()
dframe_multi
| plate | well | label | |
|---|---|---|---|
| 0 | p1 | w2 | [t4] |
| 1 | p1 | w3 | [t2, t4] |
| 2 | p1 | w4 | [t1, t3] |
| 3 | p2 | w1 | [t1] |
| 4 | p2 | w2 | [t1] |
| 5 | p2 | w3 | [t1, t2, t3] |
| 6 | p2 | w4 | [t2] |
| 7 | p2 | w5 | [t1, t3] |
| 8 | p3 | w1 | [t3, t4] |
| 9 | p3 | w4 | [t2] |
| 10 | p3 | w5 | [t2, t4] |
matcher_multi = MatcherMultilabel(
dframe_multi, columns=["plate", "well", "label"], multilabel_col="label", seed=0
)
pairs_multi = matcher_multi.get_all_pairs(sameby=["label"], diffby=["plate", "well"])
pairs_multi is also a label_id: pairs dictionary with the same
structure discussed before:
pairs_multi
{'t1': [(2, 7), (2, 3), (2, 4), (2, 5)],
't2': [(1, 10), (1, 6), (1, 9), (5, 10), (5, 9), (6, 10)],
't3': [(5, 2), (5, 8), (8, 2), (8, 7), (2, 7)],
't4': [(0, 10), (0, 8), (10, 1), (1, 8)]}