Finding Pairs

Matching profiles based on metadata columns¶

This example demostrates how to use copairs to group profiles based on their metadata properties.

Specifically, this is used in calculation of mAP for profile strength and similarity assesement.

Citation:

Kalinin, A.A., Arevalo, J., Serrano, E., Vulliard, L., Tsang, H., Bornholdt, M., Muñoz, A.F., Sivagurunathan, S., Rajwa, B., Carpenter, A.E., Way, G.P. and Singh, S., 2025. A versatile information retrieval framework for evaluating profile strength and similarity. Nature Communications 16, 5181. doi:10.1038/s41467-025-60306-2

In [1]:

Copied!

import random

import pandas as pd

from copairs import Matcher, MatcherMultilabel
import random

import pandas as pd

from copairs import Matcher, MatcherMultilabel

Data¶

Let's assume you have a dataset with 20 samples taken in 3 plates p1, p2, p3, each plate is composed of 5 wells w1, w2, w3, w4, w5, and each well has one or more labels (t1, t2, t3, t4) assigned.

In [2]:

Copied!





random.seed(0)
n_samples = 20
dframe = pd.DataFrame(
    {
        "plate": [random.choice(["p1", "p2", "p3"]) for _ in range(n_samples)],
        "well": [
            random.choice(["w1", "w2", "w3", "w4", "w5"]) for _ in range(n_samples)
        ],
        "label": [random.choice(["t1", "t2", "t3", "t4"]) for _ in range(n_samples)],
    }
)
dframe = dframe.drop_duplicates()
dframe = dframe.sort_values(by=["plate", "well", "label"])
dframe = dframe.reset_index(drop=True)
random.seed(0)
n_samples = 20
dframe = pd.DataFrame(
    {
        "plate": [random.choice(["p1", "p2", "p3"]) for _ in range(n_samples)],
        "well": [
            random.choice(["w1", "w2", "w3", "w4", "w5"]) for _ in range(n_samples)
        ],
        "label": [random.choice(["t1", "t2", "t3", "t4"]) for _ in range(n_samples)],
    }
)
dframe = dframe.drop_duplicates()
dframe = dframe.sort_values(by=["plate", "well", "label"])
dframe = dframe.reset_index(drop=True)

Getting valid pairs¶

To get pairs of samples that share the same label but comes from different plates at different well positions:

In [3]:

Copied!

matcher = Matcher(dframe, ["plate", "well", "label"], seed=0)
pairs_dict = matcher.get_all_pairs(sameby=["label"], diffby=["plate", "well"])
pairs_dict
matcher = Matcher(dframe, ["plate", "well", "label"], seed=0)
pairs_dict = matcher.get_all_pairs(sameby=["label"], diffby=["plate", "well"])
pairs_dict

Out[3]:

{'t1': [(3, 11), (3, 5), (3, 6), (3, 7)],
 't2': [(1, 16), (1, 10), (1, 15), (8, 16), (8, 15), (10, 16)],
 't3': [(9, 4), (9, 13), (13, 4), (13, 12), (4, 12)],
 't4': [(0, 17), (0, 14), (17, 2), (2, 14)]}

Getting valid pairs from a multilabel column¶

For eficiency reasons, you may not want to have duplicated rows. You can group all the labels in a single row and use MatcherMultilabel to find the corresponding pairs:

In [4]:

Copied!

dframe_multi = dframe.groupby(["plate", "well"])["label"].unique().reset_index()
dframe_multi
dframe_multi = dframe.groupby(["plate", "well"])["label"].unique().reset_index()
dframe_multi

Out[4]:

	plate	well	label
0	p1	w2	[t4]
1	p1	w3	[t2, t4]
2	p1	w4	[t1, t3]
3	p2	w1	[t1]
4	p2	w2	[t1]
5	p2	w3	[t1, t2, t3]
6	p2	w4	[t2]
7	p2	w5	[t1, t3]
8	p3	w1	[t3, t4]
9	p3	w4	[t2]
10	p3	w5	[t2, t4]

In [5]:

Copied!





matcher_multi = MatcherMultilabel(
    dframe_multi, columns=["plate", "well", "label"], multilabel_col="label", seed=0
)
pairs_multi = matcher_multi.get_all_pairs(sameby=["label"], diffby=["plate", "well"])
matcher_multi = MatcherMultilabel(
    dframe_multi, columns=["plate", "well", "label"], multilabel_col="label", seed=0
)
pairs_multi = matcher_multi.get_all_pairs(sameby=["label"], diffby=["plate", "well"])

pairs_multi is also a label_id: pairs dictionary with the same structure discussed before:

In [6]:

Copied!

pairs_multi
pairs_multi

Out[6]:

{'t1': [(2, 7), (2, 3), (2, 4), (2, 5)],
 't2': [(1, 10), (1, 6), (1, 9), (5, 10), (5, 9), (6, 10)],
 't3': [(5, 2), (5, 8), (8, 2), (8, 7), (2, 7)],
 't4': [(0, 10), (0, 8), (10, 1), (1, 8)]}