Typical morphological profiling datasets have millions of cells and hundreds of features per cells. When working with this data, you must

  • clean the data

  • normalize the features so that they are comparable across experiments

  • transform the features so that their distributions are well-behaved ( i.e., bring them in line with assumptions we want to make about their disributions)

  • select features based on their quality

  • aggregate the single-cell data, if needed

The cytominer package makes these steps fast and easy.

Load data

First, load the data, which is stored in a database backend created using https://github.com/cytomining/cytominer-database.

fixture <-
  system.file("extdata", "fixture_intensities_shapes.sqlite",
              package = "cytominer")

db <- DBI::dbConnect(RSQLite::SQLite(), fixture)

Then load associated metadata if it exists, and copy it to the backend so that we can use it later.

ext_metadata <-
  readr::read_csv(system.file("extdata", "metadata.csv",
                              package = "cytominer")) %>%
  dplyr::rename(g_well = Well)
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Well = col_character(),
##   Type = col_character()
## )
ext_metadata <- dplyr::copy_to(db, ext_metadata)

Next, select a measurement table that you want to work. Here, we will pick intensities but we can easily extend to using multiple or all measurement classes if needed by creating new views.

intensities <-
  dplyr::tbl(src = db, "view_intensities") %>%

For this example, lets filter the data down to a few wells.

measurements <-
  intensities %>%
  dplyr::filter(g_well %in% c("A01", "A02", "A10", "A11"))

How many rows does this table have?

measurements %>%
  dplyr::tally() %>%

That actually 9 times the number of cells in this experiment (n = 40): each compartment and each channel gets its own row. Here, we have 3 compartments (cell, nucleus, cytoplasm) and 3 channels (CellMask, Hoechst, Alexa568). So that’s 3 x 3 x 40 = 360.

Next, do some setup stuff that we will need later

qualities <- c("q_debris")

groupings <-

variables <-
  colnames(measurements) %>%

measurements %<>%
  dplyr::select(dplyr::one_of(c(groupings, qualities, variables)))


Let’s remove cells that come from images that were marked as having debris

debris_removed <-
  measurements %>% dplyr::filter(q_debris == 0)

Then, remove cells where all the measurements are NA’s - this may happen if the identified cell mask was too small to measure any of the features.

na_rows_removed <-
    population = debris_removed,
    variables = variables
  ) %>%


We need to normalize the data so that

  • features are on the same scale

  • plate-to-plate variation is reduced

The default for doing this is standardization. Here, we take all the cells from control wells in the experiment (this is where the external metadata gets used) and compute normalizations parameters from that (in this case, just the mean and s.d.) and then apply it to the whole dataset (i.e. the population)

normalized <-
    population = na_rows_removed %>% 
    variables = variables,
    strata =  c("g_plate", "g_pattern", "g_channel"),
    sample =
      na_rows_removed %>%
        ext_metadata %>% 
          dplyr::filter(Type == "ctrl") %>% 
      ) %>% dplyr::collect()
## Joining, by = "g_well"
normalized %<>% dplyr::collect()

In some cases, we may have features that have no variance at all (e.g. Euler number). If these features have not already been removed by this stage, the standardization step will results in all values for that feature being NA ( because s.d. = 0). Lets remove them:

First, count how many cells have NA values per feature:

na_frequency <-
    population = normalized,
    variables = variables)

na_frequency %>%
  tidyr::gather(feature, na_count) %>%
feature na_count
m_intensities_first_quartile 0
m_intensities_integrated 0
m_intensities_maximum 0
m_intensities_mean 0
m_intensities_median 0
m_intensities_median_absolute_deviation 0
m_intensities_minimum 0
m_intensities_standard_deviation 0
m_intensities_third_quartile 0

As it turns out, no feature has NA in this example. But lets run this cleaning operation anyway (no features will be dropped)

cleaned <-
    population = normalized,
    variables = variables,
    operation = "drop_na_columns"

variables <-
  colnames(cleaned) %>%


Tranform the data so that assumptions we may later make about the data distribution are satisfied (e.g. Gaussianity). The default here is generalized_log.

transformed <-
    population = cleaned,
    variables = variables

Select features

Finally, we typically perform feature selection on the data. Feature selection is an expensive operation, so we usually want to train the feature selection model on a sample of the dataset. Here, we choose to aggregate the data instead of sampling it (i.e. collapse it to per-well aggregates)

aggregated <-
    population = transformed,
    variables = variables,
    strata = groupings
  ) %>%

variables <-
  colnames(aggregated) %>%

… and then apply feature selection on the per-cell data. Here correlation_threshold - a method that reduces the redundancy of features - is used.

selected <-
    population = transformed,
    variables = variables,
    sample = aggregated,
    operation = "correlation_threshold"
  ) %>%

And now lets take a glimpse at the data!

selected %>%
## Rows: 315
## Columns: 10
## $ g_plate                                 <chr> "110000106828", "110000106828"…
## $ g_well                                  <chr> "A01", "A01", "A01", "A01", "A…
## $ g_image                                 <chr> "5a822385194aceb347a07f27285db…
## $ g_pattern                               <chr> "Cells", "Cells", "Cells", "Ce…
## $ g_channel                               <chr> "Alexa568", "Alexa568", "Alexa…
## $ q_debris                                <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ m_intensities_integrated                <dbl> -1.48831626, -0.06761634, -0.9…
## $ m_intensities_maximum                   <dbl> -1.25783096, -1.51483166, -1.5…
## $ m_intensities_median_absolute_deviation <dbl> -1.4847843, -0.5372476, -1.103…
## $ m_intensities_minimum                   <dbl> -2.3552231, -1.9025552, -0.109…