vignettes/cytominer-pipeline.Rmd
cytominer-pipeline.Rmd
## NULL
Typical morphological profiling datasets have millions of cells and hundreds of features per cells. When working with this data, you must
clean the data
normalize the features so that they are comparable across experiments
transform the features so that their distributions are well-behaved ( i.e., bring them in line with assumptions we want to make about their disributions)
select features based on their quality
aggregate the single-cell data, if needed
The cytominer package makes these steps fast and easy.
First, load the data, which is stored in a database backend created using https://github.com/cytomining/cytominer-database.
fixture <-
system.file("extdata", "fixture_intensities_shapes.sqlite",
package = "cytominer")
db <- DBI::dbConnect(RSQLite::SQLite(), fixture)
Then load associated metadata if it exists, and copy it to the backend so that we can use it later.
ext_metadata <-
readr::read_csv(system.file("extdata", "metadata.csv",
package = "cytominer")) %>%
dplyr::rename(g_well = Well)
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## Well = col_character(),
## Type = col_character()
## )
ext_metadata <- dplyr::copy_to(db, ext_metadata)
Next, select a measurement table that you want to work. Here, we will pick intensities
but we can easily extend to using multiple or all measurement classes if needed by creating new views.
For this example, lets filter the data down to a few wells.
How many rows does this table have?
n |
---|
360 |
That actually 9 times the number of cells in this experiment (n = 40): each compartment and each channel gets its own row. Here, we have 3 compartments (cell, nucleus, cytoplasm) and 3 channels (CellMask, Hoechst, Alexa568). So that’s 3 x 3 x 40 = 360.
Next, do some setup stuff that we will need later
Let’s remove cells that come from images that were marked as having debris
debris_removed <-
measurements %>% dplyr::filter(q_debris == 0)
Then, remove cells where all the measurements are NA’s - this may happen if the identified cell mask was too small to measure any of the features.
na_rows_removed <-
cytominer::drop_na_rows(
population = debris_removed,
variables = variables
) %>%
dplyr::compute()
We need to normalize the data so that
features are on the same scale
plate-to-plate variation is reduced
The default for doing this is standardization
. Here, we take all the cells from control wells in the experiment (this is where the external metadata gets used) and compute normalizations parameters from that (in this case, just the mean and s.d.) and then apply it to the whole dataset (i.e. the population)
normalized <-
cytominer::normalize(
population = na_rows_removed %>%
dplyr::collect(),
variables = variables,
strata = c("g_plate", "g_pattern", "g_channel"),
sample =
na_rows_removed %>%
dplyr::inner_join(
ext_metadata %>%
dplyr::filter(Type == "ctrl") %>%
dplyr::select(g_well)
) %>% dplyr::collect()
)
## Joining, by = "g_well"
normalized %<>% dplyr::collect()
In some cases, we may have features that have no variance at all (e.g. Euler number). If these features have not already been removed by this stage, the standardization step will results in all values for that feature being NA ( because s.d. = 0). Lets remove them:
First, count how many cells have NA values per feature:
na_frequency <-
cytominer::count_na_rows(
population = normalized,
variables = variables)
na_frequency %>%
tidyr::gather(feature, na_count) %>%
knitr::kable()
feature | na_count |
---|---|
m_intensities_first_quartile | 0 |
m_intensities_integrated | 0 |
m_intensities_maximum | 0 |
m_intensities_mean | 0 |
m_intensities_median | 0 |
m_intensities_median_absolute_deviation | 0 |
m_intensities_minimum | 0 |
m_intensities_standard_deviation | 0 |
m_intensities_third_quartile | 0 |
As it turns out, no feature has NA in this example. But lets run this cleaning operation anyway (no features will be dropped)
cleaned <-
cytominer::variable_select(
population = normalized,
variables = variables,
operation = "drop_na_columns"
)
variables <-
colnames(cleaned) %>%
stringr::str_subset("^m_")
Tranform the data so that assumptions we may later make about the data distribution are satisfied (e.g. Gaussianity). The default here is generalized_log
.
transformed <-
cytominer::transform(
population = cleaned,
variables = variables
)
Finally, we typically perform feature selection on the data. Feature selection is an expensive operation, so we usually want to train the feature selection model on a sample of the dataset. Here, we choose to aggregate the data instead of sampling it (i.e. collapse it to per-well aggregates)
aggregated <-
cytominer::aggregate(
population = transformed,
variables = variables,
strata = groupings
) %>%
dplyr::collect()
variables <-
colnames(aggregated) %>%
stringr::str_subset("^m_")
… and then apply feature selection on the per-cell data. Here correlation_threshold
- a method that reduces the redundancy of features - is used.
selected <-
cytominer::variable_select(
population = transformed,
variables = variables,
sample = aggregated,
operation = "correlation_threshold"
) %>%
dplyr::collect()
And now lets take a glimpse at the data!
selected %>%
dplyr::glimpse()
## Rows: 315
## Columns: 10
## $ g_plate <chr> "110000106828", "110000106828"…
## $ g_well <chr> "A01", "A01", "A01", "A01", "A…
## $ g_image <chr> "5a822385194aceb347a07f27285db…
## $ g_pattern <chr> "Cells", "Cells", "Cells", "Ce…
## $ g_channel <chr> "Alexa568", "Alexa568", "Alexa…
## $ q_debris <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ m_intensities_integrated <dbl> -1.48831626, -0.06761634, -0.9…
## $ m_intensities_maximum <dbl> -1.25783096, -1.51483166, -1.5…
## $ m_intensities_median_absolute_deviation <dbl> -1.4847843, -0.5372476, -1.103…
## $ m_intensities_minimum <dbl> -2.3552231, -1.9025552, -0.109…
DBI::dbDisconnect(db)