Overview

This page provides a brief overview of CytoTable topics. For a brief introduction on how to use CytoTable, please see the tutorial page.

Presets and Manual Overrides

Various preset configurations are available for use within CytoTable which affect how data are read and produced under presets.config. These presets are intended to assist with common data source expectations. By default, CytoTable will use the “cellprofiler_csv” preset. Please note that these presets may not capture all possible outcomes. Use manual overrides within convert() as needed.

Data Sources

flowchart LR images[("Image\nfile(s)")]:::outlined --> image-tools[Image Analysis Tools]:::outlined image-tools --> measurements[("Measurement\nfile(s)")]:::green measurements --> CytoTable:::green classDef outlined fill:#fff,stroke:#333 classDef green fill:#97F0B4,stroke:#333

Data sources for CytoTable are measurement data created from other cell biology image analysis tools. These measurement data are the focus of the data source content which follows.

Data Source Locations

Data sources may be provided to CytoTable using local filepaths or remote object-storage filepaths (for example, AWS S3, GCP Cloud Storage, Azure Storage). We use cloudpathlib under the hood to reference files in a unified way, whether they’re local or remote.

Cloud Data Sources

CytoTable uses cloudpathlib to access cloud-based data sources. CytoTable supports:

Cloud Service Configuration and Authentication

Remote object storage paths which require authentication or other specialized configuration may use cloudpathlib client arguments (S3Client, AzureBlobClient, GSClient) and convert(..., **kwargs) (convert()).

For example, remote AWS S3 paths which are public-facing and do not require authentication (like, or similar to, aws s3 ... --no-sign-request) may be used via convert(..., no_sign_request=True) (convert()).

Each cloud service provider may have different requirements for authentication (there is no fully unified API for these). Please see the cloudpathlib client documentation for more information on which arguments may be used for configuration with specific cloud providers (for example, S3Client, GSClient, or AzureBlobClient).

Cloud Service File Type Parsing Differences

Data sources retrieved from cloud services are not all treated the same due to technical constraints. See below for a description of how each file type is treated for a better understanding of expectations.

Comma-separated values (.csv):

CytoTable reads cloud-based CSV files directly.

SQLite Databases (.sqlite):

CytoTable downloads cloud-based SQLite databases locally before other CytoTable processing. This is necessary to account for differences in how SQLite’s virtual file system (VFS) operates in context with cloud service object storage.

Note: Large SQLite files stored in the cloud may benefit from explicit local cache specification through a special keyword argument (**kwarg) passed through CytoTable to cloudpathlib called local_cache_dir. See the cloudpathlib documentation on caching. This argument helps ensure constraints surrounding temporary local file storage locations do not impede the ability to download or work with the data (for example, file size limitations and periodic deletions outside of CytoTable might be encountered within default OS temporary file storage locations).

A quick example of how this argument is used: convert(..., local_cache_dir="non_temporary_directory", ...) (convert()).

Future work to enable direct SQLite data access from cloud locations for CytoTable will be documented within GitHub issue CytoTable/#70.

Data Source Types

Data source compatibility for CytoTable is focused (but not explicitly limited to) the following.

CellProfiler Data Sources

  • Comma-separated values (.csv): “A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values.” (reference) CellProfiler CSV data sources generally follow the format provided as output by CellProfiler ExportToSpreadsheet.

  • Manual specification: CSV data source types may be manually specified by using convert(..., source_datatype="csv", ...) (convert()).

  • Preset specification: CSV data sources from CellProfiler may use the configuration preset convert(..., preset="cellprofiler_csv", ...) (convert()).

  • SQLite Databases (.sqlite): “SQLite database files are commonly used as containers to transfer rich content between systems and as a long-term archival format for data.” (reference) CellProfiler SQLite database sources may follow a format provided as output by CellProfiler ExportToDatabase or cytominer-database.

  • Manual specification: SQLite data source types may be manually specified by using convert(..., source_datatype="sqlite", ...) (convert()).

  • Preset specification: SQLite data sources from CellProfiler may use the configuration preset convert(..., preset="cellprofiler_sqlite", ...) (convert()).

IN Carta Data Sources

  • Manual specification: CSV data source types may be manually specified by using convert(..., source_datatype="csv", ...) (convert()).

  • Preset specification: CSV data sources from In Carta Image Analysis Software may use the configuration preset convert(..., preset="in-carta", ...) (convert()).

Data Destinations

Data Destination Locations

Converted data destinations are may be provided to CytoTable using only local filepaths (in contrast to data sources, which may also be remote). Specify the converted data destination using the convert(..., dest_path="<a local filepath>") (convert()).

Data Destination Types

  • Apache Parquet (.parquet): “Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.” (reference)

Parquet data destination type may be specified by using convert(..., dest_datatype="parquet", ...) (convert()).

Data Transformations

CytoTable performs various types of data transformations. This section help define terminology and expectations surrounding the use of this terminology. CytoTable might use one or all of these depending on user configuration.

Data Chunking

OriginalChanges

“Data source”

Col_ACol_BCol_C
1a0.01
2b0.02

“Chunk 1”

Col_ACol_BCol_C
1a0.01

“Chunk 2”

Col_ACol_BCol_C
2b0.02

Example of data chunking performed on a simple table of data.

Data chunking within CytoTable involves slicing data sources into “chunks” of rows which all contain the same columns and have a lower number of rows than the original data source. CytoTable uses data chunking through the chunk_size argument value (convert(..., chunk_size=1000, ...) (convert())) to reduce the memory footprint of operations on subsets of data. CytoTable may be used to create chunked data output by disabling concatenation and joins, e.g. convert(..., concat=False,join=False, ...) (convert()). Parquet “datasets” are an abstraction which may be used to read CytoTable output data chunks which are not concatenated or joined (for example, see PyArrow documentation or Pandas documentation on using source paths which are directories).

Data Concatenations

OriginalChanges

“Chunk 1”

Col_ACol_BCol_C
1a0.01

“Chunk 2”

Col_ACol_BCol_C
2b0.02

“Concatenated data”

Col_ACol_BCol_C
1a0.01
2b0.02

Example of data concatenation performed on simple tables of similar data “chunks”.

Data concatenation within CytoTable involves bringing two or more data “chunks” with the same columns together as a unified dataset. Just as chunking slices data apart, concatenation brings them together. Data concatenation within CytoTable typically occurs using a ParquetWriter to assist with composing a single file from many individual files.

Data Joins

OriginalChanges

“Table 1” (notice Col_C)

Col_ACol_BCol_C
1a0.01

“Table 2” (notice Col_Z)

Col_ACol_BCol_Z
1a2024-01-01

“Joined data” (as Table 1 left-joined with Table 2)

Col_ACol_BCol_CCol_Z
1a0.012024-01-01
Join Specification in SQL
SELECT *
FROM Table_1
LEFT JOIN Table_2 ON
Table_1.Col_A = Table_2.Col_A;

Example of a data join performed on simple example tables.

Data joins within CytoTable involve bringing one or more data sources together with differing columns as a new dataset. The word “join” here is interpreted through SQL-based terminology on joins. Joins may be specified in CytoTable using DuckDB-style SQL through convert(..., joins="SELECT * FROM ... JOIN ...", ...) (convert()). Also see CytoTable’s presets found here: presets.config or via GitHub source code for presets.config.

Note: data software outside of CytoTable sometimes makes use of the term “merge” to describe capabilities which are similar to join (for ex. pandas.DataFrame.merge. Within CytoTable, we opt to describe these operations with “join” to avoid confusion with software development alongside the technologies used (for example, DuckDB SQL includes no MERGE keyword).

Pagination

CytoTable uses keyset pagination to help manage system-specific reasonable memory usage when working with large datasets. Pagination, sometimes also called paging or “data chunking”, allows CytoTable to avoid loading entire datasets into memory at once while accomplishing tasks. Keyset pagination leverages existing column data as pagesets to perform data extractions which focus on only a subset of the data as defined within the pageset keys (see example usage below). We use keyset pagination to reduce the overall memory footprint during extractions where other methods inadvertently may not scale for whole dataset work (such as offset-based pagination, which extracts then drops the offset data)(see here for more information).

Keyset pagination definitions may be defined using the page_keys parameter: convert(..., page_keys={"table_name": "column_name" }, ...) (convert()). The page_keys parameter expects a dictionary where the keys are names of tables and values which are columns to be used for the keyset pagination pages. Pagination is implemented in conjunction with the chunk_size parameter which indicates the size of each page. We provide preset configurations for these parameters through the preset parameter convert(..., preset="", ...). Customizing the chunk_size or page_keys parameters allows you to tune the process to the size of your data and the resources available on your system. For large datasets, smaller chunk sizes or specific pagination columns can help manage the workload by enabling smaller, more manageable data extraction at a time.