Overview¶

This page provides a brief overview of CytoTable topics. For a brief introduction on how to use CytoTable, please see the tutorial page.

Presets and Manual Overrides¶

Various preset configurations are available for use within CytoTable which affect how data are read and produced under presets.config. These presets are intended to assist with common data source expectations. By default, CytoTable will use the “cellprofiler_csv” preset. Please note that these presets may not capture all possible outcomes. Use manual overrides within convert() as needed.

Data Sources¶

        flowchart LR
    images[("Image\nfile(s)")]:::outlined --> image-tools[Image Analysis Tools]:::outlined
    image-tools --> measurements[("Measurement\nfile(s)")]:::green
    measurements --> CytoTable:::green

    classDef outlined fill:#fff,stroke:#333
    classDef green fill:#97F0B4,stroke:#333

Data sources for CytoTable are measurement data created from other cell biology image analysis tools. These measurement data are the focus of the data source content which follows.

Data Source Locations¶

Data sources may be provided to CytoTable using local filepaths or remote object-storage filepaths (for example, AWS S3, GCP Cloud Storage, Azure Storage). We use cloudpathlib under the hood to reference files in a unified way, whether they’re local or remote.

Cloud Data Sources¶

CytoTable uses cloudpathlib to access cloud-based data sources. CytoTable supports:

Amazon S3: s3://bucket_name/object_name
Google Cloud Storage: gc://bucket_name/object_name
Azure Blob Storage: az://container_name/blob_name

Cloud Service Configuration and Authentication¶

Remote object storage paths which require authentication or other specialized configuration may use cloudpathlib client arguments (S3Client, AzureBlobClient, GSClient) and convert(..., **kwargs) (convert()).

For example, remote AWS S3 paths which are public-facing and do not require authentication (like, or similar to, aws s3 ... --no-sign-request) may be used via convert(..., no_sign_request=True) (convert()).

Each cloud service provider may have different requirements for authentication (there is no fully unified API for these). Please see the cloudpathlib client documentation for more information on which arguments may be used for configuration with specific cloud providers (for example, S3Client, GSClient, or AzureBlobClient).

Cloud Service File Type Parsing Differences¶

Data sources retrieved from cloud services are not all treated the same due to technical constraints. See below for a description of how each file type is treated for a better understanding of expectations.

Comma-separated values (.csv):

CytoTable reads cloud-based CSV files directly.

SQLite Databases (.sqlite):

CytoTable downloads cloud-based SQLite databases locally before other CytoTable processing. This is necessary to account for differences in how SQLite’s virtual file system (VFS) operates in context with cloud service object storage.

Note: Large SQLite files stored in the cloud may benefit from explicit local cache specification through a special keyword argument (**kwarg) passed through CytoTable to cloudpathlib called local_cache_dir. See the cloudpathlib documentation on caching. This argument helps ensure constraints surrounding temporary local file storage locations do not impede the ability to download or work with the data (for example, file size limitations and periodic deletions outside of CytoTable might be encountered within default OS temporary file storage locations).

A quick example of how this argument is used: convert(..., local_cache_dir="non_temporary_directory", ...) (convert()).

Future work to enable direct SQLite data access from cloud locations for CytoTable will be documented within GitHub issue CytoTable/#70.

NumPy Zipped Archive (.npz):

CytoTable downloads cloud-based NPZ archives locally before other CytoTable processing. This is due to the complexity of random access within the context of a zipped file stored within the cloud. Please note the content above mentioned about SQLite, which also applies to NPZ files.

Data Source Types¶

Data source compatibility for CytoTable is focused (but not explicitly limited to) the following.

CellProfiler Data Sources¶

Comma-separated values (.csv): “A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values.” (reference) CellProfiler CSV data sources generally follow the format provided as output by CellProfiler ExportToSpreadsheet.

Manual specification: CSV data source types may be manually specified by using convert(..., source_datatype="csv", ...) (convert()).

Preset specification: CSV data sources from CellProfiler may use the configuration preset convert(..., preset="cellprofiler_csv", ...) (convert()).

SQLite Databases (.sqlite): “SQLite database files are commonly used as containers to transfer rich content between systems and as a long-term archival format for data.” (reference) CellProfiler SQLite database sources may follow a format provided as output by CellProfiler ExportToDatabase or cytominer-database.

Manual specification: SQLite data source types may be manually specified by using convert(..., source_datatype="sqlite", ...) (convert()).

Preset specification: SQLite data sources from CellProfiler may use the configuration preset convert(..., preset="cellprofiler_sqlite", ...) (convert()).

DeepProfiler Data Sources¶

NumPy Zipped Archive (.npz): DeepProfiler software provides output data in NPZ format.

Manual specification: NPZ data source types may be manually specified by using convert(..., source_datatype="npz", ...) (convert()).

Preset specification: NPZ data from DeepProfiler may be converted through CytoTable by using the following preset convert(..., preset="deepprofiler", ...) (convert()).

IN Carta Data Sources¶

Comma-separated values (.csv): Molecular Devices IN Carta software provides output data in CSV format.

Manual specification: CSV data source types may be manually specified by using convert(..., source_datatype="csv", ...) (convert()).

Preset specification: CSV data sources from In Carta Image Analysis Software may use the configuration preset convert(..., preset="in-carta", ...) (convert()).

Data Destinations¶

Data Destination Locations¶

Converted data destinations are may be provided to CytoTable using only local filepaths (in contrast to data sources, which may also be remote). Specify the converted data destination using the convert(..., dest_path="<a local filepath>") (convert()).

Data Destination Types¶

Apache Parquet (.parquet): “Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.” (reference)

Parquet data destination type may be specified by using convert(..., dest_datatype="parquet", ...) (convert()).

Data Transformations¶

CytoTable performs various types of data transformations. This section help define terminology and expectations surrounding the use of this terminology. CytoTable might use one or all of these depending on user configuration.

Data Chunking¶

Original

Changes

“Data source”

Col_A	Col_B	Col_C
1	a	0.01
2	b	0.02

“Chunk 1”

Col_A	Col_B	Col_C
1	a	0.01

“Chunk 2”

Col_A	Col_B	Col_C
2	b	0.02

Example of data chunking performed on a simple table of data.

Data chunking within CytoTable involves slicing data sources into “chunks” of rows which all contain the same columns and have a lower number of rows than the original data source. CytoTable uses data chunking through the chunk_size argument value (convert(..., chunk_size=1000, ...) (convert())) to reduce the memory footprint of operations on subsets of data. CytoTable may be used to create chunked data output by disabling concatenation and joins, e.g. convert(..., concat=False,join=False, ...) (convert()). Parquet “datasets” are an abstraction which may be used to read CytoTable output data chunks which are not concatenated or joined (for example, see PyArrow documentation or Pandas documentation on using source paths which are directories).

Data Concatenations¶

Original

Changes

“Chunk 1”

Col_A	Col_B	Col_C
1	a	0.01

“Chunk 2”

Col_A	Col_B	Col_C
2	b	0.02

“Concatenated data”

Col_A	Col_B	Col_C
1	a	0.01
2	b	0.02

Example of data concatenation performed on simple tables of similar data “chunks”.

Data concatenation within CytoTable involves bringing two or more data “chunks” with the same columns together as a unified dataset. Just as chunking slices data apart, concatenation brings them together. Data concatenation within CytoTable typically occurs using a ParquetWriter to assist with composing a single file from many individual files.

Data Joins¶

Original

Changes

“Table 1” (notice Col_C)

Col_A	Col_B	Col_C
1	a	0.01

“Table 2” (notice Col_Z)

Col_A	Col_B	Col_Z
1	a	2024-01-01

“Joined data” (as Table 1 left-joined with Table 2)

Col_A	Col_B	Col_C	Col_Z
1	a	0.01	2024-01-01

Join Specification in SQL

SELECT *
FROM Table_1
LEFT JOIN Table_2 ON
Table_1.Col_A = Table_2.Col_A;

Example of a data join performed on simple example tables.

Data joins within CytoTable involve bringing one or more data sources together with differing columns as a new dataset. The word “join” here is interpreted through SQL-based terminology on joins. Joins may be specified in CytoTable using DuckDB-style SQL through convert(..., joins="SELECT * FROM ... JOIN ...", ...) (convert()). Also see CytoTable’s presets found here: presets.config or via GitHub source code for presets.config.

Note: data software outside of CytoTable sometimes makes use of the term “merge” to describe capabilities which are similar to join (for ex. pandas.DataFrame.merge. Within CytoTable, we opt to describe these operations with “join” to avoid confusion with software development alongside the technologies used (for example, DuckDB SQL includes no MERGE keyword).

Pagination¶

CytoTable uses keyset pagination to help manage system-specific reasonable memory usage when working with large datasets. Pagination, sometimes also called paging or “data chunking”, allows CytoTable to avoid loading entire datasets into memory at once while accomplishing tasks. Keyset pagination leverages existing column data as pagesets to perform data extractions which focus on only a subset of the data as defined within the pageset keys (see example usage below). We use keyset pagination to reduce the overall memory footprint during extractions where other methods inadvertently may not scale for whole dataset work (such as offset-based pagination, which extracts then drops the offset data)(see here for more information).

Keyset pagination definitions may be defined using the page_keys parameter: convert(..., page_keys={"table_name": "column_name" }, ...) (convert()). The page_keys parameter expects a dictionary where the keys are names of tables and values which are columns to be used for the keyset pagination pages. Pagination is implemented in conjunction with the chunk_size parameter which indicates the size of each page. We provide preset configurations for these parameters through the preset parameter convert(..., preset="", ...). Customizing the chunk_size or page_keys parameters allows you to tune the process to the size of your data and the resources available on your system. For large datasets, smaller chunk sizes or specific pagination columns can help manage the workload by enabling smaller, more manageable data extraction at a time.

Overview¶

Presets and Manual Overrides¶

Data Sources¶

Data Source Locations¶

Cloud Data Sources¶

Cloud Service Configuration and Authentication¶

Cloud Service File Type Parsing Differences¶

Data Source Types¶

CellProfiler Data Sources¶

DeepProfiler Data Sources¶

IN Carta Data Sources¶

Data Destinations¶

Data Destination Locations¶

Data Destination Types¶

Data Transformations¶

Data Chunking¶

Data Concatenations¶

Data Joins¶

CytoTable

Navigation

Related Topics