Overview¶
This page provides a brief overview of CytoTable topics. For a brief introduction on how to use CytoTable, please see the tutorial page.
Presets and Manual Overrides¶
Various preset configurations are available for use within CytoTable which affect how data are read and produced under presets.config
.
These presets are intended to assist with common data source expectations.
By default, CytoTable will use the “cellprofiler_csv” preset.
Please note that these presets may not capture all possible outcomes.
Use manual overrides within convert()
as needed.
Data Sources¶
flowchart LR images[("Image\nfile(s)")]:::outlined --> image-tools[Image Analysis Tools]:::outlined image-tools --> measurements[("Measurement\nfile(s)")]:::green measurements --> CytoTable:::green classDef outlined fill:#fff,stroke:#333 classDef green fill:#97F0B4,stroke:#333
Data sources for CytoTable are measurement data created from other cell biology image analysis tools. These measurement data are the focus of the data source content which follows.
Data Source Locations¶
Data sources may be provided to CytoTable using local filepaths or remote object-storage filepaths (for example, AWS S3, GCP Cloud Storage, Azure Storage). We use cloudpathlib under the hood to reference files in a unified way, whether they’re local or remote.
Cloud Data Sources¶
CytoTable uses cloudpathlib to access cloud-based data sources. CytoTable supports:
Amazon S3:
s3://bucket_name/object_name
Google Cloud Storage:
gc://bucket_name/object_name
Azure Blob Storage:
az://container_name/blob_name
Cloud Service Configuration and Authentication¶
Remote object storage paths which require authentication or other specialized configuration may use cloudpathlib client arguments (S3Client, AzureBlobClient, GSClient) and convert(..., **kwargs)
(convert()
).
For example, remote AWS S3 paths which are public-facing and do not require authentication (like, or similar to, aws s3 ... --no-sign-request
) may be used via convert(..., no_sign_request=True)
(convert()
).
Each cloud service provider may have different requirements for authentication (there is no fully unified API for these).
Please see the cloudpathlib client documentation for more information on which arguments may be used for configuration with specific cloud providers (for example, S3Client
, GSClient
, or AzureBlobClient
).
Cloud Service File Type Parsing Differences¶
Data sources retrieved from cloud services are not all treated the same due to technical constraints. See below for a description of how each file type is treated for a better understanding of expectations.
Comma-separated values (.csv):
CytoTable reads cloud-based CSV files directly.
SQLite Databases (.sqlite):
CytoTable downloads cloud-based SQLite databases locally before other CytoTable processing. This is necessary to account for differences in how SQLite’s virtual file system (VFS) operates in context with cloud service object storage.
Note: Large SQLite files stored in the cloud may benefit from explicit local cache specification through a special keyword argument (**kwarg
) passed through CytoTable to cloudpathlib
called local_cache_dir
. See the cloudpathlib documentation on caching.
This argument helps ensure constraints surrounding temporary local file storage locations do not impede the ability to download or work with the data (for example, file size limitations and periodic deletions outside of CytoTable might be encountered within default OS temporary file storage locations).
A quick example of how this argument is used: convert(..., local_cache_dir="non_temporary_directory", ...)
(convert()
).
Future work to enable direct SQLite data access from cloud locations for CytoTable will be documented within GitHub issue CytoTable/#70.
Data Source Types¶
Data source compatibility for CytoTable is focused (but not explicitly limited to) the following.
CellProfiler Data Sources¶
Comma-separated values (.csv): “A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values.” (reference) CellProfiler CSV data sources generally follow the format provided as output by CellProfiler ExportToSpreadsheet.
SQLite Databases (.sqlite): “SQLite database files are commonly used as containers to transfer rich content between systems and as a long-term archival format for data.” (reference) CellProfiler SQLite database sources may follow a format provided as output by CellProfiler ExportToDatabase or cytominer-database.
IN Carta Data Sources¶
Comma-separated values (.csv): Molecular Devices IN Carta software provides output data in CSV format.
Data Destinations¶
Data Destination Locations¶
Converted data destinations are may be provided to CytoTable using only local filepaths (in contrast to data sources, which may also be remote).
Specify the converted data destination using the convert(..., dest_path="<a local filepath>")
(convert()
).
Data Destination Types¶
Apache Parquet (.parquet): “Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.” (reference)
Parquet data destination type may be specified by using
convert(..., dest_datatype="parquet", ...)
(convert()
).
Data Transformations¶
CytoTable performs various types of data transformations. This section help define terminology and expectations surrounding the use of this terminology. CytoTable might use one or all of these depending on user configuration.
Data Chunking¶
Original | Changes | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
“Data source”
|
“Chunk 1”
“Chunk 2”
|
Example of data chunking performed on a simple table of data.
Data chunking within CytoTable involves slicing data sources into “chunks” of rows which all contain the same columns and have a lower number of rows than the original data source.
CytoTable uses data chunking through the chunk_size
argument value (convert(..., chunk_size=1000, ...)
(convert()
)) to reduce the memory footprint of operations on subsets of data.
CytoTable may be used to create chunked data output by disabling concatenation and joins, e.g. convert(..., concat=False,join=False, ...)
(convert()
).
Parquet “datasets” are an abstraction which may be used to read CytoTable output data chunks which are not concatenated or joined (for example, see PyArrow documentation or Pandas documentation on using source paths which are directories).
Data Concatenations¶
Original | Changes | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
“Chunk 1”
“Chunk 2”
|
“Concatenated data”
|
Example of data concatenation performed on simple tables of similar data “chunks”.
Data concatenation within CytoTable involves bringing two or more data “chunks” with the same columns together as a unified dataset. Just as chunking slices data apart, concatenation brings them together. Data concatenation within CytoTable typically occurs using a ParquetWriter to assist with composing a single file from many individual files.
Data Joins¶
Original | Changes | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
“Table 1” (notice Col_C)
“Table 2” (notice Col_Z)
|
“Joined data” (as Table 1 left-joined with Table 2)
|
||||||||||||||||||||
Join Specification in SQL | |||||||||||||||||||||
SELECT *
FROM Table_1
LEFT JOIN Table_2 ON
Table_1.Col_A = Table_2.Col_A;
|
Example of a data join performed on simple example tables.
Data joins within CytoTable involve bringing one or more data sources together with differing columns as a new dataset.
The word “join” here is interpreted through SQL-based terminology on joins.
Joins may be specified in CytoTable using DuckDB-style SQL through convert(..., joins="SELECT * FROM ... JOIN ...", ...)
(convert()
).
Also see CytoTable’s presets found here: presets.config
or via GitHub source code for presets.config.
Note: data software outside of CytoTable sometimes makes use of the term “merge” to describe capabilities which are similar to join (for ex. pandas.DataFrame.merge
.
Within CytoTable, we opt to describe these operations with “join” to avoid confusion with software development alongside the technologies used (for example, DuckDB SQL includes no MERGE
keyword).
Pagination¶
CytoTable uses keyset pagination to help manage system-specific reasonable memory usage when working with large datasets. Pagination, sometimes also called paging or “data chunking”, allows CytoTable to avoid loading entire datasets into memory at once while accomplishing tasks. Keyset pagination leverages existing column data as pagesets to perform data extractions which focus on only a subset of the data as defined within the pageset keys (see example usage below). We use keyset pagination to reduce the overall memory footprint during extractions where other methods inadvertently may not scale for whole dataset work (such as offset-based pagination, which extracts then drops the offset data)(see here for more information).
Keyset pagination definitions may be defined using the page_keys
parameter: convert(..., page_keys={"table_name": "column_name" }, ...)
(convert()
).
The page_keys
parameter expects a dictionary where the keys are names of tables and values which are columns to be used for the keyset pagination pages.
Pagination is implemented in conjunction with the chunk_size
parameter which indicates the size of each page.
We provide preset configurations for these parameters through the preset
parameter convert(..., preset="", ...)
.
Customizing the chunk_size
or page_keys
parameters allows you to tune the process to the size of your data and the resources available on your system.
For large datasets, smaller chunk sizes or specific pagination columns can help manage the workload by enabling smaller, more manageable data extraction at a time.