Python API¶
Convert¶
CytoTable: convert - transforming data for use with pyctyominer.
- cytotable.convert._run_export_workflow(source_path: str, dest_path: str, source_datatype: str | None, metadata: List[str] | Tuple[str, ...] | None, compartments: List[str] | Tuple[str, ...] | None, identifying_columns: List[str] | Tuple[str, ...] | None, concat: bool, join: bool, joins: str | None, chunk_size: int | None, infer_common_schema: bool, drop_null: bool, sort_output: bool, page_keys: Dict[str, str], dest_datatype: Literal['parquet', 'anndata_h5ad', 'anndata_zarr'] = 'parquet', data_type_cast_map: Dict[str, str] | None = None, add_tablenumber: bool | None = None, **kwargs) Dict[str, List[Dict[str, Any]]] | List[Any] | str[source]¶
Export data to various formats (e.g., parquet) based on configuration.
- Parameters:
source_path (str) – str reference to read source files from. Note: may be local or remote object-storage location using convention “s3://…” or similar.
dest_path (str) – Path to write files to. This path will be used for intermediary data work and must be a new file or directory path. This parameter will result in a directory on join=False. This parameter will result in a single file on join=True. Note: this may only be a local path.
source_datatype (Optional[str]) – Source datatype to focus on during conversion.
metadata (Optional[Union[List[str], Tuple[str, ...]]]) – Metadata names to use for conversion.
compartments (Optional[Union[List[str], Tuple[str, ...]]]) – Compartment names to use for conversion.
identifying_columns (Optional[Union[List[str], Tuple[str, ...]]]) – Column names which are used as ID’s and as a result need to be ignored with regards to renaming.
concat (bool) – Whether to concatenate similar files together.
join (bool) – Whether to join the compartment data together into one dataset.
joins (Optional[str]) – DuckDB-compatible SQL which will be used to perform the join operations.
chunk_size (Optional[int]) – Size of join chunks which is used to limit data size during join ops.
infer_common_schema (bool) – Whether to infer a common schema when concatenating sources.
drop_null (bool) – Whether to drop null results.
sort_output (bool) – Specifies whether to sort cytotable output or not.
page_keys (Dict[str, str]) – A dictionary which defines which column names are used for keyset pagination in order to perform data extraction.
dest_datatype (Literal["parquet", "anndata_h5ad", "anndata_zarr"]) – Output destination datatype to write to. Defaults to ‘parquet’.
data_type_cast_map (Optional[Dict[str, str]]) – A dictionary mapping data type groups to specific types. Roughly includes Arrow data types language from: https://arrow.apache.org/docs/python/api/datatypes.html
add_tablenumber (Optional[bool]) – Whether to add a calculated tablenumber column which helps differentiate repeated values (such as
ObjectNumber) across source data.**kwargs – Keyword args used for gathering source data, primarily relevant for Cloudpathlib cloud-based client configuration.
- Returns:
Grouped sources which include metadata about destination filepath where parquet file was written or a string filepath for the joined result.
- Return type:
Union[Dict[str, List[Dict[str, Any]]], List[Any], str]
- Raises:
CytoTableException – Raised when a source group lacks a matching
page_keysentry for non-NPZ source data, since pagination requires a key column.
- cytotable.convert.convert(source_path: str, dest_path: str, dest_datatype: Literal['parquet', 'anndata_h5ad', 'anndata_zarr'] = 'parquet', dest_backend: Literal['parquet', 'iceberg'] = 'parquet', image_dir: str | None = None, include_source_images: bool = False, mask_dir: str | None = None, outline_dir: str | None = None, segmentation_file_regex: Dict[str, str] | None = None, source_datatype: str | None = None, metadata: List[str] | Tuple[str, ...] | None = None, compartments: List[str] | Tuple[str, ...] | None = None, identifying_columns: List[str] | Tuple[str, ...] | None = None, concat: bool = True, join: bool = True, joins: str | None = None, chunk_size: int | None = None, infer_common_schema: bool = True, drop_null: bool = False, data_type_cast_map: Dict[str, str] | None = None, add_tablenumber: bool | None = None, page_keys: Dict[str, str] | None = None, bbox_column_map: Dict[str, str] | None = None, sort_output: bool = True, preset: str | None = 'cellprofiler_csv', parsl_config: Config | None = None, **kwargs) Dict[str, List[Dict[str, Any]]] | List[Any] | str[source]¶
Convert file-based data from various sources to Pycytominer-compatible standards.
Note: source paths may be local or remote object-storage location using convention “s3://…” or similar.
- Parameters:
source_path (str) – str reference to read source files from. Note: may be local or remote object-storage location using convention “s3://…” or similar.
dest_path (str) – Path to write files to. Setting
dest_backend="parquet"will trigger CytoTable to use the provided path to perform intermediary data processing. The path must represent a new file or directory. This parameter will result in a directory onjoin=False. This parameter will result in a single file onjoin=True. Settingdest_backend="iceberg"will trigger CytoTable to use the provided path as the local warehouse root directory. CytoTable still stages parquet files internally (during write), but these intermediary files are temporary and automatically deleted following write of the final output atdest_path.dest_datatype (Literal["parquet", "anndata_h5ad", "anndata_zarr"]) – Output destination datatype to write to. CytoTable uses this value when the selected backend is
"parquet". Fordest_backend="iceberg", CytoTable currently requiresdest_datatype="parquet"because CytoTable uses parquet as the temporary staging format before it writes data into the Iceberg warehouse.dest_backend (Literal["parquet", "iceberg"]) – Output backend to write to. Defaults to
"parquet". Use"iceberg"to store processed CytoTable tables in a local Iceberg warehouse.image_dir (Optional[str]) – Optional directory or cloud object-storage prefix of source images aligned with the experiment of interest. CytoTable uses this input to build OME-Arrow image crops and, when
include_source_images=True, full-image rows in the iceberg table calledimages.source_images. Requiresdest_backend="iceberg".include_source_images (bool) – Whether to also store full source images in an Iceberg
images.source_imagestable. Requiresimage_diranddest_backend="iceberg".mask_dir (Optional[str]) – Optional directory or cloud object-storage prefix of segmentation masks corresponding to images within
image_dir. CytoTable uses these files to populateome_arrow_labelwhen no outline image is available. Requiresdest_backend="iceberg".outline_dir (Optional[str]) – Optional directory or cloud object-storage prefix of outline images corresponding to images within
image_dir. CytoTable uses these files to populateome_arrow_labelbefore falling back tomask_dir. Requiresdest_backend="iceberg".segmentation_file_regex (Optional[Dict[str, str]]) – Optional regex mapping of segmentation filename patterns to source image filename patterns to link masks and/or outlines. For example, use
{r".*_outline\.tiff$": r"(plateA_well_B03_site_1)\.tiff$"}when outline files and source images do not share the same basename. Requiresdest_backend="iceberg".source_datatype (Optional[str]) – Source datatype to focus on during conversion.
metadata (Optional[Union[List[str], Tuple[str, ...]]]) – Metadata names to use for conversion.
compartments (Optional[Union[List[str], Tuple[str, ...]]]) – Compartment names to use for conversion.
identifying_columns (Optional[Union[List[str], Tuple[str, ...]]]) – Column names which are used as ID’s and as a result need to be ignored with regards to renaming.
concat (bool) – Whether to concatenate similar files together.
join (bool) – Whether to join the compartment data together into one dataset.
joins (Optional[str]) – DuckDB-compatible SQL which will be used to perform the join operations.
chunk_size (Optional[int]) – Size of join chunks which is used to limit data size during join ops.
infer_common_schema (bool) – Whether to infer a common schema when concatenating sources.
drop_null (bool) – Whether to drop nan/null values from results.
data_type_cast_map (Optional[Dict[str, str]]) – A dictionary mapping data type groups to specific types. Roughly includes Arrow data types language from: https://arrow.apache.org/docs/python/api/datatypes.html
add_tablenumber (Optional[bool]) – Whether to add a calculated tablenumber which helps differentiate various repeated values (such as ObjectNumber) within source data. Useful for processing multiple SQLite or CSV data sources together to retain distinction from each dataset.
page_keys (Optional[Dict[str, str]]) – The table and column names to be used for key pagination. Uses the form:
{"table_name": "column_name"}. Expects columns to include numeric data (ints or floats). Interacts with thechunk_sizeparameter to form pages ofchunk_size.bbox_column_map (Optional[Dict[str, str]]) – Optional dictionary that explicitly maps image crop bounding box columns using keys
x_min,x_max,y_min, andy_max. For Iceberg profile exports, CytoTable recodes the provided bounding box value pairs as new columns injoined_profilesasMetadata_SourceBBoxXMin,Metadata_SourceBBoxXMax,Metadata_SourceBBoxYMin, andMetadata_SourceBBoxYMax.sort_output (bool) – Specifies whether to sort cytotable output or not.
preset (Optional[str]) – An optional group of presets to use based on common configurations.
parsl_config (Optional[parsl.Config]) – Optional Parsl configuration to use for running CytoTable operations. Note: when using CytoTable multiple times in the same process, CytoTable will use the first provided configuration for all runs.
**kwargs – Additional keyword args forwarded to source-gathering and backend-specific writers (for example, Cloudpathlib client configuration or Iceberg-specific options).
- Returns:
Grouped sources which include metadata about destination filepath where parquet file was written or str of joined result filepath.
- Return type:
Union[Dict[str, List[Dict[str, Any]]], List[Any], str]
- Raises:
CytoTableException – Raised when input options are inconsistent (for example, when a source group lacks a matching
page_keysentry for non-NPZ data).DatatypeException – Raised when
source_datatypeis not a supported source type.
Example
from cytotable import convert # using a local path with cellprofiler csv presets convert( source_path="./tests/data/cellprofiler/ExampleHuman", source_datatype="csv", dest_path="ExampleHuman.parquet", dest_datatype="parquet", preset="cellprofiler_csv", ) # using an s3-compatible path with no signature for client # and cellprofiler csv presets convert( source_path="s3://s3path", source_datatype="csv", dest_path="s3_local_result", dest_datatype="parquet", concat=True, preset="cellprofiler_csv", no_sign_request=True, ) # using local path with cellprofiler sqlite presets convert( source_path="example.sqlite", dest_path="example.parquet", dest_datatype="parquet", preset="cellprofiler_sqlite", )
- cytotable.convert._concat_join_sources(*args, **kwargs)¶
Concatenate join sources from parquet-based chunks.
For a reference to data concatenation within Arrow see the following: https://arrow.apache.org/docs/python/generated/pyarrow.concat_tables.html
- Parameters:
sources (Dict[str, List[Dict[str, Any]]]) – Grouped datasets of files which will be used by other functions. Includes the metadata concerning location of actual data.
dest_path (str) – Destination path to write file-based content.
join_sources (List[str]) – List of local filepath destination for join source chunks which will be concatenated.
dest_datatype (Literal["parquet", "anndata_h5ad", "anndata_zarr"]) – The datatype of the output destination file. Default is ‘parquet’.
sort_output (bool) – Specifies whether to sort cytotable output or not.
- Returns:
Path to concatenated file which is created as a result of this function.
- Return type:
str
- cytotable.convert._concat_source_group(*args, **kwargs)¶
Concatenate group of source data together as single file.
For a reference to data concatenation within Arrow see the following: https://arrow.apache.org/docs/python/generated/pyarrow.concat_tables.html
Notes: this function presumes a multi-directory, multi-file common data structure for compartments and other data. For example:
Source (file tree):
root ├── subdir_1 │ └── Cells.csv └── subdir_2 └── Cells.csv
Becomes:
# earlier data read into parquet chunks from multiple # data source files. read_data = [ {"table": ["cells-1.parquet", "cells-2.parquet"]}, {"table": ["cells-1.parquet", "cells-2.parquet"]}, ] # focus of this function concatted = [{"table": ["cells.parquet"]}]
- Parameters:
source_group_name (str) – Name of data source source group (for common compartments, etc).
source_group (List[Dict[str, Any]]) – Data structure containing grouped data for concatenation.
dest_path (str) – Optional destination path for concatenated sources.
common_schema (Optional[List[Tuple[str, str]]]) – Common schema to use for concatenation amongst arrow tables which may have slightly different but compatible schema.
sort_output (bool) – Specifies whether to sort cytotable output or not.
- Returns:
Updated dictionary containing concatenated sources.
- Return type:
List[Dict[str, Any]]
- Raises:
SchemaException – Raised when source files cannot be unified under a common schema during concatenation.
OSError – Re-raised when cleaning up the source-group directory fails with an errno other than
ENOTEMPTY.
- cytotable.convert._get_table_columns_and_types(*args, **kwargs)¶
Gather column data from table through duckdb.
- Parameters:
source (Dict[str, Any]) – Contains source data details. Represents a single file or table of some kind.
sort_output (bool) – Specifies whether to sort cytotable output or not.
- Returns:
list of dictionaries which each include column level information
- Return type:
List[Optional[Dict[str, str]]]
- Raises:
duckdb.Error – Re-raised when the underlying duckdb query fails for a reason other than mixed-type errors on a sqlite source.
- cytotable.convert._get_table_keyset_pagination_sets(*args, **kwargs)¶
Get table data chunk keys for later use in capturing segments of values. This work also provides a chance to catch problematic input data which will be ignored with warnings.
- Parameters:
chunk_size (int) – The size in rowcount of the chunks to create.
page_key (str) – The column name to be used to identify pagination chunks. Expected to be of numeric type (int, float) for ordering.
source (Optional[Dict[str, Any]]) – Contains the source data to be chunked. Represents a single file or table of some kind.
sql_stmt (Optional[str]) – Optional sql statement to form the pagination set from. Default behavior extracts pagination sets from the full data source.
- Returns:
List of keys to use for reading the data later on.
- Return type:
Union[List[Optional[Tuple[Union[int, float], Union[int, float]]]], List[None], None]
- cytotable.convert._infer_source_group_common_schema(*args, **kwargs)¶
Infers a common schema for a group of parquet files which may have similar but slightly different schema or data. Intended to assist with data concatenation and other operations.
- Parameters:
source_group (List[Dict[str, Any]]) – Group of one or more data sources which includes metadata about path to parquet data.
data_type_cast_map (Optional[Dict[str, str]]) – A dictionary mapping data type groups to specific types. Roughly includes Arrow data types language from: https://arrow.apache.org/docs/python/api/datatypes.html
- Returns:
A list of tuples which includes column name and PyArrow datatype. This data will later be used as the basis for forming a PyArrow schema.
- Return type:
List[Tuple[str, pa.DataType]]
- cytotable.convert._join_source_pageset(*args, **kwargs)¶
Join sources based on join group keys (group of specific join column values)
- Parameters:
dest_path (str) – Destination path to write file-based content.
joins (str) – DuckDB-compatible SQL which will be used to perform the join operations using the join_group keys as a reference.
page_key (str) – Column name used to filter rows for the current pageset.
pageset (Union[Tuple[int, int], None]) – Inclusive
(start, end)bounds onpage_keyfor the chunk being processed;Noneselects the entire joined result.sort_output (bool) – Whether to sort the joined output by
page_key.drop_null (bool) – Whether to drop rows with null values within the resulting joined data.
- Returns:
Path to joined file which is created as a result of this function.
- Return type:
str
- cytotable.convert._prepare_join_sql(*args, **kwargs)¶
Prepare join SQL statement with actual locations of data based on the sources.
- Parameters:
sources (Dict[str, List[Dict[str, Any]]]) – Grouped datasets of files which will be used by other functions. Includes the metadata concerning location of actual data.
joins (str) – DuckDB-compatible SQL which will be used to perform the join operations using the join_group keys as a reference.
- Returns:
String representing the SQL to be used in later join work.
- Return type:
str
- cytotable.convert._prep_cast_column_data_types(*args, **kwargs)¶
Cast data types per what is received in cast_map.
Example: - columns: [{“column_id”:0, “column_name”:”colname”, “column_dtype”:”DOUBLE”}] - data_type_cast_map: {“float”: “float32”}
The above passed through this function will set the “column_dtype” value to a “REAL” dtype (“REAL” in duckdb is roughly equivalent to “float32”)
- Parameters:
columns (List[Dict[str, str]]) – Column metadata records (each with
column_id,column_name, andcolumn_dtype) describing the columns to cast.data_type_cast_map (Dict[str, str]) –
A dictionary mapping data type groups to specific types. Roughly to eventually align with DuckDB types: https://duckdb.org/docs/sql/data_types/overview
Note: includes synonym matching for common naming convention use in Pandas and/or PyArrow via cytotable.utils.DATA_TYPE_SYNONYMS
- Returns:
list of dictionaries which each include column level information
- Return type:
List[Dict[str, str]]
- cytotable.convert._set_tablenumber(*args, **kwargs)¶
Gathers a “TableNumber” from the image table (if CSV) or SQLite file (if SQLite source) which is a unique identifier intended to help differentiate between imagenumbers to create distinct records for single-cell profiles referenced across multiple source data exports. For example, ImageNumber column values from CellProfiler will repeat across exports, meaning we may lose distinction when combining multiple export files together through CytoTable.
Note: - If using CSV data sources, the image.csv table is used for checksum. - If using SQLite data sources, the entire SQLite database is used for checksum.
- Parameters:
sources (Dict[str, List[Dict[str, Any]]]) – Contains metadata about data tables and related contents.
add_tablenumber (Optional[bool]) – Whether to add a calculated tablenumber. Note: when False, adds None as the tablenumber
- Returns:
New source group with added TableNumber details.
- Return type:
Dict[str, List[Dict[str, Any]]]
- cytotable.convert._prepend_column_name(*args, **kwargs)¶
Rename columns using the source group name, avoiding identifying columns.
Notes: * A source_group_name represents a filename referenced as part of what is specified within targets.
- Parameters:
table_path (str) – Path to a parquet file which will be modified.
source_group_name (str) – Name of data source source group (for common compartments, etc).
identifying_columns (List[str]) – Column names which are used as ID’s and as a result need to be treated differently when renaming.
metadata (Union[List[str], Tuple[str, ...]]) – List of source data names which are used as metadata.
compartments (List[str]) – List of source data names which are used as compartments.
- Returns:
Path to the modified file.
- Return type:
str
- cytotable.convert._source_pageset_to_parquet(*args, **kwargs)¶
Export source data to chunked parquet file using chunk size and offsets.
- Parameters:
source_group_name (str) – Name of the source group (for ex. compartment or metadata table name).
source (Dict[str, Any]) – Contains the source data to be chunked. Represents a single file or table of some kind along with collected information about table.
pageset (Optional[Tuple[Union[int, float], Union[int, float]]]) – The pageset for chunking the data from source.
dest_path (str) – Path to store the output data.
sort_output (bool) – Specifies whether to sort cytotable output or not.
- Returns:
A string of the output filepath.
- Return type:
str
- Raises:
CytoTableException – Raised when
pagesetisNonefor non-NPZ source types, since a pageset range is required for table queries.duckdb.Error – Re-raised when the duckdb-based read fails for a reason other than a mixed-type sqlite error.
Access¶
Generic table access helpers for Parquet files and Iceberg warehouses.
- cytotable.warehouse.access.list_tables(path: str | Path, *, include_views: bool = True) list[str][source]¶
List available table names from a Parquet path or Iceberg warehouse.
- cytotable.warehouse.access.read_table(path: str | Path, table_name: str | None = None) DataFrame[source]¶
Read a table from a Parquet path or Iceberg warehouse.
Iceberg¶
Utilities for reading and writing local Iceberg warehouses with CytoTable.
- class cytotable.warehouse.iceberg.TinyCatalog[source]¶
Bases:
objectPlaceholder catalog when pyiceberg is unavailable.
- cytotable.warehouse.iceberg.catalog(warehouse_path: str | Path, *, default_namespace: str = DEFAULT_NAMESPACE, registry_file: str = DEFAULT_REGISTRY_FILE) TinyCatalog[source]¶
Open a local Iceberg warehouse and return its tiny catalog.
- cytotable.warehouse.iceberg.describe_iceberg_warehouse(warehouse_path: str | Path, include_views: bool = True, *, default_namespace: str = DEFAULT_NAMESPACE, registry_file: str = DEFAULT_REGISTRY_FILE) DataFrame[source]¶
Summarize tables and saved views within a local Iceberg warehouse.
- cytotable.warehouse.iceberg.list_iceberg_tables(warehouse_path: str | Path, include_views: bool = True, *, default_namespace: str = DEFAULT_NAMESPACE, registry_file: str = DEFAULT_REGISTRY_FILE) list[str][source]¶
List fully qualified tables and optional views in a local Iceberg warehouse.
- cytotable.warehouse.iceberg.read_iceberg_table(warehouse_path: str | Path, table_name: str, *, default_namespace: str = DEFAULT_NAMESPACE, registry_file: str = DEFAULT_REGISTRY_FILE) DataFrame[source]¶
Read an Iceberg table or saved SQL view from a local warehouse.
- cytotable.warehouse.iceberg.write_iceberg_warehouse(source_path: str, warehouse_path: str | Path, source_datatype: str | None = None, metadata: Tuple[str, ...] | list[str] | None = None, compartments: Tuple[str, ...] | list[str] | None = None, identifying_columns: Tuple[str, ...] | list[str] | None = None, joins: str | None = None, chunk_size: int | None = None, infer_common_schema: bool = True, data_type_cast_map: Dict[str, str] | None = None, add_tablenumber: bool | None = None, page_keys: Dict[str, str] | None = None, sort_output: bool = True, preset: str | None = 'cellprofiler_csv', image_dir: str | None = None, mask_dir: str | None = None, outline_dir: str | None = None, bbox_column_map: Dict[str, str] | None = None, segmentation_file_regex: Dict[str, str] | None = None, include_source_images: bool = False, default_namespace: str = DEFAULT_NAMESPACE, images_namespace: str = DEFAULT_IMAGES_NAMESPACE, registry_file: str = DEFAULT_REGISTRY_FILE, profiles_table_name: str = DEFAULT_PROFILES_TABLE, profile_with_images_view_name: str | None = DEFAULT_PROFILE_WITH_IMAGES_VIEW, parsl_config: Config | None = None, **kwargs) str[source]¶
Write a CytoTable Iceberg warehouse from raw source data.
This helper powers
convert(..., dest_backend="iceberg")and accepts the same core conversion arguments for source selection, joins, chunking, and image export. Seecytotable.convert.convert()for the shared argument semantics; this function adds Iceberg-specific options such asdefault_namespace,images_namespace,registry_file,profiles_table_name, andprofile_with_images_view_name.- Parameters:
source_path (str) – Source path passed through to the underlying conversion. See
cytotable.convert.convert().warehouse_path (Union[str, Path]) – Filesystem path at which to create the Iceberg warehouse root. Must not already exist.
source_datatype (Optional[str]) – See
cytotable.convert.convert().metadata (Optional[Tuple[str, ...] | list[str]]) – See
cytotable.convert.convert().compartments (Optional[Tuple[str, ...] | list[str]]) – See
cytotable.convert.convert().identifying_columns (Optional[Tuple[str, ...] | list[str]]) – See
cytotable.convert.convert().joins (Optional[str]) – See
cytotable.convert.convert().chunk_size (Optional[int]) – See
cytotable.convert.convert().infer_common_schema (bool) – See
cytotable.convert.convert().data_type_cast_map (Optional[Dict[str, str]]) – See
cytotable.convert.convert().add_tablenumber (Optional[bool]) – See
cytotable.convert.convert().page_keys (Optional[Dict[str, str]]) – See
cytotable.convert.convert().sort_output (bool) – See
cytotable.convert.convert().preset (Optional[str]) – See
cytotable.convert.convert().image_dir (Optional[str]) – See
cytotable.convert.convert().mask_dir (Optional[str]) – See
cytotable.convert.convert().outline_dir (Optional[str]) – See
cytotable.convert.convert().bbox_column_map (Optional[Dict[str, str]]) – See
cytotable.convert.convert().segmentation_file_regex (Optional[Dict[str, str]]) – See
cytotable.convert.convert().include_source_images (bool) – See
cytotable.convert.convert().default_namespace (str) – Iceberg namespace under which the profiles table is registered.
images_namespace (str) – Iceberg namespace under which image-related tables are registered when image export is enabled.
registry_file (str) – Filename of the CytoTable registry file written under the warehouse root. Used to record warehouse tables and views.
profiles_table_name (str) – Name of the joined profiles table written into
default_namespace.profile_with_images_view_name (Optional[str]) – Optional view name registered when image export is enabled, joining profile rows with their corresponding image rows.
parsl_config (Optional[parsl.Config]) – See
cytotable.convert.convert().**kwargs – Additional keyword args forwarded to source-gathering. See
cytotable.convert.convert().
- Returns:
Path to the created Iceberg warehouse root.
- Return type:
str
- Raises:
CytoTableException – Raised when
warehouse_pathalready exists, when image-export options are inconsistent (for example, ancillary image options provided withoutimage_dir,image_dir/mask_dir/outline_dirreferencing a missing directory, missing join SQL, orpage_keyslacking a non-empty'join'entry while image export is requested).ImportError – Raised when the optional
pyicebergdependency is unavailable.ValueError – Raised when Iceberg export’s join configuration is missing – an empty
joinsSQL string or apage_keysmapping without a non-empty'join'entry.
Images¶
Helpers for exporting image crops alongside CytoTable measurement data.
- class cytotable.warehouse.images.BBoxColumns(x_min: str, x_max: str, y_min: str, y_max: str)[source]¶
Bases:
objectBounding box column names for cropped image export.
- x_max: str¶
- x_min: str¶
- y_max: str¶
- y_min: str¶
- class cytotable.warehouse.images.FileIndex(by_relative: dict[str, Path | AnyPath], by_basename: dict[str, list[Path | AnyPath]], by_stem: dict[str, list[Path | AnyPath]])[source]¶
Bases:
objectRelative-path-first index for image-like files in a directory tree.
- by_basename: dict[str, list[Path | AnyPath]]¶
- by_relative: dict[str, Path | AnyPath]¶
- by_stem: dict[str, list[Path | AnyPath]]¶
- cytotable.warehouse.images._build_file_index(file_dir: str | None, path_kwargs: Dict[str, Any] | None = None) FileIndex[source]¶
Build a relative-path-first index for image-like files in a directory tree.
- cytotable.warehouse.images._build_stable_image_crop_id(key_fields: dict[str, Any], image_column: str, image_name: str, bbox: dict[str, int] | None = None) str[source]¶
Build a deterministic identifier for one object/image crop row.
- cytotable.warehouse.images._build_stable_object_id(key_fields: dict[str, Any], bbox: dict[str, int] | None = None) str[source]¶
Build a deterministic object identifier for warehouse image rows.
- cytotable.warehouse.images._build_stable_source_image_id(key_fields: dict[str, Any], image_column: str, image_name: str) str[source]¶
Build a deterministic identifier for one source image row.
- cytotable.warehouse.images._crop_ome_arrow(image_path: Path | AnyPath, bbox: dict[str, int]) dict[str, Any][source]¶
Lazily crop a TIFF-backed image into an OME-Arrow struct.
- cytotable.warehouse.images._extract_image_key_fields(row: Series) dict[str, Any][source]¶
Extract image-level key fields to carry into source image rows.
- cytotable.warehouse.images._extract_key_fields(row: Series) dict[str, Any][source]¶
Extract practical measurement key fields to carry into the image table.
- cytotable.warehouse.images._find_matching_segmentation_path(data_value: str, pattern_map: dict[str, str] | None, file_dir: str | None, candidate_path: Path | AnyPath, file_index: FileIndex | None = None, lookup_cache: dict[str, Path | AnyPath | None] | None = None, path_kwargs: Dict[str, Any] | None = None) Path | AnyPath | None[source]¶
Resolve a matching mask/outline file path for an image value.
- cytotable.warehouse.images._local_image_io_path(path: Path | AnyPath) Path[source]¶
Return a local path for image I/O, caching cloud files when needed.
- cytotable.warehouse.images._normalize_file_value(value: Any) str | None[source]¶
Normalize a file-like value to a comparable path string.
- cytotable.warehouse.images._read_ome_arrow(image_path: Path | AnyPath) dict[str, Any][source]¶
Lazily load a full TIFF-backed image into an OME-Arrow struct.
- cytotable.warehouse.images._relative_index_key(path: Path | AnyPath, root: Path | AnyPath) str[source]¶
Build a normalized relative key for a file under an index root.
- cytotable.warehouse.images._require_ome_arrow() tuple[Any, Any][source]¶
Import and return OME-Arrow objects needed for crop export.
- cytotable.warehouse.images._resolve_image_columns(data: DataFrame) list[str][source]¶
Find joined-table columns that look like image filename columns.
- cytotable.warehouse.images._resolve_indexed_path(normalized_value: str, file_index: FileIndex) Path | AnyPath | None[source]¶
Resolve a normalized path string against a relative-path-first file index.
- cytotable.warehouse.images._strip_null_fields_from_type(data_type: DataType) DataType[source]¶
Remove null-typed fields from nested Arrow types for Iceberg compatibility.
- cytotable.warehouse.images._strip_null_fields_from_value(value: Any, data_type: DataType) Any[source]¶
Remove values corresponding to null-typed nested Arrow fields.
- cytotable.warehouse.images._validated_bbox_values(row: Series, bbox_columns: BBoxColumns) dict[str, int] | None[source]¶
Validate and normalize row bbox values for image cropping.
- cytotable.warehouse.images.add_object_id_to_profiles_frame(joined_frame: DataFrame, bbox_column_map: Dict[str, str] | None = None) DataFrame[source]¶
Add a stable object identifier column to a joined profiles frame.
- cytotable.warehouse.images.image_crop_table_from_joined_chunk(chunk_path: str, image_dir: str, mask_dir: str | None = None, outline_dir: str | None = None, bbox_column_map: Dict[str, str] | None = None, segmentation_file_regex: Dict[str, str] | None = None, path_kwargs: Dict[str, Any] | None = None) Table[source]¶
Build an Arrow table of OME-Arrow image crops from one joined parquet chunk.
- cytotable.warehouse.images.object_id(name: str | UUID | None = None, *, prefix: str = 'obj') str[source]¶
Return a stable string identifier with a UUID-shaped payload.
- cytotable.warehouse.images.profile_with_images_frame(joined_frame: DataFrame, image_frame: DataFrame, bbox_column_map: Dict[str, str] | None = None) DataFrame[source]¶
Expand joined measurement rows into stable object/image references and merge crops.
- cytotable.warehouse.images.resolve_bbox_columns(columns: Sequence[Any], bbox_column_map: Dict[str, str] | None = None) BBoxColumns | None[source]¶
Resolve bbox columns using custom mapping, CellProfiler naming, then fallback tags.
- cytotable.warehouse.images.source_image_table_from_joined_chunk(chunk_path: str, image_dir: str, mask_dir: str | None = None, outline_dir: str | None = None, segmentation_file_regex: Dict[str, str] | None = None, path_kwargs: Dict[str, Any] | None = None) Table[source]¶
Build an Arrow table of full OME-Arrow source images from one joined chunk.
Sources¶
CytoTable: sources - tasks and flows related to source data and metadata for performing conversion work.
- cytotable.sources._build_path(path: str, **kwargs) Path | AnyPath[source]¶
Build a path client or return local path.
- Parameters:
path (str) – Path to seek filepaths within.
**kwargs – keyword arguments to be used with Cloudpathlib.CloudPath.client .
- Returns:
A local pathlib.Path or Cloudpathlib.AnyPath type path.
- Return type:
Union[pathlib.Path, AnyPath]
- cytotable.sources._file_is_more_than_one_line(path: Path | AnyPath) bool[source]¶
Check if the file has more than one line.
- Parameters:
path (Union[pathlib.Path, AnyPath]) – The path to the file.
- Returns:
True if the file has more than one line, False otherwise. For sqlite and npz files (which are not line-oriented), always returns True.
- Return type:
bool
- cytotable.sources._filter_source_filepaths(sources: Dict[str, List[Dict[str, Any]]], source_datatype: str) Dict[str, List[Dict[str, Any]]][source]¶
Filter source filepaths based on provided source_datatype.
- Parameters:
sources (Dict[str, List[Dict[str, Any]]]) – Grouped datasets of files which will be used by other functions.
source_datatype (str) – Source datatype to use for filtering the dataset.
- Returns:
Data structure which groups related files based on the datatype.
- Return type:
Dict[str, List[Dict[str, Any]]]
- cytotable.sources._gather_sources(source_path: str, source_datatype: str | None = None, targets: List[str] | None = None, **kwargs) Dict[str, List[Dict[str, Any]]][source]¶
Flow for gathering data sources for conversion.
- Parameters:
source_path (str) – Where to gather file-based data from.
source_datatype (Optional[str]) – The source datatype (extension) to use for reading the tables.
targets (Optional[List[str]]) – The source file names to target within the provided path.
**kwargs – Additional keyword args forwarded to the cloudpathlib client when reading source paths from cloud object storage.
- Returns:
Data structure which groups related files based on the compartments.
- Return type:
Dict[str, List[Dict[str, Any]]]
- cytotable.sources._get_source_filepaths(path: Path | AnyPath, targets: List[str] | None = None, source_datatype: str | None = None) Dict[str, List[Dict[str, Any]]][source]¶
Gather dataset of filepaths from a provided directory path.
- Parameters:
path (Union[pathlib.Path, AnyPath]) – Either a directory path to seek filepaths within or a path directly to a file.
targets (Optional[List[str]]) – Compartment and metadata names to seek within the provided path.
source_datatype (Optional[str]) – The source datatype (extension) to use for reading the tables.
- Returns:
Data structure which groups related files based on the compartments.
- Return type:
Dict[str, List[Dict[str, Any]]]
- Raises:
DatatypeException – Raised when both
targetsandsource_datatypeare unset, since at least one is required to identify source files.NoInputDataException – Raised when no input files are found at
path.
- cytotable.sources._infer_source_datatype(sources: Dict[str, List[Dict[str, Any]]], source_datatype: str | None = None) str[source]¶
Infers and optionally validates datatype (extension) of files.
- Parameters:
sources (Dict[str, List[Dict[str, Any]]]) – Grouped datasets of files which will be used by other functions.
source_datatype (Optional[str]) – Optional source datatype to validate within the context of detected datatypes.
- Returns:
A string of the datatype detected or validated source_datatype.
- Return type:
str
- Raises:
DatatypeException – Raised when more than one datatype is inferred without an explicit
source_datatype, or when the requestedsource_datatypeis not present among the detected file suffixes.
Utils¶
Utility functions for CytoTable
- cytotable.utils.Parsl_AppBase_init_for_docs(self, func, *args, **kwargs)[source]¶
A function to extend Parsl.app.app.AppBase with docstring from decorated functions rather than the decorators from Parsl. Used for Sphinx documentation purposes.
- cytotable.utils._arrow_type_cast_if_specified(column: Dict[str, str], data_type_cast_map: Dict[str, str]) Dict[str, str][source]¶
Attempts to cast data types for an PyArrow field using provided a data_type_cast_map.
- Parameters:
column (Dict[str, str]) – Dictionary which includes a column idx, name, and dtype
data_type_cast_map (Dict[str, str]) – A dictionary mapping data type groups to specific types. Roughly includes Arrow data types language from: https://arrow.apache.org/docs/python/api/datatypes.html Example: {“float”: “float32”}
- Returns:
A potentially data type updated dictionary of column information
- Return type:
Dict[str, str]
- cytotable.utils._cache_cloudpath_to_local(path: AnyPath) Path[source]¶
Takes a cloudpath and uses cache to convert to a local copy for use in scenarios where remote work is not possible (sqlite).
- Parameters:
path (AnyPath) – A filepath which will be checked and potentially converted to a local filepath.
- Returns:
A local pathlib.Path to cached version of cloudpath file.
- Return type:
pathlib.Path
- cytotable.utils._column_sort(value: str)[source]¶
A custom sort for column values as a list. To be used with sorted and Pyarrow tables.
- cytotable.utils._default_parsl_config()[source]¶
Return a default Parsl configuration for use with CytoTable.
- cytotable.utils._duckdb_reader() DuckDBPyConnection[source]¶
Creates a DuckDB connection with the sqlite_scanner installed and loaded.
Note: using this function assumes implementation will close the subsequently created DuckDB connection using _duckdb_reader().close() or using a context manager, for ex., using: with _duckdb_reader() as ddb_reader:
- Returns:
A configured DuckDB connection with the sqlite_scanner and httpfs extensions installed and loaded.
- Return type:
duckdb.DuckDBPyConnection
- cytotable.utils._expand_path(path: str | Path | AnyPath) Path | AnyPath[source]¶
Expands “~” user directory references with the user’s home directory, and expands variable references with values from the environment. After user/variable expansion, the path is resolved and an absolute path is returned.
- Parameters:
path (Union[str, pathlib.Path, AnyPath]) – Path to expand.
- Returns:
A local pathlib.Path or Cloudpathlib.AnyPath type path.
- Return type:
Union[pathlib.Path, AnyPath]
- cytotable.utils._extract_npz_to_parquet(source_path: str, dest_path: str, tablenumber: int | None = None) str[source]¶
Extract data from an .npz file created by DeepProfiler as a tabular dataset and write to parquet.
DeepProfiler creates datasets which look somewhat like this: Keys in the .npz file: [‘features’, ‘metadata’, ‘locations’]
Variable: features Shape: (229, 6400) Data type: float32
Variable: locations Shape: (229, 2) Data type: float64
Variable: metadata Shape: () Data type: object Whole object: { ‘Metadata_Plate’: ‘SQ00014812’, ‘Metadata_Well’: ‘A01’, ‘Metadata_Site’: 1, ‘Plate_Map_Name’: ‘C-7161-01-LM6-022’, ‘RNA’: ‘SQ00014812/r01c01f01p01-ch3sk1fk1fl1.png’, ‘ER’: ‘SQ00014812/r01c01f01p01-ch2sk1fk1fl1.png’, ‘AGP’: ‘SQ00014812/r01c01f01p01-ch4sk1fk1fl1.png’, ‘Mito’: ‘SQ00014812/r01c01f01p01-ch5sk1fk1fl1.png’, ‘DNA’: ‘SQ00014812/r01c01f01p01-ch1sk1fk1fl1.png’, ‘Treatment_ID’: 0, ‘Treatment_Replicate’: 1, ‘Treatment’: ‘DMSO@NA’, ‘Compound’: ‘DMSO’, ‘Concentration’: ‘’, ‘Split’: ‘Training’, ‘Metadata_Model’: ‘efficientnet’ }
- Parameters:
source_path (str) – Path to the .npz file.
dest_path (str) – Destination path for the parquet file.
tablenumber (Optional[int]) – Optional tablenumber to be added to the data.
- Returns:
Path to the exported parquet file.
- Return type:
str
- cytotable.utils._gather_tablenumber_checksum(pathname: str, buffer_size: int = 1048576) int[source]¶
Build and return a checksum for use as a unique identifier across datasets referenced from cytominer-database: https://github.com/cytomining/cytominer-database/blob/master/cytominer_database/ingest_variable_engine.py#L129
- Parameters:
pathname (str) – A path to a file with which to generate the checksum on.
buffer_size (int) – Buffer size to use for reading data.
- Returns:
an integer representing the checksum of the pathname file.
- Return type:
int
- cytotable.utils._generate_pagesets(keys: List[int | float], chunk_size: int) List[Tuple[int | float, int | float]][source]¶
Generate a pageset (keyset pagination) from a list of keys.
- Parameters:
keys (List[Union[int, float]]) – List of keys to paginate.
chunk_size (int) – Size of each chunk/page.
- Returns:
List of (start_key, end_key) tuples representing each page.
- Return type:
List[Tuple[Union[int, float], Union[int, float]]]
- cytotable.utils._get_cytotable_version() str[source]¶
Seeks the current version of CytoTable using either pkg_resources or dunamai to determine the current version being used.
- Returns:
A string representing the version of CytoTable currently being used.
- Return type:
str
- cytotable.utils._glob_follow_symlinks(start: Path, pattern: str) Iterator[Path][source]¶
Like
Path.glob(pattern), but follows symlinked directories on every Python version CytoTable supports. Intended for local and network filesystems only - cloud object stores have no filesystem symlinks and must use theCloudPathbranches incloud_glob().Path.globonly gainedrecurse_symlinks=Truein 3.13; on 3.10-3.12 we walk the tree withos.walk(followlinks=True)and match each entry’s relative path against the full pattern, so any pattern accepted byPath.globworks across versions.- Parameters:
start (Path) – Root directory to glob under. Must reference a local or network filesystem path.
pattern (str) – Glob pattern relative to
start(e.g."**/*.csv").
- Yields:
Path –
pathlib.Pathentries matchingpattern, deduplicated so that two paths resolving to the same real file are only yielded once.
- cytotable.utils._glob_pattern_matches(rel_parts: Tuple[str, ...], pat_parts: List[str]) bool[source]¶
Match path components against pattern components using pathlib-glob semantics:
**matches zero or more components,*and?are fnmatch wildcards within a single component, and matching is anchored at the left ofrel_parts.- Parameters:
rel_parts (Tuple[str, ...]) – Path components of the candidate, relative to the search root (e.g.
("analysis", "Cells.csv")).pat_parts (List[str]) – Pattern components produced by splitting the glob on
"/"(e.g.["**", "*.csv"]).
- Returns:
Trueifrel_partsmatchespat_partsunder the semantics described above, elseFalse.- Return type:
bool
- cytotable.utils._natural_sort(list_to_sort: List[Any]) List[Any][source]¶
Sorts the given iterable using natural sort adapted from approach provided by the following link: https://stackoverflow.com/a/4836734
- Parameters:
list_to_sort (List[Any]) – The list to sort.
- Returns:
The sorted list.
- Return type:
List[Any]
- cytotable.utils._parsl_loaded() bool[source]¶
Checks whether Parsl configuration has already been loaded.
- cytotable.utils._sqlite_mixed_type_query_to_parquet(source_path: str, table_name: str, page_key: str, pageset: Tuple[int | float, int | float], sort_output: bool, tablenumber: int | None = None) Table[source]¶
Performs SQLite table data extraction where one or many columns include data values of potentially mismatched type such that the data may be exported to Arrow for later use.
- Parameters:
source_path (str) – A str which is a path to a SQLite database file.
table_name (str) – The name of the table being queried.
page_key (str) – The column name to be used to identify pagination chunks.
pageset (Tuple[Union[int, float], Union[int, float]]) – The range for values used for paginating data from source.
sort_output (bool) – Specifies whether to sort cytotable output or not.
tablenumber (Optional[int]) – An optional table number to append to the results. Defaults to None.
- Returns:
A PyArrow table containing the extracted rows, with mixed-type cells coerced to nulls where storage class disagrees with column type.
- Return type:
pa.Table
- cytotable.utils._unwrap_source(source: Dict[str, AppFuture | Any] | AppFuture | Any) Dict[str, Any] | Any[source]¶
Helper function to unwrap futures from sources.
- Parameters:
source (Union[Dict[str, Union[parsl.dataflow.futures.AppFuture, Any]], Union[parsl.dataflow.futures.AppFuture, Any]]) – A source is a portion of an internal data structure used by CytoTable for processing and organizing data results. May be a dictionary of values (some of which may be Parsl futures) or a single value or future.
- Returns:
An evaluated dictionary or other value type.
- Return type:
Union[Dict[str, Any], Any]
- cytotable.utils._unwrap_value(val: AppFuture | Any) Any[source]¶
Helper function to unwrap futures from values or return values where there are no futures.
- Parameters:
val (Union[parsl.dataflow.futures.AppFuture, Any]) – A value which may or may not be a Parsl future which needs to be evaluated.
- Returns:
Returns the value as-is if there’s no future, the future result if Parsl futures are encountered.
- Return type:
Any
- cytotable.utils._walk_and_match(start: Path, pattern: str) Iterator[Path][source]¶
Walk
startwithos.walk(followlinks=True)and yield entries whose path (relative tostart) matchespatternunder pathlib-glob semantics. Implements the 3.10-3.12 fallback used by_glob_follow_symlinks(). Subdirectories whose real path has already been entered are pruned before descent so that cyclic or aliasing symlinks neither hang the walk nor produce duplicate yields.- Parameters:
start (Path) – Root directory of the walk. Must reference a local or network filesystem path.
pattern (str) – Glob pattern relative to
start(e.g."**/*.csv").
- Yields:
Path –
pathlib.Pathentries (files or directories) matchingpattern.
- cytotable.utils._write_parquet_table_with_metadata(table: Table, **kwargs) None[source]¶
Adds metadata to parquet output from CytoTable. Note: this mostly wraps pyarrow.parquet.write_table https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html
- Parameters:
table (pa.Table) – Pyarrow table to be serialized as parquet table.
**kwargs – kwargs provided to this function roughly align with pyarrow.parquet.write_table. The following might be examples of what to expect here: - where: str or pyarrow.NativeFile
- cytotable.utils.cloud_glob(start: str | CloudPath | Path, pattern: str, max_matches: int | None = None, cp_client: S3Client | None = None, boto_s3_client: Any | None = None) Iterator[CloudPath | Path][source]¶
Globs under start and yields matching paths. We provide cloud-platform specific optimizations as needed based on platform SDK’s.
- Behavior by input type:
- S3 (cloudpathlib S3Path or ‘s3://…’):
Use unsigned boto3 to list keys and yield unsigned cloudpathlib.S3Path.
- Other CloudPath (e.g., GCS/Azure/local providers via cloudpathlib):
Fallback to CloudPath.glob(pattern), yielding CloudPath.
- Local or network filesystem (pathlib.Path or non-s3 string):
Walk with symlinked subdirectories followed, yielding pathlib.Path.
Symlink-following is only applied to the local-/network-filesystem branch. Object stores (S3, GCS, Azure) do not have filesystem-level symlinks, so the CloudPath branches are unaffected.
- Parameters:
start (Union[str, CloudPath, Path]) – CloudPath, pathlib.Path, or URI string.
pattern (str) – Glob pattern relative to start (supports ** for S3 branch).
max_matches (Optional[int]) – Optional cap on yielded results.
cp_client (Optional[S3Client]) – cloudpathlib S3Client (unsigned recommended).
boto_s3_client (Optional[Any]) – boto3 S3 client (unsigned recommended).
- Yields:
Union[CloudPath, Path] – cloudpathlib.S3Path for S3, CloudPath for other cloud providers, or pathlib.Path for local filesystem entries.
- Raises:
TypeError – Raised when
startis not aCloudPath,pathlib.Path, or string URI.
- cytotable.utils.evaluate_futures(sources: Dict[str, List[Dict[str, Any]]] | List[Any] | str) Any[source]¶
Evaluates any Parsl futures for use within other tasks. This enables a pattern of Parsl app usage as “tasks” and delayed future result evaluation for concurrency.
- Parameters:
sources (Union[Dict[str, List[Dict[str, Any]]], List[Any], str]) – Sources are an internal data structure used by CytoTable for processing and organizing data results. They may include futures which require asynchronous processing through Parsl, so we process them through this function.
- Returns:
A data structure which includes evaluated futures where they were found.
- Return type:
Any
- cytotable.utils.find_anndata_metadata_field_names(source: str | Path) tuple[list[str], list[str]][source]¶
Classify the source table columns as numeric and non-numeric.
Scans the Parquet file schema and returns two lists of column names: those with numeric types (float, integer, decimal) and those with any other type. This is handy for separating AnnData metadata fields by basic numeric-ness for downstream processing.
- Parameters:
source (Union[str, pathlib.Path]) – Path to a Parquet file to inspect.
- Returns:
A 2-tuple
(numeric_fields, non_numeric_fields), where each element is a list of column names.- Return type:
tuple[list[str], list[str]]
- cytotable.utils.map_pyarrow_type(field_type: DataType, data_type_cast_map: Dict[str, str] | None) DataType[source]¶
Map PyArrow types dynamically to handle nested types and casting.
This function takes a PyArrow field_type and dynamically maps it to a valid PyArrow type, handling nested types (e.g., lists, structs) and resolving type conflicts (e.g., integer to float). It also supports custom type casting using the data_type_cast_map parameter.
- Parameters:
field_type (pa.DataType) – The PyArrow data type to be mapped. This can include simple types (e.g., int, float, string) or nested types (e.g., list, struct).
data_type_cast_map (Optional[Dict[str, str]]) – A dictionary mapping data type groups to specific types. This allows for custom type casting. For example: - {“float”: “float32”} maps floating-point types to float32. - {“int”: “int64”} maps integer types to int64. If data_type_cast_map is None, default PyArrow types are used.
- Returns:
The mapped PyArrow data type. If no mapping is needed, the original field_type is returned.
- Return type:
pa.DataType
Presets¶
- cytotable.presets.config¶
Configuration presets for CytoTable
Exceptions¶
Provide hierarchy of exceptions for CytoTable
- exception cytotable.exceptions.CytoTableException[source]¶
Bases:
ExceptionRoot exception for custom hierarchy of exceptions with CytoTable.
- exception cytotable.exceptions.DatatypeException[source]¶
Bases:
CytoTableExceptionException for datatype challenges.
- exception cytotable.exceptions.NoInputDataException[source]¶
Bases:
CytoTableExceptionException for no input data.
- exception cytotable.exceptions.SchemaException[source]¶
Bases:
CytoTableExceptionException for schema challenges.