Python API¶
Convert¶
CytoTable: convert - transforming data for use with pyctyominer.
- cytotable.convert._run_export_workflow(source_path: str, dest_path: str, source_datatype: str | None, metadata: List[str] | Tuple[str, ...] | None, compartments: List[str] | Tuple[str, ...] | None, identifying_columns: List[str] | Tuple[str, ...] | None, concat: bool, join: bool, joins: str | None, chunk_size: int | None, infer_common_schema: bool, drop_null: bool, sort_output: bool, page_keys: Dict[str, str], dest_datatype: Literal['parquet', 'anndata_h5ad', 'anndata_zarr'] = 'parquet', data_type_cast_map: Dict[str, str] | None = None, add_tablenumber: bool | None = None, **kwargs) Dict[str, List[Dict[str, Any]]] | List[Any] | str[source]¶
Export data to various formats (e.g., parquet) based on configuration.
- Parameters:
source_path – str: str reference to read source files from. Note: may be local or remote object-storage location using convention “s3://…” or similar.
dest_path – str: Path to write files to. This path will be used for intermediary data work and must be a new file or directory path. This parameter will result in a directory on join=False. This parameter will result in a single file on join=True. Note: this may only be a local path.
source_datatype – Optional[str]: (Default value = None) Source datatype to focus on during conversion.
metadata – Union[List[str], Tuple[str, …]]: Metadata names to use for conversion.
compartments – Union[List[str], Tuple[str, …]]: (Default value = None) Compartment names to use for conversion.
identifying_columns – Union[List[str], Tuple[str, …]]: Column names which are used as ID’s and as a result need to be ignored with regards to renaming.
concat – bool: Whether to concatenate similar files together.
join – bool: Whether to join the compartment data together into one dataset.
joins – str: DuckDB-compatible SQL which will be used to perform the join operations.
chunk_size – Optional[int], Size of join chunks which is used to limit data size during join ops.
infer_common_schema – bool: (Default value = True) Whether to infer a common schema when concatenating sources.
drop_null – bool: Whether to drop null results.
sort_output – bool Specifies whether to sort cytotable output or not.
page_keys – Dict[str, str] A dictionary which defines which column names are used for keyset pagination in order to perform data extraction.
dest_datatype – Literal[“parquet”, “anndata_h5ad”, “anndata_zarr”]: Output destination datatype to write to. Defaults to ‘parquet’.
data_type_cast_map – Dict[str, str] A dictionary mapping data type groups to specific types. Roughly includes Arrow data types language from: https://arrow.apache.org/docs/python/api/datatypes.html
**kwargs – Any: Keyword args used for gathering source data, primarily relevant for Cloudpathlib cloud-based client configuration.
- Returns:
Grouped sources which include metadata about destination filepath where parquet file was written or a string filepath for the joined result.
- Return type:
Union[Dict[str, List[Dict[str, Any]]], str]
- cytotable.convert.convert(source_path: str, dest_path: str, dest_datatype: Literal['parquet', 'anndata_h5ad', 'anndata_zarr'] = 'parquet', dest_backend: Literal['parquet', 'iceberg'] = 'parquet', image_dir: str | None = None, include_source_images: bool = False, mask_dir: str | None = None, outline_dir: str | None = None, segmentation_file_regex: Dict[str, str] | None = None, source_datatype: str | None = None, metadata: List[str] | Tuple[str, ...] | None = None, compartments: List[str] | Tuple[str, ...] | None = None, identifying_columns: List[str] | Tuple[str, ...] | None = None, concat: bool = True, join: bool = True, joins: str | None = None, chunk_size: int | None = None, infer_common_schema: bool = True, drop_null: bool = False, data_type_cast_map: Dict[str, str] | None = None, add_tablenumber: bool | None = None, page_keys: Dict[str, str] | None = None, bbox_column_map: Dict[str, str] | None = None, sort_output: bool = True, preset: str | None = 'cellprofiler_csv', parsl_config: Config | None = None, **kwargs) Dict[str, List[Dict[str, Any]]] | List[Any] | str[source]¶
Convert file-based data from various sources to Pycytominer-compatible standards.
Note: source paths may be local or remote object-storage location using convention “s3://…” or similar.
- Parameters:
source_path – str: str reference to read source files from. Note: may be local or remote object-storage location using convention “s3://…” or similar.
dest_path – str: Path to write files to. Setting dest_backend=”parquet” will trigger CytoTable to use the provided path to perform intermediary data processing. The path must represent a new file or directory. This parameter will result in a directory on join=False. This parameter will result in a single file on join=True. Setting dest_backend=”iceberg” will trigger CytoTable to use the provided path as the local warehouse root directory. CytoTable still stages parquet files internally (during write), but these intermediary files are temporary and automatically deleted following write of the final output at dest_path.
dest_backend – Literal[“parquet”, “iceberg”]: Output backend to write to. Defaults to “parquet”. Use “iceberg” to store processed CytoTable tables in a local Iceberg warehouse.
dest_datatype – Literal[“parquet”, “anndata_h5ad”, “anndata_zarr”]: Output destination datatype to write to. CytoTable uses this value when the selected backend is “parquet”. For dest_backend=”iceberg”, CytoTable currently requires dest_datatype=”parquet” because CytoTable uses parquet as the temporary staging format before it writes data into the Iceberg warehouse.
image_dir – Optional[str] Optional directory or cloud object-storage prefix of source images aligned with the experiment of interest. CytoTable uses this input to build OME-Arrow image crops and, when include_source_images=True, full-image rows in the iceberg table called images.source_images. Requires dest_backend=”iceberg”.
include_source_images – bool Whether to also store full source images in an Iceberg images.source_images table. Requires image_dir and dest_backend=”iceberg”.
mask_dir – Optional[str] Optional directory or cloud object-storage prefix of segmentation masks corresponding to images within image_dir. CytoTable uses these files to populate ome_arrow_label when no outline image is available. Requires dest_backend=”iceberg”.
outline_dir – Optional[str] Optional directory or cloud object-storage prefix of outline images corresponding to images within image_dir. CytoTable uses these files to populate ome_arrow_label before falling back to mask_dir. Requires dest_backend=”iceberg”.
segmentation_file_regex – Optional[Dict[str, str]] Optional regex mapping of segmentation filename patterns to source image filename patterns to link masks and/or outlines. For example, use {r”.*_outline.tiff$”: r”(plateA_well_B03_site_1).tiff$”} when outline files and source images do not share the same basename. Requires dest_backend=”iceberg”.
source_datatype – Optional[str]: (Default value = None) Source datatype to focus on during conversion.
metadata – Union[List[str], Tuple[str, …]]: Metadata names to use for conversion.
compartments – Union[List[str], Tuple[str, str, str, str]]: (Default value = None) Compartment names to use for conversion.
identifying_columns – Union[List[str], Tuple[str, …]]: Column names which are used as ID’s and as a result need to be ignored with regards to renaming.
concat – bool: (Default value = True) Whether to concatenate similar files together.
join – bool: (Default value = True) Whether to join the compartment data together into one dataset
joins – str: (Default value = None): DuckDB-compatible SQL which will be used to perform the join operations.
chunk_size – Optional[int] (Default value = None) Size of join chunks which is used to limit data size during join ops
infer_common_schema – bool (Default value = True) Whether to infer a common schema when concatenating sources.
data_type_cast_map – Dict[str, str], (Default value = None) A dictionary mapping data type groups to specific types. Roughly includes Arrow data types language from: https://arrow.apache.org/docs/python/api/datatypes.html
add_tablenumber – Optional[bool] Whether to add a calculated tablenumber which helps differentiate various repeated values (such as ObjectNumber) within source data. Useful for processing multiple SQLite or CSV data sources together to retain distinction from each dataset.
page_keys – str: The table and column names to be used for key pagination. Uses the form: {“table_name”:”column_name”}. Expects columns to include numeric data (ints or floats). Interacts with the chunk_size parameter to form pages of chunk_size.
bbox_column_map – Optional[Dict[str, str]] Optional dictionary that explicitly maps image crop bounding box columns using keys x_min, x_max, y_min, and y_max. For Iceberg profile exports, CytoTable recodes the provided bounding box value pairs as new columns in joined_profiles as Metadata_SourceBBoxXMin, Metadata_SourceBBoxXMax, Metadata_SourceBBoxYMin, and Metadata_SourceBBoxYMax.
sort_output – bool (Default value = True) Specifies whether to sort cytotable output or not.
drop_null – bool (Default value = False) Whether to drop nan/null values from results
preset – str (Default value = “cellprofiler_csv”) an optional group of presets to use based on common configurations
parsl_config – Optional[parsl.Config] (Default value = None) Optional Parsl configuration to use for running CytoTable operations. Note: when using CytoTable multiple times in the same process, CytoTable will use the first provided configuration for all runs.
- Returns:
- Union[Dict[str, List[Dict[str, Any]]], str]
Grouped sources which include metadata about destination filepath where parquet file was written or str of joined result filepath.
Example
from cytotable import convert # using a local path with cellprofiler csv presets convert( source_path="./tests/data/cellprofiler/ExampleHuman", source_datatype="csv", dest_path="ExampleHuman.parquet", dest_datatype="parquet", preset="cellprofiler_csv", ) # using an s3-compatible path with no signature for client # and cellprofiler csv presets convert( source_path="s3://s3path", source_datatype="csv", dest_path="s3_local_result", dest_datatype="parquet", concat=True, preset="cellprofiler_csv", no_sign_request=True, ) # using local path with cellprofiler sqlite presets convert( source_path="example.sqlite", dest_path="example.parquet", dest_datatype="parquet", preset="cellprofiler_sqlite", )
- cytotable.convert._concat_join_sources(*args, **kwargs)¶
Concatenate join sources from parquet-based chunks.
For a reference to data concatenation within Arrow see the following: https://arrow.apache.org/docs/python/generated/pyarrow.concat_tables.html
- Parameters:
sources – Dict[str, List[Dict[str, Any]]]: Grouped datasets of files which will be used by other functions. Includes the metadata concerning location of actual data.
dest_path – str: Destination path to write file-based content.
join_sources – List[str]: List of local filepath destination for join source chunks which will be concatenated.
dest_datatype – Literal[“parquet”, “anndata_h5ad”, “anndata_zarr”] The datatype of the output destination file. Default is ‘parquet’.
sort_output – bool Specifies whether to sort cytotable output or not.
- Returns:
- str
Path to concatenated file which is created as a result of this function.
- cytotable.convert._concat_source_group(*args, **kwargs)¶
Concatenate group of source data together as single file.
For a reference to data concatenation within Arrow see the following: https://arrow.apache.org/docs/python/generated/pyarrow.concat_tables.html
Notes: this function presumes a multi-directory, multi-file common data structure for compartments and other data. For example:
Source (file tree):
root ├── subdir_1 │ └── Cells.csv └── subdir_2 └── Cells.csv
Becomes:
# earlier data read into parquet chunks from multiple # data source files. read_data = [ {"table": ["cells-1.parquet", "cells-2.parquet"]}, {"table": ["cells-1.parquet", "cells-2.parquet"]}, ] # focus of this function concatted = [{"table": ["cells.parquet"]}]
- Parameters:
source_group_name – str Name of data source source group (for common compartments, etc).
source_group – List[Dict[str, Any]]: Data structure containing grouped data for concatenation.
dest_path – Optional[str] (Default value = None) Optional destination path for concatenated sources.
common_schema – List[Tuple[str, str]] (Default value = None) Common schema to use for concatenation amongst arrow tables which may have slightly different but compatible schema.
sort_output – bool Specifies whether to sort cytotable output or not.
- Returns:
- List[Dict[str, Any]]
Updated dictionary containing concatenated sources.
- cytotable.convert._get_table_columns_and_types(*args, **kwargs)¶
Gather column data from table through duckdb.
- Parameters:
source – Dict[str, Any] Contains source data details. Represents a single file or table of some kind.
sort_output – Specifies whether to sort cytotable output or not.
- Returns:
- List[Optional[Dict[str, str]]]
list of dictionaries which each include column level information
- cytotable.convert._get_table_keyset_pagination_sets(*args, **kwargs)¶
Get table data chunk keys for later use in capturing segments of values. This work also provides a chance to catch problematic input data which will be ignored with warnings.
- Parameters:
source – Dict[str, Any] Contains the source data to be chunked. Represents a single file or table of some kind.
chunk_size – int The size in rowcount of the chunks to create.
page_key – str The column name to be used to identify pagination chunks. Expected to be of numeric type (int, float) for ordering.
sql_stmt – Optional sql statement to form the pagination set from. Default behavior extracts pagination sets from the full data source.
- Returns:
- Union[List[Optional[Tuple[Union[int, float], Union[int, float]]]], None]
List of keys to use for reading the data later on.
- cytotable.convert._infer_source_group_common_schema(*args, **kwargs)¶
Infers a common schema for a group of parquet files which may have similar but slightly different schema or data. Intended to assist with data concatenation and other operations.
- Parameters:
source_group – List[Dict[str, Any]]: Group of one or more data sources which includes metadata about path to parquet data.
data_type_cast_map – Optional[Dict[str, str]], default None A dictionary mapping data type groups to specific types. Roughly includes Arrow data types language from: https://arrow.apache.org/docs/python/api/datatypes.html
- Returns:
- List[Tuple[str, pa.DataType]]
A list of tuples which includes column name and PyArrow datatype. This data will later be used as the basis for forming a PyArrow schema.
- cytotable.convert._join_source_pageset(*args, **kwargs)¶
Join sources based on join group keys (group of specific join column values)
- Parameters:
dest_path – str: Destination path to write file-based content.
joins – str: DuckDB-compatible SQL which will be used to perform the join operations using the join_group keys as a reference.
join_group – List[Dict[str, Any]]: Group of joinable keys to be used as “chunked” filter of overall dataset.
drop_null – bool: Whether to drop rows with null values within the resulting joined data.
- Returns:
- str
Path to joined file which is created as a result of this function.
- cytotable.convert._prepare_join_sql(*args, **kwargs)¶
Prepare join SQL statement with actual locations of data based on the sources.
- Parameters:
sources – Dict[str, List[Dict[str, Any]]]: Grouped datasets of files which will be used by other functions. Includes the metadata concerning location of actual data.
joins – str: DuckDB-compatible SQL which will be used to perform the join operations using the join_group keys as a reference.
sort_output – bool Specifies whether to sort cytotable output or not.
- Returns:
String representing the SQL to be used in later join work.
- Return type:
str
- cytotable.convert._prep_cast_column_data_types(*args, **kwargs)¶
Cast data types per what is received in cast_map.
Example: - columns: [{“column_id”:0, “column_name”:”colname”, “column_dtype”:”DOUBLE”}] - data_type_cast_map: {“float”: “float32”}
The above passed through this function will set the “column_dtype” value to a “REAL” dtype (“REAL” in duckdb is roughly equivalent to “float32”)
- Parameters:
table_path – str: Path to a parquet file which will be modified.
data_type_cast_map –
Dict[str, str] A dictionary mapping data type groups to specific types. Roughly to eventually align with DuckDB types: https://duckdb.org/docs/sql/data_types/overview
Note: includes synonym matching for common naming convention use in Pandas and/or PyArrow via cytotable.utils.DATA_TYPE_SYNONYMS
- Returns:
- List[Dict[str, str]]
list of dictionaries which each include column level information
- cytotable.convert._set_tablenumber(*args, **kwargs)¶
Gathers a “TableNumber” from the image table (if CSV) or SQLite file (if SQLite source) which is a unique identifier intended to help differentiate between imagenumbers to create distinct records for single-cell profiles referenced across multiple source data exports. For example, ImageNumber column values from CellProfiler will repeat across exports, meaning we may lose distinction when combining multiple export files together through CytoTable.
Note: - If using CSV data sources, the image.csv table is used for checksum. - If using SQLite data sources, the entire SQLite database is used for checksum.
- Parameters:
sources – Dict[str, List[Dict[str, Any]]] Contains metadata about data tables and related contents.
add_tablenumber – Optional[bool] Whether to add a calculated tablenumber. Note: when False, adds None as the tablenumber
- Returns:
- List[Dict[str, Any]]
New source group with added TableNumber details.
- cytotable.convert._prepend_column_name(*args, **kwargs)¶
Rename columns using the source group name, avoiding identifying columns.
Notes: * A source_group_name represents a filename referenced as part of what is specified within targets.
- Parameters:
table_path – str: Path to a parquet file which will be modified.
source_group_name – str: Name of data source source group (for common compartments, etc).
identifying_columns – List[str]: Column names which are used as ID’s and as a result need to be treated differently when renaming.
metadata – Union[List[str], Tuple[str, …]]: List of source data names which are used as metadata.
compartments – List[str]: List of source data names which are used as compartments.
- Returns:
- str
Path to the modified file.
- cytotable.convert._source_pageset_to_parquet(*args, **kwargs)¶
Export source data to chunked parquet file using chunk size and offsets.
- Parameters:
source_group_name – str Name of the source group (for ex. compartment or metadata table name).
source – Dict[str, Any] Contains the source data to be chunked. Represents a single file or table of some kind along with collected information about table.
pageset – Optional[Tuple[Union[int, float], Union[int, float]]] The pageset for chunking the data from source.
dest_path – str Path to store the output data.
sort_output – bool Specifies whether to sort cytotable output or not.
- Returns:
- str
A string of the output filepath.
Access¶
Generic table access helpers for Parquet files and Iceberg warehouses.
- cytotable.warehouse.access.list_tables(path: str | Path, *, include_views: bool = True) list[str][source]¶
List available table names from a Parquet path or Iceberg warehouse.
- cytotable.warehouse.access.read_table(path: str | Path, table_name: str | None = None) DataFrame[source]¶
Read a table from a Parquet path or Iceberg warehouse.
Iceberg¶
Utilities for reading and writing local Iceberg warehouses with CytoTable.
- class cytotable.warehouse.iceberg.TinyCatalog[source]¶
Bases:
objectPlaceholder catalog when pyiceberg is unavailable.
- cytotable.warehouse.iceberg.catalog(warehouse_path: str | Path, *, default_namespace: str = DEFAULT_NAMESPACE, registry_file: str = DEFAULT_REGISTRY_FILE) TinyCatalog[source]¶
Open a local Iceberg warehouse and return its tiny catalog.
- cytotable.warehouse.iceberg.describe_iceberg_warehouse(warehouse_path: str | Path, include_views: bool = True, *, default_namespace: str = DEFAULT_NAMESPACE, registry_file: str = DEFAULT_REGISTRY_FILE) DataFrame[source]¶
Summarize tables and saved views within a local Iceberg warehouse.
- cytotable.warehouse.iceberg.list_iceberg_tables(warehouse_path: str | Path, include_views: bool = True, *, default_namespace: str = DEFAULT_NAMESPACE, registry_file: str = DEFAULT_REGISTRY_FILE) list[str][source]¶
List fully qualified tables and optional views in a local Iceberg warehouse.
- cytotable.warehouse.iceberg.read_iceberg_table(warehouse_path: str | Path, table_name: str, *, default_namespace: str = DEFAULT_NAMESPACE, registry_file: str = DEFAULT_REGISTRY_FILE) DataFrame[source]¶
Read an Iceberg table or saved SQL view from a local warehouse.
- cytotable.warehouse.iceberg.write_iceberg_warehouse(source_path: str, warehouse_path: str | Path, source_datatype: str | None = None, metadata: Tuple[str, ...] | list[str] | None = None, compartments: Tuple[str, ...] | list[str] | None = None, identifying_columns: Tuple[str, ...] | list[str] | None = None, joins: str | None = None, chunk_size: int | None = None, infer_common_schema: bool = True, data_type_cast_map: Dict[str, str] | None = None, add_tablenumber: bool | None = None, page_keys: Dict[str, str] | None = None, sort_output: bool = True, preset: str | None = 'cellprofiler_csv', image_dir: str | None = None, mask_dir: str | None = None, outline_dir: str | None = None, bbox_column_map: Dict[str, str] | None = None, segmentation_file_regex: Dict[str, str] | None = None, include_source_images: bool = False, default_namespace: str = DEFAULT_NAMESPACE, images_namespace: str = DEFAULT_IMAGES_NAMESPACE, registry_file: str = DEFAULT_REGISTRY_FILE, profiles_table_name: str = DEFAULT_PROFILES_TABLE, profile_with_images_view_name: str | None = DEFAULT_PROFILE_WITH_IMAGES_VIEW, parsl_config: Config | None = None, **kwargs) str[source]¶
Write a CytoTable Iceberg warehouse from raw source data.
This helper powers convert(…, dest_backend=”iceberg”) and accepts the same core conversion arguments for source selection, joins, chunking, and image export. See cytotable.convert.convert for the shared argument semantics; this function adds Iceberg-specific options such as default_namespace, images_namespace, registry_file, profiles_table_name, and profile_with_images_view_name.
- Returns:
Path to the created Iceberg warehouse root.
- Raises:
CytoTableException – If the warehouse path already exists or image export prerequisites are invalid.
ValueError – If required join SQL or join pagination keys are missing.
Images¶
Helpers for exporting image crops alongside CytoTable measurement data.
- class cytotable.warehouse.images.BBoxColumns(x_min: str, x_max: str, y_min: str, y_max: str)[source]¶
Bases:
objectBounding box column names for cropped image export.
- x_max: str¶
- x_min: str¶
- y_max: str¶
- y_min: str¶
- class cytotable.warehouse.images.FileIndex(by_relative: dict[str, Path | AnyPath], by_basename: dict[str, list[Path | AnyPath]], by_stem: dict[str, list[Path | AnyPath]])[source]¶
Bases:
objectRelative-path-first index for image-like files in a directory tree.
- by_basename: dict[str, list[Path | AnyPath]]¶
- by_relative: dict[str, Path | AnyPath]¶
- by_stem: dict[str, list[Path | AnyPath]]¶
- cytotable.warehouse.images._build_file_index(file_dir: str | None, path_kwargs: Dict[str, Any] | None = None) FileIndex[source]¶
Build a relative-path-first index for image-like files in a directory tree.
- cytotable.warehouse.images._build_stable_image_crop_id(key_fields: dict[str, Any], image_column: str, image_name: str, bbox: dict[str, int] | None = None) str[source]¶
Build a deterministic identifier for one object/image crop row.
- cytotable.warehouse.images._build_stable_object_id(key_fields: dict[str, Any], bbox: dict[str, int] | None = None) str[source]¶
Build a deterministic object identifier for warehouse image rows.
- cytotable.warehouse.images._build_stable_source_image_id(key_fields: dict[str, Any], image_column: str, image_name: str) str[source]¶
Build a deterministic identifier for one source image row.
- cytotable.warehouse.images._crop_ome_arrow(image_path: Path | AnyPath, bbox: dict[str, int]) dict[str, Any][source]¶
Lazily crop a TIFF-backed image into an OME-Arrow struct.
- cytotable.warehouse.images._extract_image_key_fields(row: Series) dict[str, Any][source]¶
Extract image-level key fields to carry into source image rows.
- cytotable.warehouse.images._extract_key_fields(row: Series) dict[str, Any][source]¶
Extract practical measurement key fields to carry into the image table.
- cytotable.warehouse.images._find_matching_segmentation_path(data_value: str, pattern_map: dict[str, str] | None, file_dir: str | None, candidate_path: Path | AnyPath, file_index: FileIndex | None = None, lookup_cache: dict[str, Path | AnyPath | None] | None = None, path_kwargs: Dict[str, Any] | None = None) Path | AnyPath | None[source]¶
Resolve a matching mask/outline file path for an image value.
- cytotable.warehouse.images._local_image_io_path(path: Path | AnyPath) Path[source]¶
Return a local path for image I/O, caching cloud files when needed.
- cytotable.warehouse.images._normalize_file_value(value: Any) str | None[source]¶
Normalize a file-like value to a comparable path string.
- cytotable.warehouse.images._read_ome_arrow(image_path: Path | AnyPath) dict[str, Any][source]¶
Lazily load a full TIFF-backed image into an OME-Arrow struct.
- cytotable.warehouse.images._relative_index_key(path: Path | AnyPath, root: Path | AnyPath) str[source]¶
Build a normalized relative key for a file under an index root.
- cytotable.warehouse.images._require_ome_arrow() tuple[Any, Any][source]¶
Import and return OME-Arrow objects needed for crop export.
- cytotable.warehouse.images._resolve_image_columns(data: DataFrame) list[str][source]¶
Find joined-table columns that look like image filename columns.
- cytotable.warehouse.images._resolve_indexed_path(normalized_value: str, file_index: FileIndex) Path | AnyPath | None[source]¶
Resolve a normalized path string against a relative-path-first file index.
- cytotable.warehouse.images._strip_null_fields_from_type(data_type: DataType) DataType[source]¶
Remove null-typed fields from nested Arrow types for Iceberg compatibility.
- cytotable.warehouse.images._strip_null_fields_from_value(value: Any, data_type: DataType) Any[source]¶
Remove values corresponding to null-typed nested Arrow fields.
- cytotable.warehouse.images._validated_bbox_values(row: Series, bbox_columns: BBoxColumns) dict[str, int] | None[source]¶
Validate and normalize row bbox values for image cropping.
- cytotable.warehouse.images.add_object_id_to_profiles_frame(joined_frame: DataFrame, bbox_column_map: Dict[str, str] | None = None) DataFrame[source]¶
Add a stable object identifier column to a joined profiles frame.
- cytotable.warehouse.images.image_crop_table_from_joined_chunk(chunk_path: str, image_dir: str, mask_dir: str | None = None, outline_dir: str | None = None, bbox_column_map: Dict[str, str] | None = None, segmentation_file_regex: Dict[str, str] | None = None, path_kwargs: Dict[str, Any] | None = None) Table[source]¶
Build an Arrow table of OME-Arrow image crops from one joined parquet chunk.
- cytotable.warehouse.images.object_id(name: str | UUID | None = None, *, prefix: str = 'obj') str[source]¶
Return a stable string identifier with a UUID-shaped payload.
- cytotable.warehouse.images.profile_with_images_frame(joined_frame: DataFrame, image_frame: DataFrame, bbox_column_map: Dict[str, str] | None = None) DataFrame[source]¶
Expand joined measurement rows into stable object/image references and merge crops.
- cytotable.warehouse.images.resolve_bbox_columns(columns: Sequence[Any], bbox_column_map: Dict[str, str] | None = None) BBoxColumns | None[source]¶
Resolve bbox columns using custom mapping, CellProfiler naming, then fallback tags.
- cytotable.warehouse.images.source_image_table_from_joined_chunk(chunk_path: str, image_dir: str, mask_dir: str | None = None, outline_dir: str | None = None, segmentation_file_regex: Dict[str, str] | None = None, path_kwargs: Dict[str, Any] | None = None) Table[source]¶
Build an Arrow table of full OME-Arrow source images from one joined chunk.
Sources¶
CytoTable: sources - tasks and flows related to source data and metadata for performing conversion work.
- cytotable.sources._build_path(path: str, **kwargs) Path | AnyPath[source]¶
Build a path client or return local path.
- Parameters:
path – Union[pathlib.Path, Any]: Path to seek filepaths within.
**kwargs – Any keyword arguments to be used with Cloudpathlib.CloudPath.client .
- Returns:
- Union[pathlib.Path, Any]
A local pathlib.Path or Cloudpathlib.AnyPath type path.
- cytotable.sources._file_is_more_than_one_line(path: Path | AnyPath) bool[source]¶
Check if the file has more than one line.
- Parameters:
path (Union[pathlib.Path, AnyPath]) – The path to the file.
- Returns:
True if the file has more than one line, False otherwise.
- Return type:
bool
- Raises:
NoInputDataException – If the file has zero lines.
- cytotable.sources._filter_source_filepaths(sources: Dict[str, List[Dict[str, Any]]], source_datatype: str) Dict[str, List[Dict[str, Any]]][source]¶
Filter source filepaths based on provided source_datatype.
- Parameters:
sources – Dict[str, List[Dict[str, Any]]] Grouped datasets of files which will be used by other functions.
source_datatype – str Source datatype to use for filtering the dataset.
- Returns:
- Dict[str, List[Dict[str, Any]]]
Data structure which groups related files based on the datatype.
- cytotable.sources._gather_sources(source_path: str, source_datatype: str | None = None, targets: List[str] | None = None, **kwargs) Dict[str, List[Dict[str, Any]]][source]¶
Flow for gathering data sources for conversion.
- Parameters:
source_path – str: Where to gather file-based data from.
source_datatype – Optional[str]: (Default value = None) The source datatype (extension) to use for reading the tables.
targets – Optional[List[str]]: (Default value = None) The source file names to target within the provided path.
- Returns:
- Dict[str, List[Dict[str, Any]]]
Data structure which groups related files based on the compartments.
- cytotable.sources._get_source_filepaths(path: Path | AnyPath, targets: List[str] | None = None, source_datatype: str | None = None) Dict[str, List[Dict[str, Any]]][source]¶
Gather dataset of filepaths from a provided directory path.
- Parameters:
path – Union[pathlib.Path, Any]: Either a directory path to seek filepaths within or a path directly to a file.
targets – List[str]: Compartment and metadata names to seek within the provided path.
source_datatype – Optional[str]: (Default value = None) The source datatype (extension) to use for reading the tables.
- Returns:
- Dict[str, List[Dict[str, Any]]]
Data structure which groups related files based on the compartments.
- cytotable.sources._infer_source_datatype(sources: Dict[str, List[Dict[str, Any]]], source_datatype: str | None = None) str[source]¶
Infers and optionally validates datatype (extension) of files.
- Parameters:
sources – Dict[str, List[Dict[str, Any]]]: Grouped datasets of files which will be used by other functions.
source_datatype – Optional[str]: (Default value = None) Optional source datatype to validate within the context of detected datatypes.
- Returns:
- str
A string of the datatype detected or validated source_datatype.
Utils¶
Utility functions for CytoTable
- cytotable.utils.Parsl_AppBase_init_for_docs(self, func, *args, **kwargs)[source]¶
A function to extend Parsl.app.app.AppBase with docstring from decorated functions rather than the decorators from Parsl. Used for Sphinx documentation purposes.
- cytotable.utils._arrow_type_cast_if_specified(column: Dict[str, str], data_type_cast_map: Dict[str, str]) Dict[str, str][source]¶
Attempts to cast data types for an PyArrow field using provided a data_type_cast_map.
- Parameters:
column – Dict[str, str]: Dictionary which includes a column idx, name, and dtype
data_type_cast_map – Dict[str, str] A dictionary mapping data type groups to specific types. Roughly includes Arrow data types language from: https://arrow.apache.org/docs/python/api/datatypes.html Example: {“float”: “float32”}
- Returns:
- Dict[str, str]
A potentially data type updated dictionary of column information
- cytotable.utils._cache_cloudpath_to_local(path: AnyPath) Path[source]¶
Takes a cloudpath and uses cache to convert to a local copy for use in scenarios where remote work is not possible (sqlite).
- Parameters:
path – Union[str, AnyPath] A filepath which will be checked and potentially converted to a local filepath.
- Returns:
- pathlib.Path
A local pathlib.Path to cached version of cloudpath file.
- cytotable.utils._column_sort(value: str)[source]¶
A custom sort for column values as a list. To be used with sorted and Pyarrow tables.
- cytotable.utils._default_parsl_config()[source]¶
Return a default Parsl configuration for use with CytoTable.
- cytotable.utils._duckdb_reader() DuckDBPyConnection[source]¶
Creates a DuckDB connection with the sqlite_scanner installed and loaded.
Note: using this function assumes implementation will close the subsequently created DuckDB connection using _duckdb_reader().close() or using a context manager, for ex., using: with _duckdb_reader() as ddb_reader:
- Returns:
duckdb.DuckDBPyConnection
- cytotable.utils._expand_path(path: str | Path | AnyPath) Path | AnyPath[source]¶
Expands “~” user directory references with the user’s home directory, and expands variable references with values from the environment. After user/variable expansion, the path is resolved and an absolute path is returned.
- Parameters:
path – Union[str, pathlib.Path, CloudPath]: Path to expand.
- Returns:
- Union[pathlib.Path, Any]
A local pathlib.Path or Cloudpathlib.AnyPath type path.
- cytotable.utils._extract_npz_to_parquet(source_path: str, dest_path: str, tablenumber: int | None = None) str[source]¶
Extract data from an .npz file created by DeepProfiler as a tabular dataset and write to parquet.
DeepProfiler creates datasets which look somewhat like this: Keys in the .npz file: [‘features’, ‘metadata’, ‘locations’]
Variable: features Shape: (229, 6400) Data type: float32
Variable: locations Shape: (229, 2) Data type: float64
Variable: metadata Shape: () Data type: object Whole object: { ‘Metadata_Plate’: ‘SQ00014812’, ‘Metadata_Well’: ‘A01’, ‘Metadata_Site’: 1, ‘Plate_Map_Name’: ‘C-7161-01-LM6-022’, ‘RNA’: ‘SQ00014812/r01c01f01p01-ch3sk1fk1fl1.png’, ‘ER’: ‘SQ00014812/r01c01f01p01-ch2sk1fk1fl1.png’, ‘AGP’: ‘SQ00014812/r01c01f01p01-ch4sk1fk1fl1.png’, ‘Mito’: ‘SQ00014812/r01c01f01p01-ch5sk1fk1fl1.png’, ‘DNA’: ‘SQ00014812/r01c01f01p01-ch1sk1fk1fl1.png’, ‘Treatment_ID’: 0, ‘Treatment_Replicate’: 1, ‘Treatment’: ‘DMSO@NA’, ‘Compound’: ‘DMSO’, ‘Concentration’: ‘’, ‘Split’: ‘Training’, ‘Metadata_Model’: ‘efficientnet’ }
- Parameters:
source_path – str Path to the .npz file.
dest_path – str Destination path for the parquet file.
tablenumber – Optional[int] Optional tablenumber to be added to the data.
- Returns:
- str
Path to the exported parquet file.
- cytotable.utils._gather_tablenumber_checksum(pathname: str, buffer_size: int = 1048576) int[source]¶
Build and return a checksum for use as a unique identifier across datasets referenced from cytominer-database: https://github.com/cytomining/cytominer-database/blob/master/cytominer_database/ingest_variable_engine.py#L129
- Parameters:
pathname – str: A path to a file with which to generate the checksum on.
buffer_size – int: Buffer size to use for reading data.
- Returns:
- int
an integer representing the checksum of the pathname file.
- cytotable.utils._generate_pagesets(keys: List[int | float], chunk_size: int) List[Tuple[int | float, int | float]][source]¶
Generate a pageset (keyset pagination) from a list of keys.
- Parameters:
List[Union[int (keys) – List of keys to paginate.
float]] – List of keys to paginate.
int (chunk_size) – Size of each chunk/page.
- Returns:
List of (start_key, end_key) tuples representing each page.
- Return type:
List[Tuple[Union[int, float], Union[int, float]]]
- cytotable.utils._get_cytotable_version() str[source]¶
Seeks the current version of CytoTable using either pkg_resources or dunamai to determine the current version being used.
- Returns:
- str
A string representing the version of CytoTable currently being used.
- cytotable.utils._natural_sort(list_to_sort)[source]¶
Sorts the given iterable using natural sort adapted from approach provided by the following link: https://stackoverflow.com/a/4836734
- Parameters:
list_to_sort – List: The list to sort.
- Returns:
The sorted list.
- Return type:
List
- cytotable.utils._parsl_loaded() bool[source]¶
Checks whether Parsl configuration has already been loaded.
- cytotable.utils._sqlite_mixed_type_query_to_parquet(source_path: str, table_name: str, page_key: str, pageset: Tuple[int | float, int | float], sort_output: bool, tablenumber: int | None = None) str[source]¶
Performs SQLite table data extraction where one or many columns include data values of potentially mismatched type such that the data may be exported to Arrow for later use.
- Parameters:
source_path – str: A str which is a path to a SQLite database file.
table_name – str: The name of the table being queried.
page_key – str: The column name to be used to identify pagination chunks.
pageset – Tuple[Union[int, float], Union[int, float]]: The range for values used for paginating data from source.
sort_output – bool Specifies whether to sort cytotable output or not.
add_cytotable_meta – bool, default=False: Whether to add CytoTable metadata fields or not
tablenumber – Optional[int], default=None: An optional table number to append to the results. Defaults to None.
- Returns:
The resulting arrow table for the data
- Return type:
pyarrow.Table
- cytotable.utils._unwrap_source(source: Dict[str, AppFuture | Any] | AppFuture | Any) Dict[str, Any] | Any[source]¶
Helper function to unwrap futures from sources.
- Parameters:
source – Union[ Dict[str, Union[parsl.dataflow.futures.AppFuture, Any]], Union[parsl.dataflow.futures.AppFuture, Any],
] – A source is a portion of an internal data structure used by CytoTable for processing and organizing data results.
- Returns:
- Union[Dict[str, Any], Any]
An evaluated dictionary or other value type.
- cytotable.utils._unwrap_value(val: AppFuture | Any) Any[source]¶
Helper function to unwrap futures from values or return values where there are no futures.
- Parameters:
val – Union[parsl.dataflow.futures.AppFuture, Any] A value which may or may not be a Parsl future which needs to be evaluated.
- Returns:
- Any
Returns the value as-is if there’s no future, the future result if Parsl futures are encountered.
- cytotable.utils._write_parquet_table_with_metadata(table: Table, **kwargs) None[source]¶
Adds metadata to parquet output from CytoTable. Note: this mostly wraps pyarrow.parquet.write_table https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html
- Parameters:
table – pa.Table: Pyarrow table to be serialized as parquet table.
**kwargs – Any: kwargs provided to this function roughly align with pyarrow.parquet.write_table. The following might be examples of what to expect here: - where: str or pyarrow.NativeFile
- cytotable.utils.cloud_glob(start: str | CloudPath | Path, pattern: str, max_matches: int | None = None, cp_client: S3Client | None = None, boto_s3_client=None) Iterator[CloudPath | Path][source]¶
Globs under start and yields matching paths. We provide cloud-platform specific optimizations as needed based on platform SDK’s.
- Behavior by input type:
- S3 (cloudpathlib S3Path or ‘s3://…’):
Use unsigned boto3 to list keys and yield unsigned cloudpathlib.S3Path.
- Other CloudPath (e.g., GCS/Azure/local providers via cloudpathlib):
Fallback to CloudPath.glob(pattern), yielding CloudPath.
- Local filesystem (pathlib.Path or non-s3 string):
Fallback to Path.glob(pattern), yielding pathlib.Path.
- Parameters:
start – CloudPath, pathlib.Path, or URI string.
pattern – Glob pattern relative to start (supports ** for S3 branch).
max_matches – Optional cap on yielded results.
cp_client – cloudpathlib S3Client (unsigned recommended).
boto_s3_client – boto3 S3 client (unsigned recommended).
- Yields:
cloudpathlib.S3Path for S3; CloudPath for other cloud providers; pathlib.Path for local.
- cytotable.utils.evaluate_futures(sources: Dict[str, List[Dict[str, Any]]] | List[Any] | str) Any[source]¶
Evaluates any Parsl futures for use within other tasks. This enables a pattern of Parsl app usage as “tasks” and delayed future result evaluation for concurrency.
- Parameters:
sources – Union[Dict[str, List[Dict[str, Any]]], List[Any], str] Sources are an internal data structure used by CytoTable for processing and organizing data results. They may include futures which require asynchronous processing through Parsl, so we process them through this function.
- Returns:
- Union[Dict[str, List[Dict[str, Any]]], str]
A data structure which includes evaluated futures where they were found.
- cytotable.utils.find_anndata_metadata_field_names(source: str | Path) tuple[list[str], list[str]][source]¶
Classify the source table columns as numeric and non-numeric.
Scans the Parquet file schema and returns two lists of column names: those with numeric types (float, integer, decimal) and those with any other type. This is handy for separating AnnData metadata fields by basic numeric-ness for downstream processing.
- Parameters:
source – Path to a Parquet file to inspect.
- Returns:
A 2-tuple
(numeric_fields, non_numeric_fields), where each element is a list of column names.
- cytotable.utils.map_pyarrow_type(field_type: DataType, data_type_cast_map: Dict[str, str] | None) DataType[source]¶
Map PyArrow types dynamically to handle nested types and casting.
This function takes a PyArrow field_type and dynamically maps it to a valid PyArrow type, handling nested types (e.g., lists, structs) and resolving type conflicts (e.g., integer to float). It also supports custom type casting using the data_type_cast_map parameter.
- Parameters:
field_type – pa.DataType The PyArrow data type to be mapped. This can include simple types (e.g., int, float, string) or nested types (e.g., list, struct).
data_type_cast_map – Optional[Dict[str, str]], default None A dictionary mapping data type groups to specific types. This allows for custom type casting. For example: - {“float”: “float32”} maps floating-point types to float32. - {“int”: “int64”} maps integer types to int64. If data_type_cast_map is None, default PyArrow types are used.
- Returns:
- pa.DataType
The mapped PyArrow data type. If no mapping is needed, the original field_type is returned.
Presets¶
- cytotable.presets.config¶
Configuration presets for CytoTable
Exceptions¶
Provide hierarchy of exceptions for CytoTable
- exception cytotable.exceptions.CytoTableException[source]¶
Bases:
ExceptionRoot exception for custom hierarchy of exceptions with CytoTable.
- exception cytotable.exceptions.DatatypeException[source]¶
Bases:
CytoTableExceptionException for datatype challenges.
- exception cytotable.exceptions.NoInputDataException[source]¶
Bases:
CytoTableExceptionException for no input data.
- exception cytotable.exceptions.SchemaException[source]¶
Bases:
CytoTableExceptionException for schema challenges.