Python API¶

Convert¶

CytoTable: convert - transforming data for use with pyctyominer.

cytotable.convert._run_export_workflow(source_path: str, dest_path: str, source_datatype: str | None, metadata: List[str] | Tuple[str, ...] | None, compartments: List[str] | Tuple[str, ...] | None, identifying_columns: List[str] | Tuple[str, ...] | None, concat: bool, join: bool, joins: str | None, chunk_size: int | None, infer_common_schema: bool, drop_null: bool, sort_output: bool, page_keys: Dict[str, str], dest_datatype: Literal['parquet', 'anndata_h5ad', 'anndata_zarr'] = 'parquet', data_type_cast_map: Dict[str, str] | None = None, add_tablenumber: bool | None = None, **kwargs) → Dict[str, List[Dict[str, Any]]] | List[Any] | str[source]¶

Export data to various formats (e.g., parquet) based on configuration.

Parameters:

source_path – str: str reference to read source files from. Note: may be local or remote object-storage location using convention “s3://…” or similar.
dest_path – str: Path to write files to. This path will be used for intermediary data work and must be a new file or directory path. This parameter will result in a directory on join=False. This parameter will result in a single file on join=True. Note: this may only be a local path.
source_datatype – Optional[str]: (Default value = None) Source datatype to focus on during conversion.
metadata – Union[List[str], Tuple[str, …]]: Metadata names to use for conversion.
compartments – Union[List[str], Tuple[str, …]]: (Default value = None) Compartment names to use for conversion.
identifying_columns – Union[List[str], Tuple[str, …]]: Column names which are used as ID’s and as a result need to be ignored with regards to renaming.
concat – bool: Whether to concatenate similar files together.
join – bool: Whether to join the compartment data together into one dataset.
joins – str: DuckDB-compatible SQL which will be used to perform the join operations.
chunk_size – Optional[int], Size of join chunks which is used to limit data size during join ops.
infer_common_schema – bool: (Default value = True) Whether to infer a common schema when concatenating sources.
drop_null – bool: Whether to drop null results.
sort_output – bool Specifies whether to sort cytotable output or not.
page_keys – Dict[str, str] A dictionary which defines which column names are used for keyset pagination in order to perform data extraction.
dest_datatype – Literal[“parquet”, “anndata_h5ad”, “anndata_zarr”]: Output destination datatype to write to. Defaults to ‘parquet’.
data_type_cast_map – Dict[str, str] A dictionary mapping data type groups to specific types. Roughly includes Arrow data types language from: https://arrow.apache.org/docs/python/api/datatypes.html
**kwargs – Any: Keyword args used for gathering source data, primarily relevant for Cloudpathlib cloud-based client configuration.

Returns:

Grouped sources which include metadata about destination filepath where parquet file was written or a string filepath for the joined result.

Return type:

Union[Dict[str, List[Dict[str, Any]]], str]

cytotable.convert.convert(source_path: str, dest_path: str, dest_datatype: Literal['parquet', 'anndata_h5ad', 'anndata_zarr'] = 'parquet', source_datatype: str | None = None, metadata: List[str] | Tuple[str, ...] | None = None, compartments: List[str] | Tuple[str, ...] | None = None, identifying_columns: List[str] | Tuple[str, ...] | None = None, concat: bool = True, join: bool = True, joins: str | None = None, chunk_size: int | None = None, infer_common_schema: bool = True, drop_null: bool = False, data_type_cast_map: Dict[str, str] | None = None, add_tablenumber: bool | None = None, page_keys: Dict[str, str] | None = None, sort_output: bool = True, preset: str | None = 'cellprofiler_csv', parsl_config: Config | None = None, **kwargs) → Dict[str, List[Dict[str, Any]]] | List[Any] | str[source]¶

Convert file-based data from various sources to Pycytominer-compatible standards.

Note: source paths may be local or remote object-storage location using convention “s3://…” or similar.

Parameters:

source_path – str: str reference to read source files from. Note: may be local or remote object-storage location using convention “s3://…” or similar.
dest_path – str: Path to write files to. This path will be used for intermediary data work and must be a new file or directory path. This parameter will result in a directory on join=False. This parameter will result in a single file on join=True. Note: this may only be a local path.
dest_datatype – Literal[“parquet”, “anndata_h5ad”, “anndata_zarr”]: Output destination datatype to write to.
source_datatype – Optional[str]: (Default value = None) Source datatype to focus on during conversion.
metadata – Union[List[str], Tuple[str, …]]: Metadata names to use for conversion.
compartments – Union[List[str], Tuple[str, str, str, str]]: (Default value = None) Compartment names to use for conversion.
identifying_columns – Union[List[str], Tuple[str, …]]: Column names which are used as ID’s and as a result need to be ignored with regards to renaming.
concat – bool: (Default value = True) Whether to concatenate similar files together.
join – bool: (Default value = True) Whether to join the compartment data together into one dataset
joins – str: (Default value = None): DuckDB-compatible SQL which will be used to perform the join operations.
chunk_size – Optional[int] (Default value = None) Size of join chunks which is used to limit data size during join ops
infer_common_schema – bool (Default value = True) Whether to infer a common schema when concatenating sources.
data_type_cast_map – Dict[str, str], (Default value = None) A dictionary mapping data type groups to specific types. Roughly includes Arrow data types language from: https://arrow.apache.org/docs/python/api/datatypes.html
add_tablenumber – Optional[bool] Whether to add a calculated tablenumber which helps differentiate various repeated values (such as ObjectNumber) within source data. Useful for processing multiple SQLite or CSV data sources together to retain distinction from each dataset.
page_keys – str: The table and column names to be used for key pagination. Uses the form: {“table_name”:”column_name”}. Expects columns to include numeric data (ints or floats). Interacts with the chunk_size parameter to form pages of chunk_size.
sort_output – bool (Default value = True) Specifies whether to sort cytotable output or not.
drop_null – bool (Default value = False) Whether to drop nan/null values from results
preset – str (Default value = “cellprofiler_csv”) an optional group of presets to use based on common configurations
parsl_config – Optional[parsl.Config] (Default value = None) Optional Parsl configuration to use for running CytoTable operations. Note: when using CytoTable multiple times in the same process, CytoTable will use the first provided configuration for all runs.

Returns:

Union[Dict[str, List[Dict[str, Any]]], str]: Grouped sources which include metadata about destination filepath where parquet file was written or str of joined result filepath.

Example

from cytotable import convert

# using a local path with cellprofiler csv presets
convert(
    source_path="./tests/data/cellprofiler/ExampleHuman",
    source_datatype="csv",
    dest_path="ExampleHuman.parquet",
    dest_datatype="parquet",
    preset="cellprofiler_csv",
)

# using an s3-compatible path with no signature for client
# and cellprofiler csv presets
convert(
    source_path="s3://s3path",
    source_datatype="csv",
    dest_path="s3_local_result",
    dest_datatype="parquet",
    concat=True,
    preset="cellprofiler_csv",
    no_sign_request=True,
)

# using local path with cellprofiler sqlite presets
convert(
    source_path="example.sqlite",
    dest_path="example.parquet",
    dest_datatype="parquet",
    preset="cellprofiler_sqlite",
)

cytotable.convert._concat_join_sources(*args, **kwargs)¶

Concatenate join sources from parquet-based chunks.

For a reference to data concatenation within Arrow see the following: https://arrow.apache.org/docs/python/generated/pyarrow.concat_tables.html

Parameters:

sources – Dict[str, List[Dict[str, Any]]]: Grouped datasets of files which will be used by other functions. Includes the metadata concerning location of actual data.
dest_path – str: Destination path to write file-based content.
join_sources – List[str]: List of local filepath destination for join source chunks which will be concatenated.
dest_datatype – Literal[“parquet”, “anndata_h5ad”, “anndata_zarr”] The datatype of the output destination file. Default is ‘parquet’.
sort_output – bool Specifies whether to sort cytotable output or not.

Returns:

str: Path to concatenated file which is created as a result of this function.

cytotable.convert._concat_source_group(*args, **kwargs)¶

Concatenate group of source data together as single file.

For a reference to data concatenation within Arrow see the following: https://arrow.apache.org/docs/python/generated/pyarrow.concat_tables.html

Notes: this function presumes a multi-directory, multi-file common data structure for compartments and other data. For example:

Source (file tree):

root
├── subdir_1
│  └── Cells.csv
└── subdir_2
    └── Cells.csv

Becomes:

# earlier data read into parquet chunks from multiple
# data source files.
read_data = [
    {"table": ["cells-1.parquet", "cells-2.parquet"]},
    {"table": ["cells-1.parquet", "cells-2.parquet"]},
]

# focus of this function
concatted = [{"table": ["cells.parquet"]}]

Parameters:

source_group_name – str Name of data source source group (for common compartments, etc).
source_group – List[Dict[str, Any]]: Data structure containing grouped data for concatenation.
dest_path – Optional[str] (Default value = None) Optional destination path for concatenated sources.
common_schema – List[Tuple[str, str]] (Default value = None) Common schema to use for concatenation amongst arrow tables which may have slightly different but compatible schema.
sort_output – bool Specifies whether to sort cytotable output or not.

Returns:

List[Dict[str, Any]]: Updated dictionary containing concatenated sources.

cytotable.convert._get_table_columns_and_types(*args, **kwargs)¶

Gather column data from table through duckdb.

Parameters:

source – Dict[str, Any] Contains source data details. Represents a single file or table of some kind.
sort_output – Specifies whether to sort cytotable output or not.

Returns:

List[Optional[Dict[str, str]]]: list of dictionaries which each include column level information

cytotable.convert._get_table_keyset_pagination_sets(*args, **kwargs)¶

Get table data chunk keys for later use in capturing segments of values. This work also provides a chance to catch problematic input data which will be ignored with warnings.

Parameters:

source – Dict[str, Any] Contains the source data to be chunked. Represents a single file or table of some kind.
chunk_size – int The size in rowcount of the chunks to create.
page_key – str The column name to be used to identify pagination chunks. Expected to be of numeric type (int, float) for ordering.
sql_stmt – Optional sql statement to form the pagination set from. Default behavior extracts pagination sets from the full data source.

Returns:

Union[List[Optional[Tuple[Union[int, float], Union[int, float]]]], None]: List of keys to use for reading the data later on.

cytotable.convert._infer_source_group_common_schema(*args, **kwargs)¶

Infers a common schema for a group of parquet files which may have similar but slightly different schema or data. Intended to assist with data concatenation and other operations.

Parameters:

source_group – List[Dict[str, Any]]: Group of one or more data sources which includes metadata about path to parquet data.
data_type_cast_map – Optional[Dict[str, str]], default None A dictionary mapping data type groups to specific types. Roughly includes Arrow data types language from: https://arrow.apache.org/docs/python/api/datatypes.html

Returns:

List[Tuple[str, pa.DataType]]: A list of tuples which includes column name and PyArrow datatype. This data will later be used as the basis for forming a PyArrow schema.

cytotable.convert._join_source_pageset(*args, **kwargs)¶

Join sources based on join group keys (group of specific join column values)

Parameters:

dest_path – str: Destination path to write file-based content.
joins – str: DuckDB-compatible SQL which will be used to perform the join operations using the join_group keys as a reference.
join_group – List[Dict[str, Any]]: Group of joinable keys to be used as “chunked” filter of overall dataset.
drop_null – bool: Whether to drop rows with null values within the resulting joined data.

Returns:

str: Path to joined file which is created as a result of this function.

cytotable.convert._prepare_join_sql(*args, **kwargs)¶

Prepare join SQL statement with actual locations of data based on the sources.

Parameters:

sources – Dict[str, List[Dict[str, Any]]]: Grouped datasets of files which will be used by other functions. Includes the metadata concerning location of actual data.
joins – str: DuckDB-compatible SQL which will be used to perform the join operations using the join_group keys as a reference.
sort_output – bool Specifies whether to sort cytotable output or not.

Returns:

String representing the SQL to be used in later join work.

Return type:

str

cytotable.convert._prep_cast_column_data_types(*args, **kwargs)¶

Cast data types per what is received in cast_map.

Example: - columns: [{“column_id”:0, “column_name”:”colname”, “column_dtype”:”DOUBLE”}] - data_type_cast_map: {“float”: “float32”}

The above passed through this function will set the “column_dtype” value to a “REAL” dtype (“REAL” in duckdb is roughly equivalent to “float32”)

Parameters:

table_path – str: Path to a parquet file which will be modified.
data_type_cast_map –
Dict[str, str] A dictionary mapping data type groups to specific types. Roughly to eventually align with DuckDB types: https://duckdb.org/docs/sql/data_types/overview

Note: includes synonym matching for common naming convention use in Pandas and/or PyArrow via cytotable.utils.DATA_TYPE_SYNONYMS

Returns:

List[Dict[str, str]]: list of dictionaries which each include column level information

cytotable.convert._set_tablenumber(*args, **kwargs)¶

Gathers a “TableNumber” from the image table (if CSV) or SQLite file (if SQLite source) which is a unique identifier intended to help differentiate between imagenumbers to create distinct records for single-cell profiles referenced across multiple source data exports. For example, ImageNumber column values from CellProfiler will repeat across exports, meaning we may lose distinction when combining multiple export files together through CytoTable.

Note: - If using CSV data sources, the image.csv table is used for checksum. - If using SQLite data sources, the entire SQLite database is used for checksum.

Parameters:

sources – Dict[str, List[Dict[str, Any]]] Contains metadata about data tables and related contents.
add_tablenumber – Optional[bool] Whether to add a calculated tablenumber. Note: when False, adds None as the tablenumber

Returns:

List[Dict[str, Any]]: New source group with added TableNumber details.

cytotable.convert._prepend_column_name(*args, **kwargs)¶

Rename columns using the source group name, avoiding identifying columns.

Notes: * A source_group_name represents a filename referenced as part of what is specified within targets.

Parameters:

table_path – str: Path to a parquet file which will be modified.
source_group_name – str: Name of data source source group (for common compartments, etc).
identifying_columns – List[str]: Column names which are used as ID’s and as a result need to be treated differently when renaming.
metadata – Union[List[str], Tuple[str, …]]: List of source data names which are used as metadata.
compartments – List[str]: List of source data names which are used as compartments.

Returns:

str: Path to the modified file.

cytotable.convert._source_pageset_to_parquet(*args, **kwargs)¶

Export source data to chunked parquet file using chunk size and offsets.

Parameters:

source_group_name – str Name of the source group (for ex. compartment or metadata table name).
source – Dict[str, Any] Contains the source data to be chunked. Represents a single file or table of some kind along with collected information about table.
pageset – Optional[Tuple[Union[int, float], Union[int, float]]] The pageset for chunking the data from source.
dest_path – str Path to store the output data.
sort_output – bool Specifies whether to sort cytotable output or not.

Returns:

str: A string of the output filepath.

Sources¶

CytoTable: sources - tasks and flows related to source data and metadata for performing conversion work.

cytotable.sources._build_path(path: str, **kwargs) → Path | AnyPath[source]¶

Build a path client or return local path.

Parameters:

path – Union[pathlib.Path, Any]: Path to seek filepaths within.
**kwargs – Any keyword arguments to be used with Cloudpathlib.CloudPath.client .

Returns:

Union[pathlib.Path, Any]: A local pathlib.Path or Cloudpathlib.AnyPath type path.

cytotable.sources._file_is_more_than_one_line(path: Path | AnyPath) → bool[source]¶

Check if the file has more than one line.

Parameters:: path (Union[pathlib.Path, AnyPath]) – The path to the file.
Returns:: True if the file has more than one line, False otherwise.
Return type:: bool
Raises:: NoInputDataException – If the file has zero lines.

cytotable.sources._filter_source_filepaths(sources: Dict[str, List[Dict[str, Any]]], source_datatype: str) → Dict[str, List[Dict[str, Any]]][source]¶

Filter source filepaths based on provided source_datatype.

Parameters:

sources – Dict[str, List[Dict[str, Any]]] Grouped datasets of files which will be used by other functions.
source_datatype – str Source datatype to use for filtering the dataset.

Returns:

Dict[str, List[Dict[str, Any]]]: Data structure which groups related files based on the datatype.

cytotable.sources._gather_sources(source_path: str, source_datatype: str | None = None, targets: List[str] | None = None, **kwargs) → Dict[str, List[Dict[str, Any]]][source]¶

Flow for gathering data sources for conversion.

Parameters:

source_path – str: Where to gather file-based data from.
source_datatype – Optional[str]: (Default value = None) The source datatype (extension) to use for reading the tables.
targets – Optional[List[str]]: (Default value = None) The source file names to target within the provided path.

Returns:

Dict[str, List[Dict[str, Any]]]: Data structure which groups related files based on the compartments.

cytotable.sources._get_source_filepaths(path: Path | AnyPath, targets: List[str] | None = None, source_datatype: str | None = None) → Dict[str, List[Dict[str, Any]]][source]¶

Gather dataset of filepaths from a provided directory path.

Parameters:

path – Union[pathlib.Path, Any]: Either a directory path to seek filepaths within or a path directly to a file.
targets – List[str]: Compartment and metadata names to seek within the provided path.
source_datatype – Optional[str]: (Default value = None) The source datatype (extension) to use for reading the tables.

Returns:

Dict[str, List[Dict[str, Any]]]: Data structure which groups related files based on the compartments.

cytotable.sources._infer_source_datatype(sources: Dict[str, List[Dict[str, Any]]], source_datatype: str | None = None) → str[source]¶

Infers and optionally validates datatype (extension) of files.

Parameters:

sources – Dict[str, List[Dict[str, Any]]]: Grouped datasets of files which will be used by other functions.
source_datatype – Optional[str]: (Default value = None) Optional source datatype to validate within the context of detected datatypes.

Returns:

str: A string of the datatype detected or validated source_datatype.

Utils¶

Utility functions for CytoTable

cytotable.utils.Parsl_AppBase_init_for_docs(self, func, *args, **kwargs)[source]¶: A function to extend Parsl.app.app.AppBase with docstring from decorated functions rather than the decorators from Parsl. Used for Sphinx documentation purposes.

cytotable.utils._arrow_type_cast_if_specified(column: Dict[str, str], data_type_cast_map: Dict[str, str]) → Dict[str, str][source]¶

Attempts to cast data types for an PyArrow field using provided a data_type_cast_map.

Parameters:

column – Dict[str, str]: Dictionary which includes a column idx, name, and dtype
data_type_cast_map – Dict[str, str] A dictionary mapping data type groups to specific types. Roughly includes Arrow data types language from: https://arrow.apache.org/docs/python/api/datatypes.html Example: {“float”: “float32”}

Returns:

Dict[str, str]: A potentially data type updated dictionary of column information

cytotable.utils._cache_cloudpath_to_local(path: AnyPath) → Path[source]¶

Takes a cloudpath and uses cache to convert to a local copy for use in scenarios where remote work is not possible (sqlite).

Parameters:

path – Union[str, AnyPath] A filepath which will be checked and potentially converted to a local filepath.

Returns:

pathlib.Path: A local pathlib.Path to cached version of cloudpath file.

cytotable.utils._column_sort(value: str)[source]¶: A custom sort for column values as a list. To be used with sorted and Pyarrow tables.

cytotable.utils._default_parsl_config()[source]¶: Return a default Parsl configuration for use with CytoTable.

cytotable.utils._duckdb_reader() → DuckDBPyConnection[source]¶

Creates a DuckDB connection with the sqlite_scanner installed and loaded.

Note: using this function assumes implementation will close the subsequently created DuckDB connection using _duckdb_reader().close() or using a context manager, for ex., using: with _duckdb_reader() as ddb_reader:

Returns:: duckdb.DuckDBPyConnection

cytotable.utils._expand_path(path: str | Path | AnyPath) → Path | AnyPath[source]¶

Expands “~” user directory references with the user’s home directory, and expands variable references with values from the environment. After user/variable expansion, the path is resolved and an absolute path is returned.

Parameters:

path – Union[str, pathlib.Path, CloudPath]: Path to expand.

Returns:

Union[pathlib.Path, Any]: A local pathlib.Path or Cloudpathlib.AnyPath type path.

cytotable.utils._extract_npz_to_parquet(source_path: str, dest_path: str, tablenumber: int | None = None) → str[source]¶

Extract data from an .npz file created by DeepProfiler as a tabular dataset and write to parquet.

DeepProfiler creates datasets which look somewhat like this: Keys in the .npz file: [‘features’, ‘metadata’, ‘locations’]

Variable: features Shape: (229, 6400) Data type: float32

Variable: locations Shape: (229, 2) Data type: float64

Variable: metadata Shape: () Data type: object Whole object: { ‘Metadata_Plate’: ‘SQ00014812’, ‘Metadata_Well’: ‘A01’, ‘Metadata_Site’: 1, ‘Plate_Map_Name’: ‘C-7161-01-LM6-022’, ‘RNA’: ‘SQ00014812/r01c01f01p01-ch3sk1fk1fl1.png’, ‘ER’: ‘SQ00014812/r01c01f01p01-ch2sk1fk1fl1.png’, ‘AGP’: ‘SQ00014812/r01c01f01p01-ch4sk1fk1fl1.png’, ‘Mito’: ‘SQ00014812/r01c01f01p01-ch5sk1fk1fl1.png’, ‘DNA’: ‘SQ00014812/r01c01f01p01-ch1sk1fk1fl1.png’, ‘Treatment_ID’: 0, ‘Treatment_Replicate’: 1, ‘Treatment’: ‘DMSO@NA’, ‘Compound’: ‘DMSO’, ‘Concentration’: ‘’, ‘Split’: ‘Training’, ‘Metadata_Model’: ‘efficientnet’ }

Parameters:

source_path – str Path to the .npz file.
dest_path – str Destination path for the parquet file.
tablenumber – Optional[int] Optional tablenumber to be added to the data.

Returns:

str: Path to the exported parquet file.

cytotable.utils._gather_tablenumber_checksum(pathname: str, buffer_size: int = 1048576) → int[source]¶

Build and return a checksum for use as a unique identifier across datasets referenced from cytominer-database: https://github.com/cytomining/cytominer-database/blob/master/cytominer_database/ingest_variable_engine.py#L129

Parameters:

pathname – str: A path to a file with which to generate the checksum on.
buffer_size – int: Buffer size to use for reading data.

Returns:

int: an integer representing the checksum of the pathname file.

cytotable.utils._generate_pagesets(keys: List[int | float], chunk_size: int) → List[Tuple[int | float, int | float]][source]¶

Generate a pageset (keyset pagination) from a list of keys.

Parameters:

List[Union[int (keys) – List of keys to paginate.
float]] – List of keys to paginate.
int (chunk_size) – Size of each chunk/page.

Returns:

List of (start_key, end_key) tuples representing each page.

Return type:

List[Tuple[Union[int, float], Union[int, float]]]

cytotable.utils._get_cytotable_version() → str[source]¶

Seeks the current version of CytoTable using either pkg_resources or dunamai to determine the current version being used.

Returns:

str: A string representing the version of CytoTable currently being used.

cytotable.utils._natural_sort(list_to_sort)[source]¶

Sorts the given iterable using natural sort adapted from approach provided by the following link: https://stackoverflow.com/a/4836734

Parameters:: list_to_sort – List: The list to sort.
Returns:: The sorted list.
Return type:: List

cytotable.utils._parsl_loaded() → bool[source]¶: Checks whether Parsl configuration has already been loaded.

cytotable.utils._sqlite_mixed_type_query_to_parquet(source_path: str, table_name: str, page_key: str, pageset: Tuple[int | float, int | float], sort_output: bool, tablenumber: int | None = None) → str[source]¶

Performs SQLite table data extraction where one or many columns include data values of potentially mismatched type such that the data may be exported to Arrow for later use.

Parameters:

source_path – str: A str which is a path to a SQLite database file.
table_name – str: The name of the table being queried.
page_key – str: The column name to be used to identify pagination chunks.
pageset – Tuple[Union[int, float], Union[int, float]]: The range for values used for paginating data from source.
sort_output – bool Specifies whether to sort cytotable output or not.
add_cytotable_meta – bool, default=False: Whether to add CytoTable metadata fields or not
tablenumber – Optional[int], default=None: An optional table number to append to the results. Defaults to None.

Returns:

The resulting arrow table for the data

Return type:

pyarrow.Table

cytotable.utils._unwrap_source(source: Dict[str, AppFuture | Any] | AppFuture | Any) → Dict[str, Any] | Any[source]¶

Helper function to unwrap futures from sources.

Parameters:

source – Union[ Dict[str, Union[parsl.dataflow.futures.AppFuture, Any]], Union[parsl.dataflow.futures.AppFuture, Any],
] – A source is a portion of an internal data structure used by CytoTable for processing and organizing data results.

Returns:

Union[Dict[str, Any], Any]: An evaluated dictionary or other value type.

cytotable.utils._unwrap_value(val: AppFuture | Any) → Any[source]¶

Helper function to unwrap futures from values or return values where there are no futures.

Parameters:

val – Union[parsl.dataflow.futures.AppFuture, Any] A value which may or may not be a Parsl future which needs to be evaluated.

Returns:

Any: Returns the value as-is if there’s no future, the future result if Parsl futures are encountered.

cytotable.utils._write_parquet_table_with_metadata(table: Table, **kwargs) → None[source]¶

Adds metadata to parquet output from CytoTable. Note: this mostly wraps pyarrow.parquet.write_table https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html

Parameters:

table – pa.Table: Pyarrow table to be serialized as parquet table.
**kwargs – Any: kwargs provided to this function roughly align with pyarrow.parquet.write_table. The following might be examples of what to expect here: - where: str or pyarrow.NativeFile

Globs under start and yields matching paths. We provide cloud-platform specific optimizations as needed based on platform SDK’s.

Behavior by input type:

S3 (cloudpathlib S3Path or ‘s3://…’):
Use unsigned boto3 to list keys and yield unsigned cloudpathlib.S3Path.
Other CloudPath (e.g., GCS/Azure/local providers via cloudpathlib):
Fallback to CloudPath.glob(pattern), yielding CloudPath.
Local filesystem (pathlib.Path or non-s3 string):
Fallback to Path.glob(pattern), yielding pathlib.Path.

Parameters:

start – CloudPath, pathlib.Path, or URI string.
pattern – Glob pattern relative to start (supports ** for S3 branch).
max_matches – Optional cap on yielded results.
cp_client – cloudpathlib S3Client (unsigned recommended).
boto_s3_client – boto3 S3 client (unsigned recommended).

Yields:

cloudpathlib.S3Path for S3; CloudPath for other cloud providers; pathlib.Path for local.

cytotable.utils.evaluate_futures(sources: Dict[str, List[Dict[str, Any]]] | List[Any] | str) → Any[source]¶

Evaluates any Parsl futures for use within other tasks. This enables a pattern of Parsl app usage as “tasks” and delayed future result evaluation for concurrency.

Parameters:

sources – Union[Dict[str, List[Dict[str, Any]]], List[Any], str] Sources are an internal data structure used by CytoTable for processing and organizing data results. They may include futures which require asynchronous processing through Parsl, so we process them through this function.

Returns:

Union[Dict[str, List[Dict[str, Any]]], str]: A data structure which includes evaluated futures where they were found.

cytotable.utils.find_anndata_metadata_field_names(source: str | Path) → tuple[list[str], list[str]][source]¶

Classify the source table columns as numeric and non-numeric.

Scans the Parquet file schema and returns two lists of column names: those with numeric types (float, integer, decimal) and those with any other type. This is handy for separating AnnData metadata fields by basic numeric-ness for downstream processing.

Parameters:: source – Path to a Parquet file to inspect.
Returns:: A 2-tuple (numeric_fields, non_numeric_fields), where each element is a list of column names.

cytotable.utils.map_pyarrow_type(field_type: DataType, data_type_cast_map: Dict[str, str] | None) → DataType[source]¶

Map PyArrow types dynamically to handle nested types and casting.

This function takes a PyArrow field_type and dynamically maps it to a valid PyArrow type, handling nested types (e.g., lists, structs) and resolving type conflicts (e.g., integer to float). It also supports custom type casting using the data_type_cast_map parameter.

Parameters:

field_type – pa.DataType The PyArrow data type to be mapped. This can include simple types (e.g., int, float, string) or nested types (e.g., list, struct).
data_type_cast_map – Optional[Dict[str, str]], default None A dictionary mapping data type groups to specific types. This allows for custom type casting. For example: - {“float”: “float32”} maps floating-point types to float32. - {“int”: “int64”} maps integer types to int64. If data_type_cast_map is None, default PyArrow types are used.

Returns:

pa.DataType: The mapped PyArrow data type. If no mapping is needed, the original field_type is returned.

Presets¶

cytotable.presets.config¶: Configuration presets for CytoTable

Exceptions¶

Provide hierarchy of exceptions for CytoTable

exception cytotable.exceptions.CytoTableException[source]¶

Bases: Exception

Root exception for custom hierarchy of exceptions with CytoTable.

exception cytotable.exceptions.DatatypeException[source]¶

Bases: CytoTableException

Exception for datatype challenges.

exception cytotable.exceptions.NoInputDataException[source]¶

Bases: CytoTableException

Exception for no input data.

exception cytotable.exceptions.SchemaException[source]¶

Bases: CytoTableException

Exception for schema challenges.

Python API¶

Convert¶

Sources¶

Utils¶

Presets¶

Exceptions¶

CytoTable

Navigation

Related Topics