Python API

Convert

CytoTable: convert - transforming data for use with pyctyominer.

cytotable.convert._to_parquet(source_path: str, dest_path: str, source_datatype: str | None, metadata: List[str] | Tuple[str, ...] | None, compartments: List[str] | Tuple[str, ...] | None, identifying_columns: List[str] | Tuple[str, ...] | None, concat: bool, join: bool, joins: str | None, chunk_size: int | None, infer_common_schema: bool, drop_null: bool, sort_output: bool, page_keys: Dict[str, str], data_type_cast_map: Dict[str, str] | None = None, add_tablenumber: bool | None = None, **kwargs) Dict[str, List[Dict[str, Any]]] | List[Any] | str[source]

Export data to parquet.

Parameters:
  • source_path – str: str reference to read source files from. Note: may be local or remote object-storage location using convention “s3://…” or similar.

  • dest_path – str: Path to write files to. This path will be used for intermediary data work and must be a new file or directory path. This parameter will result in a directory on join=False. This parameter will result in a single file on join=True. Note: this may only be a local path.

  • source_datatype – Optional[str]: (Default value = None) Source datatype to focus on during conversion.

  • metadata – Union[List[str], Tuple[str, …]]: Metadata names to use for conversion.

  • compartments – Union[List[str], Tuple[str, …]]: (Default value = None) Compartment names to use for conversion.

  • identifying_columns – Union[List[str], Tuple[str, …]]: Column names which are used as ID’s and as a result need to be ignored with regards to renaming.

  • concat – bool: Whether to concatenate similar files together.

  • join – bool: Whether to join the compartment data together into one dataset.

  • joins – str: DuckDB-compatible SQL which will be used to perform the join operations.

  • chunk_size – Optional[int], Size of join chunks which is used to limit data size during join ops.

  • infer_common_schema – bool: (Default value = True) Whether to infer a common schema when concatenating sources.

  • drop_null – bool: Whether to drop null results.

  • sort_output – bool Specifies whether to sort cytotable output or not.

  • page_keys – Dict[str, str] A dictionary which defines which column names are used for keyset pagination in order to perform data extraction.

  • data_type_cast_map – Dict[str, str] A dictionary mapping data type groups to specific types. Roughly includes Arrow data types language from: https://arrow.apache.org/docs/python/api/datatypes.html

  • **kwargs – Any: Keyword args used for gathering source data, primarily relevant for Cloudpathlib cloud-based client configuration.

Returns:

Grouped sources which include metadata about destination filepath where parquet file was written or a string filepath for the joined result.

Return type:

Union[Dict[str, List[Dict[str, Any]]], str]

cytotable.convert.convert(source_path: str, dest_path: str, dest_datatype: Literal['parquet'], source_datatype: str | None = None, metadata: List[str] | Tuple[str, ...] | None = None, compartments: List[str] | Tuple[str, ...] | None = None, identifying_columns: List[str] | Tuple[str, ...] | None = None, concat: bool = True, join: bool = True, joins: str | None = None, chunk_size: int | None = None, infer_common_schema: bool = True, drop_null: bool = False, data_type_cast_map: Dict[str, str] | None = None, add_tablenumber: bool | None = None, page_keys: Dict[str, str] | None = None, sort_output: bool = True, preset: str | None = 'cellprofiler_csv', parsl_config: Config | None = None, **kwargs) Dict[str, List[Dict[str, Any]]] | List[Any] | str[source]

Convert file-based data from various sources to Pycytominer-compatible standards.

Note: source paths may be local or remote object-storage location using convention “s3://…” or similar.

Parameters:
  • source_path – str: str reference to read source files from. Note: may be local or remote object-storage location using convention “s3://…” or similar.

  • dest_path – str: Path to write files to. This path will be used for intermediary data work and must be a new file or directory path. This parameter will result in a directory on join=False. This parameter will result in a single file on join=True. Note: this may only be a local path.

  • dest_datatype – Literal[“parquet”]: Destination datatype to write to.

  • source_datatype – Optional[str]: (Default value = None) Source datatype to focus on during conversion.

  • metadata – Union[List[str], Tuple[str, …]]: Metadata names to use for conversion.

  • compartments – Union[List[str], Tuple[str, str, str, str]]: (Default value = None) Compartment names to use for conversion.

  • identifying_columns – Union[List[str], Tuple[str, …]]: Column names which are used as ID’s and as a result need to be ignored with regards to renaming.

  • concat – bool: (Default value = True) Whether to concatenate similar files together.

  • join – bool: (Default value = True) Whether to join the compartment data together into one dataset

  • joins – str: (Default value = None): DuckDB-compatible SQL which will be used to perform the join operations.

  • chunk_size – Optional[int] (Default value = None) Size of join chunks which is used to limit data size during join ops

  • infer_common_schema – bool (Default value = True) Whether to infer a common schema when concatenating sources.

  • data_type_cast_map – Dict[str, str], (Default value = None) A dictionary mapping data type groups to specific types. Roughly includes Arrow data types language from: https://arrow.apache.org/docs/python/api/datatypes.html

  • add_tablenumber – Optional[bool] Whether to add a calculated tablenumber which helps differentiate various repeated values (such as ObjectNumber) within source data. Useful for processing multiple SQLite or CSV data sources together to retain distinction from each dataset.

  • page_keys – str: The table and column names to be used for key pagination. Uses the form: {“table_name”:”column_name”}. Expects columns to include numeric data (ints or floats). Interacts with the chunk_size parameter to form pages of chunk_size.

  • sort_output – bool (Default value = True) Specifies whether to sort cytotable output or not.

  • drop_null – bool (Default value = False) Whether to drop nan/null values from results

  • preset – str (Default value = “cellprofiler_csv”) an optional group of presets to use based on common configurations

  • parsl_config – Optional[parsl.Config] (Default value = None) Optional Parsl configuration to use for running CytoTable operations. Note: when using CytoTable multiple times in the same process, CytoTable will use the first provided configuration for all runs.

Returns:

Union[Dict[str, List[Dict[str, Any]]], str]

Grouped sources which include metadata about destination filepath where parquet file was written or str of joined result filepath.

Example

from cytotable import convert

# using a local path with cellprofiler csv presets
convert(
    source_path="./tests/data/cellprofiler/ExampleHuman",
    source_datatype="csv",
    dest_path="ExampleHuman.parquet",
    dest_datatype="parquet",
    preset="cellprofiler_csv",
)

# using an s3-compatible path with no signature for client
# and cellprofiler csv presets
convert(
    source_path="s3://s3path",
    source_datatype="csv",
    dest_path="s3_local_result",
    dest_datatype="parquet",
    concat=True,
    preset="cellprofiler_csv",
    no_sign_request=True,
)

# using local path with cellprofiler sqlite presets
convert(
    source_path="example.sqlite",
    dest_path="example.parquet",
    dest_datatype="parquet",
    preset="cellprofiler_sqlite",
)

cytotable.convert._concat_join_sources(*args, **kwargs)

Concatenate join sources from parquet-based chunks.

For a reference to data concatenation within Arrow see the following: https://arrow.apache.org/docs/python/generated/pyarrow.concat_tables.html

Parameters:
  • sources – Dict[str, List[Dict[str, Any]]]: Grouped datasets of files which will be used by other functions. Includes the metadata concerning location of actual data.

  • dest_path – str: Destination path to write file-based content.

  • join_sources – List[str]: List of local filepath destination for join source chunks which will be concatenated.

  • sort_output – bool Specifies whether to sort cytotable output or not.

Returns:

str

Path to concatenated file which is created as a result of this function.


cytotable.convert._concat_source_group(*args, **kwargs)

Concatenate group of source data together as single file.

For a reference to data concatenation within Arrow see the following: https://arrow.apache.org/docs/python/generated/pyarrow.concat_tables.html

Notes: this function presumes a multi-directory, multi-file common data structure for compartments and other data. For example:

Source (file tree):

root
├── subdir_1
│  └── Cells.csv
└── subdir_2
    └── Cells.csv

Becomes:

# earlier data read into parquet chunks from multiple
# data source files.
read_data = [
    {"table": ["cells-1.parquet", "cells-2.parquet"]},
    {"table": ["cells-1.parquet", "cells-2.parquet"]},
]

# focus of this function
concatted = [{"table": ["cells.parquet"]}]
Parameters:
  • source_group_name – str Name of data source source group (for common compartments, etc).

  • source_group – List[Dict[str, Any]]: Data structure containing grouped data for concatenation.

  • dest_path – Optional[str] (Default value = None) Optional destination path for concatenated sources.

  • common_schema – List[Tuple[str, str]] (Default value = None) Common schema to use for concatenation amongst arrow tables which may have slightly different but compatible schema.

  • sort_output – bool Specifies whether to sort cytotable output or not.

Returns:

List[Dict[str, Any]]

Updated dictionary containing concatenated sources.


cytotable.convert._get_table_columns_and_types(*args, **kwargs)

Gather column data from table through duckdb.

Parameters:
  • source – Dict[str, Any] Contains source data details. Represents a single file or table of some kind.

  • sort_output – Specifies whether to sort cytotable output or not.

Returns:

List[Dict[str, str]]

list of dictionaries which each include column level information


cytotable.convert._get_table_keyset_pagination_sets(*args, **kwargs)

Get table data chunk keys for later use in capturing segments of values. This work also provides a chance to catch problematic input data which will be ignored with warnings.

Parameters:
  • source – Dict[str, Any] Contains the source data to be chunked. Represents a single file or table of some kind.

  • chunk_size – int The size in rowcount of the chunks to create.

  • page_key – str The column name to be used to identify pagination chunks. Expected to be of numeric type (int, float) for ordering.

  • sql_stmt – Optional sql statement to form the pagination set from. Default behavior extracts pagination sets from the full data source.

Returns:

List[Any]

List of keys to use for reading the data later on.


cytotable.convert._infer_source_group_common_schema(*args, **kwargs)

Infers a common schema for group of parquet files which may have similar but slightly different schema or data. Intended to assist with data concatenation and other operations.

Parameters:
  • source_group – List[Dict[str, Any]]: Group of one or more data sources which includes metadata about path to parquet data.

  • data_type_cast_map – Optional[Dict[str, str]], default None A dictionary mapping data type groups to specific types. Roughly includes Arrow data types language from: https://arrow.apache.org/docs/python/api/datatypes.html

Returns:

List[Tuple[str, str]]

A list of tuples which includes column name and PyArrow datatype. This data will later be used as the basis for forming a PyArrow schema.


cytotable.convert._join_source_pageset(*args, **kwargs)

Join sources based on join group keys (group of specific join column values)

Parameters:
  • dest_path – str: Destination path to write file-based content.

  • joins – str: DuckDB-compatible SQL which will be used to perform the join operations using the join_group keys as a reference.

  • join_group – List[Dict[str, Any]]: Group of joinable keys to be used as “chunked” filter of overall dataset.

  • drop_null – bool: Whether to drop rows with null values within the resulting joined data.

Returns:

str

Path to joined file which is created as a result of this function.


cytotable.convert._prepare_join_sql(*args, **kwargs)

Prepare join SQL statement with actual locations of data based on the sources.

Parameters:
  • sources – Dict[str, List[Dict[str, Any]]]: Grouped datasets of files which will be used by other functions. Includes the metadata concerning location of actual data.

  • joins – str: DuckDB-compatible SQL which will be used to perform the join operations using the join_group keys as a reference.

  • sort_output – bool Specifies whether to sort cytotable output or not.

Returns:

String representing the SQL to be used in later join work.

Return type:

str


cytotable.convert._prep_cast_column_data_types(*args, **kwargs)

Cast data types per what is received in cast_map.

Example: - columns: [{“column_id”:0, “column_name”:”colname”, “column_dtype”:”DOUBLE”}] - data_type_cast_map: {“float”: “float32”}

The above passed through this function will set the “column_dtype” value to a “REAL” dtype (“REAL” in duckdb is roughly equivalent to “float32”)

Parameters:
  • table_path – str: Path to a parquet file which will be modified.

  • data_type_cast_map

    Dict[str, str] A dictionary mapping data type groups to specific types. Roughly to eventually align with DuckDB types: https://duckdb.org/docs/sql/data_types/overview

    Note: includes synonym matching for common naming convention use in Pandas and/or PyArrow via cytotable.utils.DATA_TYPE_SYNONYMS

Returns:

List[Dict[str, str]]

list of dictionaries which each include column level information


cytotable.convert._set_tablenumber(*args, **kwargs)

Gathers a “TableNumber” from the image table (if CSV) or SQLite file (if SQLite source) which is a unique identifier intended to help differentiate between imagenumbers to create distinct records for single-cell profiles referenced across multiple source data exports. For example, ImageNumber column values from CellProfiler will repeat across exports, meaning we may lose distinction when combining multiple export files together through CytoTable.

Note: - If using CSV data sources, the image.csv table is used for checksum. - If using SQLite data sources, the entire SQLite database is used for checksum.

Parameters:
  • sources – Dict[str, List[Dict[str, Any]]] Contains metadata about data tables and related contents.

  • add_tablenumber – Optional[bool] Whether to add a calculated tablenumber. Note: when False, adds None as the tablenumber

Returns:

List[Dict[str, Any]]

New source group with added TableNumber details.


cytotable.convert._prepend_column_name(*args, **kwargs)

Rename columns using the source group name, avoiding identifying columns.

Notes: * A source_group_name represents a filename referenced as part of what is specified within targets.

Parameters:
  • table_path – str: Path to a parquet file which will be modified.

  • source_group_name – str: Name of data source source group (for common compartments, etc).

  • identifying_columns – List[str]: Column names which are used as ID’s and as a result need to be treated differently when renaming.

  • metadata – Union[List[str], Tuple[str, …]]: List of source data names which are used as metadata.

  • compartments – List[str]: List of source data names which are used as compartments.

Returns:

str

Path to the modified file.


cytotable.convert._source_pageset_to_parquet(*args, **kwargs)

Export source data to chunked parquet file using chunk size and offsets.

Parameters:
  • source_group_name – str Name of the source group (for ex. compartment or metadata table name).

  • source – Dict[str, Any] Contains the source data to be chunked. Represents a single file or table of some kind along with collected information about table.

  • pageset – Tuple[int, int] The pageset for chunking the data from source.

  • dest_path – str Path to store the output data.

  • sort_output – bool Specifies whether to sort cytotable output or not.

Returns:

str

A string of the output filepath.


Sources

CytoTable: sources - tasks and flows related to source data and metadata for performing conversion work.

cytotable.sources._build_path(path: str, **kwargs) Path | AnyPath[source]

Build a path client or return local path.

Parameters:
  • path – Union[pathlib.Path, Any]: Path to seek filepaths within.

  • **kwargs – Any keyword arguments to be used with Cloudpathlib.CloudPath.client .

Returns:

Union[pathlib.Path, Any]

A local pathlib.Path or Cloudpathlib.AnyPath type path.

cytotable.sources._file_is_more_than_one_line(path: Path | AnyPath) bool[source]

Check if the file has more than one line.

Parameters:

path (Union[pathlib.Path, AnyPath]) – The path to the file.

Returns:

True if the file has more than one line, False otherwise.

Return type:

bool

Raises:

NoInputDataException – If the file has zero lines.

cytotable.sources._filter_source_filepaths(sources: Dict[str, List[Dict[str, Any]]], source_datatype: str) Dict[str, List[Dict[str, Any]]][source]

Filter source filepaths based on provided source_datatype.

Parameters:
  • sources – Dict[str, List[Dict[str, Any]]] Grouped datasets of files which will be used by other functions.

  • source_datatype – str Source datatype to use for filtering the dataset.

Returns:

Dict[str, List[Dict[str, Any]]]

Data structure which groups related files based on the datatype.

cytotable.sources._gather_sources(source_path: str, source_datatype: str | None = None, targets: List[str] | None = None, **kwargs) Dict[str, List[Dict[str, Any]]][source]

Flow for gathering data sources for conversion.

Parameters:
  • source_path – str: Where to gather file-based data from.

  • source_datatype – Optional[str]: (Default value = None) The source datatype (extension) to use for reading the tables.

  • targets – Optional[List[str]]: (Default value = None) The source file names to target within the provided path.

Returns:

Dict[str, List[Dict[str, Any]]]

Data structure which groups related files based on the compartments.

cytotable.sources._get_source_filepaths(path: Path | AnyPath, targets: List[str] | None = None, source_datatype: str | None = None) Dict[str, List[Dict[str, Any]]][source]

Gather dataset of filepaths from a provided directory path.

Parameters:
  • path – Union[pathlib.Path, Any]: Either a directory path to seek filepaths within or a path directly to a file.

  • targets – List[str]: Compartment and metadata names to seek within the provided path.

  • source_datatype – Optional[str]: (Default value = None) The source datatype (extension) to use for reading the tables.

Returns:

Dict[str, List[Dict[str, Any]]]

Data structure which groups related files based on the compartments.

cytotable.sources._infer_source_datatype(sources: Dict[str, List[Dict[str, Any]]], source_datatype: str | None = None) str[source]

Infers and optionally validates datatype (extension) of files.

Parameters:
  • sources – Dict[str, List[Dict[str, Any]]]: Grouped datasets of files which will be used by other functions.

  • source_datatype – Optional[str]: (Default value = None) Optional source datatype to validate within the context of detected datatypes.

Returns:

str

A string of the datatype detected or validated source_datatype.


Utils

Utility functions for CytoTable

cytotable.utils.Parsl_AppBase_init_for_docs(self, func, *args, **kwargs)[source]

A function to extend Parsl.app.app.AppBase with docstring from decorated functions rather than the decorators from Parsl. Used for Sphinx documentation purposes.

cytotable.utils._arrow_type_cast_if_specified(column: Dict[str, str], data_type_cast_map: Dict[str, str]) Dict[str, str][source]

Attempts to cast data types for an PyArrow field using provided a data_type_cast_map.

Parameters:
  • column – Dict[str, str]: Dictionary which includes a column idx, name, and dtype

  • data_type_cast_map – Dict[str, str] A dictionary mapping data type groups to specific types. Roughly includes Arrow data types language from: https://arrow.apache.org/docs/python/api/datatypes.html Example: {“float”: “float32”}

Returns:

Dict[str, str]

A potentially data type updated dictionary of column information

cytotable.utils._cache_cloudpath_to_local(path: AnyPath) Path[source]

Takes a cloudpath and uses cache to convert to a local copy for use in scenarios where remote work is not possible (sqlite).

Parameters:

path – Union[str, AnyPath] A filepath which will be checked and potentially converted to a local filepath.

Returns:

pathlib.Path

A local pathlib.Path to cached version of cloudpath file.

cytotable.utils._column_sort(value: str)[source]

A custom sort for column values as a list. To be used with sorted and Pyarrow tables.

cytotable.utils._default_parsl_config()[source]

Return a default Parsl configuration for use with CytoTable.

cytotable.utils._duckdb_reader() DuckDBPyConnection[source]

Creates a DuckDB connection with the sqlite_scanner installed and loaded.

Note: using this function assumes implementation will close the subsequently created DuckDB connection using _duckdb_reader().close() or using a context manager, for ex., using: with _duckdb_reader() as ddb_reader:

Returns:

duckdb.DuckDBPyConnection

cytotable.utils._expand_path(path: str | Path | AnyPath) Path | AnyPath[source]

Expands “~” user directory references with the user’s home directory, and expands variable references with values from the environment. After user/variable expansion, the path is resolved and an absolute path is returned.

Parameters:

path – Union[str, pathlib.Path, CloudPath]: Path to expand.

Returns:

Union[pathlib.Path, Any]

A local pathlib.Path or Cloudpathlib.AnyPath type path.

cytotable.utils._gather_tablenumber_checksum(pathname: str, buffer_size: int = 1048576) int[source]

Build and return a checksum for use as a unique identifier across datasets referenced from cytominer-database: https://github.com/cytomining/cytominer-database/blob/master/cytominer_database/ingest_variable_engine.py#L129

Parameters:
  • pathname – str: A path to a file with which to generate the checksum on.

  • buffer_size – int: Buffer size to use for reading data.

Returns:

int

an integer representing the checksum of the pathname file.

cytotable.utils._generate_pagesets(keys: List[int | float], chunk_size: int) List[Tuple[int | float, int | float]][source]

Generate a pageset (keyset pagination) from a list of keys.

Parameters:
  • List[Union[int (keys) – List of keys to paginate.

  • float]] – List of keys to paginate.

  • int (chunk_size) – Size of each chunk/page.

Returns:

List of (start_key, end_key) tuples representing each page.

Return type:

List[Tuple[Union[int, float], Union[int, float]]]

cytotable.utils._get_cytotable_version() str[source]

Seeks the current version of CytoTable using either pkg_resources or dunamai to determine the current version being used.

Returns:

str

A string representing the version of CytoTable currently being used.

cytotable.utils._natural_sort(list_to_sort)[source]

Sorts the given iterable using natural sort adapted from approach provided by the following link: https://stackoverflow.com/a/4836734

Parameters:

list_to_sort – List: The list to sort.

Returns:

The sorted list.

Return type:

List

cytotable.utils._parsl_loaded() bool[source]

Checks whether Parsl configuration has already been loaded.

cytotable.utils._sqlite_mixed_type_query_to_parquet(source_path: str, table_name: str, page_key: str, pageset: Tuple[int | float, int | float], sort_output: bool, tablenumber: int | None = None) str[source]

Performs SQLite table data extraction where one or many columns include data values of potentially mismatched type such that the data may be exported to Arrow for later use.

Parameters:
  • source_path – str: A str which is a path to a SQLite database file.

  • table_name – str: The name of the table being queried.

  • page_key – str: The column name to be used to identify pagination chunks.

  • pageset – Tuple[int, int]: The range for values used for paginating data from source.

  • sort_output – bool Specifies whether to sort cytotable output or not.

  • add_cytotable_meta – bool, default=False: Whether to add CytoTable metadata fields or not

  • tablenumber – Optional[int], default=None: An optional table number to append to the results. Defaults to None.

Returns:

The resulting arrow table for the data

Return type:

pyarrow.Table

cytotable.utils._unwrap_source(source: Dict[str, AppFuture | Any] | AppFuture | Any) Dict[str, Any] | Any[source]

Helper function to unwrap futures from sources.

Parameters:
  • source – Union[ Dict[str, Union[parsl.dataflow.futures.AppFuture, Any]], Union[parsl.dataflow.futures.AppFuture, Any],

  • ] – A source is a portion of an internal data structure used by CytoTable for processing and organizing data results.

Returns:

Union[Dict[str, Any], Any]

An evaluated dictionary or other value type.

cytotable.utils._unwrap_value(val: AppFuture | Any) Any[source]

Helper function to unwrap futures from values or return values where there are no futures.

Parameters:

val – Union[parsl.dataflow.futures.AppFuture, Any] A value which may or may not be a Parsl future which needs to be evaluated.

Returns:

Any

Returns the value as-is if there’s no future, the future result if Parsl futures are encountered.

cytotable.utils._write_parquet_table_with_metadata(table: Table, **kwargs) None[source]

Adds metadata to parquet output from CytoTable. Note: this mostly wraps pyarrow.parquet.write_table https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html

Parameters:
  • table – pa.Table: Pyarrow table to be serialized as parquet table.

  • **kwargs – Any: kwargs provided to this function roughly align with pyarrow.parquet.write_table. The following might be examples of what to expect here: - where: str or pyarrow.NativeFile

cytotable.utils.evaluate_futures(sources: Dict[str, List[Dict[str, Any]]] | List[Any] | str) Any[source]

Evaluates any Parsl futures for use within other tasks. This enables a pattern of Parsl app usage as “tasks” and delayed future result evaluation for concurrency.

Parameters:

sources – Union[Dict[str, List[Dict[str, Any]]], List[Any], str] Sources are an internal data structure used by CytoTable for processing and organizing data results. They may include futures which require asynchronous processing through Parsl, so we process them through this function.

Returns:

Union[Dict[str, List[Dict[str, Any]]], str]

A data structure which includes evaluated futures where they were found.


Presets

cytotable.presets.config

Configuration presets for CytoTable


Exceptions

Provide hierarchy of exceptions for CytoTable

exception cytotable.exceptions.CytoTableException[source]

Bases: Exception

Root exception for custom hierarchy of exceptions with CytoTable.

exception cytotable.exceptions.DatatypeException[source]

Bases: CytoTableException

Exception for datatype challenges.

exception cytotable.exceptions.NoInputDataException[source]

Bases: CytoTableException

Exception for no input data.

exception cytotable.exceptions.SchemaException[source]

Bases: CytoTableException

Exception for schema challenges.