Technical Architecture¶
Documentation covering technical architecture for CytoTable.
Workflows¶
CytoTable uses Parsl to execute collections of tasks as python_app
’s.
In Parsl, work may be isolated using Python functions decorated with the @python_app
decorator.
join_app
’s are collections of one or more other apps and are decorated using the @join_app
decorator.
See the following documentation for more information on how apps may be used within Parsl: Parsl: Apps
Workflow Execution¶
Procedures within CytoTable are executed using Parsl Executors. Parsl Executors may be configured through Parsl Configuration’s.
Parsl configurations may be passed to convert(..., parsl_config=parsl.Config)
(convert()
)
By default, CytoTable assumes local task execution with LocalProvider. For greater scalability, CytoTable may be used with a HighThroughputExecutor (See Parsl’s scalability documentation for more information).
Data Technologies¶
Data Paths¶
Data source paths handled by CytoTable may be local or cloud-based paths. Local data paths are handled using Python’s Pathlib module. Cloud-based data paths are managed by cloudpathlib. Reference the following page for how cloudpathlib client arguments may be used: Overview: Data Source Locations
Data Paths - Cloud-based SQLite¶
SQLite data stored in cloud-based paths are downloaded locally using cloudpathlib’s caching capabilities to perform SQL queries. Using data in this way may require the use of an additional parameter for the cloud storage provider to set the cache directory explicitly to avoid storage limitations (some temporary directories are constrained to system memory, etc).
For example:
import cytotable
# Convert CellProfiler SQLite to parquet
cytotable.convert(
source_path="s3://bucket-name/single-cells.sqlite",
dest_path="test.parquet",
dest_datatype="parquet",
# set the local cache dir to `./tmpdata`
# this will get passed to cloudpathlib's client
local_cache_dir="./tmpdata",
)
In-process Data Format¶
In addition to using Python native data types, we also accomplish internal data management for CytoTable using PyArrow (Apache Arrow) Tables. Using Arrow-compatible formats is intended to assist cross-platform utility, encourage high-performance, and enable advanced data integration with non-Python tools.
Arrow Memory Allocator Selection¶
PyArrow may select to use malloc
, jemalloc
, or mimalloc
depending on the operating system and allocator availability.
This memory allocator selection may also be overridden by a developer implementing CytoTable to help with performance aspects related to user environments.
PyArrow inherits environment configuration from the Arrow C++ implementation (see note on this page).
Use the ARROW_DEFAULT_MEMORY_POOL
environment variable to statically define which memory allocator will be used when implementing CytoTable.
Arrow Memory Mapping Selection¶
PyArrow includes functionality which enables memory mapped parquet file reads for performance benefits (see memory_map
parameter).
This functionality is enabled by default in CytoTable.
You may disable this functionality by setting environment variable CYTOTABLE_ARROW_USE_MEMORY_MAPPING
to 0
(for example: export CYTOTABLE_ARROW_USE_MEMORY_MAPPING=0
).
SQL-based Data Management¶
We use the DuckDB Python API client in some areas to interface with SQL (for example, SQLite databases) and other data formats. We use DuckDB SQL statements to organize joined datasets or tables as Arrow format results.