Technical Architecture¶
Documentation covering technical architecture for CytoTable.
Workflows¶
CytoTable uses Parsl to execute collections of tasks as python_app
’s.
In Parsl, work may be isolated using Python functions decorated with the @python_app
decorator.
join_app
’s are collections of one or more other apps and are decorated using the @join_app
decorator.
See the following documentation for more information on how apps may be used within Parsl: Parsl: Apps
Workflow Execution¶
Workflow tasks within CytoTable are executed using Parsl Executors.
Parsl Executors for CytoTable may be configured through Parsl Config
’s .
For example, you may use the following: convert(..., parsl_config=parsl.Config())
(convert()
)
CytoTable is implemented by default with Parsl’s HighThroughputExecutor
, a multiprocess executor (please see Parsl’s scalability documentation for more information).
Please note: use of Parsl’s ThreadPoolExecutor
may result in unfreed memory within certain systems because of Apache Arrow’s memory allocators.
Unfreed memory can eventually result in a lack of available memory through single-process use.
When using ThreadPoolExecutor we suggest using a Linux system, leveraging the malloc
/system memory allocator with Arrow (e.g. export ARROW_DEFAULT_MEMORY_POOL="system"
), and/or forking subprocedures for best results when it comes to freeing memory for these usecases.
This note does not apply to the HighThroughputExecutor
.
Data Technologies¶
Data Paths¶
Data source paths handled by CytoTable may be local or cloud-based paths. Local data paths are handled using Python’s Pathlib module. Cloud-based data paths are managed by cloudpathlib. Reference the following page for how cloudpathlib client arguments may be used: Overview: Data Source Locations
Data Paths - Cloud-based SQLite¶
SQLite data stored in cloud-based paths are downloaded locally using cloudpathlib’s caching capabilities to perform SQL queries. Using data in this way may require the use of an additional parameter for the cloud storage provider to set the cache directory explicitly to avoid storage limitations (some temporary directories are constrained to system memory, etc).
For example:
import cytotable
# Convert CellProfiler SQLite to parquet
cytotable.convert(
source_path="s3://bucket-name/single-cells.sqlite",
dest_path="test.parquet",
dest_datatype="parquet",
# set the local cache dir to `./tmpdata`
# this will get passed to cloudpathlib's client
local_cache_dir="./tmpdata",
)
In-process Data Format¶
In addition to using Python native data types, we also accomplish internal data management for CytoTable using PyArrow (Apache Arrow) Tables. Using Arrow-compatible formats is intended to assist cross-platform utility, encourage high-performance, and enable advanced data integration with non-Python tools.
Arrow Memory Allocator Selection¶
PyArrow may select to use malloc
, jemalloc
, or mimalloc
depending on the operating system and allocator availability.
This memory allocator selection may also be overridden by a developer implementing CytoTable to help with performance aspects related to user environments.
PyArrow inherits environment configuration from the Arrow C++ implementation (see note on this page).
Use the ARROW_DEFAULT_MEMORY_POOL
environment variable to statically define which memory allocator will be used when implementing CytoTable.
Arrow Memory Mapping Selection¶
PyArrow includes functionality which enables memory mapped parquet file reads for performance benefits (see memory_map
parameter).
This functionality is enabled by default in CytoTable.
You may disable this functionality by setting environment variable CYTOTABLE_ARROW_USE_MEMORY_MAPPING
to 0
(for example: export CYTOTABLE_ARROW_USE_MEMORY_MAPPING=0
).
SQL-based Data Management¶
We use the DuckDB Python API client in some areas to interface with SQL (for example, SQLite databases) and other data formats. We use DuckDB SQL statements to organize joined datasets or tables as Arrow format results.