Tutorial: CellProfiler SQLite or CSV to Parquet¶
A start-to-finish walkthrough for image analysts who want a working Parquet export from CellProfiler outputs (SQLite or CSV), including public S3 and local data.
What you will accomplish¶
Convert CellProfiler outputs to Parquet with a preset that matches common table/column layouts.
Handle both SQLite (typical Cell Painting Gallery exports) and CSV folder outputs.
Keep a persistent local cache so downloads are reused and avoid “file vanished” errors on temp disks.
Verify the outputs quickly (file names and row counts) without needing to understand the internals.
If your data looks like this, change…
Local SQLite instead of S3: set
source_pathto the local.sqlitefile; removeno_sign_request; keeplocal_cache_dir.CellProfiler CSV folders: point
source_pathto the folder that containsCells.csv,Cytoplasm.csv, etc.; setsource_datatype="csv"andpreset="cellprofiler_csv".Only certain compartments: add
targets=["cells", "nuclei"](case-insensitive).Memory constrained: lower
chunk_size(e.g., 10000) and ensureCACHE_DIRhas space.
Setup (copy-paste)¶
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install cytotable
Inputs and outputs¶
SQLite example (public S3):
s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126114/BR00126114.sqliteNo credentials are required (no_sign_request=True).CSV example (local folder):
./tests/data/cellprofiler/ExampleHumanwhich containsCells.csv,Cytoplasm.csv,Nuclei.csv, etc.Outputs: Parquet files for each compartment (Image, Cells, Cytoplasm, Nuclei) in
./outputs/....
Before you start¶
Install Cytotable:
pip install cytotableMake sure you have enough local disk space (~1–2 GB) for the cached SQLite and Parquet outputs.
If you prefer to download the file first, you can also
aws s3 cpthe same path locally, then setsource_pathto the local file and dropno_sign_request.
Step 1: choose your input type¶
Pick one of the two setups below.
SQLite from public S3 (Cell Painting Gallery)
export SOURCE_PATH="s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126114/BR00126114.sqlite"
export SOURCE_DATATYPE="sqlite"
export PRESET="cellprofiler_sqlite_cpg0016_jump"
export DEST_PATH="./outputs/br00126114.parquet"
export CACHE_DIR="./sqlite_s3_cache"
mkdir -p "$(dirname "$DEST_PATH")" "$CACHE_DIR"
CellProfiler CSV folder (local or mounted storage)
export SOURCE_PATH="./tests/data/cellprofiler/ExampleHuman"
export SOURCE_DATATYPE="csv"
export PRESET="cellprofiler_csv"
export DEST_PATH="./outputs/examplehuman.parquet"
export CACHE_DIR="./csv_cache"
mkdir -p "$(dirname "$DEST_PATH")" "$CACHE_DIR"
Step 2: run the conversion (minimal Python)¶
import os
import cytotable
# If you used the bash exports above:
SOURCE_PATH = os.environ["SOURCE_PATH"]
SOURCE_DATATYPE = os.environ["SOURCE_DATATYPE"]
DEST_PATH = os.environ["DEST_PATH"]
PRESET = os.environ["PRESET"]
CACHE_DIR = os.environ["CACHE_DIR"]
# (Alternatively, set them directly as strings in Python.)
result = cytotable.convert(
source_path=SOURCE_PATH,
source_datatype=SOURCE_DATATYPE,
dest_path=DEST_PATH,
dest_datatype="parquet",
preset=PRESET,
local_cache_dir=CACHE_DIR,
# For public S3 (SQLite or CSV) add:
no_sign_request=True,
# Reasonable chunking for large tables; adjust up/down if you hit memory limits
chunk_size=30000,
)
print(result)
Why these flags matter (in plain language):
local_cache_dir: keeps downloaded data somewhere predictable.preset: selects the right table names and page keys for this dataset (SQLite or CSV).chunk_size: processes data in pieces so you don’t need excessive RAM.no_sign_request: needed because the sample bucket is public and unsigned.
Step 3: check that the outputs look right¶
You should see Parquet files in the destination directory.
If you set join=True (handy for the SQLite example), you get a single . parquet file containing all compartments.
If you set join=False (handy for CSV folders), you get separate Parquet files for each compartment.
ls "$DEST_PATH"
# SQLite example: br00126114.parquet
# CSV example: examplehuman.parquet
What success looks like¶
A stable local cache of the SQLite file or CSV downloads remains in
CACHE_DIR(useful for repeated runs).Parquet outputs exist in
DEST_PATHand can be read by DuckDB/Pandas/PyArrow.