Tutorial: CellProfiler SQLite or CSV to Parquet¶

A start-to-finish walkthrough for image analysts who want a working Parquet export from CellProfiler outputs (SQLite or CSV), including public S3 and local data.

What you will accomplish¶

Convert CellProfiler outputs to Parquet with a preset that matches common table/column layouts.
Handle both SQLite (typical Cell Painting Gallery exports) and CSV folder outputs.
Keep a persistent local cache so downloads are reused and avoid “file vanished” errors on temp disks.
Verify the outputs quickly (file names and row counts) without needing to understand the internals.

If your data looks like this, change…

Local SQLite instead of S3: set source_path to the local .sqlite file; remove no_sign_request; keep local_cache_dir.
CellProfiler CSV folders: point source_path to the folder that contains Cells.csv, Cytoplasm.csv, etc.; set source_datatype="csv" and preset="cellprofiler_csv".
Only certain compartments: add targets=["cells", "nuclei"] (case-insensitive).
Memory constrained: lower chunk_size (e.g., 10000) and ensure CACHE_DIR has space.

Setup (copy-paste)¶

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install cytotable

Inputs and outputs¶

SQLite example (public S3): s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126114/BR00126114.sqlite No credentials are required (no_sign_request=True).
CSV example (local folder): ./tests/data/cellprofiler/ExampleHuman which contains Cells.csv, Cytoplasm.csv, Nuclei.csv, etc.
Outputs: Parquet files for each compartment (Image, Cells, Cytoplasm, Nuclei) in ./outputs/....

Before you start¶

Install Cytotable: pip install cytotable
Make sure you have enough local disk space (~1–2 GB) for the cached SQLite and Parquet outputs.
If you prefer to download the file first, you can also aws s3 cp the same path locally, then set source_path to the local file and drop no_sign_request.

Step 1: choose your input type¶

Pick one of the two setups below.

SQLite from public S3 (Cell Painting Gallery)

export SOURCE_PATH="s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126114/BR00126114.sqlite"
export SOURCE_DATATYPE="sqlite"
export PRESET="cellprofiler_sqlite_cpg0016_jump"
export DEST_PATH="./outputs/br00126114.parquet"
export CACHE_DIR="./sqlite_s3_cache"
mkdir -p "$(dirname "$DEST_PATH")" "$CACHE_DIR"

CellProfiler CSV folder (local or mounted storage)

export SOURCE_PATH="./tests/data/cellprofiler/ExampleHuman"
export SOURCE_DATATYPE="csv"
export PRESET="cellprofiler_csv"
export DEST_PATH="./outputs/examplehuman.parquet"
export CACHE_DIR="./csv_cache"
mkdir -p "$(dirname "$DEST_PATH")" "$CACHE_DIR"

Step 2: run the conversion (minimal Python)¶

import os
import cytotable

# If you used the bash exports above:
SOURCE_PATH = os.environ["SOURCE_PATH"]
SOURCE_DATATYPE = os.environ["SOURCE_DATATYPE"]
DEST_PATH = os.environ["DEST_PATH"]
PRESET = os.environ["PRESET"]
CACHE_DIR = os.environ["CACHE_DIR"]

# (Alternatively, set them directly as strings in Python.)

result = cytotable.convert(
    source_path=SOURCE_PATH,
    source_datatype=SOURCE_DATATYPE,
    dest_path=DEST_PATH,
    dest_datatype="parquet",
    preset=PRESET,
    local_cache_dir=CACHE_DIR,
    # For public S3 (SQLite or CSV) add:
    no_sign_request=True,
    # Reasonable chunking for large tables; adjust up/down if you hit memory limits
    chunk_size=30000,
)

print(result)

Why these flags matter (in plain language):

local_cache_dir: keeps downloaded data somewhere predictable.
preset: selects the right table names and page keys for this dataset (SQLite or CSV).
chunk_size: processes data in pieces so you don’t need excessive RAM.
no_sign_request: needed because the sample bucket is public and unsigned.

Step 3: check that the outputs look right¶

You should see Parquet files in the destination directory. If you set join=True (handy for the SQLite example), you get a single . parquet file containing all compartments. If you set join=False (handy for CSV folders), you get separate Parquet files for each compartment.

ls "$DEST_PATH"
# SQLite example: br00126114.parquet
# CSV example: examplehuman.parquet

What success looks like¶

A stable local cache of the SQLite file or CSV downloads remains in CACHE_DIR (useful for repeated runs).
Parquet outputs exist in DEST_PATH and can be read by DuckDB/Pandas/PyArrow.