# Tutorial: CellProfiler SQLite or CSV to Parquet

A start-to-finish walkthrough for image analysts who want a working Parquet export from CellProfiler outputs (SQLite or CSV), including public S3 and local data.

## What you will accomplish

- Convert CellProfiler outputs to Parquet with a preset that matches common table/column layouts.
- Handle both SQLite (typical Cell Painting Gallery exports) and CSV folder outputs.
- Keep a persistent local cache so downloads are reused and avoid “file vanished” errors on temp disks.
- Verify the outputs quickly (file names and row counts) without needing to understand the internals.

```{admonition} If your data looks like this, change...
- Local SQLite instead of S3: set `source_path` to the local `.sqlite` file; remove `no_sign_request`; keep `local_cache_dir`.
- CellProfiler CSV folders: point `source_path` to the folder that contains `Cells.csv`, `Cytoplasm.csv`, etc.; set `source_datatype="csv"` and `preset="cellprofiler_csv"`.
- Only certain compartments: add `targets=["cells", "nuclei"]` (case-insensitive).
- Memory constrained: lower `chunk_size` (e.g., 10000) and ensure `CACHE_DIR` has space.
```

## Setup (copy-paste)

```bash
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install cytotable
```

## Inputs and outputs

- **SQLite example (public S3):** `s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126114/BR00126114.sqlite`
  No credentials are required (`no_sign_request=True`).
- **CSV example (local folder):** `./tests/data/cellprofiler/ExampleHuman` which contains `Cells.csv`, `Cytoplasm.csv`, `Nuclei.csv`, etc.
- **Outputs:** Parquet files for each compartment (Image, Cells, Cytoplasm, Nuclei) in `./outputs/...`.

## Before you start

- Install Cytotable:
  `pip install cytotable`
- Make sure you have enough local disk space (~1–2 GB) for the cached SQLite and Parquet outputs.
- If you prefer to download the file first, you can also `aws s3 cp` the same path locally, then set `source_path` to the local file and drop `no_sign_request`.

## Step 1: choose your input type

Pick one of the two setups below.

**SQLite from public S3 (Cell Painting Gallery)**

```bash
export SOURCE_PATH="s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126114/BR00126114.sqlite"
export SOURCE_DATATYPE="sqlite"
export PRESET="cellprofiler_sqlite_cpg0016_jump"
export DEST_PATH="./outputs/br00126114.parquet"
export CACHE_DIR="./sqlite_s3_cache"
mkdir -p "$(dirname "$DEST_PATH")" "$CACHE_DIR"
```

**CellProfiler CSV folder (local or mounted storage)**

```bash
export SOURCE_PATH="./tests/data/cellprofiler/ExampleHuman"
export SOURCE_DATATYPE="csv"
export PRESET="cellprofiler_csv"
export DEST_PATH="./outputs/examplehuman.parquet"
export CACHE_DIR="./csv_cache"
mkdir -p "$(dirname "$DEST_PATH")" "$CACHE_DIR"
```

## Step 2: run the conversion (minimal Python)

```python
import os
import cytotable

# If you used the bash exports above:
SOURCE_PATH = os.environ["SOURCE_PATH"]
SOURCE_DATATYPE = os.environ["SOURCE_DATATYPE"]
DEST_PATH = os.environ["DEST_PATH"]
PRESET = os.environ["PRESET"]
CACHE_DIR = os.environ["CACHE_DIR"]

# (Alternatively, set them directly as strings in Python.)

result = cytotable.convert(
    source_path=SOURCE_PATH,
    source_datatype=SOURCE_DATATYPE,
    dest_path=DEST_PATH,
    dest_datatype="parquet",
    preset=PRESET,
    local_cache_dir=CACHE_DIR,
    # For public S3 (SQLite or CSV) add:
    no_sign_request=True,
    # Reasonable chunking for large tables; adjust up/down if you hit memory limits
    chunk_size=30000,
)

print(result)
```

Why these flags matter (in plain language):

- `local_cache_dir`: keeps downloaded data somewhere predictable.
- `preset`: selects the right table names and page keys for this dataset (SQLite or CSV).
- `chunk_size`: processes data in pieces so you don’t need excessive RAM.
- `no_sign_request`: needed because the sample bucket is public and unsigned.

## Step 3: check that the outputs look right

You should see Parquet files in the destination directory.
If you set `join=True` (handy for the SQLite example), you get a single `. parquet` file containing all compartments.
If you set `join=False` (handy for CSV folders), you get separate Parquet files for each compartment.

```bash
ls "$DEST_PATH"
# SQLite example: br00126114.parquet
# CSV example: examplehuman.parquet
```

## What success looks like

- A stable local cache of the SQLite file or CSV downloads remains in `CACHE_DIR` (useful for repeated runs).
- Parquet outputs exist in `DEST_PATH` and can be read by DuckDB/Pandas/PyArrow.