Tutorial: NPZ embeddings + metadata to Parquet¶

A start-to-finish walkthrough for turning NPZ files (for example, DeepProfiler outputs) plus metadata into Parquet. This uses a small example bundled in the repo.

What you will accomplish¶

Read NPZ feature files and matching metadata from disk.
Combine them into Parquet with a preset that aligns common keys.
Validate the output shape and schema.

If your data looks like this, change…

NPZ in a different folder: point source_path there; keep preset="deepprofiler".
Memory constrained: add chunk_size=10000 to the convert call.
.npy files or plain CSV feature tables: this tutorial/preset does not cover them; use the CellProfiler CSV/SQLite flows instead.

Setup (copy-paste)¶

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install cytotable

Inputs and outputs¶

Input: Example NPZ + metadata in this repo: tests/data/deepprofiler/pycytominer_example
Output: A Parquet file under ./outputs/deepprofiler_example.parquet

Step 1: define your paths¶

export SOURCE_PATH="tests/data/deepprofiler/pycytominer_example"
export DEST_PATH="./outputs/deepprofiler_example.parquet"
mkdir -p "$DEST_PATH"

Step 2: run the conversion¶

import os
import cytotable

source_path = os.environ["SOURCE_PATH"]
dest_path = os.environ["DEST_PATH"]

result = cytotable.convert(
    source_path=source_path,
    source_datatype="npz",
    dest_path=dest_path,
    dest_datatype="parquet",
    preset="deepprofiler",
    concat=True,
    join=False,
)

print(result)

Notes (why these flags matter):

preset="deepprofiler" aligns NPZ feature arrays with metadata columns.
concat=True merges multiple NPZ shards.
join=False writes per-table Parquet files (the preset produces all_files.npz as the logical table).

Step 3: validate the output¶

You should see deepprofiler_example.parquet in DEST_PATH. Opening it with Pandas or PyArrow should show non-zero rows and both feature (efficientnet_*) and metadata columns.

What success looks like¶

A Parquet file deepprofiler_example.parquet exists in DEST_PATH.
DuckDB/Pandas can read the file; row count is non-zero.
Feature columns (for example, efficientnet_*) and metadata columns (plate/well/site) both appear.

Tutorial: NPZ embeddings + metadata to Parquet¶

What you will accomplish¶

Setup (copy-paste)¶

Inputs and outputs¶

Step 1: define your paths¶

Step 2: run the conversion¶

Step 3: validate the output¶

What success looks like¶

CytoTable

Navigation

Related Topics