Tutorial: NPZ embeddings + metadata to Parquet¶
A start-to-finish walkthrough for turning NPZ files (for example, DeepProfiler outputs) plus metadata into Parquet. This uses a small example bundled in the repo.
What you will accomplish¶
Read NPZ feature files and matching metadata from disk.
Combine them into Parquet with a preset that aligns common keys.
Validate the output shape and schema.
If your data looks like this, change…
NPZ in a different folder: point
source_paththere; keeppreset="deepprofiler".Memory constrained: add
chunk_size=10000to the convert call..npyfiles or plain CSV feature tables: this tutorial/preset does not cover them; use the CellProfiler CSV/SQLite flows instead.
Setup (copy-paste)¶
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install cytotable
Inputs and outputs¶
Input: Example NPZ + metadata in this repo:
tests/data/deepprofiler/pycytominer_exampleOutput: A Parquet file under
./outputs/deepprofiler_example.parquet
Step 1: define your paths¶
export SOURCE_PATH="tests/data/deepprofiler/pycytominer_example"
export DEST_PATH="./outputs/deepprofiler_example.parquet"
mkdir -p "$DEST_PATH"
Step 2: run the conversion¶
import os
import cytotable
source_path = os.environ["SOURCE_PATH"]
dest_path = os.environ["DEST_PATH"]
result = cytotable.convert(
source_path=source_path,
source_datatype="npz",
dest_path=dest_path,
dest_datatype="parquet",
preset="deepprofiler",
concat=True,
join=False,
)
print(result)
Notes (why these flags matter):
preset="deepprofiler"aligns NPZ feature arrays with metadata columns.concat=Truemerges multiple NPZ shards.join=Falsewrites per-table Parquet files (the preset producesall_files.npzas the logical table).
Step 3: validate the output¶
You should see deepprofiler_example.parquet in DEST_PATH.
Opening it with Pandas or PyArrow should show non-zero rows and both feature (efficientnet_*) and metadata columns.
What success looks like¶
A Parquet file
deepprofiler_example.parquetexists inDEST_PATH.DuckDB/Pandas can read the file; row count is non-zero.
Feature columns (for example,
efficientnet_*) and metadata columns (plate/well/site) both appear.