# Tutorial: Linking CellProfiler morphology features with single-cell image crops The Apache ecosystem tool, Iceberg, builds a warehouse data structure to manage connections between single-cell features and OME-arrow image crops. In this tutorial, you will see a start-to-finish walkthrough of using CytoTable to link CellProfiler measurements and cropped microscopy images. ## What you will accomplish - Convert CellProfiler outputs (e.g., SQLite file) to an Iceberg warehouse instead of a single Parquet file (which is default CytoTable behavior). - Using Iceberg, create a materialized `profiles.joined_profiles` table that connects single-cell profiles to cropped images. - Optionally build a separate `images.image_crops` Iceberg table containing OME-Arrow image crops. - Optionally build a separate `images.source_images` Iceberg table containing full OME-Arrow source images. - Save a `profiles.profile_with_images` warehouse "view" that displays joined profiles with image crops. - Overlay mask or outline images into this "view". ```{admonition} When to use this tutorial - Use this tutorial when you want to bundle single-cell morphology features with single-cell image crops instead of a single Parquet file. - Skip this tutorial if you only need the standard joined measurement table; use the Parquet tutorial instead. ``` ## Setup *Note:* This tutorial installs `cytotable[iceberg-images]` rather than `cytotable[iceberg]` because the image tables and views require both `pyiceberg` and `ome-arrow`. If you only need the Iceberg warehouse for profile tables and do not plan to export `images.image_crops` or `images.source_images`, then `cytotable[iceberg]` is sufficient. Image crop export requires Python 3.11 or newer because the optional `ome-arrow` dependency is only available on Python 3.11+. ```bash python -m venv .venv source .venv/bin/activate pip install --upgrade pip pip install "cytotable[iceberg-images]" ``` ## Inputs and outputs - **Measurement input:** a CellProfiler SQLite file or CSV folder - **Image input:** a local directory or cloud object-storage prefix containing source TIFF files - **Optional segmentation input:** a local directory or cloud object-storage prefix containing mask and/or outline TIFF files - **Output:** a new local Iceberg warehouse directory containing `profiles.joined_profiles` and, when requested, image tables and views such as `images.image_crops`, `images.source_images`, and `profiles.profile_with_images` (see below for more details on Iceberg warehouse outputs) ## Basic warehouse export This creates a materialized `profiles.joined_profiles` table in Iceberg. ```python from cytotable import convert warehouse_path = convert( source_path="./tests/data/cellprofiler/ExampleHuman", source_datatype="csv", dest_path="./example_warehouse", dest_backend="iceberg", dest_datatype="parquet", preset="cellprofiler_csv", ) print(warehouse_path) ``` *Note:* `dest_backend="iceberg"` means the final output is an Iceberg warehouse. CytoTable still requires `dest_datatype="parquet"` because CytoTable stages joined data as parquet internally before writing the warehouse tables. ## Add image crops with OME-Arrow When a user specifies `image_dir`, CytoTable uses temporary parquet staging and chunked single-cell joins to append cropped image payloads into a separate `images.image_crops` table inside the warehouse. `image_dir`, `mask_dir`, and `outline_dir` may reference local paths or cloud object-storage paths, following the same `s3://...`, `gs://...`, or `az://...` style supported for measurement inputs. If your cloud provider needs extra configuration, pass the relevant `cloudpathlib` client arguments through `convert(..., **kwargs)`. If `image_dir` is not provided, CytoTable writes only the profile-side Iceberg output, which is usually the right choice when you only need measurement data or want a lighter-weight warehouse. Provide `image_dir` when you want the warehouse to connect single-cell profiles to cropped images. ```python from cytotable import convert warehouse_path = convert( source_path="./tests/data/cellprofiler/ExampleHuman", source_datatype="csv", dest_path="./example_warehouse_with_images", dest_backend="iceberg", dest_datatype="parquet", preset="cellprofiler_csv", image_dir="./images", mask_dir="./masks", outline_dir="./outlines", ) print(warehouse_path) ``` Important behavior: - image export requires `dest_backend="iceberg"` - image export requires `join=True` (the default) - CytoTable writes cropped images to a separate `images.image_crops` table - full source images may also be written to `images.source_images` with `include_source_images=True` - CytoTable deterministically generates `Metadata_ObjectID` values in `images.image_crops` for object-level references, rather than assigning them randomly - CytoTable also deterministically generates `Metadata_ImageCropID` values unique to each crop row - when both `outline_dir` and `mask_dir` produce a matching overlay for the same source image, CytoTable stores both and uses the outline for the generated `ome_arrow_label` overlay field If you also want the original images stored in the warehouse: ```python convert( source_path="./tests/data/cellprofiler/ExampleHuman", source_datatype="csv", dest_path="./example_warehouse_with_source_images", dest_backend="iceberg", preset="cellprofiler_csv", image_dir="./images", include_source_images=True, ) ``` The same pattern also works with cloud image paths, for example `image_dir="s3://example-bucket/images"` plus any needed authentication or client configuration arguments passed through `convert(..., **kwargs)`. ## Bounding boxes CytoTable uses bounding box columns from the joined measurement rows to dynamically crop each image. In the materialized `joined_profiles` table, the resolved bbox columns are recoded as `Metadata_SourceBBoxXMin`, `Metadata_SourceBBoxXMax`, `Metadata_SourceBBoxYMin`, and `Metadata_SourceBBoxYMax`. CytoTable searches for bounding box columns in the following order and only moves to the next option if the earlier one does not provide all four required columns: 1. user-defined explicit setting of `bbox_column_map` in `CytoTable.convert()` 1. CellProfiler-style `AreaShape_BoundingBox...` column names 1. substring fallback using `Minimum_X`, `Maximum_X`, `Minimum_Y`, `Maximum_Y` *Note:* The substring fallback is a broad last-resort match. It is useful when your data do not follow the standard CellProfiler bbox naming conventions, but if multiple unrelated columns contain those substrings, you should prefer an explicit `bbox_column_map`. If you need to override the automatic choice: ```python convert( source_path="./tests/data/cellprofiler/ExampleHuman", source_datatype="csv", dest_path="./example_warehouse_bbox_override", dest_backend="iceberg", preset="cellprofiler_csv", image_dir="./images", bbox_column_map={ "x_min": "Cells_AreaShape_BoundingBoxMinimum_X", "x_max": "Cells_AreaShape_BoundingBoxMaximum_X", "y_min": "Cells_AreaShape_BoundingBoxMinimum_Y", "y_max": "Cells_AreaShape_BoundingBoxMaximum_Y", }, ) ``` ## Segmentation matching By default, CytoTable matches the CellProfiler columns that store mask and outline information by basename or stem. If your segmentation files follow a different naming convention, use `segmentation_file_regex` to map segmentation filename patterns to source image filename patterns. ```python convert( source_path="./tests/data/cellprofiler/ExampleHuman", source_datatype="csv", dest_path="./example_warehouse_regex_match", dest_backend="iceberg", preset="cellprofiler_csv", image_dir="./images", outline_dir="./outlines", segmentation_file_regex={ r".*_outline\\.tiff$": r"(plateA_well_B03_site_1)\\.tiff$", }, ) ``` The mapping uses: - key: regex for segmentation filenames - value: regex for the source image filename ## Reading the warehouse CytoTable exposes helper functions for local Iceberg warehouses so you can list available tables and views, inspect the warehouse contents, and read tables back into Python: ```python from cytotable import describe_iceberg_warehouse, list_tables, read_table print(list_tables("./example_warehouse_with_images")) print(describe_iceberg_warehouse("./example_warehouse_with_images")) profiles = read_table("./example_warehouse_with_images", "joined_profiles") image_crops = read_table("./example_warehouse_with_images", "image_crops") profile_with_images = read_table( "./example_warehouse_with_images", "profile_with_images" ) ``` If you want to see the rendered outputs from this workflow, including example table listings and readback results, see the notebook example `examples/cytotable_with_profiles_and_images`. Unqualified reads still work for unique table or view names. The current layout is: - `profiles.joined_profiles` - `images.image_crops` - `images.source_images` - `profiles.profile_with_images` The same helpers also work for a single Parquet file: ```python from cytotable import list_tables, read_table print(list_tables("./ExampleHuman.parquet")) profiles = read_table("./ExampleHuman.parquet") ``` ## What success looks like *Note:* In this tutorial, a "materialized table" means a table whose rows are stored directly in the warehouse. A "view" stores the logic for producing a result and re-runs that logic each time you read it, rather than storing its own rows directly. - the warehouse directory exists and contains Iceberg metadata/data files - `profiles.joined_profiles` appears as a materialized table, meaning a stored table with data rather than just a saved view definition - `images.image_crops` appears as a table only when the user specifies the `image_dir` argument and crop rows are actually written; this is where CytoTable stores single-cell image crops as separate rows in the warehouse - `images.source_images` appears as a table only when `include_source_images=True` - `profiles.profile_with_images` appears as a saved view only when `images.image_crops` exists and contains rows - `images.image_crops` rows include a deterministic `Metadata_ObjectID` derived from measurement keys and crop bounds, rather than a random ID - `images.image_crops` rows include a deterministic `Metadata_ImageCropID` derived from measurement keys, crop bounds, and the source image reference - `images.image_crops` rows include `ome_arrow_image` data and optional `ome_arrow_label` data stored as OME-Arrow objects - `images.source_images` rows include a deterministic `Metadata_ImageID` derived from image-level keys and the source image reference