4. Run each CellProfiler step#
The steps below are outlined specifically for running Distributed CellProfiler on AWS. If you are not doing so, the steps will still be more-or-less the same
Make sure CellProfiler pipelines are accessible
Make sure CellProfiler knows where the input images are, either via a CSV or a batch file
Run each CellProfiler pipeline, in sequence, with appropriate input folder, output folder, and grouping
4.1. Upload your pipelines to S3#
In your project’s workspace directory, create a batch specific folder and upload your pipelines there. If there are previous batches from the same project or a similar one, you may find it easiest to copy the files directly. Once uploaded and/or copied, the file structure should look like the below.
└── pipelines
└── 2016_04_01_a549_48hr_batch1
├── illum_without_batchfile.cppipe
└── analysis_without_batchfile.cppipe
4.2. Configure Distributed-CellProfiler’s run_batch_general
script#
Note
Note that run_batch_general
is not required; the Distributed CellProfiler handbook lays out a number of different ways of creating jobs. However, we find it the most efficient way to run numerous pipelines on the same data. If you do not wish to use it, you can adjust steps 3 and 4 in the “Run each CellProfiler step” to “Create a job file” and “Execute python3 run.py submitJob jobFileName.json
”
run_batch_general.py
can be configured once at the beginning of the run of a batch of data, and then can be run for each step simply by uncommenting the name of the step to run. The following variables in the project specific stuff
section of the script should be configured:
topdirname
andbatchsuffix
should match yourPROJECT_NAME
andBATCH_ID
, respectivelyappname
is typically the same astopdirname
, but if that name is long and cumbersome you can create an abbreviated version here (ie2015_10_05_DrugRepurposing
rather than2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad
). This will be used in yourconfig.py
filerows
,columns
, andsites
should reflect the imaging conditions usedplatelist
should contain a list of plates, comma separated, ie['SQ00015167','SQ00015168']
If you are using pipeline files with the LoadData module and CSVs, you should make sure that the pipeline names reflect your pipeline names (or adjust if not). Otherwise, you should make sure that the batch file names reflect your batch file names.
If following the recommended structures and procedures, none of the
not project specific
section of the script should need to be adapted, but if you are making changes you may need to.
4.3. Configure Distributed-CellProfiler’s fleet file#
If running in a fresh clone of Distributed-CellProfiler, you will need to configure a single fleet file, which will be used in all subsequent steps. Refer to the manual for instructions.
4.4. Change required parameters in Distributed-CellProfiler’s config file#
If running in a fresh clone of Distributed-CellProfiler, you will need to set the AWS_REGION
, SSH_KEY_NAME
, AWS_BUCKET
, and SQS_DEAD_LETTER_QUEUE
settings to appropriate settings for your account. Refer to the manual for instructions.
4.5. Run each CellProfiler step#
You may have as many as 5 or as few as 2 CellProfiler steps
(optional) Z projection
(optional) QC - see also section 4.6
illumination correction
(optional) assay development - see also section 4.6
analysis
For each step, the steps you will run will be identical:
Configure the
config.py
fileExecute
python3 run.py setup
Uncomment the correct step name in your
run_batch_general.py
file (and ensure all other steps are commented out)Execute
python3 run_batch_general.py
Execute
python3 run.py startCluster files/yourFleetFileName.json
, where you have set the name of the fleet file previously created or locatedExecute
python3 run.py monitor files/APP_NAMESpotFleetRequestId.json
, where APP_NAME matches the APP_NAME variable set in step 1.
Information on all of these steps is available in the Distributed-CellProfiler wiki.
You need only absolutely change the variables stated above and below for Distributed-CellProfiler to function, but other variables may be useful, such as using a non-default profile, adjusting whether or not you would like to pre-download files and/or use plugins and/or restart only parts of a batch of data.
In general, as long as you are running inside a tmux session and it isn’t killed, the monitor should destroy any and all infrastructure created on AWS as part of the running Distributed-CellProfiler, but it is the user’s responsibility to check that this has completed appropriately; failure to do so may lead to spot fleets generating charges after all useful work has completed.
4.5.1. (Optional) Z projection#
Your
APP_NAME
variable should be set to theappname
set inrun_batch_general.py
plus_Zproj
, ie2015_10_05_DrugRepurposing_Zproj
Your number of
CLUSTER_MACHINES
should be medium-large, ie a hundred or few hundred.Your
SQS_MESSAGE_VISIBILITY
should be short, such as5*60
(5 minutes)
4.5.2. (Optional) QC#
Your
APP_NAME
variable should be set to theappname
set inrun_batch_general.py
plus_QC
, ie2015_10_05_DrugRepurposing_QC
Your number of
CLUSTER_MACHINES
should be medium-large, ie a hundred or few hundred.Your
SQS_MESSAGE_VISIBILITY
should be short, such as5*60
(5 minutes)
4.5.3. Illumination Correction#
Your
APP_NAME
variable should be set to theappname
set inrun_batch_general.py
plus_Illum
, ie2015_10_05_DrugRepurposing_Illum
Your number of
CLUSTER_MACHINES
should be set to the number of plates you have divided by 4 then rounded up, ie 6 for 22 platesYour
SQS_MESSAGE_VISIBILITY
should be 12 hours720*60
4.5.4. (Optional) Assay Dev#
Your
APP_NAME
variable should be set to theappname
set inrun_batch_general.py
plus_AssayDev
, ie2015_10_05_DrugRepurposing_AssayDev
Your number of
CLUSTER_MACHINES
should be medium-large, ie a hundred or few hundred.Your
SQS_MESSAGE_VISIBILITY
should be short, such as5*60
(5 minutes)
4.5.5. Analysis#
Your
APP_NAME
variable should be set to theappname
set inrun_batch_general.py
plus_Analysis
, ie2015_10_05_DrugRepurposing_Analysis
Your number of
CLUSTER_MACHINES
should be as many as possible per your account limits, ideally at least a few hundred.Your
SQS_MESSAGE_VISIBILITY
should be 10-20 minutes for images with a binning of 2, longer (30-120 minutes) for unbinned images and/or CellProfiler 2 or 3 runs. This value is the most potentially variable -once you’ve run a single analysis workflow, you can adjust this value accordingly based on your log files.
4.6. (Optional) Do any post-CellProfiler steps#
The optional QC and assay development steps have post-CellProfiler components. These pipelines need only be run if the user plans to do these post-CellProfiler steps.
QC may require visual inspection of images, creation of a machine learning classifier to detect poor quality images, and/or running scripts to evaluate CV. How best to evaluate quality is left to the user.
The assay development step creates images to be visually evaluated, either individually or after stitching for easier evaluation. If your segmentation is not ideal, you may need to update your assay dev and analysis pipelines by manually tuning the segmentation steps on a local set of representative data until they perform better on your images, then making sure to update the segmentation in the assay dev and analysis pipelines on your cluster accordingly.
After running the final analysis pipelines, proceed to the next step of this guide.