This repository contains all pipelines, scripts, and reference documents used for the machine learning-assisted image analysis of images from the high-throughput screening in the publication below. This includes files used for batch processing on high-performance computers. Additionally, the repository contains scripts for analyzing classification parameter importance, and files from statistical analyses in R. Scripts and templates for image preprocessing and analysis with Coli-Inspector and MicrobeJ (for analyzing DNA profile widths and creating kymograph heat maps) are available in the Zenodo repository of our previous paper RecN and RecA orchestrate an ordered DNA supercompaction response following ciprofloxacin-induced DNA damage in Escherichia coli.
Vikedal K, Berges N, Riisnæs IMM, Ræder SB, Bjørnholt JV, Bjørås M, Skarstad K, Helgesen E, and Booth JA
Exploring the genetic landscape of ciprofloxacin-induced DNA supercompaction in Escherichia coli
bioRxiv (2025). doi: 10.1101/2025.07.xx.xxxxxx
Images and data from high-throughput imaging are available in the BioImage Archive under accession number S-BIAD2152 at https://doi.org/10.6019/S-BIAD2152.
Please see the Material and methods section and the Supplementary Material of the paper for details on screening procedure and image analysis.
- CellProfiler pipelines
- HPC scripts
- CellProfiler output handling scripts
- Analyzing importance of classification parameters
- R-based statistical analyses
- License Details
- Author
- Acknowledgements
The main CellProfiler pipeline used for analysis of images from the screening, as well as pipelines used for training of the Single-Cell and Phenotype models, are located in the CellProfiler_Pipelines folder. The main CellProfiler pipeline is set up for batch processing of the screening images on a high-performance computer (cluster computer).
Batch processing workflow:
- Place raw images in an
input-imagesfolder within the plate directory. - Run the pipeline (CellProfiler_Main_KV_ScreeningPaper.cppipe) locally to determine which images to include in the analysis and generate a
Batch_data.h5batch file. - Run CellProfiler in batch mode on the high-performance computer using the generated batch file.
- Per-cell measurement tables (
.csv) with data from the CellProfiler analysis of every well will be written to anoutput-datafolder in the same location as theinput-imagesfolder.
There are three separate files outputed for each well:
*_ImageMetadata.csv: contain metadata for full images.*_ObjectMeasurements.csv: include measurements of cell and DNA features for all segmented objects prior to filtration with the Single-Cell model.*_InterestingObjectMeasurements.csv: contain measurements of cell and DNA features as well as phenotype scoring (Phenotype model) for cells remaining after filtration with the Single-Cell model.
The sorting-rules folder contains the rules of the:
- Single-Cell model (InterestingCellFilter_MainAnalysis.txt) used to exclude irrelevant objects
- Phenotype model (CompactionCategoryFilter_MainAnalysis.txt) used to classify DNA compaction phenotypes
This folder should always be included within the input-images directory to ensure filtering is handled correctly in the CellProfiler pipelines.
The CPA_properties folder contains .properties files used for training the Single-Cell model (InterestingCellFilter) and Phenotype model (CompactionCategoryFilter) in CellProfiler Analyst. The classifier_ignore_columns setting lists parameters excluded from training to avoid confounders (e.g. image metadata).
The batch processing capabilities of CellProfiler were employed on high-performance computers (HPCs) to automate the large scale analysis of images from the screening. After creating a batch file in CellProfiler locally, all files needed for image analysis were uploaded to the HPC (plate directory containing input-images (with sorting-rules) and output-data (with Batch_data.h5) folders). SLURM scripts were used to manage the large-scale batch processing:
- Environment setup: Ensure that Anaconda or Miniconda is installed on the HPC, and that CellProfiler is available. Use environment.yml to create a Conda environment with CellProfiler and necessary dependecies. Note: this environment worked for CellProfiler v4.1.3 - newer versions may require other dependencies.
- Job submission: The HPC_job-script.slurm script was used to run the analysis. Note: edit all parameters annotated with
#CHANGEbefore reusing the script. - Output sorting: Analysis will produce a separate output file (
.csv) for every well from the analyzed plate. To sort the output files into folders based on their plate row letters, we used the script sorting-script.sh.
After CellProfiler generates per-cell measurement tables for each well, a series of custom Python scripts (and functions) were used to process and summarize these data:
- Concatenate data for all wells within a given plate row:
- Function:
combineRowResults(<...>)from (CombinePlateResults.py). - Output: all three
.csvfiles outputed from CellProfiler for wells within a plate row are concatenated into three files, prefixed by<plate date>_<row>_RowRes_.
- Function:
- Compile and summarize all per-cell measurements for every plate to get per-well results:
- Main script: CombinePlateResults.py
- Supporting scripts:
- Output: All concatenated row-level
.csvfiles are analyzed by the DNACompactionAnalysis.py script to produce row-level*_RowAnalyzed.xlsxfiles with summarized results for each well and descriptive statistics. These row-level files are then combined into plate-level results-files titled<plate date>_FullPlateAnalyzed.xlsx. - Requires that the PlateDateNumberIndexList.xlsx and WellGeneIndexList.xlsx files are up-to-date in the working folder.
- Create Hit list which summarize results from all replicates and calculate enrichment scores for DNA supercompaction phenoytpes in all strains:
- Main script: CombineEntireHitList.py
- Supporting scripts:
- Output: Generates the
MainHitList_AllPlatesCombined.xlsxfile, containing enrichment scores and hit status for every strain across plates and replicates, by combining and analyzing results from a folder of all*_FullPlateAnalyzed.xlsxfiles. - Requires that the PlateDateNumberIndexList.xlsx and WellGeneIndexList.xlsx files are up-to-date in the working folder.
Note: The scripts assume the index lists (PlateDateNumberIndexList.xlsx and WellGeneIndexList.xlsx) are complete. If using custom plate layouts, update the index files accordingly.
The AnalyzeClassificationRules.py script was used to calculate the importance of parameters and features for:
- Single-Cell model (InterestingCellFilter_MainAnalysis.txt)
- Phenotype model (CompactionCategoryFilter_MainAnalysis.txt).
All scripts and data used to generate the mixed-effects model results for time- and dose-dependent survival data after CIP exposure and UV irradiation are available in the R_StatisticalAnalyses folder. The rendered .html reports document the full statistical analyses (including our rationale for using only main effects), and the .Rmd files together with the accompanying .txt data files allow full replication of the results.
This repository is released under the MIT License. See LICENSE for details.
Enrichment scoring methods in the Python scripts are derived from the source code of CellProfiler Analyst, which is licensed under the BSD 3-Clause License. This applies to parts of CombineEntireHitList.py and DNACompactionAnalysis.py. Additionally, the scripts polyafit.py, dirichletintegrate.py, and hypergeom.py were copied from the open-source code. The relevant license text is included in License_BSD3.
Krister Vikedal
Enrichment scoring methods in the Python scripts are derived from the source code of CellProfiler Analyst. CellProfiler and CellProfiler Analyst can be downloaded from cellprofiler.org and cellprofileranalyst.org, respectively.
We thank Jon Kristen Lærdahl (Oslo University Hospital) for assistance with initial CellProfiler pipeline development. Screening image analyses were performed on resources provided by Sigma2 – the National Infrastructure for High-Performance Computing and Data Storage in Norway (projects nn5014k and nn9383k), as well as resources from the high-performance computing infrastructure at the University of Oslo (project ec100). We also acknowledge the use of ChatGPT by OpenAI for code suggestions in some of the scripts included in this repository.