Exploring the genetic landscape of DNA supercompaction

This repository contains all pipelines, scripts, and reference documents used for the machine learning-assisted image analysis of images from the high-throughput screening in the publication below. This includes files used for batch processing on high-performance computers. Additionally, the repository contains scripts for analyzing classification parameter importance, and files from statistical analyses in R. Scripts and templates for image preprocessing and analysis with Coli-Inspector and MicrobeJ (for analyzing DNA profile widths and creating kymograph heat maps) are available in the Zenodo repository of our previous paper RecN and RecA orchestrate an ordered DNA supercompaction response following ciprofloxacin-induced DNA damage in Escherichia coli.

Publication

Vikedal K, Berges N, Riisnæs IMM, Ræder SB, Bjørnholt JV, Bjørås M, Skarstad K, Helgesen E, and Booth JA
Exploring the genetic landscape of ciprofloxacin-induced DNA supercompaction in Escherichia coli
bioRxiv (2025). doi: 10.1101/2025.07.xx.xxxxxx

Images and data from high-throughput imaging are available in the BioImage Archive under accession number S-BIAD2152 at https://doi.org/10.6019/S-BIAD2152.

Please see the Material and methods section and the Supplementary Material of the paper for details on screening procedure and image analysis.

CellProfiler pipelines

The main CellProfiler pipeline used for analysis of images from the screening, as well as pipelines used for training of the Single-Cell and Phenotype models, are located in the CellProfiler_Pipelines folder. The main CellProfiler pipeline is set up for batch processing of the screening images on a high-performance computer (cluster computer).

Batch processing workflow:

Place raw images in an input-images folder within the plate directory.
Run the pipeline (CellProfiler_Main_KV_ScreeningPaper.cppipe) locally to determine which images to include in the analysis and generate a Batch_data.h5 batch file.
Run CellProfiler in batch mode on the high-performance computer using the generated batch file.
Per-cell measurement tables (.csv) with data from the CellProfiler analysis of every well will be written to an output-data folder in the same location as the input-images folder.

There are three separate files outputed for each well:

*_ImageMetadata.csv: contain metadata for full images.
*_ObjectMeasurements.csv: include measurements of cell and DNA features for all segmented objects prior to filtration with the Single-Cell model.
*_InterestingObjectMeasurements.csv: contain measurements of cell and DNA features as well as phenotype scoring (Phenotype model) for cells remaining after filtration with the Single-Cell model.

Sorting rules (Single-Cell and Phenotype model)

The sorting-rules folder contains the rules of the:

Single-Cell model (InterestingCellFilter_MainAnalysis.txt) used to exclude irrelevant objects
Phenotype model (CompactionCategoryFilter_MainAnalysis.txt) used to classify DNA compaction phenotypes

This folder should always be included within the input-images directory to ensure filtering is handled correctly in the CellProfiler pipelines.

CellProfiler Analyst properties

The CPA_properties folder contains .properties files used for training the Single-Cell model (InterestingCellFilter) and Phenotype model (CompactionCategoryFilter) in CellProfiler Analyst. The classifier_ignore_columns setting lists parameters excluded from training to avoid confounders (e.g. image metadata).

HPC scripts

The batch processing capabilities of CellProfiler were employed on high-performance computers (HPCs) to automate the large scale analysis of images from the screening. After creating a batch file in CellProfiler locally, all files needed for image analysis were uploaded to the HPC (plate directory containing input-images (with sorting-rules) and output-data (with Batch_data.h5) folders). SLURM scripts were used to manage the large-scale batch processing:

Environment setup: Ensure that Anaconda or Miniconda is installed on the HPC, and that CellProfiler is available. Use environment.yml to create a Conda environment with CellProfiler and necessary dependecies. Note: this environment worked for CellProfiler v4.1.3 - newer versions may require other dependencies.
Job submission: The HPC_job-script.slurm script was used to run the analysis. Note: edit all parameters annotated with #CHANGE before reusing the script.
Output sorting: Analysis will produce a separate output file (.csv) for every well from the analyzed plate. To sort the output files into folders based on their plate row letters, we used the script sorting-script.sh.

CellProfiler output handling scripts

After CellProfiler generates per-cell measurement tables for each well, a series of custom Python scripts (and functions) were used to process and summarize these data:

Concatenate data for all wells within a given plate row:
- Function: combineRowResults(<...>) from (CombinePlateResults.py).
- Output: all three .csv files outputed from CellProfiler for wells within a plate row are concatenated into three files, prefixed by <plate date>_<row>_RowRes_.
Compile and summarize all per-cell measurements for every plate to get per-well results:
- Main script: CombinePlateResults.py
- Supporting scripts:
- Output: All concatenated row-level .csv files are analyzed by the DNACompactionAnalysis.py script to produce row-level *_RowAnalyzed.xlsx files with summarized results for each well and descriptive statistics. These row-level files are then combined into plate-level results-files titled <plate date>_FullPlateAnalyzed.xlsx.
- Requires that the PlateDateNumberIndexList.xlsx and WellGeneIndexList.xlsx files are up-to-date in the working folder.
Create Hit list which summarize results from all replicates and calculate enrichment scores for DNA supercompaction phenoytpes in all strains:
- Main script: CombineEntireHitList.py
- Supporting scripts:
- Output: Generates the MainHitList_AllPlatesCombined.xlsx file, containing enrichment scores and hit status for every strain across plates and replicates, by combining and analyzing results from a folder of all *_FullPlateAnalyzed.xlsx files.
- Requires that the PlateDateNumberIndexList.xlsx and WellGeneIndexList.xlsx files are up-to-date in the working folder.

Note: The scripts assume the index lists (PlateDateNumberIndexList.xlsx and WellGeneIndexList.xlsx) are complete. If using custom plate layouts, update the index files accordingly.

Analyzing importance of classification parameters

The AnalyzeClassificationRules.py script was used to calculate the importance of parameters and features for:

Single-Cell model (InterestingCellFilter_MainAnalysis.txt)
Phenotype model (CompactionCategoryFilter_MainAnalysis.txt).

R-based statistical analyses

All scripts and data used to generate the mixed-effects model results for time- and dose-dependent survival data after CIP exposure and UV irradiation are available in the R_StatisticalAnalyses folder. The rendered .html reports document the full statistical analyses (including our rationale for using only main effects), and the .Rmd files together with the accompanying .txt data files allow full replication of the results.

License Details

This repository is released under the MIT License. See LICENSE for details.

Enrichment scoring methods in the Python scripts are derived from the source code of CellProfiler Analyst, which is licensed under the BSD 3-Clause License. This applies to parts of CombineEntireHitList.py and DNACompactionAnalysis.py. Additionally, the scripts polyafit.py, dirichletintegrate.py, and hypergeom.py were copied from the open-source code. The relevant license text is included in License_BSD3.

Author

Krister Vikedal

Acknowledgements

Enrichment scoring methods in the Python scripts are derived from the source code of CellProfiler Analyst. CellProfiler and CellProfiler Analyst can be downloaded from cellprofiler.org and cellprofileranalyst.org, respectively.

We thank Jon Kristen Lærdahl (Oslo University Hospital) for assistance with initial CellProfiler pipeline development. Screening image analyses were performed on resources provided by Sigma2 – the National Infrastructure for High-Performance Computing and Data Storage in Norway (projects nn5014k and nn9383k), as well as resources from the high-performance computing infrastructure at the University of Oslo (project ec100). We also acknowledge the use of ChatGPT by OpenAI for code suggestions in some of the scripts included in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
AnalyzeClassificationRules		AnalyzeClassificationRules
CellProfiler_OutputHandling		CellProfiler_OutputHandling
CellProfiler_Pipelines		CellProfiler_Pipelines
HPC_scripts		HPC_scripts
R_StatisticalAnalyses		R_StatisticalAnalyses
LICENSE		LICENSE
License_BSD3		License_BSD3
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Exploring the genetic landscape of DNA supercompaction

Publication

Table of Contents

CellProfiler pipelines

Sorting rules (Single-Cell and Phenotype model)

CellProfiler Analyst properties

HPC scripts

CellProfiler output handling scripts

Analyzing importance of classification parameters

R-based statistical analyses

License Details

Author

Acknowledgements

About

Uh oh!

Releases 2

Packages

Languages

License

ous-mik/exploring-supercompaction-landscape

Folders and files

Latest commit

History

Repository files navigation

Exploring the genetic landscape of DNA supercompaction

Publication

Table of Contents

CellProfiler pipelines

Sorting rules (Single-Cell and Phenotype model)

CellProfiler Analyst properties

HPC scripts

CellProfiler output handling scripts

Analyzing importance of classification parameters

R-based statistical analyses

License Details

Author

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages