Python code that algorithmically builds antimicrobial resistance catalogues of mutations.
This repo contains 2 approaches to build resistance catalogues:
- Definite defectives (solo-based approach)
- Interval regression
The first is used in https://doi.org/10.1101/2025.01.30.635633, and the second is a Python translation of the method used in https://doi.org/10.1038/s41467-023-44325-5, but is still under development.
This method relies on the logic that mutations that do not cause resistance can co-occur with those that do. If a mutation in isolation (solo) does not cause resistance, then it is not contributing to the phenotype when present in a mixture either.
Mutations that occur in isolation across specified genes are traversed in sequence, and if their proportion of drug-susceptibility (vs. resistance) passes the specified statistical test, they are characterized as benign (susceptible) and removed from the dataset. This step repeats iteratively until no more solo benign mutations are found.
Remaining mutations are classified based on their resistance rates and statistical test results. Those that don’t meet thresholds are labeled as U.
The classification approach supports:
- No test: assumes homogeneous susceptibility is sufficient for S
- Binomial test: against a specified background resistance rate
- Fisher's test: using a contingency background
- Seeding: You can pre-seed the catalogue with known neutral mutations.
- Overrides: You can override or supplement the final catalogue with manual or rule-based entries.
Because the method uses GARC1 grammar, rules like {rpoB@*_fs:R} can be supplied post-hoc. These rules can:
- Be additive (lower priority than specific entries)
- Replace matching mutations (requires
replace=Trueandwildcardssupplied)
The catalogue can be returned as a dictionary or a Piezo-compatible pandas.DataFrame.
Contingency tables, proportions, p-values, and Wilson confidence intervals are stored in the EVIDENCE field.
This method is under development and will be released soon with accompanying documentation.
We recommend using Conda for environment and dependency management.
conda env create -f env.yml
conda activate catomatic
pip install .You need two input DataFrames:
- Samples: one row per sample, with 'R' or 'S' phenotypes (
UNIQUEID,PHENOTYPE) - Mutations: one row per mutation per sample (
UNIQUEID,MUTATION)
If exporting to Piezo format:
- The
MUTATIONcolumn must follow GARC1 grammar (gene@mutation) - A path to a
wildcards.jsonfile (containing mutation rules) must be provided
from catomatic.BinaryCatalogue import BinaryBuilder
# Build catalogue
catalogue = BinaryBuilder(samples=samples_df, mutations=mutations_df).build()
# View dictionary version
cat_dict = catalogue.return_catalogue()
# Convert to Piezo-compatible format
catalogue_df = catalogue.build_piezo(
genbank_ref='...',
catalogue_name='...',
version='...',
drug='...',
wildcards='path/to/wildcards.json'
)
# Optionally export to CSV
catalogue.to_piezo(
genbank_ref='...',
catalogue_name='...',
version='...',
drug='...',
wildcards='path/to/wildcards.json',
outfile='path/to/output.csv'
)After installation, the simplest way to run the catomatic catalogue builder is via the command line interface using the binary subcommand. You must use either --to_piezo or --to_json to specify the output format. Additional metadata is required when using --to_piezo.
python -m catomatic binary \
--samples path/to/samples.csv \
--mutations path/to/mutations.csv \
--to_json \
--outfile path/to/output/catalogue.jsonpython -m catomatic binary \
--samples path/to/samples.csv \
--mutations path/to/mutations.csv \
--to_piezo \
--outfile path/to/output/catalogue.csv \
--genbank_ref '...' \
--catalogue_name '...' \
--version '...' \
--drug '...' \
--wildcards path/to/wildcards.json| Parameter | Type | Description |
|---|---|---|
--samples |
str |
Path to the samples (phenotypes) file. Required. |
--mutations |
str |
Path to the mutations file. Required. |
--outfile |
str |
Output file path for saving the catalogue. Required with --to_json or --to_piezo. |
--to_json |
flag |
Export the resulting catalogue in JSON format. Optional. |
--to_piezo |
flag |
Export the resulting catalogue in Piezo-compatible CSV format. Optional. |
--genbank_ref |
str |
GenBank reference string for Piezo export. Required with --to_piezo. |
--catalogue_name |
str |
Name of the catalogue. Required with --to_piezo. |
--version |
str |
Catalogue version. Required with --to_piezo. |
--drug |
str |
Name of the drug. Required with --to_piezo. |
--wildcards |
str |
Path to JSON file containing wildcard mutation definitions. Required with --to_piezo. |
--test |
str |
Type of statistical test to apply. One of: None, Binomial, or Fisher. Optional. |
--background |
float |
Background mutation rate (0–1). Required if --test Binomial is used. |
--p |
float |
P-value threshold for statistical significance. Optional. Defaults to 0.95. |
--tails |
str |
Tail type for statistical test. One of: one, two. Optional. Defaults to two. |
--strict_unlock |
flag |
If set, disables classification of susceptible (S) mutations unless statistically confident. |
- When using post-hoc rule updates via .update(), you must provide wildcards and set replace=True if you intend to override existing entries.
- For Piezo export, placeholder entries are inserted automatically if needed to satisfy parser requirements (R, S, and U must be represented).
- The EVIDENCE column includes contingency tables, proportions, confidence intervals, and p-values, and may optionally include sample IDs if
record_ids=True.
If you use catomatic in your research, please cite:
