amRdata is the first package in the amR suite for antimicrobial resistance (AMR) prediction. It takes a user‑provided species or taxon ID, downloads the corresponding genomes and AST data from BV‑BRC, constructs pangenomes, extracts features at multiple molecular scales, and prepares a unified Parquet‑backed DuckDB file for downstream ML modeling in amRml.
The workflow is comprised of 6 primary processes:
- BV‑BRC metadata (isolate metadata + AMR phenotypic labels) →
- BV-BRC genomes (sequence data) →
- Panaroo pangenome (genes, struct) →
- CD‑HIT protein clusters (proteins) →
- Pfam domain extraction (domains) →
- Database formatting
amRdata includes functions to:
- Query and download bacterial genome data from BV-BRC
- Acquire paired antimicrobial susceptibility testing (AST) results
- Extract molecular features across scales:
- Gene clusters (Panaroo pangenome analysis)
- Protein clusters (CD-HIT sequence similarity)
- Protein domains (Pfam annotations)
- Structural variants (Panaroo pangenome rearrangements)
- Store all data in highly efficient Parquet and DuckDB formats
See the package vignette for detailed usage.
# Install from GitHub
if (!requireNamespace("remotes", quietly = TRUE))
install.packages("remotes")
remotes::install_github("JRaviLab/amRdata")library(amRdata)
# Step 1: Download and prepare genomes with paired AST data from BV-BRC
prepareGenomes(
user_bacs = c("Shigella flexneri"),
base_dir = "data/Shigella_flexneri",
method = "ftp", # or "cli"
verbose = TRUE
)
# Step 2: Run full feature extraction (Panaroo → CD-HIT → InterProScan → metadata cleaning)
runDataProcessing(
duckdb_path = "data/Shigella_flexneri/Sfl.duckdb",
output_path = "data/Shigella_flexneri",
threads = 16,
ref_file_path = "data_raw/"
)
# A final Parquet-backed DuckDB is created:
# data/Shigella_flexneri/Sfl_parquet.duckdb
This contains data for feature presence/absence and counts across scales in genome by feature matrices, as well as all available sample metadata. - BV‑BRC data access amRdata uses the BV‑BRC CLI (via Docker) or FTP server to access:
- Genome metadata
- AMR phenotype data
- Genome assemblies (
.fna,.faa,.gff)
Functions involved:
.updateBVBRCdata()
.retrieveCustomQuery()
.retrieveQueryIDs()
retrieveGenomes()
.filterGenomes()
prepareGenomes()
After initial download, all BV-BRC metadata is cached automatically
under: data/bvbrc/bvbrcData.duckdb
The package interfaces with BV-BRC (Bacterial and Viral Bioinformatics Resource Center) to access bacterial genome sequences and antimicrobial susceptibility testing data either using FTP or the BV-BRC CLI wrapped in a Docker container for reproducible access:
- Query isolate metadata with flexible filtering
- Download genome files (
.fna,.faa,.gff) - Retrieve AST results linking genotypes to phenotypes
- Apply quality control filters (assembly quality, metadata completeness)
Features are extracted at four complementary molecular scales:
Panaroo is executed inside a container using .runPanaroo().
Our pangenome creation approach:
- Allows end-to-end single pangenome runs
- Offers parallelized multi-batch pangenomes for large isolate sets
(>5,000 genomes)
- Supports automated pangenome merging through
.mergePanaroo()
- Supports automated pangenome merging through
- Generates gene presence/absence and count matrices per isolate
- Identifies structural variants (gene triplets indicating genome rearrangements)
Outputs are written into the per-taxon DuckDB for efficient storage and querying.
CD-HIT is executed inside a container using .runCDHIT().
Our protein clustering approach:
- Clusters proteins across all isolates from BV-BRC .faa files
- Creates protein presence/absence and count matrices per isolate
- Saves cluster names and annotations
InterProScan is executed inside a container using domainFromIPR().
Our Pfam domain extraction approach:
- Automatically configures InterPro’s databases for use
- Runs parallelized and containerized domain annotation
- Maps domain presence/absence and counts to genomes and proteins
- Provides another functional annotation layer
Final data formatting and storage is executed using cleanData().
Our final data storage script:
- Harmonizes drug names, classes, and countries in BV-BRC metadata
- Generates temporal bins to stratify analysis across time
- Summarizes AMR information across the dataset
- Writes all data into highly compressed data structures
- Parquet: Binary, columnar storage for large matrices
- These can be made human-readable by calling
arrow::read_parquet
- These can be made human-readable by calling
- DuckDB: SQL-queryable database for rapid filtering of linked Parquets
- Parquet: Binary, columnar storage for large matrices
An example of the process for downloading and processing all data and metadata for Shigella flexneri genomes with paired AST metadata.
library(amRdata)
# 1. Download & filter genomes
prepareGenomes(
user_bacs = c("Shigella flexneri"),
base_dir = "data/Shigella_flexneri",
method = "ftp"
)
# 2. Run multi-scale feature extraction
runDataProcessing(
duckdb_path = "data/Shigella_flexneri.duckdb",
output_path = "data/Shigella_flexneri",
threads = 8, # Or whatever your system supports
ref_file_path = "data_raw/"
)
# 3. Load final data
library(DBI)
library(arrow)
# To view all attached data tables in the database
con <- DBI::dbConnect(duckdb::duckdb(), "Shigella_flexneri/Sfl_parquet.duckdb")
DBI::dbListTables(con)
# To load human-readable data tables into R
# e.g., Looking at gene cluster counts per isolate
Sfl_gene_counts <- arrow::read_parquet("data/Shigella_flexneri/gene_count.parquet")
# To connect gene cluster IDs to their annotated names
Sfl_gene_names <- arrow::read_parquet("data/Shigella_flexneri/gene_names.parquet")
External dependencies (managed through Docker)
- BV‑BRC CLI
- Panaroo
- CD‑HIT
- InterProScan
- DuckDB
- Arrow (Parquet)
The user does not need to install these manually.
The package requires:
- An internet connection to access BV-BRC data and metadata
- A local Docker installation
- Containers for internal tools are pulled automatically and do not require configuration
- Make sure Docker is running before you start processing data!
- Sufficient storage for databases, downloaded files, and processed output (we recommend 20GB+)
- Multicore processing and sufficient (16GB+) of RAM are highly
recommended
- Species with many isolates may run poorly or fail to complete on older hardware
Feature matrices dimensions depend on species:
- Rows: Number of isolates (typically <10,000)
- Columns: Number of features (ballpark estimates)
- Genes: 5,000-50,000
- Proteins: 5,000-50,000
- Domains: 500-10,000
- Structural variants: 1,000-10,000
The package uses established bioinformatics tools:
- Panaroo (≥1.3.0): Pangenome analysis
- CD-HIT (≥4.8.1): Protein clustering
- InterProScan (≥5.0): Domain annotation
- Docker: For BV-BRC CLI container
These are automatically managed through the Docker container.
Processing times vary by species and isolate count:
-
Data download: 0-1 hours
-
Pangenome construction: 0-6 hours
-
Protein clustering: 0-3 hours
-
Domain annotation: 0-1 hours
-
Total: 1-12 hours for a complete species analysis
-
These numbers will all vary greatly based on isolate number, genome complexity, and available hardware.
-
Parallelization significantly reduces processing time when multiple cores are available.
amRdata is designed to work seamlessly with other amR packages:
library(amRdata)
library(amRml)
library(amRshiny)
# 1. Curate data
prepareGenomes("Shigella flexneri")
runDataProcessing("amRdata/data/Shigella_flexneri/Sfl.duckdb")
# 2. Train models
runMLmodels("amRdata/data/Shigella_flexneri/Sfl_parquet.duckdb")
# 3. Visualize ### To add
launch_dashboard()If you use amRdata in your research, please cite:
Brenner E, Ghosh A, Wolfe E, Boyer E, Vang C, Lesiyon R, Mayer D, Ravi J. (2026).
amR: an R package suite to predict antimicrobial resistance in bacterial pathogens.
R package version 0.99.0.
https://github.com/JRaviLab/amR
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
Report bugs and request features at: https://github.com/JRaviLab/amRml/issues
BSD 3-Clause License. See LICENSE for details.
Corresponding author: Janani Ravi (janani.ravi@cuanschutz.edu)
Lab website: https://jravilab.github.io
Please note that amRml is released with a Contributor Code of
Conduct.
By contributing to this project, you agree to abide by its terms.