Skip to content

JRaviLab/amRdata

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

amRdata

Lifecycle: experimental

amRdata is the first package in the amR suite for antimicrobial resistance (AMR) prediction. It takes a user‑provided species or taxon ID, downloads the corresponding genomes and AST data from BV‑BRC, constructs pangenomes, extracts features at multiple molecular scales, and prepares a unified Parquet‑backed DuckDB file for downstream ML modeling in amRml.

The workflow is comprised of 6 primary processes:

  1. BV‑BRC metadata (isolate metadata + AMR phenotypic labels) →
  2. BV-BRC genomes (sequence data) →
  3. Panaroo pangenome (genes, struct) →
  4. CD‑HIT protein clusters (proteins) →
  5. Pfam domain extraction (domains) →
  6. Database formatting

Overview

amRdata includes functions to:

  • Query and download bacterial genome data from BV-BRC
  • Acquire paired antimicrobial susceptibility testing (AST) results
  • Extract molecular features across scales:
    • Gene clusters (Panaroo pangenome analysis)
    • Protein clusters (CD-HIT sequence similarity)
    • Protein domains (Pfam annotations)
    • Structural variants (Panaroo pangenome rearrangements)
  • Store all data in highly efficient Parquet and DuckDB formats

See the package vignette for detailed usage.

Installation

# Install from GitHub
if (!requireNamespace("remotes", quietly = TRUE))
    install.packages("remotes")

remotes::install_github("JRaviLab/amRdata")

Quick start

library(amRdata)

# Step 1: Download and prepare genomes with paired AST data from BV-BRC
prepareGenomes(
  user_bacs = c("Shigella flexneri"),
  base_dir  = "data/Shigella_flexneri",
  method    = "ftp",   # or "cli"
  verbose   = TRUE
)

# Step 2: Run full feature extraction (Panaroo → CD-HIT → InterProScan → metadata cleaning)
runDataProcessing(
  duckdb_path = "data/Shigella_flexneri/Sfl.duckdb",
  output_path = "data/Shigella_flexneri",
  threads     = 16,
  ref_file_path = "data_raw/"
)

# A final Parquet-backed DuckDB is created:
#   data/Shigella_flexneri/Sfl_parquet.duckdb

This contains data for feature presence/absence and counts across scales in genome by feature matrices, as well as all available sample metadata. 

Package features

Data curation

  1. BV‑BRC data access amRdata uses the BV‑BRC CLI (via Docker) or FTP server to access:
  • Genome metadata
  • AMR phenotype data
  • Genome assemblies (.fna, .faa, .gff)

Functions involved:

.updateBVBRCdata()
.retrieveCustomQuery()
.retrieveQueryIDs()
retrieveGenomes()
.filterGenomes()
prepareGenomes()

After initial download, all BV-BRC metadata is cached automatically under: data/bvbrc/bvbrcData.duckdb

The package interfaces with BV-BRC (Bacterial and Viral Bioinformatics Resource Center) to access bacterial genome sequences and antimicrobial susceptibility testing data either using FTP or the BV-BRC CLI wrapped in a Docker container for reproducible access:

  • Query isolate metadata with flexible filtering
  • Download genome files (.fna, .faa, .gff)
  • Retrieve AST results linking genotypes to phenotypes
  • Apply quality control filters (assembly quality, metadata completeness)

Feature extraction

Features are extracted at four complementary molecular scales:

1. Gene clusters

Panaroo is executed inside a container using .runPanaroo().

Our pangenome creation approach:

  • Allows end-to-end single pangenome runs
  • Offers parallelized multi-batch pangenomes for large isolate sets (>5,000 genomes)
    • Supports automated pangenome merging through .mergePanaroo()
  • Generates gene presence/absence and count matrices per isolate
  • Identifies structural variants (gene triplets indicating genome rearrangements)

Outputs are written into the per-taxon DuckDB for efficient storage and querying.

2. Protein clusters

CD-HIT is executed inside a container using .runCDHIT().

Our protein clustering approach:

  • Clusters proteins across all isolates from BV-BRC .faa files
  • Creates protein presence/absence and count matrices per isolate
  • Saves cluster names and annotations

3. Pfam domains

InterProScan is executed inside a container using domainFromIPR().

Our Pfam domain extraction approach:

  • Automatically configures InterPro’s databases for use
  • Runs parallelized and containerized domain annotation
  • Maps domain presence/absence and counts to genomes and proteins
  • Provides another functional annotation layer

4. Data cleaning and storage

Final data formatting and storage is executed using cleanData().

Our final data storage script:

  • Harmonizes drug names, classes, and countries in BV-BRC metadata
  • Generates temporal bins to stratify analysis across time
  • Summarizes AMR information across the dataset
  • Writes all data into highly compressed data structures
    • Parquet: Binary, columnar storage for large matrices
      • These can be made human-readable by calling arrow::read_parquet
    • DuckDB: SQL-queryable database for rapid filtering of linked Parquets

Workflow example

An example of the process for downloading and processing all data and metadata for Shigella flexneri genomes with paired AST metadata.

library(amRdata)

# 1. Download & filter genomes
prepareGenomes(
  user_bacs  = c("Shigella flexneri"),
  base_dir   = "data/Shigella_flexneri",
  method     = "ftp"
)

# 2. Run multi-scale feature extraction

runDataProcessing(
  duckdb_path    = "data/Shigella_flexneri.duckdb",
  output_path    = "data/Shigella_flexneri",
  threads        = 8, # Or whatever your system supports
  ref_file_path  = "data_raw/"
)

# 3. Load final data

library(DBI)
library(arrow)

# To view all attached data tables in the database

con <- DBI::dbConnect(duckdb::duckdb(), "Shigella_flexneri/Sfl_parquet.duckdb")
DBI::dbListTables(con)


# To load human-readable data tables into R

# e.g., Looking at gene cluster counts per isolate
Sfl_gene_counts <- arrow::read_parquet("data/Shigella_flexneri/gene_count.parquet")
  
  # To connect gene cluster IDs to their annotated names
  Sfl_gene_names <- arrow::read_parquet("data/Shigella_flexneri/gene_names.parquet")

Data requirements

External dependencies (managed through Docker)

  • BV‑BRC CLI
  • Panaroo
  • CD‑HIT
  • InterProScan
  • DuckDB
  • Arrow (Parquet)

The user does not need to install these manually.

The package requires:

  • An internet connection to access BV-BRC data and metadata
  • A local Docker installation
    • Containers for internal tools are pulled automatically and do not require configuration
    • Make sure Docker is running before you start processing data!
  • Sufficient storage for databases, downloaded files, and processed output (we recommend 20GB+)
  • Multicore processing and sufficient (16GB+) of RAM are highly recommended
    • Species with many isolates may run poorly or fail to complete on older hardware

Output

Feature matrices dimensions depend on species:

  • Rows: Number of isolates (typically <10,000)
  • Columns: Number of features (ballpark estimates)
    • Genes: 5,000-50,000
    • Proteins: 5,000-50,000
    • Domains: 500-10,000
    • Structural variants: 1,000-10,000

External dependencies

The package uses established bioinformatics tools:

  • Panaroo (≥1.3.0): Pangenome analysis
  • CD-HIT (≥4.8.1): Protein clustering
  • InterProScan (≥5.0): Domain annotation
  • Docker: For BV-BRC CLI container

These are automatically managed through the Docker container.

Performance

Processing times vary by species and isolate count:

  • Data download: 0-1 hours

  • Pangenome construction: 0-6 hours

  • Protein clustering: 0-3 hours

  • Domain annotation: 0-1 hours

  • Total: 1-12 hours for a complete species analysis

  • These numbers will all vary greatly based on isolate number, genome complexity, and available hardware.

  • Parallelization significantly reduces processing time when multiple cores are available.

Integration with amR suite

amRdata is designed to work seamlessly with other amR packages:

library(amRdata)
library(amRml)
library(amRshiny)

# 1. Curate data
prepareGenomes("Shigella flexneri")
runDataProcessing("amRdata/data/Shigella_flexneri/Sfl.duckdb")

# 2. Train models
runMLmodels("amRdata/data/Shigella_flexneri/Sfl_parquet.duckdb")

# 3. Visualize ### To add
launch_dashboard()

Related packages

  • amR: Suite metapackage
  • amRml: ML for AMR prediction
  • amRshiny: Interactive dashboard

Citation

If you use amRdata in your research, please cite:

Brenner E, Ghosh A, Wolfe E, Boyer E, Vang C, Lesiyon R, Mayer D, Ravi J. (2026).
amR: an R package suite to predict antimicrobial resistance in bacterial pathogens.
R package version 0.99.0.
https://github.com/JRaviLab/amR

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Reporting issues

Report bugs and request features at: https://github.com/JRaviLab/amRml/issues

License

BSD 3-Clause License. See LICENSE for details.

Contact

Corresponding author: Janani Ravi (janani.ravi@cuanschutz.edu)

Lab website: https://jravilab.github.io

Code of conduct

Please note that amRml is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

About

Houses the AMR data package

Topics

Resources

License

Unknown, BSD-3-Clause licenses found

Licenses found

Unknown
LICENSE
BSD-3-Clause
LICENSE.md

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5

Languages