Skip to content

Biomedical Ontology Normalization and Disambiguation

License

Notifications You must be signed in to change notification settings

Aronow-Lab/BOND

Repository files navigation

BOND: Biomedical Ontology Normalization and Disambiguation

License: MIT Python 3.11+ GitHub HuggingFace Dataset HuggingFace Model HuggingFace Reranker

BOND is a system for mapping free-text biological terms to standardized ontology identifiers. It combines hybrid retrieval (exact matching, BM25, dense embeddings), reciprocal rank fusion, cross-encoder reranker, graph-based expansion, and LLM-powered disambiguation to achieve high-accuracy ontology normalization for biomedical metadata.

πŸ““ Example Notebooks

  • Reranker Trained Notebook: Open In Colab - Fine-tuned BOND reranker model
  • Train Your Own Reranker: Open In Colab - Complete guide with dataset structure, configuration, and training instructions

🎯 Overview

BOND addresses the critical challenge of harmonizing diverse biological terminology across research studies and datasets. When researchers submit data to public repositories like GEO, ArrayExpress, or CELLxGENE, they often use inconsistent or non-standard terminology. BOND automatically maps these "author terms" to standardized ontology identifiers, enabling:

  • Metadata harmonization across datasets
  • Semantic interoperability for cross-study analysis
  • FAIR compliance through ontology-linked metadata
  • Scalable curation of large biomedical repositories

✨ Key Features

  • Hybrid Search Architecture: Combines exact matching, BM25 keyword search, and dense semantic search (FAISS) for comprehensive retrieval
  • Reciprocal Rank Fusion (RRF): Intelligently combines results from multiple retrieval methods
  • Cross-Encoder Reranker: Fine-tuned biomedical reranker model improves accuracy by 10-15% by re-ranking candidates
  • LLM-Powered Expansion: Uses large language models to generate query expansions and context-aware synonyms
  • Graph-Based Expansion: Leverages ontology hierarchies to discover related terms
  • Context-Aware Disambiguation: LLM reasoning to select the correct ontology ID from candidate matches
  • Multi-Ontology Support: Works with Cell Ontology (CL), UBERON, MONDO, EFO, PATO, HANCESTRO, and more
  • Organism-Aware Routing: Automatically selects appropriate ontologies based on organism and field type
  • RESTful API: FastAPI-based service for easy integration
  • CLI Tool: Command-line interface for batch processing

πŸ“Š Supported Fields and Ontologies

Supported Fields

  • cell_type: Cell types and classifications (Cell Ontology)
  • tissue: Anatomical structures (UBERON)
  • disease: Disease conditions (MONDO)
  • development_stage: Developmental stages (organism-specific)
  • sex: Biological sex (PATO)
  • self_reported_ethnicity: Ethnicity/ancestry (HANCESTRO)
  • assay: Experimental methods (EFO)
  • organism: Taxonomic classification (NCBI Taxonomy)

Supported Organisms

  • Homo sapiens
  • Mus musculus
  • Danio rerio (zebrafish)
  • Drosophila melanogaster (fruit fly)
  • Caenorhabditis elegans (C. elegans)

πŸš€ Quick Start

Installation

# Clone the repository
git clone https://github.com/Aronow-Lab/BOND.git
cd BOND

# Create and activate virtual environment
python3.11 -m venv bond_venv
source bond_venv/bin/activate  # On Windows: bond_venv\Scripts\activate

# Install dependencies
pip install --upgrade pip
pip install -e .

Prerequisites

  1. Ontology Database: You need an SQLite database containing ontology terms. See Installation Guide for details.
  2. FAISS Index: Build a FAISS index for dense semantic search:
    bond-build-faiss --sqlite_path assets/ontologies.sqlite --assets_path assets
  3. Environment Variables: Configure LLM providers and embedding models (see Configuration below)

Basic Usage

CLI Example

bond-query \
  --query "T-cell" \
  --field cell_type \
  --organism "Homo sapiens" \
  --tissue "blood" \
  --verbose

Python API Example

from bond import BondMatcher
from bond.config import BondSettings

# Initialize matcher
settings = BondSettings()
matcher = BondMatcher(settings)

# Query
result = matcher.query(
    query="T-cell",
    field_name="cell_type",
    organism="Homo sapiens",
    tissue="blood"
)

print(f"Matched: {result['chosen']['label']}")
print(f"Ontology ID: {result['chosen']['id']}")
print(f"Confidence: {result['chosen']['llm_confidence']}")

API Server

# Start server (with anonymous access for development)
BOND_ALLOW_ANON=1 bond-serve

# Or with API key authentication
export BOND_API_KEY=your-secret-key
bond-serve

Then query the API:

curl -X POST "http://localhost:8000/query" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-secret-key" \
  -d '{
    "query": "T-cell",
    "field_name": "cell_type",
    "organism": "Homo sapiens",
    "tissue": "blood"
  }'

πŸ”§ Configuration

BOND uses environment variables for configuration. Create a .env file:

# Embedding Model (choose ONE of the options below)

# Option A: Use your Ollama encoder (recommended for local)
# Make sure the model is available in Ollama first:
#   ollama pull rajdeopankaj/bond-embed-v1-fp16
BOND_EMBED_MODEL=ollama:rajdeopankaj/bond-embed-v1-fp16
# If running remote Ollama, set the API base (default: http://localhost:11434)
# OLLAMA_API_BASE=http://your-ollama-host:11434

# Option B: Use a LiteLLM-compatible hosted embedding endpoint
# Example: OpenAI, Azure OpenAI, Together, Groq, etc.
# BOND_EMBED_MODEL=litellm:text-embedding-3-small

# Option C: Use Hugging Face TEI (Text Embeddings Inference)
# Deploy your HF model (pankajrajdeo/bond-embed-v1-fp16) behind a LiteLLM-compatible endpoint,
# then set the model name accordingly.
# Example if exposed as huggingface/teimodel (via LiteLLM routing):
# BOND_EMBED_MODEL=litellm:huggingface/teimodel

# LLM Providers (for expansion and disambiguation)
BOND_EXPANSION_LLM=anthropic/claude-3-5-sonnet-20241022
BOND_DISAMBIGUATION_LLM=anthropic/claude-3-5-sonnet-20241022


# Or use OpenAI
# BOND_EXPANSION_LLM=openai/gpt-4o
# BOND_DISAMBIGUATION_LLM=openai/gpt-4o

# API Keys (set as environment variables or in .env)
ANTHROPIC_API_KEY=your-key
OPENAI_API_KEY=your-key

# Paths
BOND_ASSETS_PATH=assets
BOND_SQLITE_PATH=assets/ontologies.sqlite

# Reranker Configuration
# Path to trained reranker model (cross-encoder)
# Default: Uses pre-trained model from reranker-model/ directory
# To use custom reranker: BOND_RERANKER_PATH=/path/to/your/reranker-model
# To disable reranker: BOND_ENABLE_RERANKER=0
BOND_RERANKER_PATH=./reranker-model
BOND_ENABLE_RERANKER=1

# Retrieval Parameters
BOND_TOPK_EXACT=5
BOND_TOPK_BM25=20
BOND_TOPK_DENSE=50
BOND_TOPK_FINAL=20

# Optional: Disable LLM stages for retrieval-only mode
# BOND_RETRIEVAL_ONLY=1

Using Published Encoders

  • Hugging Face model: pankajrajdeo/bond-embed-v1-fp16See model card and direct usage: https://huggingface.co/pankajrajdeo/bond-embed-v1-fp16
  • Ollama model: rajdeopankaj/bond-embed-v1-fp16 Pull first, then set BOND_EMBED_MODEL=ollama:rajdeopankaj/bond-embed-v1-fp16 and (optionally) OLLAMA_API_BASE.

Note: BOND’s embedding provider currently supports LiteLLM-style endpoints and Ollama natively. If you prefer running the Hugging Face model locally with SentenceTransformers, use it to precompute embeddings for your own pipelines; for BOND’s FAISS build and runtime embedding, route the HF model via a LiteLLM-compatible endpoint (e.g., TEI behind a gateway) or use the Ollama variant.

πŸ“– Documentation

πŸ—οΈ Architecture

Pipeline Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STAGE 1: QUERY EXPANSION                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ LLM generates synonyms, abbreviations,         β”‚
β”‚ and context-aware expansions                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STAGE 2: HYBRID RETRIEVAL                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β€’ Exact Match (SQLite FTS) β†’ Top-K             β”‚
β”‚ β€’ BM25 Search (SQLite FTS) β†’ Top-K             β”‚
β”‚ β€’ Dense Search (FAISS) β†’ Top-K                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STAGE 3: RECIPROCAL RANK FUSION                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Combine rankings from all methods using RRF    β”‚
β”‚ with field-aware weighting                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STAGE 4: CROSS-ENCODER RERANKER                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Re-rank top candidates using fine-tuned        β”‚
β”‚ biomedical reranker model (bioformer-16L)      β”‚
β”‚ Improves accuracy by 10-15%                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STAGE 5: GRAPH EXPANSION (Optional)            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Expand ontology neighbors if confidence low    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STAGE 6: LLM DISAMBIGUATION                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ LLM selects best match from candidates with    β”‚
β”‚ reasoning and confidence scoring                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Components

  • bond/pipeline.py: Core BondMatcher class implementing the full pipeline
  • bond/retrieval/: Retrieval modules (BM25, FAISS)
  • bond/fusion.py: Reciprocal rank fusion implementation
  • bond/rerank.py: Cross-encoder reranker integration (loads and applies fine-tuned reranker model)
  • bond/graph_utils.py: Ontology graph traversal
  • bond/server.py: FastAPI REST service
  • bond/cli.py: Command-line interface

πŸ“Š Benchmark Dataset

BOND includes a comprehensive, context-aware benchmark dataset derived from CELLxGENE metadata:

  • 25,416 normalization pairs across 7 field types (cell_type, tissue, disease, assay, sex, development_stage, self_reported_ethnicity)
  • 85 high-quality datasets from CELLxGENE Census (72 single-cell, 13 spatial)
  • 186 unique tissues across Homo sapiens and Mus musculus
  • Stratified 80/10/10 splits: 20,332 train / 2,542 validation / 2,542 test

Unlike traditional string-matching benchmarks, BOND requires reasoning over biological contextβ€”including tissue, disease, organism, and development stage informationβ€”to accurately map author-provided annotations to standardized ontology labels.

Dataset: AronowLab/bond-czi-benchmark

Usage

from datasets import load_dataset

dataset = load_dataset("AronowLab/bond-czi-benchmark")

# Access splits
train = dataset["train"]
val = dataset["validation"]
test = dataset["test"]

# Example
sample = train[0]
print(f"Input: {sample['author_term']}")
print(f"Context: {sample['tissue']}, {sample['disease']}")
print(f"Target: {sample['ontology_label']} ({sample['ontology_id']})")

πŸ”¬ Evaluation

BOND has been evaluated on the BOND-CZI benchmark dataset (25,416 pairs from 85 datasets). Performance highlights:

  • Embedding Model: Custom fine-tuned encoder (bond-embed-v1-fp16) achieves 92.7% accuracy@10 on ontology evaluation
  • Hybrid Retrieval: Combines exact match, BM25, and dense semantic search with Reciprocal Rank Fusion
  • Cross-Encoder Reranker: Fine-tuned biomedical reranker (bioformers/bioformer-16L) improves Hit@10 accuracy from ~75-80% (retrieval only) to 85-90% (with reranker)
  • LLM Disambiguation: Uses large language models for context-aware term selection
  • Multi-field Support: Handles 7 biological field types (cell_type, tissue, disease, etc.) across 5 organisms

See the benchmark-metrics/README.md for evaluation results and scripts.

πŸ”„ Current Work

Active development and research efforts:

  • Harmonized Knowledge Graph: Building a harmonized transcriptomic knowledge graph from BOND-normalized metadata to enable advanced querying, cross-dataset analysis, and relationship discovery in single-cell transcriptomics data.

πŸ› οΈ Development

Running Tests

make test
# or
pytest

Code Quality

make lint
# or
ruff check .
black .

Building FAISS Index

bond-build-faiss \
  --sqlite_path assets/ontologies.sqlite \
  --assets_path assets

Generating Ontology Database

bond-generate-sqlite

πŸ“ Citation

If you use BOND in your research, please cite:

BOND System:

@software{bond_2026,
  title={BOND: Biomedical Ontology Normalization and Disambiguation},
  author={Rajdeo, Pankaj and Gelal, Rupesh},
  year={2026},
  url={https://github.com/Aronow-Lab/BOND}
}

🀝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


About

Biomedical Ontology Normalization and Disambiguation

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published