BOND is a system for mapping free-text biological terms to standardized ontology identifiers. It combines hybrid retrieval (exact matching, BM25, dense embeddings), reciprocal rank fusion, cross-encoder reranker, graph-based expansion, and LLM-powered disambiguation to achieve high-accuracy ontology normalization for biomedical metadata.
- Reranker Trained Notebook:
- Fine-tuned BOND reranker model
- Train Your Own Reranker:
- Complete guide with dataset structure, configuration, and training instructions
BOND addresses the critical challenge of harmonizing diverse biological terminology across research studies and datasets. When researchers submit data to public repositories like GEO, ArrayExpress, or CELLxGENE, they often use inconsistent or non-standard terminology. BOND automatically maps these "author terms" to standardized ontology identifiers, enabling:
- Metadata harmonization across datasets
- Semantic interoperability for cross-study analysis
- FAIR compliance through ontology-linked metadata
- Scalable curation of large biomedical repositories
- Hybrid Search Architecture: Combines exact matching, BM25 keyword search, and dense semantic search (FAISS) for comprehensive retrieval
- Reciprocal Rank Fusion (RRF): Intelligently combines results from multiple retrieval methods
- Cross-Encoder Reranker: Fine-tuned biomedical reranker model improves accuracy by 10-15% by re-ranking candidates
- LLM-Powered Expansion: Uses large language models to generate query expansions and context-aware synonyms
- Graph-Based Expansion: Leverages ontology hierarchies to discover related terms
- Context-Aware Disambiguation: LLM reasoning to select the correct ontology ID from candidate matches
- Multi-Ontology Support: Works with Cell Ontology (CL), UBERON, MONDO, EFO, PATO, HANCESTRO, and more
- Organism-Aware Routing: Automatically selects appropriate ontologies based on organism and field type
- RESTful API: FastAPI-based service for easy integration
- CLI Tool: Command-line interface for batch processing
cell_type: Cell types and classifications (Cell Ontology)tissue: Anatomical structures (UBERON)disease: Disease conditions (MONDO)development_stage: Developmental stages (organism-specific)sex: Biological sex (PATO)self_reported_ethnicity: Ethnicity/ancestry (HANCESTRO)assay: Experimental methods (EFO)organism: Taxonomic classification (NCBI Taxonomy)
Homo sapiensMus musculusDanio rerio(zebrafish)Drosophila melanogaster(fruit fly)Caenorhabditis elegans(C. elegans)
# Clone the repository
git clone https://github.com/Aronow-Lab/BOND.git
cd BOND
# Create and activate virtual environment
python3.11 -m venv bond_venv
source bond_venv/bin/activate # On Windows: bond_venv\Scripts\activate
# Install dependencies
pip install --upgrade pip
pip install -e .- Ontology Database: You need an SQLite database containing ontology terms. See Installation Guide for details.
- FAISS Index: Build a FAISS index for dense semantic search:
bond-build-faiss --sqlite_path assets/ontologies.sqlite --assets_path assets
- Environment Variables: Configure LLM providers and embedding models (see Configuration below)
bond-query \
--query "T-cell" \
--field cell_type \
--organism "Homo sapiens" \
--tissue "blood" \
--verbosefrom bond import BondMatcher
from bond.config import BondSettings
# Initialize matcher
settings = BondSettings()
matcher = BondMatcher(settings)
# Query
result = matcher.query(
query="T-cell",
field_name="cell_type",
organism="Homo sapiens",
tissue="blood"
)
print(f"Matched: {result['chosen']['label']}")
print(f"Ontology ID: {result['chosen']['id']}")
print(f"Confidence: {result['chosen']['llm_confidence']}")# Start server (with anonymous access for development)
BOND_ALLOW_ANON=1 bond-serve
# Or with API key authentication
export BOND_API_KEY=your-secret-key
bond-serveThen query the API:
curl -X POST "http://localhost:8000/query" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-secret-key" \
-d '{
"query": "T-cell",
"field_name": "cell_type",
"organism": "Homo sapiens",
"tissue": "blood"
}'BOND uses environment variables for configuration. Create a .env file:
# Embedding Model (choose ONE of the options below)
# Option A: Use your Ollama encoder (recommended for local)
# Make sure the model is available in Ollama first:
# ollama pull rajdeopankaj/bond-embed-v1-fp16
BOND_EMBED_MODEL=ollama:rajdeopankaj/bond-embed-v1-fp16
# If running remote Ollama, set the API base (default: http://localhost:11434)
# OLLAMA_API_BASE=http://your-ollama-host:11434
# Option B: Use a LiteLLM-compatible hosted embedding endpoint
# Example: OpenAI, Azure OpenAI, Together, Groq, etc.
# BOND_EMBED_MODEL=litellm:text-embedding-3-small
# Option C: Use Hugging Face TEI (Text Embeddings Inference)
# Deploy your HF model (pankajrajdeo/bond-embed-v1-fp16) behind a LiteLLM-compatible endpoint,
# then set the model name accordingly.
# Example if exposed as huggingface/teimodel (via LiteLLM routing):
# BOND_EMBED_MODEL=litellm:huggingface/teimodel
# LLM Providers (for expansion and disambiguation)
BOND_EXPANSION_LLM=anthropic/claude-3-5-sonnet-20241022
BOND_DISAMBIGUATION_LLM=anthropic/claude-3-5-sonnet-20241022
# Or use OpenAI
# BOND_EXPANSION_LLM=openai/gpt-4o
# BOND_DISAMBIGUATION_LLM=openai/gpt-4o
# API Keys (set as environment variables or in .env)
ANTHROPIC_API_KEY=your-key
OPENAI_API_KEY=your-key
# Paths
BOND_ASSETS_PATH=assets
BOND_SQLITE_PATH=assets/ontologies.sqlite
# Reranker Configuration
# Path to trained reranker model (cross-encoder)
# Default: Uses pre-trained model from reranker-model/ directory
# To use custom reranker: BOND_RERANKER_PATH=/path/to/your/reranker-model
# To disable reranker: BOND_ENABLE_RERANKER=0
BOND_RERANKER_PATH=./reranker-model
BOND_ENABLE_RERANKER=1
# Retrieval Parameters
BOND_TOPK_EXACT=5
BOND_TOPK_BM25=20
BOND_TOPK_DENSE=50
BOND_TOPK_FINAL=20
# Optional: Disable LLM stages for retrieval-only mode
# BOND_RETRIEVAL_ONLY=1- Hugging Face model:
pankajrajdeo/bond-embed-v1-fp16See model card and direct usage: https://huggingface.co/pankajrajdeo/bond-embed-v1-fp16 - Ollama model:
rajdeopankaj/bond-embed-v1-fp16Pull first, then setBOND_EMBED_MODEL=ollama:rajdeopankaj/bond-embed-v1-fp16and (optionally)OLLAMA_API_BASE.
Note: BONDβs embedding provider currently supports LiteLLM-style endpoints and Ollama natively. If you prefer running the Hugging Face model locally with SentenceTransformers, use it to precompute embeddings for your own pipelines; for BONDβs FAISS build and runtime embedding, route the HF model via a LiteLLM-compatible endpoint (e.g., TEI behind a gateway) or use the Ollama variant.
- Installation Guide - Detailed setup instructions
- Hybrid Search Guide - Advanced search features
- Reranker Training Guide - Training custom rerankers. See notebooks/ for training code and example notebooks
- Benchmark Dataset - Context-aware benchmark for biomedical entity normalization
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 1: QUERY EXPANSION β
βββββββββββββββββββββββββββββββββββββββββββββββββββ€
β LLM generates synonyms, abbreviations, β
β and context-aware expansions β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 2: HYBRID RETRIEVAL β
βββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β’ Exact Match (SQLite FTS) β Top-K β
β β’ BM25 Search (SQLite FTS) β Top-K β
β β’ Dense Search (FAISS) β Top-K β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 3: RECIPROCAL RANK FUSION β
βββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Combine rankings from all methods using RRF β
β with field-aware weighting β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 4: CROSS-ENCODER RERANKER β
βββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Re-rank top candidates using fine-tuned β
β biomedical reranker model (bioformer-16L) β
β Improves accuracy by 10-15% β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 5: GRAPH EXPANSION (Optional) β
βββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Expand ontology neighbors if confidence low β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAGE 6: LLM DISAMBIGUATION β
βββββββββββββββββββββββββββββββββββββββββββββββββββ€
β LLM selects best match from candidates with β
β reasoning and confidence scoring β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
bond/pipeline.py: CoreBondMatcherclass implementing the full pipelinebond/retrieval/: Retrieval modules (BM25, FAISS)bond/fusion.py: Reciprocal rank fusion implementationbond/rerank.py: Cross-encoder reranker integration (loads and applies fine-tuned reranker model)bond/graph_utils.py: Ontology graph traversalbond/server.py: FastAPI REST servicebond/cli.py: Command-line interface
BOND includes a comprehensive, context-aware benchmark dataset derived from CELLxGENE metadata:
- 25,416 normalization pairs across 7 field types (cell_type, tissue, disease, assay, sex, development_stage, self_reported_ethnicity)
- 85 high-quality datasets from CELLxGENE Census (72 single-cell, 13 spatial)
- 186 unique tissues across Homo sapiens and Mus musculus
- Stratified 80/10/10 splits: 20,332 train / 2,542 validation / 2,542 test
Unlike traditional string-matching benchmarks, BOND requires reasoning over biological contextβincluding tissue, disease, organism, and development stage informationβto accurately map author-provided annotations to standardized ontology labels.
Dataset: AronowLab/bond-czi-benchmark
from datasets import load_dataset
dataset = load_dataset("AronowLab/bond-czi-benchmark")
# Access splits
train = dataset["train"]
val = dataset["validation"]
test = dataset["test"]
# Example
sample = train[0]
print(f"Input: {sample['author_term']}")
print(f"Context: {sample['tissue']}, {sample['disease']}")
print(f"Target: {sample['ontology_label']} ({sample['ontology_id']})")BOND has been evaluated on the BOND-CZI benchmark dataset (25,416 pairs from 85 datasets). Performance highlights:
- Embedding Model: Custom fine-tuned encoder (
bond-embed-v1-fp16) achieves 92.7% accuracy@10 on ontology evaluation - Hybrid Retrieval: Combines exact match, BM25, and dense semantic search with Reciprocal Rank Fusion
- Cross-Encoder Reranker: Fine-tuned biomedical reranker (
bioformers/bioformer-16L) improves Hit@10 accuracy from ~75-80% (retrieval only) to 85-90% (with reranker) - LLM Disambiguation: Uses large language models for context-aware term selection
- Multi-field Support: Handles 7 biological field types (cell_type, tissue, disease, etc.) across 5 organisms
See the benchmark-metrics/README.md for evaluation results and scripts.
Active development and research efforts:
- Harmonized Knowledge Graph: Building a harmonized transcriptomic knowledge graph from BOND-normalized metadata to enable advanced querying, cross-dataset analysis, and relationship discovery in single-cell transcriptomics data.
make test
# or
pytestmake lint
# or
ruff check .
black .bond-build-faiss \
--sqlite_path assets/ontologies.sqlite \
--assets_path assetsbond-generate-sqliteIf you use BOND in your research, please cite:
BOND System:
@software{bond_2026,
title={BOND: Biomedical Ontology Normalization and Disambiguation},
author={Rajdeo, Pankaj and Gelal, Rupesh},
year={2026},
url={https://github.com/Aronow-Lab/BOND}
}We welcome contributions! Please see CONTRIBUTING.md for guidelines.
This project is licensed under the MIT License - see the LICENSE file for details.
- Repository: github.com/Aronow-Lab/BOND
- Benchmark Dataset: huggingface.co/datasets/AronowLab/bond-czi-benchmark
- Reranker Model: huggingface.co/AronowLab/BOND-reranker
- Authors: Pankaj Rajdeo, Rupesh Gelal