Skip to content
/ RagScore Public

⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in Colab, Jupyter, or CLI. Privacy-first, async, visual reports.

License

Notifications You must be signed in to change notification settings

HZYAI/RagScore

Repository files navigation

RAGScore Logo

PyPI version PyPI Downloads Python 3.9+ License Ollama Open In Colab

Generate QA datasets & evaluate RAG systems in 2 commands

🔒 Privacy-First • ⚡ Async & Fast • 🤖 Any LLM • 🏠 Local or Cloud

English | 中文 | 日本語


⚡ 2-Line RAG Evaluation

# Step 1: Generate QA pairs from your docs
ragscore generate docs/

# Step 2: Evaluate your RAG system
ragscore evaluate http://localhost:8000/query

That's it. Get accuracy scores and incorrect QA pairs instantly.

============================================================
✅ EXCELLENT: 85/100 correct (85.0%)
Average Score: 4.20/5.0
============================================================

❌ 15 Incorrect Pairs:

  1. Q: "What is RAG?"
     Score: 2/5 - Factually incorrect

  2. Q: "How does retrieval work?"
     Score: 3/5 - Incomplete answer

🚀 Quick Start

Install

pip install ragscore              # Core (works with Ollama)
pip install "ragscore[openai]"    # + OpenAI support
pip install "ragscore[notebook]"  # + Jupyter/Colab support
pip install "ragscore[all]"       # + All providers

Option 1: Python API (Notebook-Friendly)

Perfect for Jupyter, Colab, and rapid iteration. Get instant visualizations.

from ragscore import quick_test

# 1. Audit your RAG in one line
result = quick_test(
    endpoint="http://localhost:8000/query",  # Your RAG API
    docs="docs/",                            # Your documents
    n=10,                                    # Number of test questions
)

# 2. See the report
result.plot()

# 3. Inspect failures
bad_rows = result.df[result.df['score'] < 3]
display(bad_rows[['question', 'rag_answer', 'reason']])

Rich Object API:

  • result.accuracy - Accuracy score
  • result.df - Pandas DataFrame of all results
  • result.plot() - 3-panel visualization
  • result.corrections - List of items to fix

Option 2: CLI (Production)

Generate QA Pairs

# Set API key (or use local Ollama - no key needed!)
export OPENAI_API_KEY="sk-..."

# Generate from any document
ragscore generate paper.pdf
ragscore generate docs/*.pdf --concurrency 10

Evaluate Your RAG

# Point to your RAG endpoint
ragscore evaluate http://localhost:8000/query

# Custom options
ragscore evaluate http://api/ask --model gpt-4o --output results.json

🏠 100% Private with Local LLMs

# Use Ollama - no API keys, no cloud, 100% private
ollama pull llama3.1
ragscore generate confidential_docs/*.pdf
ragscore evaluate http://localhost:8000/query

Perfect for: Healthcare 🏥 • Legal ⚖️ • Finance 🏦 • Research 🔬


🔌 Supported LLMs

Provider Setup Notes
Ollama ollama serve Local, free, private
OpenAI export OPENAI_API_KEY="sk-..." Best quality
Anthropic export ANTHROPIC_API_KEY="..." Long context
DashScope export DASHSCOPE_API_KEY="..." Qwen models
vLLM export LLM_BASE_URL="..." Production-grade
Any OpenAI-compatible export LLM_BASE_URL="..." Groq, Together, etc.

📊 Output Formats

Generated QA Pairs (output/generated_qas.jsonl)

{
  "id": "abc123",
  "question": "What is RAG?",
  "answer": "RAG (Retrieval-Augmented Generation) combines...",
  "rationale": "This is explicitly stated in the introduction...",
  "support_span": "RAG systems retrieve relevant documents...",
  "difficulty": "medium",
  "source_path": "docs/rag_intro.pdf"
}

Evaluation Results (--output results.json)

{
  "summary": {
    "total": 100,
    "correct": 85,
    "incorrect": 15,
    "accuracy": 0.85,
    "avg_score": 4.2
  },
  "incorrect_pairs": [
    {
      "question": "What is RAG?",
      "golden_answer": "RAG combines retrieval with generation...",
      "rag_answer": "RAG is a database system.",
      "score": 2,
      "reason": "Factually incorrect - RAG is not a database"
    }
  ]
}

🧪 Python API

from ragscore import run_pipeline, run_evaluation

# Generate QA pairs
run_pipeline(paths=["docs/"], concurrency=10)

# Evaluate RAG
results = run_evaluation(
    endpoint="http://localhost:8000/query",
    model="gpt-4o",  # LLM for judging
)
print(f"Accuracy: {results.accuracy:.1%}")

🤖 AI Agent Integration

RAGScore is designed for AI agents and automation:

# Structured CLI with predictable output
ragscore generate docs/ --concurrency 5
ragscore evaluate http://api/query --output results.json

# Exit codes: 0 = success, 1 = error
# JSON output for programmatic parsing

CLI Reference:

Command Description
ragscore generate <paths> Generate QA pairs from documents
ragscore evaluate <endpoint> Evaluate RAG against golden QAs
ragscore --help Show all commands and options
ragscore generate --help Show generate options
ragscore evaluate --help Show evaluate options

⚙️ Configuration

Zero config required. Optional environment variables:

export RAGSCORE_CHUNK_SIZE=512          # Chunk size for documents
export RAGSCORE_QUESTIONS_PER_CHUNK=5   # QAs per chunk
export RAGSCORE_WORK_DIR=/path/to/dir   # Working directory

🔐 Privacy & Security

Data Cloud LLM Local LLM
Documents ✅ Local ✅ Local
Text chunks ⚠️ Sent to LLM ✅ Local
Generated QAs ✅ Local ✅ Local
Evaluation results ✅ Local ✅ Local

Compliance: GDPR ✅ • HIPAA ✅ (with local LLMs) • SOC 2 ✅


🧪 Development

git clone https://github.com/HZYAI/RagScore.git
cd RagScore
pip install -e ".[dev,all]"
pytest

🔗 Links


⭐ Star us on GitHub if RAGScore helps you!
Made with ❤️ for the RAG community

About

⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in Colab, Jupyter, or CLI. Privacy-first, async, visual reports.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Languages