Skip to content

Conversation

@Prajna1999
Copy link
Collaborator

@Prajna1999 Prajna1999 commented Jan 29, 2026

Earlier question_id was not in the response for GET evaluations/{evaluation_id} endpoint.

In the app/crud/evaluations/score.py

class TraceData(TypedDict):
    """Data for a single trace including Q&A and scores."""

    trace_id: str
    question: str
    llm_answer: str
    **question_id: int | None**
    ground_truth_answer: str
    scores: list[TraceScore]

This pydantic model did not have question_id earlier, hence in the fetch_trace_scores_from_langfuse function where the model was used omitting the concerned key.

Summary by CodeRabbit

  • New Features

    • Trace data now includes optional question identifiers to enable finer-grained tracking, correlation, and filtering of evaluation results across runs and datasets.
    • Enhanced score retrieval: supports numeric and categorical scores, aggregates summary scores, and handles mixed/empty datasets and per-trace failures robustly.
  • Tests

    • Added comprehensive tests covering score fetching, aggregation, filtering, and error scenarios.

✏️ Tip: You can customize this high-level summary in your review settings.

@Prajna1999 Prajna1999 requested a review from vprashrex January 29, 2026 13:15
@coderabbitai
Copy link

coderabbitai bot commented Jan 29, 2026

📝 Walkthrough

Walkthrough

Added an optional question_id: int | None field to the TraceData TypedDict and introduced/exposed a Langfuse trace-score fetcher function with comprehensive tests covering score types, missing data, and error cases.

Changes

Cohort / File(s) Summary
Type Definition
backend/app/crud/evaluations/score.py
Added `question_id: int
Langfuse integration & tests
backend/app/crud/evaluations/langfuse.py, backend/app/tests/crud/evaluations/test_langfuse.py
Exposed/added fetch_trace_scores_from_langfuse public API and added extensive tests for fetching trace scores (presence/absence of question_id, numeric/categorical scores, filtering incomplete scores, error handling, summary aggregation, per-trace failures).

Sequence Diagram(s)

(omitted — changes are small data-shape addition plus test coverage; no multi-component control-flow requiring visualization)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

bug

Suggested reviewers

  • AkhileshNegi
  • vprashrex

Poem

🐰 A tiny id hops into the trace,
tucked in data, finding its place.
Tests dusted off, the scores align,
Langfuse whispers, "All looks fine."
Hop, hop—merge time, carrot-bright delight! 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly relates to the main objective of the PR: ensuring question_id is included in trace data that was previously missing from responses.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@vprashrex
Copy link
Collaborator

Yes this was there and also one more thing

make sure in /backend/app/crud/evaluations/langfuse.py

create_langfuse_dataset_run function

It should not be
if question_id:

It should be
if question_id is not None:

@codecov
Copy link

codecov bot commented Jan 29, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/crud/evaluations/score.py (1)

20-27: Fix question_id type contract: int annotation conflicts with "" (str) runtime values.

TraceData.question_id is typed as int but initialized and retrieved as empty string "" throughout the codebase:

  • Initialized as empty string in langfuse.py line 426
  • Retrieved with empty string default in langfuse.py line 449
  • Checked explicitly for empty string in core.py line 378

Additionally, the grouping logic in core.py line 391 (sorted(groups.keys())) will fail at runtime with a TypeError when trace data contains mixed int and str values for question_id, as the dict key type is annotated as int but receives str values from fetched traces.

Align the type to match runtime behavior: change question_id: int to question_id: int | None or question_id: str | int, or enforce conversion to int in the fetcher and initialize to None instead of "".

Example fix (int | None only)
 class TraceData(TypedDict):
     """Data for a single trace including Q&A and scores."""
 
     trace_id: str
     question: str
     llm_answer: str
-    question_id: int
+    question_id: int | None
     ground_truth_answer: str
     scores: list[TraceScore]
🤖 Fix all issues with AI agents
In `@backend/app/tests/crud/evaluations/test_langfuse.py`:
- Around line 712-742: Update the tests to match the declared
TraceData.question_id contract (int | None) and the normalization performed in
fetch_trace_scores_from_langfuse: when a trace has no question_id, assert
trace["question_id"] is None (not ""), and for non-int question_id values assert
they are converted to int (or None on failure) consistent with TraceData; adjust
expectations in test_fetch_trace_scores_without_question_id and
test_fetch_trace_scores_mixed_question_id_types to reference this behavior and
the fetch_trace_scores_from_langfuse normalization logic.

Comment on lines +712 to +742
def test_fetch_trace_scores_without_question_id(self) -> None:
"""Test fetching traces without question_id (backwards compatibility)."""
mock_langfuse = MagicMock()

# Mock dataset run
mock_run_item = MagicMock()
mock_run_item.trace_id = "trace_1"
mock_dataset_run = MagicMock()
mock_dataset_run.dataset_run_items = [mock_run_item]
mock_langfuse.api.datasets.get_run.return_value = mock_dataset_run

# Mock trace without question_id in metadata
mock_trace = MagicMock()
mock_trace.input = {"question": "What is 2+2?"}
mock_trace.output = {"answer": "4"}
mock_trace.metadata = {"ground_truth": "4"} # No question_id
mock_trace.scores = []

mock_langfuse.api.trace.get.return_value = mock_trace

result = fetch_trace_scores_from_langfuse(
langfuse=mock_langfuse,
dataset_name="test_dataset",
run_name="test_run",
)

# Verify trace has empty string for question_id
assert len(result["traces"]) == 1
trace = result["traces"][0]
assert trace["question_id"] == ""
assert trace["trace_id"] == "trace_1"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Tests allow non-int question_id, conflicting with the declared contract.

test_fetch_trace_scores_without_question_id expects an empty string, and test_fetch_trace_scores_mixed_question_id_types expects a string. This conflicts with TraceData.question_id: int and the PR objective. Align tests with the final contract (e.g., int | None with normalization) or widen the type if strings are supported.

Also applies to: 979-1018

🤖 Prompt for AI Agents
In `@backend/app/tests/crud/evaluations/test_langfuse.py` around lines 712 - 742,
Update the tests to match the declared TraceData.question_id contract (int |
None) and the normalization performed in fetch_trace_scores_from_langfuse: when
a trace has no question_id, assert trace["question_id"] is None (not ""), and
for non-int question_id values assert they are converted to int (or None on
failure) consistent with TraceData; adjust expectations in
test_fetch_trace_scores_without_question_id and
test_fetch_trace_scores_mixed_question_id_types to reference this behavior and
the fetch_trace_scores_from_langfuse normalization logic.

@Prajna1999 Prajna1999 merged commit 5e91aea into main Jan 29, 2026
3 checks passed
@Prajna1999 Prajna1999 deleted the fix/lf-question-id-not-found branch January 29, 2026 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants