Evaluation: add export_format query param for grouped trace export #562

vprashrex · 2026-01-27T17:27:18Z

Summary

Target issue is #545
Explain the motivation for making this change. What existing problem does the pull request solve?

The frontend needs a new CSV export format that groups repeated questions horizontally, but the current API only supports row-based export where each iteration appears as a separate row.

Solution: This PR extends the existing evaluation export API to support a grouped format with structure:

{
  "traces": [
    {
      "question_id": 1,
      "question": "What is Python?",
      "ground_truth_answer": "Python is a programming language.",

      "llm_answers": [
        "LLM answer 1",
        "LLM answer 2"
      ],

      "trace_ids": [
        "uuid-1",
        "uuid-2"
      ],

      "scores": [
        [
          {
            "name": "cosine_similarity",
            "value": 0.78,
            "data_type": "NUMERIC"
          }
        ],
        [
          {
            "name": "cosine_similarity",
            "value": 0.72,
            "data_type": "NUMERIC"
          }
        ]
      ]
    },

    {
      "question_id": 2,
      "question": "What is a variable?",
      "ground_truth_answer": "A variable stores a value.",

      "llm_answers": [
        "LLM answer 1",
        "LLM answer 2"
      ],

      "trace_ids": [
        "uuid-3",
        "uuid-4"
      ],

      "scores": [
        [
          {
            "name": "cosine_similarity",
            "value": 0.71,
            "data_type": "NUMERIC"
          }
        ],
        [
          {
            "name": "cosine_similarity",
            "value": 0.69,
            "data_type": "NUMERIC"
          }
        ]
      ]
    }
  ]
}

Checklist

Before submitting a pull request, please ensure that you mark these task.

Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
If you've fixed a bug or added code that is tested and has test cases.

Notes

GET /api/v1/evaluations/{id}?get_trace_info=true&export_format=grouped
Validation: export_format=grouped requires get_trace_info=true, returns error if traces don't have question_id.

Need to merge this PR: #553 , before merging this.

Summary by CodeRabbit

New Features
- Added an export_format option ("row" default, "grouped") to evaluation endpoints to return traces either as individual rows or grouped by question (grouped includes question, ground truth, LLM answers, trace IDs, and aggregated scores; grouped requires trace info).
Bug Fixes / Validation
- Requests for grouped format without trace info now return a 400 error; grouping errors surface a clear failure response.
Tests
- Added tests for grouped-format success and failure scenarios.
Documentation
- API docs updated with row and grouped response formats and examples.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-27T17:27:50Z

📝 Walkthrough

Walkthrough

Adds an export_format query parameter to the evaluation run endpoint supporting grouped-by-question_id trace exports, implements group_traces_by_question_id, updates docs to describe row and grouped response shapes, and adds tests for grouped export behavior. Duplicate function and test definitions were introduced.

Changes

Cohort / File(s)	Summary
API Documentation `backend/app/api/docs/evaluation/get_evaluation.md`	Adds optional `export_format` (`row`
API Route Logic `backend/app/api/routes/evaluations/evaluation.py`	Adds `export_format: str = Query("row", enum=["row","grouped"])` to `get_evaluation_run_status`; enforces `grouped` requires `get_trace_info=true` (400 if missing); applies `group_traces_by_question_id` to `score["traces"]` when requested and surfaces grouping errors as failure responses.
CRUD Grouping Function `backend/app/crud/evaluations/core.py`	Adds `group_traces_by_question_id(traces)` that groups traces by `question_id` into objects with `question_id`, `question`, `ground_truth_answer`, `llm_answers`, `trace_ids`, and `scores`, sorted by `question_id`. Note: function is duplicated (redefinition) and should be deduplicated.
Tests `backend/app/tests/api/routes/test_evaluation.py`	Adds tests verifying that requesting `export_format=grouped` without `get_trace_info` fails (400) and that grouped export succeeds with `get_trace_info=true`, returning grouped structure containing `llm_answers` and `trace_ids`. Note: tests appear duplicated and should be consolidated.

Sequence Diagram

sequenceDiagram
    participant Client
    participant API_Route as API Route
    participant Validator
    participant CRUD
    participant DB as EvalRun/Score
    participant Response

    Client->>API_Route: GET /api/v1/evaluations/{id}?export_format=grouped&get_trace_info=true
    API_Route->>Validator: Validate export_format & get_trace_info
    alt invalid (grouped without trace info)
        Validator-->>API_Route: validation error
        API_Route-->>Client: 400 Bad Request
    else valid
        API_Route->>DB: Load evaluation run & score (including traces)
        DB-->>API_Route: score with traces
        API_Route->>CRUD: group_traces_by_question_id(traces)
        alt grouping succeeds
            CRUD-->>API_Route: grouped traces list
            API_Route->>Response: build 200 payload (grouped format)
            Response-->>Client: 200 OK with grouped traces
        else grouping fails
            CRUD-->>API_Route: ValueError / error message
            API_Route-->>Client: 400 Bad Request with failure message
        end
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Evaluation: Add question id #553: Introduces/propagates question_id into trace payloads which this grouped-export feature depends on.

Suggested reviewers

Prajna1999

Poem

🐰 I hop through traces, bundle each quest,
Questions in clusters, answers at rest,
Row or grouped, I tally each score,
Trace-ids and answers—grouped evermore,
Carrot in paw, I deliver the chore.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding an export_format query parameter to support grouped trace export in the evaluation API.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@backend/app/crud/evaluations/core.py`:
- Around line 341-344: The check currently only inspects traces[0] for a
question_id which allows later traces with missing or empty question_id to
create a None key and break sorting; update the validation in the function that
handles traces to iterate over all traces (e.g., use any(t.get("question_id") in
(None, "") for t in traces) or a for-loop) and raise the same ValueError if any
trace has a missing/empty question_id; also fix the error message text to
"whether" instead of "weather" so the raised message reads "Grouped export
format is not available for this evaluation." Reference the symbols traces,
question_id, and the grouping/sorting logic (groups and sorted(groups.keys()))
when making the change.

In `@backend/app/tests/api/routes/test_evaluation.py`:
- Around line 1037-1044: The inline comment next to the client.get call is
stale; in the params dict the key "get_trace_info" is already set to True, so
remove the incorrect comment ("# Missing get_trace_info=true") adjacent to the
response = client.get(...) call (the params containing "export_format" and
"get_trace_info") to avoid confusion.

🧹 Nitpick comments (7)

backend/app/crud/evaluations/core.py (3)
319-338: Incomplete docstring with placeholder text.

The docstring contains placeholder descriptions (Description) that should be properly filled in. Per coding guidelines, Python functions should have complete documentation.
📝 Suggested docstring improvement
 def group_traces_by_question_id(
     traces: list[dict[str, Any]],
 ) -> list[dict[str, Any]]:
     """
-    Docstring for group_traces_by_question_id
-    
-    :param traces: Description
-    :type traces: list[dict[str, Any]]
-    :return: Description
-    :rtype: list[dict[str, Any]]
-
+    Group traces by question_id for grouped export format.
+    
+    Args:
+        traces: List of trace dictionaries containing question_id, question,
+            ground_truth_answer, llm_answer, trace_id, and scores.
+    
     Returns:
         List of grouped traces sorted by question_id:
         [
             {
                 "question_id": 1,
                 "question": "What is Python?",
                 "ground_truth_answer": "...",
                 "llm_answers": ["Answer 1", "Answer 2"],
                 "trace_ids": ["trace-1", "trace-2"],
                 "scores": [[...], [...]]
             }
         ]
+    
+    Raises:
+        ValueError: If traces lack a valid question_id.
     """
346-352: Type hint mismatch: question_id may not always be int.

The type hint dict[int, list[dict[str, Any]]] assumes question_id is always an integer, but the code uses .get("question_id") which could return any type. If question_id values are strings or mixed types, this could lead to unexpected sorting behavior.
♻️ Suggested type hint adjustment
-    groups: dict[int, list[dict[str, Any]]] = {}
+    groups: dict[Any, list[dict[str, Any]]] = {}
367-368: Add trailing newline at end of file.

Per static analysis hint (Ruff W292), add a newline at the end of the file.
     logger.info(f"[group_traces_by_question_id] Created {len(result)} groups")
     return result
+
backend/app/api/routes/evaluations/evaluation.py (2)
141-145: Minor style inconsistency: extra space around = in status_code.

Line 143 uses status_code = 400 with spaces, while line 138 uses status_code=400 without spaces. Consider keeping consistent style.
♻️ Consistent formatting
     if export_format == "grouped" and not get_trace_info:
         raise HTTPException(
-            status_code = 400,
+            status_code=400,
             detail="export_format=grouped requires get_trace_info=true"
         )
164-171: Consider moving the import to the top of the file and avoiding in-place mutation.

The import at line 166 should ideally be at the top of the file with other imports for better readability and consistency.

The code mutates eval_run.score["traces"] in-place (line 169). If eval_run is a SQLModel instance that might be used elsewhere or persisted, this could have unintended side effects. Consider creating a copy for the response instead.
♻️ Proposed refactor

Move the import to the top of the file:
from app.crud.evaluations.core import group_traces_by_question_id
Then modify the grouping logic to avoid in-place mutation:
     # Formatter = grouped
     if export_format == "grouped" and eval_run.score and "traces" in eval_run.score:
-        from app.crud.evaluations.core import group_traces_by_question_id
         try:
             grouped_traces = group_traces_by_question_id(eval_run.score["traces"])
-            eval_run.score["traces"] = grouped_traces
+            # Create a copy to avoid mutating the model instance
+            eval_run.score = {**eval_run.score, "grouped_traces": grouped_traces}
         except ValueError as e:
             return APIResponse.failure_response(error=str(e), data=eval_run)
Note: This also aligns with the documentation which shows grouped_traces as a separate field.
backend/app/tests/api/routes/test_evaluation.py (2)
959-1061: Consider adding test for ValueError case (traces without question_id).

The success test is comprehensive, but there's no test coverage for the case where traces lack question_id and the grouping function raises a ValueError. This is an important error path handled in the route (lines 170-171 of evaluation.py).
📝 Suggested additional test
def test_get_evaluation_run_grouped_format_missing_question_id_fails(
    self,
    client: TestClient,
    user_api_key_header: dict[str, str],
    db: Session,
    user_api_key: TestAuthContext,
    create_test_dataset: EvaluationDataset,
) -> None:
    """Test grouped export fails when traces lack question_id."""
    eval_run = EvaluationRun(
        run_name="test_run_no_qid",
        dataset_name=create_test_dataset.name,
        dataset_id=create_test_dataset.id,
        config={"model": "gpt-4o"},
        status="completed",
        total_items=1,
        score={
            "traces": [
                {
                    "trace_id": "trace-1",
                    "question": "What is Python?",
                    "llm_answer": "A language",
                    "scores": [],
                    # No question_id
                }
            ],
            "summary_scores": [],
        },
        organization_id=user_api_key.organization_id,
        project_id=user_api_key.project_id,
    )
    db.add(eval_run)
    db.commit()
    db.refresh(eval_run)

    response = client.get(
        f"/api/v1/evaluations/{eval_run.id}",
        params={"export_format": "grouped", "get_trace_info": True},
        headers=user_api_key_header,
    )

    assert response.status_code == 200
    response_data = response.json()
    assert response_data["success"] is False
    assert "not available" in response_data["error"].lower()
1061-1062: Missing blank line before next class definition.

PEP 8 recommends two blank lines before class definitions at the module level.
         )
+

 class TestGetDataset:

coderabbitai · 2026-01-27T17:30:20Z

backend/app/crud/evaluations/core.py

+    # weather question_id exists in the traces
+    if traces and (traces[0].get("question_id") is None or traces[0].get("question_id") ==  ""):
+        raise ValueError(
+            "Grouped export format is not available for this evaluation.")


⚠️ Potential issue | 🟠 Major

Validation only checks the first trace, missing question_id in other traces will cause issues.

The current validation only checks traces[0] for question_id. If subsequent traces have missing or empty question_id, they will be grouped under None as a key, and sorted(groups.keys()) on line 355 will raise a TypeError when comparing None with integers.

Also, typo: "weather" should be "whether".

🐛 Proposed fix for robust validation

- # weather question_id exists in the traces - if traces and (traces[0].get("question_id") is None or traces[0].get("question_id") == ""): + # Validate that all traces have a valid question_id + if not traces: + return [] + + for trace in traces: + question_id = trace.get("question_id") + if question_id is None or question_id == "": + raise ValueError( + "Grouped export format is not available for this evaluation.") - raise ValueError( - "Grouped export format is not available for this evaluation.")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# weather question_id exists in the traces

if traces and (traces[0].get("question_id") is None or traces[0].get("question_id") == ""):

raise ValueError(

"Grouped export format is not available for this evaluation.")

# Validate that all traces have a valid question_id

if not traces:

return []

for trace in traces:

question_id = trace.get("question_id")

if question_id is None or question_id == "":

raise ValueError(

"Grouped export format is not available for this evaluation.")

🤖 Prompt for AI Agents

In `@backend/app/crud/evaluations/core.py` around lines 341 - 344, The check currently only inspects traces[0] for a question_id which allows later traces with missing or empty question_id to create a None key and break sorting; update the validation in the function that handles traces to iterate over all traces (e.g., use any(t.get("question_id") in (None, "") for t in traces) or a for-loop) and raise the same ValueError if any trace has a missing/empty question_id; also fix the error message text to "whether" instead of "weather" so the raised message reads "Grouped export format is not available for this evaluation." Reference the symbols traces, question_id, and the grouping/sorting logic (groups and sorted(groups.keys())) when making the change.

coderabbitai · 2026-01-27T17:30:21Z

backend/app/tests/api/routes/test_evaluation.py

+        response = client.get(
+            f"/api/v1/evaluations/{eval_run.id}",
+            params={
+                "export_format": "grouped",
+                "get_trace_info": True,
+            },  # Missing get_trace_info=true
+            headers=user_api_key_header,
+        )


⚠️ Potential issue | 🟡 Minor

Stale comment: get_trace_info is actually True here.

The comment on line 1042 says # Missing get_trace_info=true but get_trace_info: True is present in the params. This appears to be copy-paste from the previous test.

🧹 Remove stale comment

response = client.get( f"/api/v1/evaluations/{eval_run.id}", params={ "export_format": "grouped", "get_trace_info": True, - }, # Missing get_trace_info=true + }, headers=user_api_key_header, )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

response = client.get(

f"/api/v1/evaluations/{eval_run.id}",

params={

"export_format": "grouped",

"get_trace_info": True,

}, # Missing get_trace_info=true

headers=user_api_key_header,

)

response = client.get(

f"/api/v1/evaluations/{eval_run.id}",

params={

"export_format": "grouped",

"get_trace_info": True,

},

headers=user_api_key_header,

)

🤖 Prompt for AI Agents

In `@backend/app/tests/api/routes/test_evaluation.py` around lines 1037 - 1044, The inline comment next to the client.get call is stale; in the params dict the key "get_trace_info" is already set to True, so remove the incorrect comment ("# Missing get_trace_info=true") adjacent to the response = client.get(...) call (the params containing "export_format" and "get_trace_info") to avoid confusion.

…ed-export-route

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@backend/app/crud/evaluations/core.py`:
- Around line 358-366: The docstring for group_traces_by_question_id has
trailing whitespace and Black formatting violations; run the project's
formatter/pre-commit (e.g., black/isort) and remove any trailing spaces in that
docstring block, reflow or re-indent lines to match Black's style (keep the
triple-quoted docstring, parameter lines and Returns section neatly wrapped),
then re-run tests/CI to ensure the formatting errors are resolved.

🧹 Nitpick comments (1)

backend/app/tests/api/routes/test_evaluation.py (1)

939-1045: Prefer a factory/fixture for EvaluationRun setup in these tests.

To align with the test fixture pattern and reduce duplication, consider extracting the EvaluationRun creation into a factory/fixture helper (e.g., create_test_evaluation_run).

As per coding guidelines: “backend/app/tests/**/*.py: Use factory pattern for test fixtures in backend/app/tests/”.

backend/app/crud/evaluations/core.py

codecov · 2026-01-28T08:36:45Z

Codecov Report

❌ Patch coverage is 94.00000% with 3 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
backend/app/api/routes/evaluations/evaluation.py	77.77%	2 Missing ⚠️
backend/app/crud/evaluations/core.py	94.11%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

backend/app/api/docs/evaluation/get_evaluation.md

backend/app/api/routes/evaluations/evaluation.py

backend/app/crud/evaluations/core.py

…pdate docstring for clarity

Evaluation: add export_format query param with grouped traces support

07ab964

vprashrex self-assigned this Jan 27, 2026

vprashrex added enhancement New feature or request ready-for-review labels Jan 27, 2026

vprashrex added this to Kaapi-dev Jan 27, 2026

vprashrex moved this to In review in Kaapi-dev Jan 27, 2026

vprashrex linked an issue Jan 27, 2026 that may be closed by this pull request

Evaluation: Extend export functionality #545

Closed

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

vprashrex requested review from AkhileshNegi and nishika26 January 28, 2026 05:07

Merge remote-tracking branch 'origin/main' into feat/evaulation-group…

24db70e

…ed-export-route

coderabbitai bot reviewed Jan 28, 2026

View reviewed changes

backend/app/crud/evaluations/core.py Show resolved Hide resolved

vprashrex added 2 commits January 28, 2026 13:57

Fix formatting and update docstring for group_traces_by_question_id

eca2181

Fix trailing whitespace

cae0263

nishika26 requested changes Jan 29, 2026

View reviewed changes

Refactor: Move import of group_traces_by_question_id to the top and u…

e6de3d0

…pdate docstring for clarity

vprashrex requested a review from nishika26 January 29, 2026 05:59

AkhileshNegi approved these changes Jan 29, 2026

View reviewed changes

vprashrex force-pushed the feat/evaulation-grouped-export-route branch from a164ee5 to e6de3d0 Compare January 29, 2026 09:20

Docs: Improve export_format parameter description for clarity

fdc4582

nishika26 approved these changes Jan 29, 2026

View reviewed changes

vprashrex merged commit e0838da into main Jan 29, 2026
3 checks passed

vprashrex deleted the feat/evaulation-grouped-export-route branch January 29, 2026 09:51

github-project-automation bot moved this from In review to Closed in Kaapi-dev Jan 29, 2026

Evaluation: add export_format query param for grouped trace export #562

Evaluation: add export_format query param for grouped trace export #562

Uh oh!

Conversation

vprashrex commented Jan 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Jan 28, 2026

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vprashrex commented Jan 27, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 27, 2026 •

edited

Loading