Skip to content

Conversation

@vprashrex
Copy link
Collaborator

@vprashrex vprashrex commented Jan 27, 2026

Summary

Target issue is #545
Explain the motivation for making this change. What existing problem does the pull request solve?

The frontend needs a new CSV export format that groups repeated questions horizontally, but the current API only supports row-based export where each iteration appears as a separate row.

Solution: This PR extends the existing evaluation export API to support a grouped format with structure:

{
  "traces": [
    {
      "question_id": 1,
      "question": "What is Python?",
      "ground_truth_answer": "Python is a programming language.",

      "llm_answers": [
        "LLM answer 1",
        "LLM answer 2"
      ],

      "trace_ids": [
        "uuid-1",
        "uuid-2"
      ],

      "scores": [
        [
          {
            "name": "cosine_similarity",
            "value": 0.78,
            "data_type": "NUMERIC"
          }
        ],
        [
          {
            "name": "cosine_similarity",
            "value": 0.72,
            "data_type": "NUMERIC"
          }
        ]
      ]
    },

    {
      "question_id": 2,
      "question": "What is a variable?",
      "ground_truth_answer": "A variable stores a value.",

      "llm_answers": [
        "LLM answer 1",
        "LLM answer 2"
      ],

      "trace_ids": [
        "uuid-3",
        "uuid-4"
      ],

      "scores": [
        [
          {
            "name": "cosine_similarity",
            "value": 0.71,
            "data_type": "NUMERIC"
          }
        ],
        [
          {
            "name": "cosine_similarity",
            "value": 0.69,
            "data_type": "NUMERIC"
          }
        ]
      ]
    }
  ]
}

Checklist

Before submitting a pull request, please ensure that you mark these task.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

Notes

GET /api/v1/evaluations/{id}?get_trace_info=true&export_format=grouped
Validation: export_format=grouped requires get_trace_info=true, returns error if traces don't have question_id.

Need to merge this PR: #553 , before merging this.

Summary by CodeRabbit

  • New Features

    • Added an export_format option ("row" default, "grouped") to evaluation endpoints to return traces either as individual rows or grouped by question (grouped includes question, ground truth, LLM answers, trace IDs, and aggregated scores; grouped requires trace info).
  • Bug Fixes / Validation

    • Requests for grouped format without trace info now return a 400 error; grouping errors surface a clear failure response.
  • Tests

    • Added tests for grouped-format success and failure scenarios.
  • Documentation

    • API docs updated with row and grouped response formats and examples.

✏️ Tip: You can customize this high-level summary in your review settings.

@vprashrex vprashrex self-assigned this Jan 27, 2026
@vprashrex vprashrex added enhancement New feature or request ready-for-review labels Jan 27, 2026
@vprashrex vprashrex moved this to In review in Kaapi-dev Jan 27, 2026
@vprashrex vprashrex linked an issue Jan 27, 2026 that may be closed by this pull request
@coderabbitai
Copy link

coderabbitai bot commented Jan 27, 2026

📝 Walkthrough

Walkthrough

Adds an export_format query parameter to the evaluation run endpoint supporting grouped-by-question_id trace exports, implements group_traces_by_question_id, updates docs to describe row and grouped response shapes, and adds tests for grouped export behavior. Duplicate function and test definitions were introduced.

Changes

Cohort / File(s) Summary
API Documentation
backend/app/api/docs/evaluation/get_evaluation.md
Adds optional export_format (row
API Route Logic
backend/app/api/routes/evaluations/evaluation.py
Adds export_format: str = Query("row", enum=["row","grouped"]) to get_evaluation_run_status; enforces grouped requires get_trace_info=true (400 if missing); applies group_traces_by_question_id to score["traces"] when requested and surfaces grouping errors as failure responses.
CRUD Grouping Function
backend/app/crud/evaluations/core.py
Adds group_traces_by_question_id(traces) that groups traces by question_id into objects with question_id, question, ground_truth_answer, llm_answers, trace_ids, and scores, sorted by question_id. Note: function is duplicated (redefinition) and should be deduplicated.
Tests
backend/app/tests/api/routes/test_evaluation.py
Adds tests verifying that requesting export_format=grouped without get_trace_info fails (400) and that grouped export succeeds with get_trace_info=true, returning grouped structure containing llm_answers and trace_ids. Note: tests appear duplicated and should be consolidated.

Sequence Diagram

sequenceDiagram
    participant Client
    participant API_Route as API Route
    participant Validator
    participant CRUD
    participant DB as EvalRun/Score
    participant Response

    Client->>API_Route: GET /api/v1/evaluations/{id}?export_format=grouped&get_trace_info=true
    API_Route->>Validator: Validate export_format & get_trace_info
    alt invalid (grouped without trace info)
        Validator-->>API_Route: validation error
        API_Route-->>Client: 400 Bad Request
    else valid
        API_Route->>DB: Load evaluation run & score (including traces)
        DB-->>API_Route: score with traces
        API_Route->>CRUD: group_traces_by_question_id(traces)
        alt grouping succeeds
            CRUD-->>API_Route: grouped traces list
            API_Route->>Response: build 200 payload (grouped format)
            Response-->>Client: 200 OK with grouped traces
        else grouping fails
            CRUD-->>API_Route: ValueError / error message
            API_Route-->>Client: 400 Bad Request with failure message
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • Prajna1999

Poem

🐰 I hop through traces, bundle each quest,
Questions in clusters, answers at rest,
Row or grouped, I tally each score,
Trace-ids and answers—grouped evermore,
Carrot in paw, I deliver the chore.

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding an export_format query parameter to support grouped trace export in the evaluation API.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@backend/app/crud/evaluations/core.py`:
- Around line 341-344: The check currently only inspects traces[0] for a
question_id which allows later traces with missing or empty question_id to
create a None key and break sorting; update the validation in the function that
handles traces to iterate over all traces (e.g., use any(t.get("question_id") in
(None, "") for t in traces) or a for-loop) and raise the same ValueError if any
trace has a missing/empty question_id; also fix the error message text to
"whether" instead of "weather" so the raised message reads "Grouped export
format is not available for this evaluation." Reference the symbols traces,
question_id, and the grouping/sorting logic (groups and sorted(groups.keys()))
when making the change.

In `@backend/app/tests/api/routes/test_evaluation.py`:
- Around line 1037-1044: The inline comment next to the client.get call is
stale; in the params dict the key "get_trace_info" is already set to True, so
remove the incorrect comment ("# Missing get_trace_info=true") adjacent to the
response = client.get(...) call (the params containing "export_format" and
"get_trace_info") to avoid confusion.
🧹 Nitpick comments (7)
backend/app/crud/evaluations/core.py (3)

319-338: Incomplete docstring with placeholder text.

The docstring contains placeholder descriptions (Description) that should be properly filled in. Per coding guidelines, Python functions should have complete documentation.

📝 Suggested docstring improvement
 def group_traces_by_question_id(
     traces: list[dict[str, Any]],
 ) -> list[dict[str, Any]]:
     """
-    Docstring for group_traces_by_question_id
-    
-    :param traces: Description
-    :type traces: list[dict[str, Any]]
-    :return: Description
-    :rtype: list[dict[str, Any]]
-
+    Group traces by question_id for grouped export format.
+    
+    Args:
+        traces: List of trace dictionaries containing question_id, question,
+            ground_truth_answer, llm_answer, trace_id, and scores.
+    
     Returns:
         List of grouped traces sorted by question_id:
         [
             {
                 "question_id": 1,
                 "question": "What is Python?",
                 "ground_truth_answer": "...",
                 "llm_answers": ["Answer 1", "Answer 2"],
                 "trace_ids": ["trace-1", "trace-2"],
                 "scores": [[...], [...]]
             }
         ]
+    
+    Raises:
+        ValueError: If traces lack a valid question_id.
     """

346-352: Type hint mismatch: question_id may not always be int.

The type hint dict[int, list[dict[str, Any]]] assumes question_id is always an integer, but the code uses .get("question_id") which could return any type. If question_id values are strings or mixed types, this could lead to unexpected sorting behavior.

♻️ Suggested type hint adjustment
-    groups: dict[int, list[dict[str, Any]]] = {}
+    groups: dict[Any, list[dict[str, Any]]] = {}

367-368: Add trailing newline at end of file.

Per static analysis hint (Ruff W292), add a newline at the end of the file.

     logger.info(f"[group_traces_by_question_id] Created {len(result)} groups")
     return result
+
backend/app/api/routes/evaluations/evaluation.py (2)

141-145: Minor style inconsistency: extra space around = in status_code.

Line 143 uses status_code = 400 with spaces, while line 138 uses status_code=400 without spaces. Consider keeping consistent style.

♻️ Consistent formatting
     if export_format == "grouped" and not get_trace_info:
         raise HTTPException(
-            status_code = 400,
+            status_code=400,
             detail="export_format=grouped requires get_trace_info=true"
         )

164-171: Consider moving the import to the top of the file and avoiding in-place mutation.

  1. The import at line 166 should ideally be at the top of the file with other imports for better readability and consistency.

  2. The code mutates eval_run.score["traces"] in-place (line 169). If eval_run is a SQLModel instance that might be used elsewhere or persisted, this could have unintended side effects. Consider creating a copy for the response instead.

♻️ Proposed refactor

Move the import to the top of the file:

from app.crud.evaluations.core import group_traces_by_question_id

Then modify the grouping logic to avoid in-place mutation:

     # Formatter = grouped
     if export_format == "grouped" and eval_run.score and "traces" in eval_run.score:
-        from app.crud.evaluations.core import group_traces_by_question_id
         try:
             grouped_traces = group_traces_by_question_id(eval_run.score["traces"])
-            eval_run.score["traces"] = grouped_traces
+            # Create a copy to avoid mutating the model instance
+            eval_run.score = {**eval_run.score, "grouped_traces": grouped_traces}
         except ValueError as e:
             return APIResponse.failure_response(error=str(e), data=eval_run)

Note: This also aligns with the documentation which shows grouped_traces as a separate field.

backend/app/tests/api/routes/test_evaluation.py (2)

959-1061: Consider adding test for ValueError case (traces without question_id).

The success test is comprehensive, but there's no test coverage for the case where traces lack question_id and the grouping function raises a ValueError. This is an important error path handled in the route (lines 170-171 of evaluation.py).

📝 Suggested additional test
def test_get_evaluation_run_grouped_format_missing_question_id_fails(
    self,
    client: TestClient,
    user_api_key_header: dict[str, str],
    db: Session,
    user_api_key: TestAuthContext,
    create_test_dataset: EvaluationDataset,
) -> None:
    """Test grouped export fails when traces lack question_id."""
    eval_run = EvaluationRun(
        run_name="test_run_no_qid",
        dataset_name=create_test_dataset.name,
        dataset_id=create_test_dataset.id,
        config={"model": "gpt-4o"},
        status="completed",
        total_items=1,
        score={
            "traces": [
                {
                    "trace_id": "trace-1",
                    "question": "What is Python?",
                    "llm_answer": "A language",
                    "scores": [],
                    # No question_id
                }
            ],
            "summary_scores": [],
        },
        organization_id=user_api_key.organization_id,
        project_id=user_api_key.project_id,
    )
    db.add(eval_run)
    db.commit()
    db.refresh(eval_run)

    response = client.get(
        f"/api/v1/evaluations/{eval_run.id}",
        params={"export_format": "grouped", "get_trace_info": True},
        headers=user_api_key_header,
    )

    assert response.status_code == 200
    response_data = response.json()
    assert response_data["success"] is False
    assert "not available" in response_data["error"].lower()

1061-1062: Missing blank line before next class definition.

PEP 8 recommends two blank lines before class definitions at the module level.

         )
+

 class TestGetDataset:

Comment on lines 341 to 344
# weather question_id exists in the traces
if traces and (traces[0].get("question_id") is None or traces[0].get("question_id") == ""):
raise ValueError(
"Grouped export format is not available for this evaluation.")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validation only checks the first trace, missing question_id in other traces will cause issues.

The current validation only checks traces[0] for question_id. If subsequent traces have missing or empty question_id, they will be grouped under None as a key, and sorted(groups.keys()) on line 355 will raise a TypeError when comparing None with integers.

Also, typo: "weather" should be "whether".

🐛 Proposed fix for robust validation
-    # weather question_id exists in the traces
-    if traces and (traces[0].get("question_id") is None or traces[0].get("question_id") ==  ""):
+    # Validate that all traces have a valid question_id
+    if not traces:
+        return []
+    
+    for trace in traces:
+        question_id = trace.get("question_id")
+        if question_id is None or question_id == "":
+            raise ValueError(
+                "Grouped export format is not available for this evaluation.")
-        raise ValueError(
-            "Grouped export format is not available for this evaluation.")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# weather question_id exists in the traces
if traces and (traces[0].get("question_id") is None or traces[0].get("question_id") == ""):
raise ValueError(
"Grouped export format is not available for this evaluation.")
# Validate that all traces have a valid question_id
if not traces:
return []
for trace in traces:
question_id = trace.get("question_id")
if question_id is None or question_id == "":
raise ValueError(
"Grouped export format is not available for this evaluation.")
🤖 Prompt for AI Agents
In `@backend/app/crud/evaluations/core.py` around lines 341 - 344, The check
currently only inspects traces[0] for a question_id which allows later traces
with missing or empty question_id to create a None key and break sorting; update
the validation in the function that handles traces to iterate over all traces
(e.g., use any(t.get("question_id") in (None, "") for t in traces) or a
for-loop) and raise the same ValueError if any trace has a missing/empty
question_id; also fix the error message text to "whether" instead of "weather"
so the raised message reads "Grouped export format is not available for this
evaluation." Reference the symbols traces, question_id, and the grouping/sorting
logic (groups and sorted(groups.keys())) when making the change.

Comment on lines +1037 to +1044
response = client.get(
f"/api/v1/evaluations/{eval_run.id}",
params={
"export_format": "grouped",
"get_trace_info": True,
}, # Missing get_trace_info=true
headers=user_api_key_header,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Stale comment: get_trace_info is actually True here.

The comment on line 1042 says # Missing get_trace_info=true but get_trace_info: True is present in the params. This appears to be copy-paste from the previous test.

🧹 Remove stale comment
         response = client.get(
             f"/api/v1/evaluations/{eval_run.id}",
             params={
                 "export_format": "grouped",
                 "get_trace_info": True,
-            },  # Missing get_trace_info=true
+            },
             headers=user_api_key_header,
         )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
response = client.get(
f"/api/v1/evaluations/{eval_run.id}",
params={
"export_format": "grouped",
"get_trace_info": True,
}, # Missing get_trace_info=true
headers=user_api_key_header,
)
response = client.get(
f"/api/v1/evaluations/{eval_run.id}",
params={
"export_format": "grouped",
"get_trace_info": True,
},
headers=user_api_key_header,
)
🤖 Prompt for AI Agents
In `@backend/app/tests/api/routes/test_evaluation.py` around lines 1037 - 1044,
The inline comment next to the client.get call is stale; in the params dict the
key "get_trace_info" is already set to True, so remove the incorrect comment ("#
Missing get_trace_info=true") adjacent to the response = client.get(...) call
(the params containing "export_format" and "get_trace_info") to avoid confusion.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@backend/app/crud/evaluations/core.py`:
- Around line 358-366: The docstring for group_traces_by_question_id has
trailing whitespace and Black formatting violations; run the project's
formatter/pre-commit (e.g., black/isort) and remove any trailing spaces in that
docstring block, reflow or re-indent lines to match Black's style (keep the
triple-quoted docstring, parameter lines and Returns section neatly wrapped),
then re-run tests/CI to ensure the formatting errors are resolved.
🧹 Nitpick comments (1)
backend/app/tests/api/routes/test_evaluation.py (1)

939-1045: Prefer a factory/fixture for EvaluationRun setup in these tests.

To align with the test fixture pattern and reduce duplication, consider extracting the EvaluationRun creation into a factory/fixture helper (e.g., create_test_evaluation_run).

As per coding guidelines: “backend/app/tests/**/*.py: Use factory pattern for test fixtures in backend/app/tests/”.

@codecov
Copy link

codecov bot commented Jan 28, 2026

Codecov Report

❌ Patch coverage is 94.00000% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
backend/app/api/routes/evaluations/evaluation.py 77.77% 2 Missing ⚠️
backend/app/crud/evaluations/core.py 94.11% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@vprashrex vprashrex requested a review from nishika26 January 29, 2026 05:59
@vprashrex vprashrex force-pushed the feat/evaulation-grouped-export-route branch from a164ee5 to e6de3d0 Compare January 29, 2026 09:20
@vprashrex vprashrex merged commit e0838da into main Jan 29, 2026
3 checks passed
@vprashrex vprashrex deleted the feat/evaulation-grouped-export-route branch January 29, 2026 09:51
@github-project-automation github-project-automation bot moved this from In review to Closed in Kaapi-dev Jan 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request ready-for-review

Projects

Status: Closed

Development

Successfully merging this pull request may close these issues.

Evaluation: Extend export functionality

4 participants