Add score trace URL to evaluation run #558

vprashrex · 2026-01-25T06:03:55Z

Issue: #494

Summary

Add score_trace_url field to evaluation run model
Update evaluation service to populate trace URL

Test plan

Run migrations
Verify evaluation runs capture trace URLs

Summary by CodeRabbit

New Features
- Evaluation runs can store and retrieve external trace files (score trace URLs) so detailed per-trace scoring is available and cached.
- Service now loads and merges traces from external storage, DB, or live fetches, honoring resync requests.
Chores
- Database migration added to support the new trace URL field.
Tests
- New tests covering storage upload/load, S3 vs DB trace retrieval, resync behavior, and upload fallbacks.

coderabbitai · 2026-01-25T06:04:03Z

📝 Walkthrough

Walkthrough

Adds a nullable score_trace_url column to evaluation runs (migration + model), changes storage utilities to upload/load JSON arrays, updates CRUD to accept/persist score_trace_url, implements S3-first lazy trace loading with DB fallback and resync via Langfuse, and adds unit/integration tests for these flows.

Changes

Cohort / File(s)	Summary
DB Migration & Model `backend/app/alembic/versions/043_add_score_trace_url_to_evaluation_run.py`, `backend/app/models/evaluation.py`	Adds migration (revision `043`) to add nullable `score_trace_url` column and updates `EvaluationRun` model with `score_trace_url: str
Storage Utilities `backend/app/core/storage_utils.py`	Changes upload format to a JSON array (not JSONL) and adds `load_json_from_object_store(storage, url) -> list
CRUD `backend/app/crud/evaluations/core.py`	Extends `update_evaluation_run()` signature to accept optional `score_trace_url` and persist it to the DB when provided.
Evaluation Service Logic `backend/app/services/evaluations/evaluation.py`	Implements S3-first lazy loading of traces using `score_trace_url`, merges S3/DB/Langfuse traces and summary scores, supports resync to bypass caches, and persists merged score and `score_trace_url`.
Storage Tests `backend/app/tests/core/test_storage_utils.py`	Adds tests for `upload_jsonl_to_object_store` success/failure and `load_json_from_object_store` success, CloudStorageError, and invalid JSON handling.
Score Storage Tests `backend/app/tests/crud/evaluations/test_score_storage.py`	Adds tests covering S3 upload of traces with summary-only DB storage, upload failure fallback (store full score in DB), and skipping upload for empty traces.
Service Tests `backend/app/tests/services/evaluations/__init__.py`, `backend/app/tests/services/evaluations/test_evaluation_service_s3.py`	Adds fixtures and tests for loading traces from S3, returning DB traces when no S3 URL, and resync behavior that fetches fresh Langfuse traces and bypasses cache.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Service as EvaluationService
    participant DB as Database
    participant S3 as Object Storage
    participant Langfuse as Langfuse API

    Client->>Service: get_evaluation_with_scores(resync_score=false)
    Service->>DB: get_evaluation_run(id)
    DB-->>Service: EvaluationRun (may include score_trace_url)

    alt score_trace_url present and not resync
        Service->>S3: load_json_from_object_store(score_trace_url)
        alt S3 returns traces
            S3-->>Service: traces list
            Service->>Service: merge traces with summary_scores
        else S3 fails
            S3-->>Service: error
            Service->>DB: read cached traces from eval_run.score
            DB-->>Service: cached traces (if any)
            Service->>Service: merge cached traces with summary_scores
        end
    else resync requested
        Service->>Langfuse: fetch_trace_scores_from_langfuse()
        Langfuse-->>Service: fresh traces & summary_scores
        Service->>Service: merge Langfuse data with any existing summaries
        Service->>S3: upload_jsonl_to_object_store(traces) (optional)
        S3-->>Service: score_trace_url (on success)
        Service->>DB: update_evaluation_run(score, score_trace_url)
    end

    Service-->>Client: evaluation with merged scores and traces

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Evaluation: Fix score format #549 — touches evaluation score merging and retrieval logic; strongly related to score/trace handling changes.

Suggested reviewers

Prajna1999
AkhileshNegi

Poem

🐰 I hopped through migrations, schema in tow,
I packed traces in S3 where cool breezes blow,
Merge DB, Langfuse, and cloud with a twitch of my nose,
Now scores and traces dance where the warm data flows! ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add score trace URL to evaluation run' clearly and concisely summarizes the main change: introducing a score_trace_url field to the EvaluationRun model, which is evidenced across multiple files (models, migrations, CRUD, and services).
Docstring Coverage	✅ Passed	Docstring coverage is 86.36% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/eval-results-s3-storage

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-01-25T06:09:02Z

Codecov Report

❌ Patch coverage is 95.71429% with 9 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
backend/app/crud/evaluations/core.py	82.60%	4 Missing ⚠️
backend/app/core/storage_utils.py	83.33%	3 Missing ⚠️
backend/app/services/evaluations/evaluation.py	90.00%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

- test added - bug fixes

…est files

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

backend/app/core/storage_utils.py (1)
86-123: ⚠️ Potential issue | 🟡 Minor

Align JSONL metadata with JSON array payload.

The payload is now a JSON array, but the docstring and content-type still say JSONL. Update the metadata to avoid confusion (or revert to JSONL formatting).
🧾 Suggested update
-    Upload JSONL (JSON Lines) content to object store.
+    Upload JSON array content to object store.
@@
-        results: List of dictionaries to be converted to JSONL
+        results: List of dictionaries to be serialized as a JSON array
@@
-        headers = Headers({"content-type": "application/jsonl"})
+        headers = Headers({"content-type": "application/json"})
backend/app/crud/evaluations/core.py (1)
184-209: ⚠️ Potential issue | 🟡 Minor

Document the new score_trace_url parameter.

The Args section doesn’t mention the newly added parameter.
✍️ Suggested update
-        object_store_url: New object store URL (optional)
-        score: New score dict (optional)
-        embedding_batch_job_id: New embedding batch job ID (optional)
+        object_store_url: New object store URL (optional)
+        score_trace_url: New per-trace score URL (optional)
+        score: New score dict (optional)
+        embedding_batch_job_id: New embedding batch job ID (optional)

🤖 Fix all issues with AI agents

In `@backend/app/alembic/versions/043_add_score_trace_url_to_evaluation_run.py`:
- Around line 19-32: The migration functions upgrade and downgrade lack return
type hints; update the function signatures for upgrade() and downgrade() to
declare return type -> None so they read as upgrade() -> None: and downgrade()
-> None: (keep existing bodies unchanged) to satisfy the type-hinting rule.

In `@backend/app/core/storage_utils.py`:
- Around line 151-152: The log statement in load_json_from_object_store has a
missing closing quote and brace in the f-string; update the logger.info call
(logger.info(...)) to properly close the string and include the url variable,
e.g., fix the f-string to something like f"[load_json_from_object_store] Loading
JSON from '{url}'" so the message is well-formed and includes the trailing quote
and bracket.

In `@backend/app/services/evaluations/evaluation.py`:
- Around line 233-267: The code in get_evaluation_with_scores treats S3 and DB
as mutually exclusive (uses `elif`), so if `eval_run.score_trace_url` exists but
`load_json_from_object_store` fails or returns None you never fall back to DB
traces (`has_cached_traces_db`); change the control flow so the DB check is a
separate conditional executed after the S3 attempt/exception (i.e., replace the
`elif has_cached_traces_db` with an independent `if has_cached_traces_db` and
ensure the S3-exception handler does not return or suppress the subsequent DB
check), using the existing symbols `eval_run`, `score_trace_url`,
`has_cached_traces_db`, `load_json_from_object_store`, and
`get_evaluation_with_scores` to locate and update the logic.

In `@backend/app/tests/core/test_storage_utils.py`:
- Around line 57-98: The test methods in TestLoadJsonFromObjectStore lack
parameter type hints; update each method signature (test_load_success,
test_load_returns_none_on_error, test_load_returns_none_on_invalid_json) to
include the implicit self parameter type (self: "TestLoadJsonFromObjectStore")
and an explicit return type -> None; ensure any local test helper parameters
would also be annotated (e.g., mock_storage: MagicMock) if present, and keep the
body unchanged so assertions against load_json_from_object_store and
mock_storage.stream continue to work.
- Around line 16-55: The test methods in class TestUploadJsonlToObjectStore lack
type annotations on the implicit self parameter; update the method signatures
for test_upload_success and test_upload_returns_none_on_error to annotate self
(e.g., self: TestUploadJsonlToObjectStore) while keeping the existing return
type hints, and run tests to ensure no behavioral changes; locate these methods
and the class name TestUploadJsonlToObjectStore and add the self parameter type
annotation accordingly.

In `@backend/app/tests/crud/evaluations/test_score_storage.py`:
- Around line 14-110: The fixture mock_eval_run should be a factory (callable)
and the test functions need parameter type hints; change the fixture name to
mock_eval_run_factory (or make mock_eval_run return a Callable[[], MagicMock])
that when called creates/returns a MagicMock(spec=EvaluationRun) with id=100,
then inside tests call mock_eval_run_factory() to get the eval_run; also add
type annotations to each test parameter (e.g., mock_engine: MagicMock,
mock_get_storage: MagicMock, mock_upload: MagicMock, mock_get_eval: MagicMock,
mock_update: MagicMock, mock_eval_run_factory: Callable[[], EvaluationRun] or
MagicMock) and return type -> None for
test_uploads_traces_to_s3_and_stores_summary_only,
test_fallback_to_db_when_s3_fails, and test_no_s3_upload_when_no_traces; keep
references to save_score and update_evaluation_run/get_evaluation_run_by_id
unchanged.

In `@backend/app/tests/services/evaluations/test_evaluation_service_s3.py`:
- Around line 42-120: The tests lack type hints: add annotations to the test
methods' parameters and fixtures referenced in this file—annotate mock
parameters (mock_get_storage, mock_load, mock_get_eval, mock_get_langfuse,
mock_fetch_langfuse, mock_save_score) as MagicMock, annotate fixture parameters
completed_eval_run_with_s3 and completed_eval_run_with_db_traces as
EvaluationRun, and annotate self as "TestGetEvaluationWithScoresS3" on the
methods test_loads_traces_from_s3, test_returns_db_traces_when_no_s3_url, and
test_resync_bypasses_cache_and_fetches_langfuse; also add return type
annotations for the fixtures (completed_eval_run_with_s3 -> EvaluationRun,
completed_eval_run_with_db_traces -> EvaluationRun) and import MagicMock and
EvaluationRun types where needed.
- Around line 14-37: Replace the two near-duplicate fixtures
completed_eval_run_with_s3 and completed_eval_run_with_db_traces with a single
factory fixture (e.g., eval_run_factory) that accepts parameters like id: int,
status: str, score: dict, score_trace_url: Optional[str], dataset_name:
Optional[str], run_name: Optional[str] and returns a
MagicMock(spec=EvaluationRun) populated accordingly; update tests to call
eval_run_factory(…) to create the two scenarios. Add explicit type hints to the
fixture and returned value (eval_run_factory: Callable[..., MagicMock] or ->
MagicMock) and to all test function parameters (e.g., mock_get_storage:
MagicMock, eval_run_factory: Callable[..., MagicMock]) so every fixture/test
signature is typed. Ensure existing symbols EvaluationRun,
completed_eval_run_with_s3 and completed_eval_run_with_db_traces are
removed/replaced and all tests updated to use the new eval_run_factory and typed
parameters.

🧹 Nitpick comments (1)

backend/app/crud/evaluations/core.py (1)

386-392: Resolve the TODO about DB storage fallback.

This TODO leaves behavior undecided and includes a typo (“weather” → “whether”). Please either track it in an issue or implement the intended behavior.

If you want, I can draft the follow-up change or open an issue template entry.

coderabbitai · 2026-01-31T18:27:08Z

backend/app/alembic/versions/043_add_score_trace_url_to_evaluation_run.py

+def upgrade():
+    op.add_column(
+        "evaluation_run",
+        sa.Column(
+            "score_trace_url",
+            sqlmodel.sql.sqltypes.AutoString(),
+            nullable=True,
+            comment="S3 URL where per-trace evaluation scores are stored",
+        ),
+    )
+
+
+def downgrade():
+    op.drop_column("evaluation_run", "score_trace_url")


⚠️ Potential issue | 🟡 Minor

Add return type hints to migration functions.

Both upgrade() and downgrade() should declare -> None to comply with the type-hinting rule.

✍️ Suggested update

-def upgrade(): +def upgrade() -> None: op.add_column( "evaluation_run", sa.Column( "score_trace_url", sqlmodel.sql.sqltypes.AutoString(), nullable=True, comment="S3 URL where per-trace evaluation scores are stored", ), ) -def downgrade(): +def downgrade() -> None: op.drop_column("evaluation_run", "score_trace_url")

As per coding guidelines, **/*.py: Always add type hints to all function parameters and return values in Python code.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def upgrade():

op.add_column(

"evaluation_run",

sa.Column(

"score_trace_url",

sqlmodel.sql.sqltypes.AutoString(),

nullable=True,

comment="S3 URL where per-trace evaluation scores are stored",

),

)

def downgrade():

op.drop_column("evaluation_run", "score_trace_url")

def upgrade() -> None:

op.add_column(

"evaluation_run",

sa.Column(

"score_trace_url",

sqlmodel.sql.sqltypes.AutoString(),

nullable=True,

comment="S3 URL where per-trace evaluation scores are stored",

),

)

def downgrade() -> None:

op.drop_column("evaluation_run", "score_trace_url")

🤖 Prompt for AI Agents

In `@backend/app/alembic/versions/043_add_score_trace_url_to_evaluation_run.py` around lines 19 - 32, The migration functions upgrade and downgrade lack return type hints; update the function signatures for upgrade() and downgrade() to declare return type -> None so they read as upgrade() -> None: and downgrade() -> None: (keep existing bodies unchanged) to satisfy the type-hinting rule.

coderabbitai · 2026-01-31T18:27:09Z

backend/app/core/storage_utils.py

+def load_json_from_object_store(storage: CloudStorage, url: str) -> list | None:
+    logger.info(f"[load_json_from_object_store] Loading JSON from '{url}")


⚠️ Potential issue | 🟡 Minor

Fix log message formatting.

The opening quote isn’t closed, which makes the log line confusing.

✍️ Suggested update

- logger.info(f"[load_json_from_object_store] Loading JSON from '{url}") + logger.info(f"[load_json_from_object_store] Loading JSON from '{url}'")

🤖 Prompt for AI Agents

In `@backend/app/core/storage_utils.py` around lines 151 - 152, The log statement in load_json_from_object_store has a missing closing quote and brace in the f-string; update the logger.info call (logger.info(...)) to properly close the string and include the url variable, e.g., fix the f-string to something like f"[load_json_from_object_store] Loading JSON from '{url}'" so the message is well-formed and includes the trailing quote and bracket.

backend/app/services/evaluations/evaluation.py

coderabbitai · 2026-01-31T18:27:09Z

backend/app/tests/core/test_storage_utils.py

+class TestUploadJsonlToObjectStore:
+    """Test uploading JSON content to object store."""
+
+    def test_upload_success(self) -> None:
+        """Verify successful upload returns URL and content is correct."""
+        mock_storage = MagicMock()
+        mock_storage.put.return_value = "s3://bucket/path/traces.json"
+
+        results = [{"trace_id": "t1", "score": 0.9}]
+
+        url = upload_jsonl_to_object_store(
+            storage=mock_storage,
+            results=results,
+            filename="traces.json",
+            subdirectory="evaluations/score/70",
+        )
+
+        assert url == "s3://bucket/path/traces.json"
+        mock_storage.put.assert_called_once()
+
+        # Verify content is valid JSON
+        call_args = mock_storage.put.call_args
+        upload_file = call_args.kwargs.get("source")
+        content = upload_file.file.read().decode("utf-8")
+        assert json.loads(content) == results
+
+    def test_upload_returns_none_on_error(self) -> None:
+        """Verify returns None on CloudStorageError."""
+        mock_storage = MagicMock()
+        mock_storage.put.side_effect = CloudStorageError("Upload failed")
+
+        url = upload_jsonl_to_object_store(
+            storage=mock_storage,
+            results=[{"id": 1}],
+            filename="test.json",
+            subdirectory="test",
+        )
+
+        assert url is None
+


⚠️ Potential issue | 🟡 Minor

Add parameter type hints to test methods.

Annotate self to satisfy the type-hint requirement for function parameters.

✍️ Suggested update

- def test_upload_success(self) -> None: + def test_upload_success(self: "TestUploadJsonlToObjectStore") -> None: """Verify successful upload returns URL and content is correct.""" mock_storage = MagicMock() mock_storage.put.return_value = "s3://bucket/path/traces.json" @@ - def test_upload_returns_none_on_error(self) -> None: + def test_upload_returns_none_on_error( + self: "TestUploadJsonlToObjectStore", + ) -> None: """Verify returns None on CloudStorageError.""" mock_storage = MagicMock() mock_storage.put.side_effect = CloudStorageError("Upload failed")

As per coding guidelines, **/*.py: Always add type hints to all function parameters and return values in Python code.

🤖 Prompt for AI Agents

In `@backend/app/tests/core/test_storage_utils.py` around lines 16 - 55, The test methods in class TestUploadJsonlToObjectStore lack type annotations on the implicit self parameter; update the method signatures for test_upload_success and test_upload_returns_none_on_error to annotate self (e.g., self: TestUploadJsonlToObjectStore) while keeping the existing return type hints, and run tests to ensure no behavioral changes; locate these methods and the class name TestUploadJsonlToObjectStore and add the self parameter type annotation accordingly.

coderabbitai · 2026-01-31T18:27:09Z

backend/app/tests/core/test_storage_utils.py

+class TestLoadJsonFromObjectStore:
+    """Test loading JSON from object store."""
+
+    def test_load_success(self) -> None:
+        """Verify successful load returns parsed JSON."""
+        mock_storage = MagicMock()
+        test_data = [{"id": 1, "value": "test"}]
+        mock_storage.stream.return_value = BytesIO(
+            json.dumps(test_data).encode("utf-8")
+        )
+
+        result = load_json_from_object_store(
+            storage=mock_storage,
+            url="s3://bucket/path/file.json",
+        )
+
+        assert result == test_data
+        mock_storage.stream.assert_called_once_with("s3://bucket/path/file.json")
+
+    def test_load_returns_none_on_error(self) -> None:
+        """Verify returns None on CloudStorageError."""
+        mock_storage = MagicMock()
+        mock_storage.stream.side_effect = CloudStorageError("Download failed")
+
+        result = load_json_from_object_store(
+            storage=mock_storage,
+            url="s3://bucket/file.json",
+        )
+
+        assert result is None
+
+    def test_load_returns_none_on_invalid_json(self) -> None:
+        """Verify returns None on invalid JSON content."""
+        mock_storage = MagicMock()
+        mock_storage.stream.return_value = BytesIO(b"not valid json")
+
+        result = load_json_from_object_store(
+            storage=mock_storage,
+            url="s3://bucket/file.json",
+        )
+
+        assert result is None


⚠️ Potential issue | 🟡 Minor

Add parameter type hints to test methods.

Same type-hint requirement applies to this class’ methods.

✍️ Suggested update

- def test_load_success(self) -> None: + def test_load_success(self: "TestLoadJsonFromObjectStore") -> None: """Verify successful load returns parsed JSON.""" mock_storage = MagicMock() test_data = [{"id": 1, "value": "test"}] @@ - def test_load_returns_none_on_error(self) -> None: + def test_load_returns_none_on_error( + self: "TestLoadJsonFromObjectStore", + ) -> None: """Verify returns None on CloudStorageError.""" mock_storage = MagicMock() mock_storage.stream.side_effect = CloudStorageError("Download failed") @@ - def test_load_returns_none_on_invalid_json(self) -> None: + def test_load_returns_none_on_invalid_json( + self: "TestLoadJsonFromObjectStore", + ) -> None: """Verify returns None on invalid JSON content.""" mock_storage = MagicMock() mock_storage.stream.return_value = BytesIO(b"not valid json")

As per coding guidelines, **/*.py: Always add type hints to all function parameters and return values in Python code.

🤖 Prompt for AI Agents

In `@backend/app/tests/core/test_storage_utils.py` around lines 57 - 98, The test methods in TestLoadJsonFromObjectStore lack parameter type hints; update each method signature (test_load_success, test_load_returns_none_on_error, test_load_returns_none_on_invalid_json) to include the implicit self parameter type (self: "TestLoadJsonFromObjectStore") and an explicit return type -> None; ensure any local test helper parameters would also be annotated (e.g., mock_storage: MagicMock) if present, and keep the body unchanged so assertions against load_json_from_object_store and mock_storage.stream continue to work.

coderabbitai · 2026-01-31T18:27:09Z

backend/app/tests/crud/evaluations/test_score_storage.py

+    @pytest.fixture
+    def mock_eval_run(self):
+        """Create a mock EvaluationRun."""
+        eval_run = MagicMock(spec=EvaluationRun)
+        eval_run.id = 100
+        return eval_run
+
+    @patch("app.crud.evaluations.core.update_evaluation_run")
+    @patch("app.crud.evaluations.core.get_evaluation_run_by_id")
+    @patch("app.core.storage_utils.upload_jsonl_to_object_store")
+    @patch("app.core.cloud.storage.get_cloud_storage")
+    @patch("app.core.db.engine")
+    def test_uploads_traces_to_s3_and_stores_summary_only(
+        self,
+        mock_engine,
+        mock_get_storage,
+        mock_upload,
+        mock_get_eval,
+        mock_update,
+        mock_eval_run,
+    ) -> None:
+        """Verify traces uploaded to S3, only summary_scores stored in DB."""
+        mock_get_eval.return_value = mock_eval_run
+        mock_get_storage.return_value = MagicMock()
+        mock_upload.return_value = "s3://bucket/traces.json"
+
+        score = {
+            "summary_scores": [{"name": "accuracy", "avg": 0.9}],
+            "traces": [{"trace_id": "t1"}],
+        }
+
+        save_score(eval_run_id=100, organization_id=1, project_id=1, score=score)
+
+        # Verify upload was called with traces
+        mock_upload.assert_called_once()
+        assert mock_upload.call_args.kwargs["results"] == [{"trace_id": "t1"}]
+
+        # Verify DB gets summary only, not traces
+        call_kwargs = mock_update.call_args.kwargs
+        assert call_kwargs["score"] == {
+            "summary_scores": [{"name": "accuracy", "avg": 0.9}]
+        }
+        assert call_kwargs["score_trace_url"] == "s3://bucket/traces.json"
+
+    @patch("app.crud.evaluations.core.update_evaluation_run")
+    @patch("app.crud.evaluations.core.get_evaluation_run_by_id")
+    @patch("app.core.storage_utils.upload_jsonl_to_object_store")
+    @patch("app.core.cloud.storage.get_cloud_storage")
+    @patch("app.core.db.engine")
+    def test_fallback_to_db_when_s3_fails(
+        self,
+        mock_engine,
+        mock_get_storage,
+        mock_upload,
+        mock_get_eval,
+        mock_update,
+        mock_eval_run,
+    ) -> None:
+        """Verify full score stored in DB when S3 upload fails."""
+        mock_get_eval.return_value = mock_eval_run
+        mock_get_storage.return_value = MagicMock()
+        mock_upload.return_value = None  # S3 failed
+
+        score = {
+            "summary_scores": [{"name": "accuracy", "avg": 0.9}],
+            "traces": [{"trace_id": "t1"}],
+        }
+
+        save_score(eval_run_id=100, organization_id=1, project_id=1, score=score)
+
+        # Full score stored in DB as fallback
+        call_kwargs = mock_update.call_args.kwargs
+        assert call_kwargs["score"] == score
+        assert call_kwargs["score_trace_url"] is None
+
+    @patch("app.crud.evaluations.core.update_evaluation_run")
+    @patch("app.crud.evaluations.core.get_evaluation_run_by_id")
+    @patch("app.core.storage_utils.upload_jsonl_to_object_store")
+    @patch("app.core.cloud.storage.get_cloud_storage")
+    @patch("app.core.db.engine")
+    def test_no_s3_upload_when_no_traces(
+        self,
+        mock_engine,
+        mock_get_storage,
+        mock_upload,
+        mock_get_eval,
+        mock_update,
+        mock_eval_run,
+    ) -> None:
+        """Verify S3 upload skipped when traces is empty."""
+        mock_get_eval.return_value = mock_eval_run
+
+        score = {"summary_scores": [{"name": "accuracy", "avg": 0.9}], "traces": []}
+
+        save_score(eval_run_id=100, organization_id=1, project_id=1, score=score)
+
+        mock_upload.assert_not_called()


⚠️ Potential issue | 🟡 Minor

Use a factory fixture and add parameter type hints.

Guidelines call for factory-style fixtures and typed parameters. Consider converting the fixture into a callable factory and annotating the mocked parameters.

✍️ Suggested update (apply pattern to other tests too)

+from typing import Callable from unittest.mock import MagicMock, patch @@ class TestSaveScoreS3Upload: @@ `@pytest.fixture` - def mock_eval_run(self): + def eval_run_factory( + self: "TestSaveScoreS3Upload", + ) -> Callable[[], EvaluationRun]: """Create a mock EvaluationRun.""" - eval_run = MagicMock(spec=EvaluationRun) - eval_run.id = 100 - return eval_run + def _factory() -> EvaluationRun: + eval_run = MagicMock(spec=EvaluationRun) + eval_run.id = 100 + return eval_run + + return _factory @@ def test_uploads_traces_to_s3_and_stores_summary_only( - self, - mock_engine, - mock_get_storage, - mock_upload, - mock_get_eval, - mock_update, - mock_eval_run, + self: "TestSaveScoreS3Upload", + mock_engine: MagicMock, + mock_get_storage: MagicMock, + mock_upload: MagicMock, + mock_get_eval: MagicMock, + mock_update: MagicMock, + eval_run_factory: Callable[[], EvaluationRun], ) -> None: @@ - mock_get_eval.return_value = mock_eval_run + mock_get_eval.return_value = eval_run_factory()

As per coding guidelines, /*.py: Always add type hints to all function parameters and return values in Python code; backend/app/tests//*.py: Use factory pattern for test fixtures in backend/app/tests/.

🤖 Prompt for AI Agents

In `@backend/app/tests/crud/evaluations/test_score_storage.py` around lines 14 - 110, The fixture mock_eval_run should be a factory (callable) and the test functions need parameter type hints; change the fixture name to mock_eval_run_factory (or make mock_eval_run return a Callable[[], MagicMock]) that when called creates/returns a MagicMock(spec=EvaluationRun) with id=100, then inside tests call mock_eval_run_factory() to get the eval_run; also add type annotations to each test parameter (e.g., mock_engine: MagicMock, mock_get_storage: MagicMock, mock_upload: MagicMock, mock_get_eval: MagicMock, mock_update: MagicMock, mock_eval_run_factory: Callable[[], EvaluationRun] or MagicMock) and return type -> None for test_uploads_traces_to_s3_and_stores_summary_only, test_fallback_to_db_when_s3_fails, and test_no_s3_upload_when_no_traces; keep references to save_score and update_evaluation_run/get_evaluation_run_by_id unchanged.

coderabbitai · 2026-01-31T18:27:09Z

backend/app/tests/services/evaluations/test_evaluation_service_s3.py

+    @pytest.fixture
+    def completed_eval_run_with_s3(self):
+        """Completed eval run with S3 URL."""
+        eval_run = MagicMock(spec=EvaluationRun)
+        eval_run.id = 100
+        eval_run.status = "completed"
+        eval_run.score = {"summary_scores": [{"name": "accuracy", "avg": 0.9}]}
+        eval_run.score_trace_url = "s3://bucket/traces.json"
+        eval_run.dataset_name = "test_dataset"
+        eval_run.run_name = "test_run"
+        return eval_run
+
+    @pytest.fixture
+    def completed_eval_run_with_db_traces(self):
+        """Completed eval run with traces in DB."""
+        eval_run = MagicMock(spec=EvaluationRun)
+        eval_run.id = 101
+        eval_run.status = "completed"
+        eval_run.score = {
+            "summary_scores": [{"name": "accuracy", "avg": 0.85}],
+            "traces": [{"trace_id": "db_trace"}],
+        }
+        eval_run.score_trace_url = None
+        return eval_run


🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find the test file find . -type f -name "test_evaluation_service_s3.py" 2>/dev/null

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 145

🏁 Script executed:

cat -n ./backend/app/tests/services/evaluations/test_evaluation_service_s3.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 5429

🏁 Script executed:

# Verify the guidelines apply and check if EvaluationRun type is available rg "from app.models import EvaluationRun" ./backend/app/tests/services/evaluations/test_evaluation_service_s3.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 111

Use a factory fixture and add type hints to fixtures and test methods.

The fixtures duplicate setup logic and lack required type hints, violating the mandatory coding guidelines. A factory fixture reduces duplication and enables proper type annotations for all parameters and return values.

♻️ Suggested refactor

-from unittest.mock import MagicMock, patch +from typing import Callable +from unittest.mock import MagicMock, patch ... class TestGetEvaluationWithScoresS3: """Test get_evaluation_with_scores() S3 retrieval.""" `@pytest.fixture` - def completed_eval_run_with_s3(self): - """Completed eval run with S3 URL.""" - eval_run = MagicMock(spec=EvaluationRun) - eval_run.id = 100 - eval_run.status = "completed" - eval_run.score = {"summary_scores": [{"name": "accuracy", "avg": 0.9}]} - eval_run.score_trace_url = "s3://bucket/traces.json" - eval_run.dataset_name = "test_dataset" - eval_run.run_name = "test_run" - return eval_run + def evaluation_run_factory( + self: "TestGetEvaluationWithScoresS3", + ) -> Callable[..., EvaluationRun]: + def _factory( + *, + eval_run_id: int, + status: str = "completed", + score: dict | None = None, + score_trace_url: str | None = None, + dataset_name: str = "test_dataset", + run_name: str = "test_run", + ) -> EvaluationRun: + eval_run = MagicMock(spec=EvaluationRun) + eval_run.id = eval_run_id + eval_run.status = status + eval_run.score = score + eval_run.score_trace_url = score_trace_url + eval_run.dataset_name = dataset_name + eval_run.run_name = run_name + return eval_run + + return _factory + + `@pytest.fixture` + def completed_eval_run_with_s3( + self: "TestGetEvaluationWithScoresS3", + evaluation_run_factory: Callable[..., EvaluationRun], + ) -> EvaluationRun: - """Completed eval run with S3 URL.""" + """Completed eval run with S3 URL.""" + return evaluation_run_factory( + eval_run_id=100, + score={"summary_scores": [{"name": "accuracy", "avg": 0.9}]}, + score_trace_url="s3://bucket/traces.json", + ) `@pytest.fixture` - def completed_eval_run_with_db_traces(self): - """Completed eval run with traces in DB.""" - eval_run = MagicMock(spec=EvaluationRun) - eval_run.id = 101 - eval_run.status = "completed" - eval_run.score = { - "summary_scores": [{"name": "accuracy", "avg": 0.85}], - "traces": [{"trace_id": "db_trace"}], - } - eval_run.score_trace_url = None - return eval_run + def completed_eval_run_with_db_traces( + self: "TestGetEvaluationWithScoresS3", + evaluation_run_factory: Callable[..., EvaluationRun], + ) -> EvaluationRun: + """Completed eval run with traces in DB.""" + return evaluation_run_factory( + eval_run_id=101, + score={ + "summary_scores": [{"name": "accuracy", "avg": 0.85}], + "traces": [{"trace_id": "db_trace"}], + }, + score_trace_url=None, + )

Also add type hints to test method parameters (e.g., mock_get_storage: MagicMock instead of bare mock_get_storage).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@pytest.fixture

def completed_eval_run_with_s3(self):

"""Completed eval run with S3 URL."""

eval_run = MagicMock(spec=EvaluationRun)

eval_run.id = 100

eval_run.status = "completed"

eval_run.score = {"summary_scores": [{"name": "accuracy", "avg": 0.9}]}

eval_run.score_trace_url = "s3://bucket/traces.json"

eval_run.dataset_name = "test_dataset"

eval_run.run_name = "test_run"

return eval_run

@pytest.fixture

def completed_eval_run_with_db_traces(self):

"""Completed eval run with traces in DB."""

eval_run = MagicMock(spec=EvaluationRun)

eval_run.id = 101

eval_run.status = "completed"

eval_run.score = {

"summary_scores": [{"name": "accuracy", "avg": 0.85}],

"traces": [{"trace_id": "db_trace"}],

}

eval_run.score_trace_url = None

return eval_run

`@pytest.fixture`

def evaluation_run_factory(

self: "TestGetEvaluationWithScoresS3",

) -> Callable[..., EvaluationRun]:

"""Factory for creating evaluation run fixtures."""

def _factory(

*,

eval_run_id: int,

status: str = "completed",

score: dict | None = None,

score_trace_url: str | None = None,

dataset_name: str = "test_dataset",

run_name: str = "test_run",

) -> EvaluationRun:

eval_run = MagicMock(spec=EvaluationRun)

eval_run.id = eval_run_id

eval_run.status = status

eval_run.score = score

eval_run.score_trace_url = score_trace_url

eval_run.dataset_name = dataset_name

eval_run.run_name = run_name

return eval_run

return _factory

`@pytest.fixture`

def completed_eval_run_with_s3(

self: "TestGetEvaluationWithScoresS3",

evaluation_run_factory: Callable[..., EvaluationRun],

) -> EvaluationRun:

"""Completed eval run with S3 URL."""

return evaluation_run_factory(

eval_run_id=100,

score={"summary_scores": [{"name": "accuracy", "avg": 0.9}]},

score_trace_url="s3://bucket/traces.json",

)

`@pytest.fixture`

def completed_eval_run_with_db_traces(

self: "TestGetEvaluationWithScoresS3",

evaluation_run_factory: Callable[..., EvaluationRun],

) -> EvaluationRun:

"""Completed eval run with traces in DB."""

return evaluation_run_factory(

eval_run_id=101,

score={

"summary_scores": [{"name": "accuracy", "avg": 0.85}],

"traces": [{"trace_id": "db_trace"}],

},

score_trace_url=None,

)

🤖 Prompt for AI Agents

In `@backend/app/tests/services/evaluations/test_evaluation_service_s3.py` around lines 14 - 37, Replace the two near-duplicate fixtures completed_eval_run_with_s3 and completed_eval_run_with_db_traces with a single factory fixture (e.g., eval_run_factory) that accepts parameters like id: int, status: str, score: dict, score_trace_url: Optional[str], dataset_name: Optional[str], run_name: Optional[str] and returns a MagicMock(spec=EvaluationRun) populated accordingly; update tests to call eval_run_factory(…) to create the two scenarios. Add explicit type hints to the fixture and returned value (eval_run_factory: Callable[..., MagicMock] or -> MagicMock) and to all test function parameters (e.g., mock_get_storage: MagicMock, eval_run_factory: Callable[..., MagicMock]) so every fixture/test signature is typed. Ensure existing symbols EvaluationRun, completed_eval_run_with_s3 and completed_eval_run_with_db_traces are removed/replaced and all tests updated to use the new eval_run_factory and typed parameters.

coderabbitai · 2026-01-31T18:27:09Z

backend/app/tests/services/evaluations/test_evaluation_service_s3.py

+    def test_loads_traces_from_s3(
+        self, mock_get_storage, mock_load, mock_get_eval, completed_eval_run_with_s3
+    ) -> None:
+        """Verify traces loaded from S3 and score reconstructed."""
+        mock_get_eval.return_value = completed_eval_run_with_s3
+        mock_get_storage.return_value = MagicMock()
+        mock_load.return_value = [{"trace_id": "s3_trace"}]
+
+        result, error = get_evaluation_with_scores(
+            session=MagicMock(),
+            evaluation_id=100,
+            organization_id=1,
+            project_id=1,
+            get_trace_info=True,
+            resync_score=False,
+        )
+
+        assert error is None
+        mock_load.assert_called_once()
+        assert result.score["traces"] == [{"trace_id": "s3_trace"}]
+        assert result.score["summary_scores"] == [{"name": "accuracy", "avg": 0.9}]
+
+    @patch("app.services.evaluations.evaluation.get_evaluation_run_by_id")
+    @patch("app.core.cloud.storage.get_cloud_storage")
+    def test_returns_db_traces_when_no_s3_url(
+        self, mock_get_storage, mock_get_eval, completed_eval_run_with_db_traces
+    ) -> None:
+        """Verify DB traces returned when no S3 URL."""
+        mock_get_eval.return_value = completed_eval_run_with_db_traces
+
+        result, error = get_evaluation_with_scores(
+            session=MagicMock(),
+            evaluation_id=101,
+            organization_id=1,
+            project_id=1,
+            get_trace_info=True,
+            resync_score=False,
+        )
+
+        assert error is None
+        mock_get_storage.assert_not_called()
+        assert result.score["traces"] == [{"trace_id": "db_trace"}]
+
+    @patch("app.services.evaluations.evaluation.save_score")
+    @patch("app.services.evaluations.evaluation.fetch_trace_scores_from_langfuse")
+    @patch("app.services.evaluations.evaluation.get_langfuse_client")
+    @patch("app.services.evaluations.evaluation.get_evaluation_run_by_id")
+    @patch("app.core.storage_utils.load_json_from_object_store")
+    @patch("app.core.cloud.storage.get_cloud_storage")
+    def test_resync_bypasses_cache_and_fetches_langfuse(
+        self,
+        mock_get_storage,
+        mock_load,
+        mock_get_eval,
+        mock_get_langfuse,
+        mock_fetch_langfuse,
+        mock_save_score,
+        completed_eval_run_with_s3,
+    ) -> None:
+        """Verify resync=True skips S3/DB and fetches from Langfuse."""
+        mock_get_eval.return_value = completed_eval_run_with_s3
+        mock_get_langfuse.return_value = MagicMock()
+        mock_fetch_langfuse.return_value = {
+            "summary_scores": [],
+            "traces": [{"trace_id": "new"}],
+        }
+        mock_save_score.return_value = completed_eval_run_with_s3
+
+        get_evaluation_with_scores(
+            session=MagicMock(),
+            evaluation_id=100,
+            organization_id=1,
+            project_id=1,
+            get_trace_info=True,
+            resync_score=True,
+        )
+
+        mock_load.assert_not_called()  # S3 skipped
+        mock_fetch_langfuse.assert_called_once()  # Langfuse called


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cd backend && find . -name "test_evaluation_service_s3.py" -type f

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 137

🏁 Script executed:

# Get file info wc -l backend/app/tests/services/evaluations/test_evaluation_service_s3.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 147

🏁 Script executed:

# Read lines 42-120 to verify content sed -n '42,120p' backend/app/tests/services/evaluations/test_evaluation_service_s3.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 3139

🏁 Script executed:

# Check imports and fixture definitions at the top of the file head -50 backend/app/tests/services/evaluations/test_evaluation_service_s3.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 1956

🏁 Script executed:

# Search for the test class and fixture definitions rg -n "class Test|@pytest.fixture|def test_" backend/app/tests/services/evaluations/test_evaluation_service_s3.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 308

Add type hints to test method parameters and fixture return types.

Test methods and fixtures are missing type hint annotations. Per coding guidelines for **/*.py and backend/app/tests/**/*.py, all parameters and return values must have type hints.

Type hints to add

Test method parameters require type hints:

Mock parameters (e.g., mock_get_storage, mock_load) should be annotated as MagicMock

Fixture parameters (e.g., completed_eval_run_with_s3) should be annotated with their actual type (EvaluationRun)

self parameter should be annotated as "TestGetEvaluationWithScoresS3"

Fixtures also need return type annotations:

`@pytest.fixture` def completed_eval_run_with_s3(self) -> EvaluationRun: ... `@pytest.fixture` def completed_eval_run_with_db_traces(self) -> EvaluationRun: ...

🤖 Prompt for AI Agents

In `@backend/app/tests/services/evaluations/test_evaluation_service_s3.py` around lines 42 - 120, The tests lack type hints: add annotations to the test methods' parameters and fixtures referenced in this file—annotate mock parameters (mock_get_storage, mock_load, mock_get_eval, mock_get_langfuse, mock_fetch_langfuse, mock_save_score) as MagicMock, annotate fixture parameters completed_eval_run_with_s3 and completed_eval_run_with_db_traces as EvaluationRun, and annotate self as "TestGetEvaluationWithScoresS3" on the methods test_loads_traces_from_s3, test_returns_db_traces_when_no_s3_url, and test_resync_bypasses_cache_and_fetches_langfuse; also add return type annotations for the fixtures (completed_eval_run_with_s3 -> EvaluationRun, completed_eval_run_with_db_traces -> EvaluationRun) and import MagicMock and EvaluationRun types where needed.

Add score trace URL to evaluation run

87c8e40

vprashrex self-assigned this Jan 27, 2026

vprashrex added the enhancement New feature or request label Jan 27, 2026

vprashrex linked an issue Jan 27, 2026 that may be closed by this pull request

Evaluation: Optimize fetch results #494

Open

vprashrex added 6 commits January 29, 2026 11:53

Merge branch 'main' into feat/eval-results-s3-storage

01ce874

Migration: Add score_trace_url column to evaluation_run

8736311

Merge branch 'main' into feat/eval-results-s3-storage

62d0673

Migration: Add score_trace_url column to evaluation_run

0ac775e

fixes

6e3d12e

- test added - bug fixes

Refactor: Improve formatting and readability in storage utility and t…

afc0d70

…est files

vprashrex marked this pull request as ready for review January 31, 2026 18:20

coderabbitai bot reviewed Jan 31, 2026

View reviewed changes

vprashrex added the ready-for-review label Feb 2, 2026

vprashrex requested review from Prajna1999 and nishika26 February 2, 2026 02:51

bug fix

788f3ae

		def load_json_from_object_store(storage: CloudStorage, url: str) -> list \| None:
		logger.info(f"[load_json_from_object_store] Loading JSON from '{url}")

Add score trace URL to evaluation run #558

Are you sure you want to change the base?

Add score trace URL to evaluation run #558

Uh oh!

Conversation

vprashrex commented Jan 25, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

codecov bot commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vprashrex commented Jan 25, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 25, 2026 •

edited

Loading

codecov bot commented Jan 25, 2026 •

edited

Loading