feat: add Evaluations resource with read-only evaluation API support by Mirajul-Mohin · Pull Request #183 · vlm-run/vlmrun-python-sdk

Mirajul-Mohin · 2026-04-08T16:08:18Z

Summary

Adds a new Evaluations resource to the Python SDK, exposing read-only evaluation API endpoints. Changes across 5 files:

vlmrun/client/evaluations.py (new): Evaluations class with 7 methods — list, get, preview, metrics, summary_stats, unique_sources, delete
vlmrun/client/types.py: 9 new Pydantic models for evaluation response types (EvaluationRunResponse, EvaluationMetricsResponse, AccuracyTrendPoint, etc.) plus EvaluationSourceType and EvaluatorType Literal aliases
vlmrun/client/client.py: Wires Evaluations into the VLMRun client as self.evaluations
tests/test_evaluations.py (new): 7 unit tests covering all public methods
tests/conftest.py: Evaluations mock class added to MockVLMRun

Follows existing resource patterns (Feedback, Artifacts, etc.) using APIRequestor for HTTP calls. Pydantic model fields use Optional[X] / List[T] / Dict[K, V] from typing for Python 3.9 compatibility.

Updates since last revision

Removed write operations (run, optimize_skill, rerun_skill) and their associated request/response types per reviewer feedback — only endpoints available in the vlm-lab backend are included.
Fixed Python 3.9 compatibility: Pydantic model field annotations now use Optional[X], List[T], Dict[K, V] from typing instead of str | None / list[T] / dict[K, V] syntax (Pydantic evaluates field annotations at runtime, so from __future__ import annotations does not help).

Review & Testing Checklist for Human

Validate type definitions against actual backend API schema. All Pydantic models were inferred from vlm-cloud frontend usage, not the backend OpenAPI spec. Field names, types, and optionality (e.g., results: Dict[str, Any], nullable fields) may not match the real API contract. Deserializing a live response is the only way to confirm.
Verify query parameter naming conventions. list() sends created_at__gte / created_at__lte (Django double-underscore style). Confirm the backend expects this exact format.
EvaluationSourceType and EvaluatorType Literals are defined but unused in method signatures — preview() and metrics() accept plain str for source_type. Consider whether these should use the Literal types for compile-time safety, or be removed.
Tests are mock-based only. 7 unit tests exercise mock responses, not real API serialization. Recommend a manual smoke test against a live endpoint.

Suggested manual test plan: Instantiate VLMRun with a valid API key, call client.evaluations.list() and client.evaluations.summary_stats(), and confirm the responses deserialize without ValidationError.

Notes

list() defaults: limit=30, offset=0, order_by="created_at", descending=True. Verify these are appropriate for the backend.
A corresponding Node SDK PR is available at feat: add Evaluations resource with read-only evaluation API support vlmrun-node-sdk#129.

Link to Devin session: https://app.devin.ai/sessions/e8157dfb67844254bf3d6c358c57ebe5
Requested by: @Mirajul-Mohin

- Add Evaluations class with methods: list, get, run, preview, metrics, summary_stats, unique_sources, delete, optimize_skill, rerun_skill - Add Pydantic models for all evaluation request/response types - Wire Evaluations resource into main VLMRun client class - Support dual base URLs (api for reads, agent for writes) Co-Authored-By: mohin@vlm.run <mohinmirajul@gmail.com>

devin-ai-integration · 2026-04-08T16:08:22Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

gemini-code-assist

Code Review

This pull request introduces the Evaluations resource to the VLM Run SDK, adding a new Evaluations client class and corresponding Pydantic models to support listing, triggering, and analyzing evaluation runs. The feedback focuses on improving type safety and consistency across the SDK by utilizing defined Literals for source and evaluator types and adopting datetime objects for timestamp handling in both method arguments and response models.

gemini-code-assist · 2026-04-08T16:14:13Z

+
+from __future__ import annotations
+
+from typing import Any, Dict, List, Optional


The datetime module should be imported to support datetime objects in method arguments for better user experience and consistency with other SDK resources.

Suggested change

from typing import Any, Dict, List, Optional

from datetime import datetime

from typing import Any, Dict, List, Optional, Union

gemini-code-assist · 2026-04-08T16:14:14Z

+    EvaluationSummaryStatsResponse,
+    EvaluationUniqueSourcesResponse,
+    OptimizeSkillResponse,
+    RerunSkillResponse,


The EvaluationSourceType and EvaluatorType literals should be imported and used in method signatures to provide better type safety and IDE support.

Suggested change

RerunSkillResponse,

OptimizeSkillResponse,

RerunSkillResponse,

EvaluationSourceType,

EvaluatorType,

)

gemini-code-assist · 2026-04-08T16:14:14Z

+        created_at_gte: Optional[str] = None,
+        created_at_lte: Optional[str] = None,


Method arguments for timestamps should ideally accept datetime objects in addition to strings. This allows the SDK to handle ISO formatting consistently.

Suggested change

created_at_gte: Optional[str] = None,

created_at_lte: Optional[str] = None,

created_at_gte: Optional[Union[str, datetime]] = None,

created_at_lte: Optional[Union[str, datetime]] = None,

gemini-code-assist · 2026-04-08T16:14:14Z

+        if created_at_gte:
+            params["created_at__gte"] = created_at_gte
+        if created_at_lte:
+            params["created_at__lte"] = created_at_lte


Ensure datetime objects are serialized to ISO format before being sent as query parameters.

Suggested change

if created_at_gte:

params["created_at__gte"] = created_at_gte

if created_at_lte:

params["created_at__lte"] = created_at_lte

if created_at_gte:

params["created_at__gte"] = created_at_gte.isoformat() if isinstance(created_at_gte, datetime) else created_at_gte

if created_at_lte:

params["created_at__lte"] = created_at_lte.isoformat() if isinstance(created_at_lte, datetime) else created_at_lte

gemini-code-assist · 2026-04-08T16:14:14Z

+
+    def run(
+        self,
+        source_type: str,


Use the EvaluationSourceType literal for the source_type parameter to improve type safety.

Suggested change

source_type: str,

source_type: EvaluationSourceType,

gemini-code-assist · 2026-04-08T16:14:14Z

+        skill_ids: Optional[List[str]] = None,
+        data_from: Optional[str] = None,
+        data_to: Optional[str] = None,
+        evaluators: Optional[List[str]] = None,


Use the EvaluatorType literal for the evaluators parameter to improve type safety.

Suggested change

evaluators: Optional[List[str]] = None,

evaluators: Optional[List[EvaluatorType]] = None,

gemini-code-assist · 2026-04-08T16:14:14Z

+    """Response model for an evaluation run."""
+
+    id: str
+    source_type: str


The source_type field should use the EvaluationSourceType literal for better validation and consistency.

Suggested change

source_type: str

source_type: EvaluationSourceType

gemini-code-assist · 2026-04-08T16:14:14Z

+    source_id: Optional[str] = None
+    source_label: str
+    source_version: Optional[str] = None
+    status: str = "pending"


The status field should use the existing JobStatus literal to maintain consistency across the SDK.

Suggested change

status: str = "pending"

status: JobStatus = "pending"

gemini-code-assist · 2026-04-08T16:14:14Z

+    data_from: Optional[str] = None
+    data_to: Optional[str] = None
+    results: Dict[str, Any] = {}
+    created_at: Optional[str] = None


Timestamp fields should use the datetime type instead of str to allow Pydantic to automatically parse ISO strings into Python datetime objects, matching the pattern used in other models like FileResponse and PredictionResponse.

Suggested change

data_from: Optional[str] = None

data_to: Optional[str] = None

results: Dict[str, Any] = {}

created_at: Optional[str] = None

data_from: Optional[datetime] = None

data_to: Optional[datetime] = None

results: Dict[str, Any] = {}

created_at: Optional[datetime] = None

gemini-code-assist · 2026-04-08T16:14:14Z

+    exact_match_rate: Optional[float] = None
+    source_type: Optional[str] = None
+    source_label: Optional[str] = None
+    created_at: str


The created_at field should be typed as datetime for consistency with other SDK models.

Suggested change

created_at: str

created_at: datetime

- Replace Optional[X] with X | None, List[X] with list[X], Dict[K,V] with dict[K,V] - Add Evaluations mock class to conftest.py - Add tests/test_evaluations.py with 10 tests covering all methods Co-Authored-By: mohin@vlm.run <mohinmirajul@gmail.com>

Co-Authored-By: mohin@vlm.run <mohinmirajul@gmail.com>

…wer feedback Co-Authored-By: mohin@vlm.run <mohinmirajul@gmail.com>

…ompatibility Co-Authored-By: mohin@vlm.run <mohinmirajul@gmail.com>

devin-ai-integration

Devin Review found 3 new potential issues.

View 7 additional findings in Devin Review.

devin-ai-integration · 2026-04-08T16:32:07Z

+    data_from: Optional[str] = None
+    data_to: Optional[str] = None
+    results: Dict[str, Any] = {}
+    created_at: Optional[str] = None


🔴 created_at typed as str instead of datetime, breaking SDK-wide convention

EvaluationRunResponse.created_at is typed as Optional[str] (line 600), while every other model in types.py that has a created_at field uses datetime or Optional[datetime] (12 instances at vlmrun/client/types.py:50, :67, :83, :136, :144, :158, :178, :195, :217, :299, :311, :342). This means users get a raw ISO string instead of a parsed datetime object, which is inconsistent with the established API contract. The same issue affects AccuracyTrendPoint.created_at (line 629) and LatencyTrendPoint.created_at (line 645).

Prompt for agents

The created_at field in EvaluationRunResponse (line 600), AccuracyTrendPoint (line 629), and LatencyTrendPoint (line 645) are typed as str instead of datetime. Every other model in types.py uses datetime for created_at fields. These should be changed to Optional[datetime] (or datetime | None per AGENTS.md) for EvaluationRunResponse, and datetime for AccuracyTrendPoint and LatencyTrendPoint, to match the SDK convention. Similarly, latest_item_at in EvaluationPreviewResponse (line 616) should likely be Optional[datetime] as well, since it represents a timestamp. The datetime import is already available in the file (line 10).

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-04-08T16:32:09Z

+    """Response model for an evaluation run."""
+
+    id: str
+    source_type: str


🟡 EvaluationSourceType type alias defined but not applied to model field

EvaluationSourceType = Literal["agent", "request_domain", "skill"] is defined at line 578 but never used. The EvaluationRunResponse.source_type field (line 588) is typed as str instead of EvaluationSourceType, so Pydantic won't validate the value against the expected set. The same applies to EvaluatorType (lines 579-581) which is also unused. These type aliases were clearly created for use in the evaluation models but were never applied, losing the validation benefit that Pydantic would otherwise provide.

Suggested change

source_type: str

source_type: EvaluationSourceType

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-04-08T16:32:11Z

@@ -0,0 +1,69 @@
+"""Tests for evaluations operations."""


🔴 AGENTS.md violation: new test file missing from __future__ import annotations

The AGENTS.md mandates: "Use from __future__ import annotations at the top of every file." The new file tests/test_evaluations.py does not include this import.

Suggested change

"""Tests for evaluations operations."""

"""Tests for evaluations operations."""

from __future__ import annotations

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration Bot assigned Mirajul-Mohin Apr 8, 2026

Mirajul-Mohin temporarily deployed to dev April 8, 2026 16:08 — with GitHub Actions Inactive

This comment was marked as resolved.

Sign in to view

gemini-code-assist Bot reviewed Apr 8, 2026

View reviewed changes

devin-ai-integration Bot had a problem deploying to dev April 8, 2026 16:15 Failure

devin-ai-integration Bot temporarily deployed to dev April 8, 2026 16:15 Inactive

devin-ai-integration Bot had a problem deploying to dev April 8, 2026 16:15 Error

devin-ai-integration Bot temporarily deployed to dev April 8, 2026 16:15 Inactive

fix: black formatting for conftest.py list method signature

293bb8e

Co-Authored-By: mohin@vlm.run <mohinmirajul@gmail.com>

devin-ai-integration Bot temporarily deployed to dev April 8, 2026 16:16 Inactive

devin-ai-integration Bot had a problem deploying to dev April 8, 2026 16:16 Error

devin-ai-integration Bot had a problem deploying to dev April 8, 2026 16:16 Failure

devin-ai-integration Bot had a problem deploying to dev April 8, 2026 16:16 Error

refactor: remove run, optimize_skill, rerun_skill endpoints per revie…

c2701ed

…wer feedback Co-Authored-By: mohin@vlm.run <mohinmirajul@gmail.com>

devin-ai-integration Bot had a problem deploying to dev April 8, 2026 16:25 Error

devin-ai-integration Bot temporarily deployed to dev April 8, 2026 16:25 Inactive

devin-ai-integration Bot had a problem deploying to dev April 8, 2026 16:25 Failure

devin-ai-integration Bot had a problem deploying to dev April 8, 2026 16:25 Error

fix: use Optional/List type hints in Pydantic models for Python 3.9 c…

0367574

…ompatibility Co-Authored-By: mohin@vlm.run <mohinmirajul@gmail.com>

devin-ai-integration Bot temporarily deployed to dev April 8, 2026 16:27 Inactive

devin-ai-integration Bot mentioned this pull request Apr 8, 2026

feat: add Evaluations resource with read-only evaluation API support vlm-run/vlmrun-node-sdk#129

Open

4 tasks

devin-ai-integration Bot changed the title ~~feat: add Evaluations resource with full evaluation API support~~ feat: add Evaluations resource with read-only evaluation API support Apr 8, 2026

devin-ai-integration Bot reviewed Apr 8, 2026

View reviewed changes


		from __future__ import annotations

		from typing import Any, Dict, List, Optional

	from typing import Any, Dict, List, Optional
	from datetime import datetime
	from typing import Any, Dict, List, Optional, Union

		created_at_gte: Optional[str] = None,
		created_at_lte: Optional[str] = None,

	evaluators: Optional[List[str]] = None,
	evaluators: Optional[List[EvaluatorType]] = None,

Conversation

Mirajul-Mohin commented Apr 8, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Updates since last revision

Review & Testing Checklist for Human

Notes

Uh oh!

devin-ai-integration Bot commented Apr 8, 2026

🤖 Devin AI Engineer

Uh oh!

This comment was marked as resolved.

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Mirajul-Mohin commented Apr 8, 2026 •

edited by devin-ai-integration Bot

Loading