Skip to content

feat: add Evaluations resource with read-only evaluation API support#183

Open
Mirajul-Mohin wants to merge 5 commits into
mainfrom
devin/1775664104-add-evaluations-resource
Open

feat: add Evaluations resource with read-only evaluation API support#183
Mirajul-Mohin wants to merge 5 commits into
mainfrom
devin/1775664104-add-evaluations-resource

Conversation

@Mirajul-Mohin
Copy link
Copy Markdown
Contributor

@Mirajul-Mohin Mirajul-Mohin commented Apr 8, 2026

Summary

Adds a new Evaluations resource to the Python SDK, exposing read-only evaluation API endpoints. Changes across 5 files:

  • vlmrun/client/evaluations.py (new): Evaluations class with 7 methods — list, get, preview, metrics, summary_stats, unique_sources, delete
  • vlmrun/client/types.py: 9 new Pydantic models for evaluation response types (EvaluationRunResponse, EvaluationMetricsResponse, AccuracyTrendPoint, etc.) plus EvaluationSourceType and EvaluatorType Literal aliases
  • vlmrun/client/client.py: Wires Evaluations into the VLMRun client as self.evaluations
  • tests/test_evaluations.py (new): 7 unit tests covering all public methods
  • tests/conftest.py: Evaluations mock class added to MockVLMRun

Follows existing resource patterns (Feedback, Artifacts, etc.) using APIRequestor for HTTP calls. Pydantic model fields use Optional[X] / List[T] / Dict[K, V] from typing for Python 3.9 compatibility.

Updates since last revision

  • Removed write operations (run, optimize_skill, rerun_skill) and their associated request/response types per reviewer feedback — only endpoints available in the vlm-lab backend are included.
  • Fixed Python 3.9 compatibility: Pydantic model field annotations now use Optional[X], List[T], Dict[K, V] from typing instead of str | None / list[T] / dict[K, V] syntax (Pydantic evaluates field annotations at runtime, so from __future__ import annotations does not help).

Review & Testing Checklist for Human

  • Validate type definitions against actual backend API schema. All Pydantic models were inferred from vlm-cloud frontend usage, not the backend OpenAPI spec. Field names, types, and optionality (e.g., results: Dict[str, Any], nullable fields) may not match the real API contract. Deserializing a live response is the only way to confirm.
  • Verify query parameter naming conventions. list() sends created_at__gte / created_at__lte (Django double-underscore style). Confirm the backend expects this exact format.
  • EvaluationSourceType and EvaluatorType Literals are defined but unused in method signaturespreview() and metrics() accept plain str for source_type. Consider whether these should use the Literal types for compile-time safety, or be removed.
  • Tests are mock-based only. 7 unit tests exercise mock responses, not real API serialization. Recommend a manual smoke test against a live endpoint.

Suggested manual test plan: Instantiate VLMRun with a valid API key, call client.evaluations.list() and client.evaluations.summary_stats(), and confirm the responses deserialize without ValidationError.

Notes

Link to Devin session: https://app.devin.ai/sessions/e8157dfb67844254bf3d6c358c57ebe5
Requested by: @Mirajul-Mohin


Open with Devin

- Add Evaluations class with methods: list, get, run, preview, metrics, summary_stats, unique_sources, delete, optimize_skill, rerun_skill
- Add Pydantic models for all evaluation request/response types
- Wire Evaluations resource into main VLMRun client class
- Support dual base URLs (api for reads, agent for writes)

Co-Authored-By: mohin@vlm.run <mohinmirajul@gmail.com>
@devin-ai-integration
Copy link
Copy Markdown
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the Evaluations resource to the VLM Run SDK, adding a new Evaluations client class and corresponding Pydantic models to support listing, triggering, and analyzing evaluation runs. The feedback focuses on improving type safety and consistency across the SDK by utilizing defined Literals for source and evaluator types and adopting datetime objects for timestamp handling in both method arguments and response models.

Comment thread vlmrun/client/evaluations.py Outdated

from __future__ import annotations

from typing import Any, Dict, List, Optional
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The datetime module should be imported to support datetime objects in method arguments for better user experience and consistency with other SDK resources.

Suggested change
from typing import Any, Dict, List, Optional
from datetime import datetime
from typing import Any, Dict, List, Optional, Union

Comment thread vlmrun/client/evaluations.py Outdated
EvaluationSummaryStatsResponse,
EvaluationUniqueSourcesResponse,
OptimizeSkillResponse,
RerunSkillResponse,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The EvaluationSourceType and EvaluatorType literals should be imported and used in method signatures to provide better type safety and IDE support.

Suggested change
RerunSkillResponse,
OptimizeSkillResponse,
RerunSkillResponse,
EvaluationSourceType,
EvaluatorType,
)

Comment thread vlmrun/client/evaluations.py Outdated
Comment on lines +39 to +40
created_at_gte: Optional[str] = None,
created_at_lte: Optional[str] = None,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Method arguments for timestamps should ideally accept datetime objects in addition to strings. This allows the SDK to handle ISO formatting consistently.

Suggested change
created_at_gte: Optional[str] = None,
created_at_lte: Optional[str] = None,
created_at_gte: Optional[Union[str, datetime]] = None,
created_at_lte: Optional[Union[str, datetime]] = None,

Comment on lines +61 to +64
if created_at_gte:
params["created_at__gte"] = created_at_gte
if created_at_lte:
params["created_at__lte"] = created_at_lte
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Ensure datetime objects are serialized to ISO format before being sent as query parameters.

Suggested change
if created_at_gte:
params["created_at__gte"] = created_at_gte
if created_at_lte:
params["created_at__lte"] = created_at_lte
if created_at_gte:
params["created_at__gte"] = created_at_gte.isoformat() if isinstance(created_at_gte, datetime) else created_at_gte
if created_at_lte:
params["created_at__lte"] = created_at_lte.isoformat() if isinstance(created_at_lte, datetime) else created_at_lte

Comment thread vlmrun/client/evaluations.py Outdated

def run(
self,
source_type: str,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use the EvaluationSourceType literal for the source_type parameter to improve type safety.

Suggested change
source_type: str,
source_type: EvaluationSourceType,

Comment thread vlmrun/client/evaluations.py Outdated
skill_ids: Optional[List[str]] = None,
data_from: Optional[str] = None,
data_to: Optional[str] = None,
evaluators: Optional[List[str]] = None,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use the EvaluatorType literal for the evaluators parameter to improve type safety.

Suggested change
evaluators: Optional[List[str]] = None,
evaluators: Optional[List[EvaluatorType]] = None,

Comment thread vlmrun/client/types.py
"""Response model for an evaluation run."""

id: str
source_type: str
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The source_type field should use the EvaluationSourceType literal for better validation and consistency.

Suggested change
source_type: str
source_type: EvaluationSourceType

Comment thread vlmrun/client/types.py
source_id: Optional[str] = None
source_label: str
source_version: Optional[str] = None
status: str = "pending"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The status field should use the existing JobStatus literal to maintain consistency across the SDK.

Suggested change
status: str = "pending"
status: JobStatus = "pending"

Comment thread vlmrun/client/types.py
Comment on lines +597 to +600
data_from: Optional[str] = None
data_to: Optional[str] = None
results: Dict[str, Any] = {}
created_at: Optional[str] = None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Timestamp fields should use the datetime type instead of str to allow Pydantic to automatically parse ISO strings into Python datetime objects, matching the pattern used in other models like FileResponse and PredictionResponse.

Suggested change
data_from: Optional[str] = None
data_to: Optional[str] = None
results: Dict[str, Any] = {}
created_at: Optional[str] = None
data_from: Optional[datetime] = None
data_to: Optional[datetime] = None
results: Dict[str, Any] = {}
created_at: Optional[datetime] = None

Comment thread vlmrun/client/types.py
exact_match_rate: Optional[float] = None
source_type: Optional[str] = None
source_label: Optional[str] = None
created_at: str
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The created_at field should be typed as datetime for consistency with other SDK models.

Suggested change
created_at: str
created_at: datetime

- Replace Optional[X] with X | None, List[X] with list[X], Dict[K,V] with dict[K,V]
- Add Evaluations mock class to conftest.py
- Add tests/test_evaluations.py with 10 tests covering all methods

Co-Authored-By: mohin@vlm.run <mohinmirajul@gmail.com>
Co-Authored-By: mohin@vlm.run <mohinmirajul@gmail.com>
…wer feedback

Co-Authored-By: mohin@vlm.run <mohinmirajul@gmail.com>
…ompatibility

Co-Authored-By: mohin@vlm.run <mohinmirajul@gmail.com>
@devin-ai-integration devin-ai-integration Bot changed the title feat: add Evaluations resource with full evaluation API support feat: add Evaluations resource with read-only evaluation API support Apr 8, 2026
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 new potential issues.

View 7 additional findings in Devin Review.

Open in Devin Review

Comment thread vlmrun/client/types.py
data_from: Optional[str] = None
data_to: Optional[str] = None
results: Dict[str, Any] = {}
created_at: Optional[str] = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 created_at typed as str instead of datetime, breaking SDK-wide convention

EvaluationRunResponse.created_at is typed as Optional[str] (line 600), while every other model in types.py that has a created_at field uses datetime or Optional[datetime] (12 instances at vlmrun/client/types.py:50, :67, :83, :136, :144, :158, :178, :195, :217, :299, :311, :342). This means users get a raw ISO string instead of a parsed datetime object, which is inconsistent with the established API contract. The same issue affects AccuracyTrendPoint.created_at (line 629) and LatencyTrendPoint.created_at (line 645).

Prompt for agents
The created_at field in EvaluationRunResponse (line 600), AccuracyTrendPoint (line 629), and LatencyTrendPoint (line 645) are typed as str instead of datetime. Every other model in types.py uses datetime for created_at fields. These should be changed to Optional[datetime] (or datetime | None per AGENTS.md) for EvaluationRunResponse, and datetime for AccuracyTrendPoint and LatencyTrendPoint, to match the SDK convention. Similarly, latest_item_at in EvaluationPreviewResponse (line 616) should likely be Optional[datetime] as well, since it represents a timestamp. The datetime import is already available in the file (line 10).
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment thread vlmrun/client/types.py
"""Response model for an evaluation run."""

id: str
source_type: str
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 EvaluationSourceType type alias defined but not applied to model field

EvaluationSourceType = Literal["agent", "request_domain", "skill"] is defined at line 578 but never used. The EvaluationRunResponse.source_type field (line 588) is typed as str instead of EvaluationSourceType, so Pydantic won't validate the value against the expected set. The same applies to EvaluatorType (lines 579-581) which is also unused. These type aliases were clearly created for use in the evaluation models but were never applied, losing the validation benefit that Pydantic would otherwise provide.

Suggested change
source_type: str
source_type: EvaluationSourceType
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment thread tests/test_evaluations.py
@@ -0,0 +1,69 @@
"""Tests for evaluations operations."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 AGENTS.md violation: new test file missing from __future__ import annotations

The AGENTS.md mandates: "Use from __future__ import annotations at the top of every file." The new file tests/test_evaluations.py does not include this import.

Suggested change
"""Tests for evaluations operations."""
"""Tests for evaluations operations."""
from __future__ import annotations
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant