feat: add Evaluations resource with read-only evaluation API support#183
feat: add Evaluations resource with read-only evaluation API support#183Mirajul-Mohin wants to merge 5 commits into
Conversation
- Add Evaluations class with methods: list, get, run, preview, metrics, summary_stats, unique_sources, delete, optimize_skill, rerun_skill - Add Pydantic models for all evaluation request/response types - Wire Evaluations resource into main VLMRun client class - Support dual base URLs (api for reads, agent for writes) Co-Authored-By: mohin@vlm.run <mohinmirajul@gmail.com>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
There was a problem hiding this comment.
Code Review
This pull request introduces the Evaluations resource to the VLM Run SDK, adding a new Evaluations client class and corresponding Pydantic models to support listing, triggering, and analyzing evaluation runs. The feedback focuses on improving type safety and consistency across the SDK by utilizing defined Literals for source and evaluator types and adopting datetime objects for timestamp handling in both method arguments and response models.
|
|
||
| from __future__ import annotations | ||
|
|
||
| from typing import Any, Dict, List, Optional |
There was a problem hiding this comment.
| EvaluationSummaryStatsResponse, | ||
| EvaluationUniqueSourcesResponse, | ||
| OptimizeSkillResponse, | ||
| RerunSkillResponse, |
There was a problem hiding this comment.
| created_at_gte: Optional[str] = None, | ||
| created_at_lte: Optional[str] = None, |
There was a problem hiding this comment.
Method arguments for timestamps should ideally accept datetime objects in addition to strings. This allows the SDK to handle ISO formatting consistently.
| created_at_gte: Optional[str] = None, | |
| created_at_lte: Optional[str] = None, | |
| created_at_gte: Optional[Union[str, datetime]] = None, | |
| created_at_lte: Optional[Union[str, datetime]] = None, |
| if created_at_gte: | ||
| params["created_at__gte"] = created_at_gte | ||
| if created_at_lte: | ||
| params["created_at__lte"] = created_at_lte |
There was a problem hiding this comment.
Ensure datetime objects are serialized to ISO format before being sent as query parameters.
| if created_at_gte: | |
| params["created_at__gte"] = created_at_gte | |
| if created_at_lte: | |
| params["created_at__lte"] = created_at_lte | |
| if created_at_gte: | |
| params["created_at__gte"] = created_at_gte.isoformat() if isinstance(created_at_gte, datetime) else created_at_gte | |
| if created_at_lte: | |
| params["created_at__lte"] = created_at_lte.isoformat() if isinstance(created_at_lte, datetime) else created_at_lte |
|
|
||
| def run( | ||
| self, | ||
| source_type: str, |
| skill_ids: Optional[List[str]] = None, | ||
| data_from: Optional[str] = None, | ||
| data_to: Optional[str] = None, | ||
| evaluators: Optional[List[str]] = None, |
| """Response model for an evaluation run.""" | ||
|
|
||
| id: str | ||
| source_type: str |
| source_id: Optional[str] = None | ||
| source_label: str | ||
| source_version: Optional[str] = None | ||
| status: str = "pending" |
| data_from: Optional[str] = None | ||
| data_to: Optional[str] = None | ||
| results: Dict[str, Any] = {} | ||
| created_at: Optional[str] = None |
There was a problem hiding this comment.
Timestamp fields should use the datetime type instead of str to allow Pydantic to automatically parse ISO strings into Python datetime objects, matching the pattern used in other models like FileResponse and PredictionResponse.
| data_from: Optional[str] = None | |
| data_to: Optional[str] = None | |
| results: Dict[str, Any] = {} | |
| created_at: Optional[str] = None | |
| data_from: Optional[datetime] = None | |
| data_to: Optional[datetime] = None | |
| results: Dict[str, Any] = {} | |
| created_at: Optional[datetime] = None |
| exact_match_rate: Optional[float] = None | ||
| source_type: Optional[str] = None | ||
| source_label: Optional[str] = None | ||
| created_at: str |
- Replace Optional[X] with X | None, List[X] with list[X], Dict[K,V] with dict[K,V] - Add Evaluations mock class to conftest.py - Add tests/test_evaluations.py with 10 tests covering all methods Co-Authored-By: mohin@vlm.run <mohinmirajul@gmail.com>
Co-Authored-By: mohin@vlm.run <mohinmirajul@gmail.com>
…wer feedback Co-Authored-By: mohin@vlm.run <mohinmirajul@gmail.com>
…ompatibility Co-Authored-By: mohin@vlm.run <mohinmirajul@gmail.com>
| data_from: Optional[str] = None | ||
| data_to: Optional[str] = None | ||
| results: Dict[str, Any] = {} | ||
| created_at: Optional[str] = None |
There was a problem hiding this comment.
🔴 created_at typed as str instead of datetime, breaking SDK-wide convention
EvaluationRunResponse.created_at is typed as Optional[str] (line 600), while every other model in types.py that has a created_at field uses datetime or Optional[datetime] (12 instances at vlmrun/client/types.py:50, :67, :83, :136, :144, :158, :178, :195, :217, :299, :311, :342). This means users get a raw ISO string instead of a parsed datetime object, which is inconsistent with the established API contract. The same issue affects AccuracyTrendPoint.created_at (line 629) and LatencyTrendPoint.created_at (line 645).
Prompt for agents
The created_at field in EvaluationRunResponse (line 600), AccuracyTrendPoint (line 629), and LatencyTrendPoint (line 645) are typed as str instead of datetime. Every other model in types.py uses datetime for created_at fields. These should be changed to Optional[datetime] (or datetime | None per AGENTS.md) for EvaluationRunResponse, and datetime for AccuracyTrendPoint and LatencyTrendPoint, to match the SDK convention. Similarly, latest_item_at in EvaluationPreviewResponse (line 616) should likely be Optional[datetime] as well, since it represents a timestamp. The datetime import is already available in the file (line 10).
Was this helpful? React with 👍 or 👎 to provide feedback.
| """Response model for an evaluation run.""" | ||
|
|
||
| id: str | ||
| source_type: str |
There was a problem hiding this comment.
🟡 EvaluationSourceType type alias defined but not applied to model field
EvaluationSourceType = Literal["agent", "request_domain", "skill"] is defined at line 578 but never used. The EvaluationRunResponse.source_type field (line 588) is typed as str instead of EvaluationSourceType, so Pydantic won't validate the value against the expected set. The same applies to EvaluatorType (lines 579-581) which is also unused. These type aliases were clearly created for use in the evaluation models but were never applied, losing the validation benefit that Pydantic would otherwise provide.
| source_type: str | |
| source_type: EvaluationSourceType | |
Was this helpful? React with 👍 or 👎 to provide feedback.
| @@ -0,0 +1,69 @@ | |||
| """Tests for evaluations operations.""" | |||
There was a problem hiding this comment.
🔴 AGENTS.md violation: new test file missing from __future__ import annotations
The AGENTS.md mandates: "Use from __future__ import annotations at the top of every file." The new file tests/test_evaluations.py does not include this import.
| """Tests for evaluations operations.""" | |
| """Tests for evaluations operations.""" | |
| from __future__ import annotations | |
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
Adds a new
Evaluationsresource to the Python SDK, exposing read-only evaluation API endpoints. Changes across 5 files:vlmrun/client/evaluations.py(new):Evaluationsclass with 7 methods —list,get,preview,metrics,summary_stats,unique_sources,deletevlmrun/client/types.py: 9 new Pydantic models for evaluation response types (EvaluationRunResponse,EvaluationMetricsResponse,AccuracyTrendPoint, etc.) plusEvaluationSourceTypeandEvaluatorTypeLiteral aliasesvlmrun/client/client.py: WiresEvaluationsinto theVLMRunclient asself.evaluationstests/test_evaluations.py(new): 7 unit tests covering all public methodstests/conftest.py:Evaluationsmock class added toMockVLMRunFollows existing resource patterns (Feedback, Artifacts, etc.) using
APIRequestorfor HTTP calls. Pydantic model fields useOptional[X]/List[T]/Dict[K, V]fromtypingfor Python 3.9 compatibility.Updates since last revision
run,optimize_skill,rerun_skill) and their associated request/response types per reviewer feedback — only endpoints available in the vlm-lab backend are included.Optional[X],List[T],Dict[K, V]fromtypinginstead ofstr | None/list[T]/dict[K, V]syntax (Pydantic evaluates field annotations at runtime, sofrom __future__ import annotationsdoes not help).Review & Testing Checklist for Human
results: Dict[str, Any], nullable fields) may not match the real API contract. Deserializing a live response is the only way to confirm.list()sendscreated_at__gte/created_at__lte(Django double-underscore style). Confirm the backend expects this exact format.EvaluationSourceTypeandEvaluatorTypeLiterals are defined but unused in method signatures —preview()andmetrics()accept plainstrforsource_type. Consider whether these should use the Literal types for compile-time safety, or be removed.Suggested manual test plan: Instantiate
VLMRunwith a valid API key, callclient.evaluations.list()andclient.evaluations.summary_stats(), and confirm the responses deserialize withoutValidationError.Notes
list()defaults:limit=30,offset=0,order_by="created_at",descending=True. Verify these are appropriate for the backend.Link to Devin session: https://app.devin.ai/sessions/e8157dfb67844254bf3d6c358c57ebe5
Requested by: @Mirajul-Mohin