FEAT: Score partial content from content-filtered responses#1689
FEAT: Score partial content from content-filtered responses#1689jsong468 wants to merge 9 commits intomicrosoft:mainfrom
Conversation
|
Thanks for implementing this feature! I did only a cursory review of the code but read the thorough PR description and the overall design looks good to me. |
| scores = await self._score_async( | ||
| message, | ||
| objective=objective, | ||
| score_blocked_content=score_blocked_content, |
There was a problem hiding this comment.
Just calling out that this addition is not fully back compatible since adding this unconditional forward in _score_async breaks subclasses using the previous signature ((self, message, *, objective=None)). This can be verified by creating a Scorer subclass based on the old signature and a default score_async. That causes a keyword error on score_blocked_content. Since back-compat is called out in the PR description you may want to either call this out, or change this to a conditional forward?
There was a problem hiding this comment.
good point, the logic for substituting partial content can be moved to the public-facing score_async to preserve backward compat.
| # When True, blocked responses that contain partial model output (e.g., from Azure Content Safety | ||
| # triggering mid-generation) will be evaluated by scorers instead of being skipped or | ||
| # auto-classified as failures/refusals. | ||
| score_blocked_content: bool = False |
There was a problem hiding this comment.
If this were an attribute on the scorer, it wouldn't change this
| role_filter: Optional[ChatMessageRole] = None, | ||
| skip_on_error_result: bool = False, | ||
| infer_objective_from_request: bool = False, | ||
| score_blocked_content: bool = False, |
There was a problem hiding this comment.
I think default True is better
There was a problem hiding this comment.
I think default False is preferable for now because having a response filtered is still a meaningful result from the model that should be captured by default rather than silently going past and continuing an attack.
| if use_partial_content and piece.is_blocked() and "partial_content" in piece.prompt_metadata: | ||
| text = str(piece.prompt_metadata["partial_content"]) | ||
| else: | ||
| text = piece.converted_value | ||
| conversation_text += f"{role_display}: {text}\n" |
There was a problem hiding this comment.
We should perhaps tell the scorer that this is partial content because the full response was blocked? Isn't that relevant information?
There was a problem hiding this comment.
I think it's relevant but kind of defeats the point of scoring blocked content. The scorer may be biased in an unknown way by doing that; if the user did not want the partial content to be scored, they can switch flag to False. Otherwise, partial content will be scored as if it were normal content (which is what the flag is for) or else we would have a blurry middle ground.
Description
When OpenAI/Azure Chat Completions content filters trigger mid-generation (HTTP 200 with
finish_reason=content_filter), the model may have already produced partial content before being cut off. Currently, PyRIT discards this partial content and treats the response identically to a full block (HTTP 400), so scorers return hardcoded failures and attacks backtrack. From an adversary's perspective, partial harmful content may constitute a successful attack even though the output was eventually filtered.This PR introduces a
score_blocked_contentattribute on theScorerbase class that allows scorers to opt in to evaluating partial content from blocked responses instead of automatically treating them as failures.Target layer — partial content extraction
_extract_partial_content(response)template method toOpenAITargetbase class, returningNoneby defaultOpenAIChatTargetoverrides it to extractresponse.choices[0].message.content_handle_content_filter_responsecalls the hook and attaches the result toprompt_metadata["partial_content"]on the blockedMessagePiecebefore it is persisted to the DBresponse_error="blocked"— full backward compatibility_extract_partial_contenthook can be overridden if this changes in the future.Scorer layer — instance attribute on
Scorerscore_blocked_content: bool = Falseclass attribute onScorerTrue,score_asynccreates a modified message where blocked pieces withpartial_contentmetadata are replaced with text-type substitutes (response_error="none",converted_value_data_type="text",converted_value=<partial text>) before passing to_score_asyncresponse_error="none"so scorer short-circuits (e.g., refusal scorer'sif response_error == "blocked"check) do not fire, and the LLM evaluates the actual contentskip_on_error_result=Trueandself.score_blocked_content=True, blocked messages with partial content are not skipped.score_async(not_score_async) so that_score_async's signature remains(self, message, *, objective=None)— preserving full backward compatibility for external subclassesScorereferences the original blocked piece's IDConversationScorerreadsself.score_blocked_contentdirectly when building conversation text from DB history, usingpartial_contentfrom metadata instead ofconverted_valuefor blocked piecesDesign decision — instance attribute, not call-site parameter or
AttackScoringConfigfieldPrevious design considerations:
call-site parameter on
score_async? Earlier iterations addedscore_blocked_contentas a parameter onscore_async,score_response_async, andscore_response_multiple_scorers_async, threaded through fromAttackScoringConfigvia each attack's scoring calls. This had two problems:score_asyncmust remember to forward the parameter.score_blocked_contentparameter to_score_asyncbroke external subclasses that override it with the original(self, message, *, objective=None)signature. Moving the substitution before_score_asyncsolved this, but the forwarding burden through public APIs remained.Why not on
AttackScoringConfig? Putting the flag on the config centralizes it but still requires attacks to thread it through scoring calls. Adding it at the config level introduces complexity without significant benefit to the user.Why an instance attribute?
score_blocked_contentis a scorer behavioral policy — "should this scorer evaluate partial content?" — similar to_validatoror_score_aggregator. Setting it on the instance:score_asyncreadsself.score_blocked_content. No parameters to thread throughscore_response_async, attacks, or scorer wrappers.score_async,_score_async,score_response_async, orscore_response_multiple_scorers_asyncsignatures.ConversationScorerhandling: Readsself.score_blocked_contentdirectly instead of inferring substitution state from message piece metadata.score_blocked_contentis mutable instance state, a shared scorer could theoretically be mutated between calls. In practice, this is not a concern because:True) and external code (wantsFalse), the mutation would be explicit and visible (scorer.score_blocked_content = True), unlike a call-site parameter where the absence of the flag silently defaults toFalse.score_blocked_contentpolicy does not apply anymore. This is reasonable since if you want different scoring policies, you need different scorers.The alternative — a call-site parameter — would allow the same scorer instance to behave differently per call. However, this theoretical benefit doesn't materialize in practice (scorers aren't often shared) and comes at the cost of the forwarding burden described above.
TAP
error_score_mapinteractionTAP's
error_score_map(default{"blocked": 0.0}) runs before the scorer and assigns fixed scores for error types. When"blocked"is in the map, the scorer is never invoked, soscore_blocked_contenthas no effect. To evaluate partial content with TAP, passerror_score_map={}to disable the early-return. This interaction is documented in both the node andTreeOfAttacksWithPruningAttackerror_score_mapdocstrings.Interaction between
skip_on_error_resultandscore_blocked_contentThese flags serve different purposes:
skip_on_error_resultis a performance optimization — don't waste LLM calls on broken responses (processing errors, empty responses)score_blocked_contentis a red teaming policy — content-filtered responses may contain useful partial contentThey overlap because
message.is_error()returnsTruefor blocked responses. The implementation handles this: when both are active and the blocked message has partial content, scoring is not skipped.skip_on_errorscore_blocked_contentTrueFalsePromptSendingAttackdefault)FalseFalseCrescendoAttackdefault)TrueTrueFalseTrueTests
TestCreateTextPieceFromBlocked(6 tests intest_scorer.py) — substitute piece creation, field preservation,response_error="none",Nonewhen no partial contentTestScoreAsyncWithBlockedContent(7 tests intest_scorer.py) — text-only scorer filtering, substitute scoring, refusal scorer short-circuit bypass, no-partial-content handling, mixed pieces, normal piece unaffectedTestSkipOnErrorWithBlockedContent(3 tests intest_scorer.py) — interaction betweenskip_on_error_result=Trueandscore_blocked_content=TrueTestScoreResponseAsyncBlockedContent(3 tests intest_scorer.py) — flag flows through from scorer instance viascore_response_asyncandscore_response_multiple_scorers_asyncTestExtractPartialContentChatTarget(4 tests intest_openai_chat_target.py) — extraction from Chat Completions responses,Noneedge casesTestContentFilterPreservesPartialContent(2 tests intest_openai_chat_target.py) — end-to-end: 200 + content_filter preserves metadata, no metadata when no contenttest_conversation_scorer_uses_partial_content_when_score_blocked_content_enabled(intest_conversation_history_scorer.py) — verifies partial content is used in conversation text when flag is ontest_conversation_scorer_uses_error_json_when_score_blocked_content_disabled(intest_conversation_history_scorer.py) — verifies default behavior uses error JSON