Skip to content

FEAT: Score partial content from content-filtered responses#1689

Open
jsong468 wants to merge 9 commits intomicrosoft:mainfrom
jsong468:partial_blocked_design2
Open

FEAT: Score partial content from content-filtered responses#1689
jsong468 wants to merge 9 commits intomicrosoft:mainfrom
jsong468:partial_blocked_design2

Conversation

@jsong468
Copy link
Copy Markdown
Contributor

@jsong468 jsong468 commented May 4, 2026

Description

When OpenAI/Azure Chat Completions content filters trigger mid-generation (HTTP 200 with finish_reason=content_filter), the model may have already produced partial content before being cut off. Currently, PyRIT discards this partial content and treats the response identically to a full block (HTTP 400), so scorers return hardcoded failures and attacks backtrack. From an adversary's perspective, partial harmful content may constitute a successful attack even though the output was eventually filtered.

This PR introduces a score_blocked_content attribute on the Scorer base class that allows scorers to opt in to evaluating partial content from blocked responses instead of automatically treating them as failures.

Target layer — partial content extraction

  • Added _extract_partial_content(response) template method to OpenAITarget base class, returning None by default
  • OpenAIChatTarget overrides it to extract response.choices[0].message.content
  • _handle_content_filter_response calls the hook and attaches the result to prompt_metadata["partial_content"] on the blocked MessagePiece before it is persisted to the DB
  • No changes to response_error="blocked" — full backward compatibility
  • Partial content extraction is currently only supported for the Chat Completions API. The Responses API and Realtime APIs do not document partial content availability when a content filter fires; the _extract_partial_content hook can be overridden if this changes in the future.

Scorer layer — instance attribute on Scorer

  • Added score_blocked_content: bool = False class attribute on Scorer
  • When True, score_async creates a modified message where blocked pieces with partial_content metadata are replaced with text-type substitutes (response_error="none", converted_value_data_type="text", converted_value=<partial text>) before passing to _score_async
  • The substitute has response_error="none" so scorer short-circuits (e.g., refusal scorer's if response_error == "blocked" check) do not fire, and the LLM evaluates the actual content
  • When skip_on_error_result=True and self.score_blocked_content=True, blocked messages with partial content are not skipped.
  • The substitution happens in score_async (not _score_async) so that _score_async's signature remains (self, message, *, objective=None) — preserving full backward compatibility for external subclasses
  • The substitute is never persisted to the DB. The resulting Score references the original blocked piece's ID
  • Unique case: ConversationScorer reads self.score_blocked_content directly when building conversation text from DB history, using partial_content from metadata instead of converted_value for blocked pieces

Design decision — instance attribute, not call-site parameter or AttackScoringConfig field

Previous design considerations:

call-site parameter on score_async? Earlier iterations added score_blocked_content as a parameter on score_async, score_response_async, and score_response_multiple_scorers_async, threaded through from AttackScoringConfig via each attack's scoring calls. This had two problems:

  • Forwarding burden: Every new attack, scorer wrapper, or static method that calls score_async must remember to forward the parameter.
  • Backward compatibility: Adding a score_blocked_content parameter to _score_async broke external subclasses that override it with the original (self, message, *, objective=None) signature. Moving the substitution before _score_async solved this, but the forwarding burden through public APIs remained.

Why not on AttackScoringConfig? Putting the flag on the config centralizes it but still requires attacks to thread it through scoring calls. Adding it at the config level introduces complexity without significant benefit to the user.

Why an instance attribute? score_blocked_content is a scorer behavioral policy — "should this scorer evaluate partial content?" — similar to _validator or _score_aggregator. Setting it on the instance:

  • Zero forwarding: score_async reads self.score_blocked_content. No parameters to thread through score_response_async, attacks, or scorer wrappers.
  • Zero forgotten forwarding: New attacks, scorer wrappers, and static methods automatically inherit the behavior without code changes.
  • Full backward compatibility: No changes to score_async, _score_async, score_response_async, or score_response_multiple_scorers_async signatures.
  • Clean ConversationScorer handling: Reads self.score_blocked_content directly instead of inferring substitution state from message piece metadata.
  • Mutation concern: Since score_blocked_content is mutable instance state, a shared scorer could theoretically be mutated between calls. In practice, this is not a concern because:
  • The attribute is set once before the attack runs and never changed during execution
  • If a user did share a scorer between an attack (wants True) and external code (wants False), the mutation would be explicit and visible (scorer.score_blocked_content = True), unlike a call-site parameter where the absence of the flag silently defaults to False.
  • Between two attacks, the user can instantiate new scorers if the previous score_blocked_content policy does not apply anymore. This is reasonable since if you want different scoring policies, you need different scorers.

The alternative — a call-site parameter — would allow the same scorer instance to behave differently per call. However, this theoretical benefit doesn't materialize in practice (scorers aren't often shared) and comes at the cost of the forwarding burden described above.

TAP error_score_map interaction

TAP's error_score_map (default {"blocked": 0.0}) runs before the scorer and assigns fixed scores for error types. When "blocked" is in the map, the scorer is never invoked, so score_blocked_content has no effect. To evaluate partial content with TAP, pass error_score_map={} to disable the early-return. This interaction is documented in both the node and TreeOfAttacksWithPruningAttack error_score_map docstrings.

Interaction between skip_on_error_result and score_blocked_content

These flags serve different purposes:

  • skip_on_error_result is a performance optimization — don't waste LLM calls on broken responses (processing errors, empty responses)
  • score_blocked_content is a red teaming policy — content-filtered responses may contain useful partial content

They overlap because message.is_error() returns True for blocked responses. The implementation handles this: when both are active and the blocked message has partial content, scoring is not skipped.

skip_on_error score_blocked_content Behavior
True False Skip all errors including blocks (PromptSendingAttack default)
False False Score everything; blocked pieces get hardcoded False (CrescendoAttack default)
True True Skip real errors, but evaluate partial content from blocks
False True Score everything; evaluate partial content from blocks

Tests

  • TestCreateTextPieceFromBlocked (6 tests in test_scorer.py) — substitute piece creation, field preservation, response_error="none", None when no partial content
  • TestScoreAsyncWithBlockedContent (7 tests in test_scorer.py) — text-only scorer filtering, substitute scoring, refusal scorer short-circuit bypass, no-partial-content handling, mixed pieces, normal piece unaffected
  • TestSkipOnErrorWithBlockedContent (3 tests in test_scorer.py) — interaction between skip_on_error_result=True and score_blocked_content=True
  • TestScoreResponseAsyncBlockedContent (3 tests in test_scorer.py) — flag flows through from scorer instance via score_response_async and score_response_multiple_scorers_async
  • TestExtractPartialContentChatTarget (4 tests in test_openai_chat_target.py) — extraction from Chat Completions responses, None edge cases
  • TestContentFilterPreservesPartialContent (2 tests in test_openai_chat_target.py) — end-to-end: 200 + content_filter preserves metadata, no metadata when no content
  • test_conversation_scorer_uses_partial_content_when_score_blocked_content_enabled (in test_conversation_history_scorer.py) — verifies partial content is used in conversation text when flag is on
  • test_conversation_scorer_uses_error_json_when_score_blocked_content_disabled (in test_conversation_history_scorer.py) — verifies default behavior uses error JSON

@jsong468 jsong468 changed the title score blocked content FEAT: Score partial content from content-filtered responses May 4, 2026
@fdubut
Copy link
Copy Markdown
Contributor

fdubut commented May 5, 2026

Thanks for implementing this feature! I did only a cursory review of the code but read the thorough PR description and the overall design looks good to me.

Comment thread pyrit/score/scorer.py Outdated
Comment thread pyrit/score/scorer.py Outdated
scores = await self._score_async(
message,
objective=objective,
score_blocked_content=score_blocked_content,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just calling out that this addition is not fully back compatible since adding this unconditional forward in _score_async breaks subclasses using the previous signature ((self, message, *, objective=None)). This can be verified by creating a Scorer subclass based on the old signature and a default score_async. That causes a keyword error on score_blocked_content. Since back-compat is called out in the PR description you may want to either call this out, or change this to a conditional forward?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, the logic for substituting partial content can be moved to the public-facing score_async to preserve backward compat.

Comment thread pyrit/score/true_false/true_false_scorer.py Outdated
Comment thread pyrit/score/conversation_scorer.py
# When True, blocked responses that contain partial model output (e.g., from Azure Content Safety
# triggering mid-generation) will be evaluated by scorers instead of being skipped or
# auto-classified as failures/refusals.
score_blocked_content: bool = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this were an attribute on the scorer, it wouldn't change this

Comment thread pyrit/score/scorer.py Outdated
role_filter: Optional[ChatMessageRole] = None,
skip_on_error_result: bool = False,
infer_objective_from_request: bool = False,
score_blocked_content: bool = False,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think default True is better

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think default False is preferable for now because having a response filtered is still a meaningful result from the model that should be captured by default rather than silently going past and continuing an attack.

Comment thread pyrit/prompt_target/openai/openai_target.py
Comment thread pyrit/executor/attack/multi_turn/tree_of_attacks.py
Comment thread pyrit/prompt_target/openai/openai_response_target.py Outdated
Comment thread pyrit/score/conversation_scorer.py Outdated
Comment on lines +84 to +88
if use_partial_content and piece.is_blocked() and "partial_content" in piece.prompt_metadata:
text = str(piece.prompt_metadata["partial_content"])
else:
text = piece.converted_value
conversation_text += f"{role_display}: {text}\n"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should perhaps tell the scorer that this is partial content because the full response was blocked? Isn't that relevant information?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's relevant but kind of defeats the point of scoring blocked content. The scorer may be biased in an unknown way by doing that; if the user did not want the partial content to be scored, they can switch flag to False. Otherwise, partial content will be scored as if it were normal content (which is what the flag is for) or else we would have a blurry middle ground.

Comment thread tests/unit/prompt_target/target/test_openai_chat_target.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants