FEAT: Score partial content from content-filtered responses by jsong468 · Pull Request #1689 · microsoft/PyRIT

jsong468 · 2026-05-04T22:53:57Z

Description

When OpenAI/Azure Chat Completions content filters trigger mid-generation (HTTP 200 with finish_reason=content_filter), the model may have already produced partial content before being cut off. Currently, PyRIT discards this partial content and treats the response identically to a full block (HTTP 400), so scorers return hardcoded failures and attacks backtrack. From an adversary's perspective, partial harmful content may constitute a successful attack even though the output was eventually filtered.

This PR introduces a score_blocked_content attribute on the Scorer base class that allows scorers to opt in to evaluating partial content from blocked responses instead of automatically treating them as failures.

Target layer — partial content extraction

Added _extract_partial_content(response) template method to OpenAITarget base class, returning None by default
OpenAIChatTarget overrides it to extract response.choices[0].message.content
_handle_content_filter_response calls the hook and attaches the result to prompt_metadata["partial_content"] on the blocked MessagePiece before it is persisted to the DB
No changes to response_error="blocked" — full backward compatibility
Partial content extraction is currently only supported for the Chat Completions API. The Responses API and Realtime APIs do not document partial content availability when a content filter fires; the _extract_partial_content hook can be overridden if this changes in the future.

Scorer layer — instance attribute on `Scorer`

Added score_blocked_content: bool = False class attribute on Scorer
When True, score_async creates a modified message where blocked pieces with partial_content metadata are replaced with text-type substitutes (response_error="none", converted_value_data_type="text", converted_value=<partial text>) before passing to _score_async
The substitute has response_error="none" so scorer short-circuits (e.g., refusal scorer's if response_error == "blocked" check) do not fire, and the LLM evaluates the actual content
When skip_on_error_result=True and self.score_blocked_content=True, blocked messages with partial content are not skipped.
The substitution happens in score_async (not _score_async) so that _score_async's signature remains (self, message, *, objective=None) — preserving full backward compatibility for external subclasses
The substitute is never persisted to the DB. The resulting Score references the original blocked piece's ID
Unique case: ConversationScorer reads self.score_blocked_content directly when building conversation text from DB history, using partial_content from metadata instead of converted_value for blocked pieces

Design decision — instance attribute, not call-site parameter or `AttackScoringConfig` field

Previous design considerations:

call-site parameter on score_async? Earlier iterations added score_blocked_content as a parameter on score_async, score_response_async, and score_response_multiple_scorers_async, threaded through from AttackScoringConfig via each attack's scoring calls. This had two problems:

Forwarding burden: Every new attack, scorer wrapper, or static method that calls score_async must remember to forward the parameter.
Backward compatibility: Adding a score_blocked_content parameter to _score_async broke external subclasses that override it with the original (self, message, *, objective=None) signature. Moving the substitution before _score_async solved this, but the forwarding burden through public APIs remained.

Why not on AttackScoringConfig? Putting the flag on the config centralizes it but still requires attacks to thread it through scoring calls. Adding it at the config level introduces complexity without significant benefit to the user.

Why an instance attribute? score_blocked_content is a scorer behavioral policy — "should this scorer evaluate partial content?" — similar to _validator or _score_aggregator. Setting it on the instance:

Zero forwarding: score_async reads self.score_blocked_content. No parameters to thread through score_response_async, attacks, or scorer wrappers.
Zero forgotten forwarding: New attacks, scorer wrappers, and static methods automatically inherit the behavior without code changes.
Full backward compatibility: No changes to score_async, _score_async, score_response_async, or score_response_multiple_scorers_async signatures.
Clean ConversationScorer handling: Reads self.score_blocked_content directly instead of inferring substitution state from message piece metadata.
Mutation concern: Since score_blocked_content is mutable instance state, a shared scorer could theoretically be mutated between calls. In practice, this is not a concern because:
The attribute is set once before the attack runs and never changed during execution
If a user did share a scorer between an attack (wants True) and external code (wants False), the mutation would be explicit and visible (scorer.score_blocked_content = True), unlike a call-site parameter where the absence of the flag silently defaults to False.
Between two attacks, the user can instantiate new scorers if the previous score_blocked_content policy does not apply anymore. This is reasonable since if you want different scoring policies, you need different scorers.

The alternative — a call-site parameter — would allow the same scorer instance to behave differently per call. However, this theoretical benefit doesn't materialize in practice (scorers aren't often shared) and comes at the cost of the forwarding burden described above.

TAP `error_score_map` interaction

TAP's error_score_map (default {"blocked": 0.0}) runs before the scorer and assigns fixed scores for error types. When "blocked" is in the map, the scorer is never invoked, so score_blocked_content has no effect. To evaluate partial content with TAP, pass error_score_map={} to disable the early-return. This interaction is documented in both the node and TreeOfAttacksWithPruningAttack error_score_map docstrings.

Interaction between `skip_on_error_result` and `score_blocked_content`

These flags serve different purposes:

skip_on_error_result is a performance optimization — don't waste LLM calls on broken responses (processing errors, empty responses)
score_blocked_content is a red teaming policy — content-filtered responses may contain useful partial content

They overlap because message.is_error() returns True for blocked responses. The implementation handles this: when both are active and the blocked message has partial content, scoring is not skipped.

`skip_on_error`	`score_blocked_content`	Behavior
`True`	`False`	Skip all errors including blocks (`PromptSendingAttack` default)
`False`	`False`	Score everything; blocked pieces get hardcoded False (`CrescendoAttack` default)
`True`	`True`	Skip real errors, but evaluate partial content from blocks
`False`	`True`	Score everything; evaluate partial content from blocks

Tests

TestCreateTextPieceFromBlocked (6 tests in test_scorer.py) — substitute piece creation, field preservation, response_error="none", None when no partial content
TestScoreAsyncWithBlockedContent (7 tests in test_scorer.py) — text-only scorer filtering, substitute scoring, refusal scorer short-circuit bypass, no-partial-content handling, mixed pieces, normal piece unaffected
TestSkipOnErrorWithBlockedContent (3 tests in test_scorer.py) — interaction between skip_on_error_result=True and score_blocked_content=True
TestScoreResponseAsyncBlockedContent (3 tests in test_scorer.py) — flag flows through from scorer instance via score_response_async and score_response_multiple_scorers_async
TestExtractPartialContentChatTarget (4 tests in test_openai_chat_target.py) — extraction from Chat Completions responses, None edge cases
TestContentFilterPreservesPartialContent (2 tests in test_openai_chat_target.py) — end-to-end: 200 + content_filter preserves metadata, no metadata when no content
test_conversation_scorer_uses_partial_content_when_score_blocked_content_enabled (in test_conversation_history_scorer.py) — verifies partial content is used in conversation text when flag is on
test_conversation_scorer_uses_error_json_when_score_blocked_content_disabled (in test_conversation_history_scorer.py) — verifies default behavior uses error JSON

fdubut · 2026-05-05T00:02:10Z

Thanks for implementing this feature! I did only a cursory review of the code but read the thorough PR description and the overall design looks good to me.

adrian-gavrila · 2026-05-05T21:22:12Z

            scores = await self._score_async(
                message,
                objective=objective,
+                score_blocked_content=score_blocked_content,


Just calling out that this addition is not fully back compatible since adding this unconditional forward in _score_async breaks subclasses using the previous signature ((self, message, *, objective=None)). This can be verified by creating a Scorer subclass based on the old signature and a default score_async. That causes a keyword error on score_blocked_content. Since back-compat is called out in the PR description you may want to either call this out, or change this to a conditional forward?

good point, the logic for substituting partial content can be moved to the public-facing score_async to preserve backward compat.

rlundeen2 · 2026-05-05T21:46:50Z

+    # When True, blocked responses that contain partial model output (e.g., from Azure Content Safety
+    # triggering mid-generation) will be evaluated by scorers instead of being skipped or
+    # auto-classified as failures/refusals.
+    score_blocked_content: bool = False


If this were an attribute on the scorer, it wouldn't change this

rlundeen2 · 2026-05-06T00:08:21Z

        role_filter: Optional[ChatMessageRole] = None,
        skip_on_error_result: bool = False,
        infer_objective_from_request: bool = False,
+        score_blocked_content: bool = False,


I think default True is better

I think default False is preferable for now because having a response filtered is still a meaningful result from the model that should be captured by default rather than silently going past and continuing an attack.

romanlutz · 2026-05-06T11:50:46Z

+                    if use_partial_content and piece.is_blocked() and "partial_content" in piece.prompt_metadata:
+                        text = str(piece.prompt_metadata["partial_content"])
+                    else:
+                        text = piece.converted_value
+                    conversation_text += f"{role_display}: {text}\n"


We should perhaps tell the scorer that this is partial content because the full response was blocked? Isn't that relevant information?

I think it's relevant but kind of defeats the point of scoring blocked content. The scorer may be biased in an unknown way by doing that; if the user did not want the partial content to be scored, they can switch flag to False. Otherwise, partial content will be scored as if it were normal content (which is what the flag is for) or else we would have a blurry middle ground.

…et changes

jsong468 added 3 commits May 4, 2026 15:24

score blocked content

668512d

docstring

e6fae92

merge conflicts

3beff4a

jsong468 changed the title ~~score blocked content~~ FEAT: Score partial content from content-filtered responses May 4, 2026

fix unit tests

9a4505a

adrian-gavrila reviewed May 5, 2026

View reviewed changes

rlundeen2 reviewed May 5, 2026

View reviewed changes

Comment thread pyrit/score/true_false/true_false_scorer.py Outdated

rlundeen2 reviewed May 5, 2026

View reviewed changes

Comment thread pyrit/score/conversation_scorer.py

rlundeen2 reviewed May 5, 2026

View reviewed changes

jsong468 added 2 commits May 5, 2026 16:53

fix conversation_scorer bug and score_async

fc6c7e7

minor truthiness change

c49debc

rlundeen2 reviewed May 6, 2026

View reviewed changes

romanlutz reviewed May 6, 2026

View reviewed changes

Comment thread pyrit/prompt_target/openai/openai_target.py

romanlutz reviewed May 6, 2026

View reviewed changes

Comment thread pyrit/executor/attack/multi_turn/tree_of_attacks.py

romanlutz reviewed May 6, 2026

View reviewed changes

Comment thread pyrit/prompt_target/openai/openai_response_target.py Outdated

romanlutz reviewed May 6, 2026

View reviewed changes

Comment thread tests/unit/prompt_target/target/test_openai_chat_target.py

jsong468 added 3 commits May 7, 2026 23:30

refactor score_blocked_content as attribute and remove responses targ…

aa3bef6

…et changes

conversation scorer fix and tests

573dee1

Merge branch 'main' into partial_blocked_design2

6c1ee8c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT: Score partial content from content-filtered responses#1689

FEAT: Score partial content from content-filtered responses#1689
jsong468 wants to merge 9 commits intomicrosoft:mainfrom
jsong468:partial_blocked_design2

jsong468 commented May 4, 2026 •

edited

Loading

Uh oh!

fdubut commented May 5, 2026

Uh oh!

Uh oh!

adrian-gavrila May 5, 2026

Uh oh!

jsong468 May 5, 2026

Uh oh!

Uh oh!

Uh oh!

rlundeen2 May 5, 2026

Uh oh!

rlundeen2 May 6, 2026

Uh oh!

jsong468 May 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

romanlutz May 6, 2026

Uh oh!

jsong468 May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

jsong468 commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Target layer — partial content extraction

Scorer layer — instance attribute on Scorer

Design decision — instance attribute, not call-site parameter or AttackScoringConfig field

TAP error_score_map interaction

Interaction between skip_on_error_result and score_blocked_content

Tests

Uh oh!

fdubut commented May 5, 2026

Uh oh!

Uh oh!

adrian-gavrila May 5, 2026

Choose a reason for hiding this comment

Uh oh!

jsong468 May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rlundeen2 May 5, 2026

Choose a reason for hiding this comment

Uh oh!

rlundeen2 May 6, 2026

Choose a reason for hiding this comment

Uh oh!

jsong468 May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

romanlutz May 6, 2026

Choose a reason for hiding this comment

Uh oh!

jsong468 May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jsong468 commented May 4, 2026 •

edited

Loading

Scorer layer — instance attribute on `Scorer`

Design decision — instance attribute, not call-site parameter or `AttackScoringConfig` field

TAP `error_score_map` interaction

Interaction between `skip_on_error_result` and `score_blocked_content`