MAINT unify how default adversarial and scorer targets are set in scenarios#1695
MAINT unify how default adversarial and scorer targets are set in scenarios#1695behnam-o wants to merge 12 commits intomicrosoft:mainfrom
Conversation
| When overridden to return a non-empty sequence, the default objective scorer becomes | ||
| one ``SelfAskTrueFalseScorer`` per path AND-ed together with ``NOT(SelfAskRefusalScorer)`` | ||
| instead of the scenario-level default. |
There was a problem hiding this comment.
nit: I feel like this part is confusing here because the logic for this is not in this specific function and this should be moved to the doc string for _get_default_objective_scorer
| @classmethod | ||
| def get_override_composite_scorer_questions_path(cls) -> Sequence[Path]: | ||
| """ | ||
| Override to provide true/false question prompt paths for objective scoring. |
There was a problem hiding this comment.
| Override to provide true/false question prompt paths for objective scoring. | |
| Get the true/false question prompt paths for objective scoring. |
I think the override part is redundant, if a class implements this, it is overridden inherently
| Returns: | ||
| Sequence[Path]: Paths to true/false question prompts, or an empty sequence to use the default scorer. | ||
| """ | ||
| return [] |
There was a problem hiding this comment.
so this function subclasses one input — a list of paths — but silently determines:
- The per-path scorer type (always SelfAskTrueFalseScorer)
- The aggregator (always AND)
- That a NOT(SelfAskRefusalScorer) backstop is appended
- The chat target (via a 3-step fallback inside _get_default_objective_scorer
From the subclass author's perspective in scam.py:88-95, leakage.py:104, cyber.py:64, they "return paths" — and have no idea (without reading the base class) what scorer actually gets built. The method name suggests the mechanism ("composite") while hiding the behavior. so why not have something like :
def _get_default_objective_scorer(self) -> TrueFalseScorer:
return self._build_composite_with_refusal_backstop(
chat_target=self._resolve_default_scorer_chat_target(),
true_false_question_paths=[SCORER_SEED_PROMPT_PATH / "true_false_question" / "scams.yaml"],
)
@staticmethod
def _build_composite_with_refusal_backstop(
*,
chat_target: PromptChatTarget,
true_false_question_paths: Sequence[Path],
) -> TrueFalseCompositeScorer:
path_scorers = [
SelfAskTrueFalseScorer(chat_target=chat_target, true_false_question_path=p)
for p in true_false_question_paths
]
backstop = TrueFalseInverterScorer(scorer=SelfAskRefusalScorer(chat_target=chat_target))
return TrueFalseCompositeScorer(
aggregator=TrueFalseScoreAggregator.AND,
scorers=[*path_scorers, backstop],
)
There was a problem hiding this comment.
My reasoning behind this is ... as an author of a new scenario: "I don't want to think about what kind of scorer is used by default in my scenario ... just give me whatever all scenarios use, and btw, here's a list of questions I care about ... I'm happy with whatever other scenarios are doing with it"
and if a scenario author truely cares about the specific default scorer used for their scenario, they can still override _get_default_objective_scorer
(and if a user, as opposed to author, of a scenario wants a specific scorer, they can create one themselves and pass it as an init param)
that's why I take out the "construction of the scorers" out of individual scenarios, and only leave a way for them to specify what questions they want to be considered for scoring ... and if the maintainers decide that using a different type of scorer is more appropriate as default, scenario authors don't care and don't need to update their default scorer construction ...
I think maybe the method name could actually just be "additional_scoring_questions" or something like that, to indicate these are something the base scenario will "consider", but the author doesn't need to know how exactly.
Am I making sense? :D
There was a problem hiding this comment.
"the subclass authors have no idea what scorer actually gets built." is the goal and a good thing in my opinion ...
There was a problem hiding this comment.
gotcha yeah i guess i forgot that a user could pass in their own non-default scorer, so this is just the case that the user doesn't pass in any scorer and then we create a custom scorer using the paths. I do think the function should be renamed to default_true_false_question_paths (we have a weird inconsistency with using get_* and default_* in our class functions) and leave the scorer portion out of it but i won't die on that hill 😅
| OBJECTIVE_SCORER_CHAT_ENDPOINT="https://xxxxx.openai.azure.com/openai/v1" | ||
| OBJECTIVE_SCORER_CHAT_KEY="xxxxx" | ||
| OBJECTIVE_SCORER_CHAT_MODEL="deployment-name" |
There was a problem hiding this comment.
just making sure this is updated in our global env and probably should be whichever endpoint our best performing scorer uses
| TARGET_REQUIREMENTS: ClassVar[TargetRequirements] = TargetRequirements() | ||
|
|
||
| @classmethod | ||
| def get_override_composite_scorer_questions_path(cls) -> Sequence[Path]: |
There was a problem hiding this comment.
| def get_override_composite_scorer_questions_path(cls) -> Sequence[Path]: | |
| def get_override_composite_scorer_questions_paths(cls) -> Sequence[Path]: |
There was a problem hiding this comment.
also NIT: would it be more clear to have this be get_override_default_scorer_questions_paths (to make it more obvious it's for the default scorer) or maybe get_override_true_false_scorer_questions_paths since it applies to the true_false scorer portion of the composite.
| """ | ||
| return _get_default_chat_target( | ||
| preferred_target_key="adversarial_chat", | ||
| required_capabilities={CapabilityName.MULTI_TURN}, |
There was a problem hiding this comment.
does SYSTEM_PROMPT also need to be a require_capability? It seems like most of our multi-turn attacks require it.
| return registry_default_scorer | ||
|
|
||
| scorer = TrueFalseInverterScorer(scorer=SelfAskRefusalScorer(chat_target=chat_target)) | ||
| logger.warning(f"Using fallback default objective scorer: {type(scorer).__name__}") |
There was a problem hiding this comment.
in all the logging for this method, would it also be helpful to output target information?
| MAIN: str = "main" | ||
| FALLBACK: str = "fallback" |
There was a problem hiding this comment.
should these names be more descriptive and perhaps with docstring? maybe just me but it's not obvious that MAIN refers to objective_scorer_chat endpoint which is used for scenarios as a default target when the default scorer is not in the registry and FALLBACK refers to the default openai_chat endpoint.
| # Core scorer registration | ||
| # --------------------------------------------------------------------------- | ||
|
|
||
| def _register_fallback_scorers(self) -> None: |
There was a problem hiding this comment.
maybe dumb question, but why do we need to register these? These are both technically just fallbacks for our scenario which are not retrieved from the scorer registry directly, and I'm a bit confused on what the metrics for these would tell us. For evaluating TrueFalseInverterScorer with Refusal, we already use the refusal scorer tagged as "best" and the defaults here would differ per user (i.e., not tied to a model name and my default objective_scorer model could be different from yours).
We already register and evaluate all the different refusal scorers (with different models) and choose the best one for the inverter scoring; adding these just seems like moving closer to what we used to do by trying all possible combinations without the intermediate _tag_best_per_category step. What do you think?
cc: @rlundeen2
| from pyrit.prompt_target import PromptTarget | ||
| from pyrit.prompt_target.common.target_requirements import TargetRequirements | ||
| from pyrit.registry import ScorerRegistry | ||
| from pyrit.registry.object_registries.scorer_registry import ScorerRegistry |
There was a problem hiding this comment.
That's a VERY long import from another module. Can we make it work with pyrit.registry? There aren't that many, are there?
| from pyrit.score.true_false.self_ask_refusal_scorer import SelfAskRefusalScorer | ||
| from pyrit.score.true_false.self_ask_true_false_scorer import SelfAskTrueFalseScorer | ||
| from pyrit.score.true_false.true_false_composite_scorer import TrueFalseCompositeScorer | ||
| from pyrit.score.true_false.true_false_inverter_scorer import TrueFalseInverterScorer | ||
| from pyrit.score.true_false.true_false_score_aggregator import TrueFalseScoreAggregator |
There was a problem hiding this comment.
same here about short imports.
Changes:
1- Use a common logic for resolving default adversarial and scorer targets in scenarios
2- Have a fallback mechanism that uses the target identified by OPENAI_CHAT_*** env variables. This allows someone to only define 1 endpoint and get a runnable scenario