Skip to content

Integration of Soft-ELO#42

Open
kargibora wants to merge 9 commits into
mainfrom
feat/soft-elo
Open

Integration of Soft-ELO#42
kargibora wants to merge 9 commits into
mainfrom
feat/soft-elo

Conversation

@kargibora
Copy link
Copy Markdown
Collaborator

Implements the Soft-Elo pipeline: feed the judge's calibrated score-difference into the Bradley–Terry fit as a soft preference $\tilde y = \sigma(\beta s)$ instead of discretising to win/loss/tie. Optionally MLE-fit $\beta$ on human-labeled arena battles before the main run.

What changed

  • fit_bradley_terry (estimate_elo_ratings.py) replaces compute_bradley_terry. Takes a soft target pref_col ∈ [0,1] (0=A wins,1=B wins, 0.5=tie) and uses the standard soft-CE → weighted-LR decomposition. Hard labels ({0, 0.5, 1}) reduce to the previous fit.
  • Temperature calibration (evaluate.calibrate_temperature): concave MLE for $\beta^\star$ via scipy.optimize.minimize_scalar on $\sum\log\sigma(\beta(2y-1)\Delta s)$. Driven from estimate_elo_ratings.main — samples human battles, reruns the judge on them, parses raw scores with PairScore(temperature=1.0), fits $\beta^\star$, then re-parses all cached judge completions with the calibrated temperature (handles swap_mode="both" reconstruction).
  • Reporting: human-only BT ratings computed as ground-truth reference, prints MAE vs Human-Elo on overlapping models; return dict gains human_elo, mae_vs_human, method, calibrated_temperature.

New flags

Flag Default Effect
--soft-elo off Use soft BT targets instead of hard {0, 0.5, 1} labels.
--soft-elo-temperature 0.3 Initial $\beta$; overridden if calibration runs. Empirical range across judges in the paper: [0.36, 0.60].
--calibrate-temperature off MLE-fit $\beta^\star$ on human-labeled arena battles before the run. Requires --soft-elo; warns and skips otherwise.
--calibration-size all human battles Number of human battles to sample for calibration. Needs --calibrate-temperature.

How to run

Hard-Elo (unchanged behavior):

judgearena --task elo-lmarena-100k \
  --model_A Together/meta-llama/Llama-3.3-70B-Instruct-Turbo \
  --judge_model OpenRouter/deepseek/deepseek-chat-v3.1 \
  --n_instructions 200

Soft-Elo with calibration (recommended):
judgearena --task elo-lmarena-100k \
  --model_A Together/meta-llama/Llama-3.3-70B-Instruct-Turbo \
  --judge_model OpenRouter/deepseek/deepseek-chat-v3.1 \
  --n_instructions 200 \
  --soft-elo --calibrate-temperature --calibration-size 300

How to test

uv run pytest tests/test_cli.py tests/test_estimate_elo_ratings.py

  • test_cli.py covers the new flags routing through the unified entrypoint;
  • test_estimate_elo_ratings.py covers fit_bradley_terry and the main pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant