Integration of Soft-ELO by kargibora · Pull Request #42 · OpenEuroLLM/JudgeArena

kargibora · 2026-05-12T13:10:13Z

Implements the Soft-Elo pipeline: feed the judge's calibrated score-difference into the Bradley–Terry fit as a soft preference $\tilde y = \sigma(\beta s)$ instead of discretising to win/loss/tie. Optionally MLE-fit $\beta$ on human-labeled arena battles before the main run.

What changed

fit_bradley_terry (estimate_elo_ratings.py) replaces compute_bradley_terry. Takes a soft target pref_col ∈ [0,1] (0=A wins,1=B wins, 0.5=tie) and uses the standard soft-CE → weighted-LR decomposition. Hard labels ({0, 0.5, 1}) reduce to the previous fit.
Temperature calibration (evaluate.calibrate_temperature): concave MLE for $\beta^\star$ via scipy.optimize.minimize_scalar on $\sum\log\sigma(\beta(2y-1)\Delta s)$. Driven from estimate_elo_ratings.main — samples human battles, reruns the judge on them, parses raw scores with PairScore(temperature=1.0), fits $\beta^\star$, then re-parses all cached judge completions with the calibrated temperature (handles swap_mode="both" reconstruction).
Reporting: human-only BT ratings computed as ground-truth reference, prints MAE vs Human-Elo on overlapping models; return dict gains human_elo, mae_vs_human, method, calibrated_temperature.

New flags

Flag	Default	Effect
`--soft-elo`	off	Use soft BT targets instead of hard {0, 0.5, 1} labels.
`--soft-elo-temperature`	`0.3`	Initial $\beta$; overridden if calibration runs. Empirical range across judges in the paper: `[0.36, 0.60]`.
`--calibrate-temperature`	off	MLE-fit $\beta^\star$ on human-labeled arena battles before the run. Requires `--soft-elo`; warns and skips otherwise.
`--calibration-size`	all human battles	Number of human battles to sample for calibration. Needs `--calibrate-temperature`.

How to run

Hard-Elo (unchanged behavior):

judgearena --task elo-lmarena-100k \
  --model_A Together/meta-llama/Llama-3.3-70B-Instruct-Turbo \
  --judge_model OpenRouter/deepseek/deepseek-chat-v3.1 \
  --n_instructions 200

Soft-Elo with calibration (recommended):
judgearena --task elo-lmarena-100k \
  --model_A Together/meta-llama/Llama-3.3-70B-Instruct-Turbo \
  --judge_model OpenRouter/deepseek/deepseek-chat-v3.1 \
  --n_instructions 200 \
  --soft-elo --calibrate-temperature --calibration-size 300

How to test

uv run pytest tests/test_cli.py tests/test_estimate_elo_ratings.py

test_cli.py covers the new flags routing through the unified entrypoint;
test_estimate_elo_ratings.py covers fit_bradley_terry and the main pipeline.

kargibora added 9 commits April 14, 2026 15:33

Add soft elo

af4bced

Add temperature calibration

898b1e4

Update READMe for soft-elo support

e4498b6

Update temperature

6b401e8

Merge branch 'main' into feat/soft-elo

6f960af

Update CLI to unify elo computation

995db21

Remove duplication

b357116

Fix a edge case when all the labels are same

be53e8c

ruff fix

61f1f84

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration of Soft-ELO#42

Integration of Soft-ELO#42
kargibora wants to merge 9 commits into
mainfrom
feat/soft-elo

kargibora commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kargibora commented May 12, 2026

What changed

New flags

How to run

How to test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant