diff --git a/bioscancast/stages/eval_stage/README.md b/bioscancast/stages/eval_stage/README.md index 0a88616..5eceae2 100644 --- a/bioscancast/stages/eval_stage/README.md +++ b/bioscancast/stages/eval_stage/README.md @@ -1,43 +1,62 @@ # BioScanCast Evaluation Stage -This folder evaluates probabilistic forecasts and compares forecasting sources such as human forecasts, BioScanCast, and other LLM baselines. +This folder evaluates probabilistic forecasts and tracks how they change across repeated forecast versions. + +## Purpose + +The eval stage is designed for forecasting systems that improve over time. In the demo data, `forecast_version` is the time axis: + +- `1` = earliest forecast +- `2` = updated forecast +- `3` = latest forecast + +The plots and tables are built to show whether a source is improving as versions advance. ## What it measures -The evaluation stage scores each question/source pair using: +The evaluation stage scores each question/source/version trio using metrics where lower is better: + +- **Brier score**: probability error. +- **Log score**: negative log probability assigned to the true answer. +- **Accuracy error**: `1 - accuracy` so the direction matches the other metrics. +- **RPS**: Ranked Probability Score for ordered buckets. +- **Top probability**: the largest probability in the forecast. +- **Normalized entropy**: forecast uncertainty. +- **True probability**: probability assigned to the actual resolved outcome. + +## Input format -- **Brier score**: probability error; lower is better. -- **Log score**: how much probability was assigned to the true answer; lower is better. -- **Accuracy**: whether the top predicted bucket matches the true bucket; higher is better. -- **RPS**: Ranked Probability Score for ordered buckets; lower is better. -- **Top probability**: the largest probability in the forecast; higher means sharper forecasts. -- **Normalized entropy**: forecast uncertainty; lower means more concentrated predictions. -- **True probability**: probability assigned to the actual resolved outcome; higher is better. +Forecast files must contain at least: -## Comparing sources +- `question_id` +- `forecast_source` +- `forecast_version` +- `option` +- `probability` -If you pass more than one forecast CSV, the pipeline compares them question-by-question and produces: +`forecast_version` is the versioned update number, not a random run id. Use it to place repeated forecasts on the timeline. -- paired scatter plots -- per-question difference plots -- win-rate plots -- significance tests (paired t-test, Wilcoxon signed-rank, McNemar) +Example: + +```csv +question_id;forecast_source;forecast_version;option;probability +q1;bioscancast;1;70-100;0.70 +q1;bioscancast;2;70-100;0.45 +q1;bioscancast;3;70-100;0.15 +``` ## How to run -Single source: +From the project root: ```bash -python -m bioscancast.stages.eval_stage.main \ - --forecasts bioscancast/stages/eval_stage/bioscancast_forecasts.csv +python -m bioscancast.stages.eval_stage.main ``` -Human vs BioScanCast: +To run custom forecasts, pass one or more CSV files: ```bash -python -m bioscancast.stages.eval_stage.main \ - --forecasts bioscancast/stages/eval_stage/mock_forecasts/human_forecasts.csv \ - bioscancast/stages/eval_stage/mock_forecasts/bioscancast_forecasts.csv +python -m bioscancast.stages.eval_stage.main --forecasts bioscancast/stages/eval_stage/mock_forecasts/human_forecasts.csv bioscancast/stages/eval_stage/mock_forecasts/bioscancast_forecasts.csv bioscancast/stages/eval_stage/mock_forecasts/llm_baseline_forecasts.csv ``` ## Outputs @@ -46,24 +65,45 @@ The pipeline writes results to `bioscancast/stages/eval_stage/outputs/`. Key files: +- `question_level_metrics.csv` - `summary_metrics.csv` - `summary_metrics_by_question_type.csv` -- `question_level_metrics.csv` -- `pairwise_comparison.csv` -- `significance_tests.csv` -- `source_comparison.png` -- `brier_boxplot.png` -- `log_boxplot.png` -- `rps_boxplot.png` -- `calibration_.png` -- `scatter_.png` -- `differences_.png` -- `win_rate_.png` - -## How to explain it quickly - -The stage does three things: - -1. **Scores each forecast** against the resolved answer. -2. **Summarizes performance** by source. -3. **Compares sources directly** on the same questions and checks whether the differences are likely real or just random noise. +- `summary_metrics_over_time.csv` +- `source_ranking_over_time.csv` +- `metric_improvement_over_time.csv` +- `score_timeline_boxplots.png` +- `source_timeline_summary.png` +- `improvement_vs_v1.png` +- `question_heatmap.png` +- `source_ranking_over_time.png` + +### Main plots + +- `score_timeline_boxplots.png` shows the distribution of question scores for each source and version. +- `source_timeline_summary.png` shows the median trajectory with interquartile-range bands. +- `improvement_vs_v1.png` shows how much each version improves over version 1, where positive values mean better performance. +- `question_heatmap.png` shows question-level evolution across versions. +- `source_ranking_over_time.png` shows the source ranking by median Brier score at each version. + +## Perplexity forecasts + +To generate a new CSV from Perplexity, use the forecast generator and point it at the question file and a template forecast file so it can reuse the same option sets: + +```bash +python -m bioscancast.stages.eval_stage.generate_perplexity_forecasts --questions bioscancast/stages/eval_stage/bioscancast_questions.csv --template-forecasts bioscancast/stages/eval_stage/mock_forecasts/bioscancast_forecasts.csv --output bioscancast/stages/eval_stage/mock_forecasts/perplexity_forecasts.csv +``` + +The script expects `PERPLEXITY_API_KEY` to be set in the environment. + +## Interpretation + +If BioScanCast learns over time, we expect: + +- Brier score down +- Log score down +- RPS down +- Accuracy error down + +from version 1 to version 3. + +The versioned demo mocks in this folder are set up to make those trends visible. diff --git a/bioscancast/stages/eval_stage/compare.py b/bioscancast/stages/eval_stage/compare.py index 3e347e9..c052cca 100644 --- a/bioscancast/stages/eval_stage/compare.py +++ b/bioscancast/stages/eval_stage/compare.py @@ -1,71 +1,209 @@ from __future__ import annotations -from typing import Iterable +from typing import Iterable, Sequence import pandas as pd -REQUIRED_RESULT_COLUMNS = { - "question_id", - "forecast_source", - "brier_score", - "log_score", - "accuracy", - "rps", - "top_probability", - "normalized_entropy", - "true_probability", -} +METRIC_COLUMNS = [ + 'brier_score', + 'log_score', + 'accuracy', + 'accuracy_error', + 'rps', + 'top_probability', + 'normalized_entropy', + 'true_probability', +] + + +def _version_sort_key(value): + text = str(value) + try: + return (0, float(text)) + except Exception: + return (1, text) + + +def _ordered_unique(values: Sequence[object]) -> list[str]: + seen: list[str] = [] + for value in values: + text = str(value) + if text not in seen: + seen.append(text) + return seen def _require_columns(df: pd.DataFrame, required: Iterable[str]) -> None: missing = [col for col in required if col not in df.columns] if missing: - raise ValueError("results_df is missing required columns: " + ", ".join(missing)) + raise ValueError('results_df is missing required columns: ' + ', '.join(missing)) + + +def _clean_numeric(series: pd.Series) -> pd.Series: + return pd.to_numeric(series, errors='coerce').dropna().astype(float) + + +def _summarise_group(group: pd.DataFrame, key_columns: Sequence[str]) -> dict: + row: dict = {col: group.iloc[0][col] for col in key_columns} + row['n_questions'] = int(group['question_id'].nunique()) if 'question_id' in group.columns else int(len(group)) + if 'forecast_version' in group.columns: + row['n_versions'] = int(group['forecast_version'].nunique()) + + for metric in METRIC_COLUMNS: + if metric not in group.columns: + continue + values = _clean_numeric(group[metric]) + if values.empty: + row[f'mean_{metric}'] = float('nan') + row[f'median_{metric}'] = float('nan') + row[f'std_{metric}'] = float('nan') + row[f'q1_{metric}'] = float('nan') + row[f'q3_{metric}'] = float('nan') + continue + row[f'mean_{metric}'] = float(values.mean()) + row[f'median_{metric}'] = float(values.median()) + row[f'std_{metric}'] = float(values.std(ddof=1)) if len(values) > 1 else 0.0 + row[f'q1_{metric}'] = float(values.quantile(0.25)) + row[f'q3_{metric}'] = float(values.quantile(0.75)) + return row + + +def _summarise_by_group(results_df: pd.DataFrame, group_cols: Sequence[str]) -> pd.DataFrame: + _require_columns(results_df, {'question_id', *group_cols}) + rows = [_summarise_group(group, group_cols) for _, group in results_df.groupby(list(group_cols), dropna=False)] + summary = pd.DataFrame(rows) + if summary.empty: + return summary + + sort_cols = [col for col in group_cols if col in summary.columns] + if 'forecast_version' in summary.columns: + summary = summary.sort_values( + ['forecast_version'] + [c for c in sort_cols if c != 'forecast_version'], + key=lambda s: s.map(_version_sort_key) if s.name == 'forecast_version' else s.astype(str), + ) + else: + summary = summary.sort_values(sort_cols) + return summary.reset_index(drop=True) def compare_sources(results_df: pd.DataFrame) -> pd.DataFrame: """Aggregate scoring metrics by forecast source.""" - _require_columns(results_df, REQUIRED_RESULT_COLUMNS) - - summary = ( - results_df.groupby("forecast_source", dropna=False) - .agg( - n_questions=("question_id", "nunique"), - mean_brier_score=("brier_score", "mean"), - mean_log_score=("log_score", "mean"), - mean_accuracy=("accuracy", "mean"), - mean_rps=("rps", "mean"), - mean_top_probability=("top_probability", "mean"), - mean_normalized_entropy=("normalized_entropy", "mean"), - mean_true_probability=("true_probability", "mean"), - ) - .reset_index() - .sort_values("mean_brier_score", ascending=True) - ) + _require_columns(results_df, {'question_id', 'forecast_source'}) + summary = _summarise_by_group(results_df, ['forecast_source']) + if 'median_brier_score' in summary.columns: + summary = summary.sort_values(['median_brier_score', 'forecast_source'], ascending=[True, True]).reset_index(drop=True) + return summary + + +def compare_sources_over_time(results_df: pd.DataFrame) -> pd.DataFrame: + """Aggregate scoring metrics by forecast version and source.""" + _require_columns(results_df, {'question_id', 'forecast_source', 'forecast_version'}) + summary = _summarise_by_group(results_df, ['forecast_version', 'forecast_source']) + if not summary.empty: + summary = summary.sort_values( + ['forecast_version', 'forecast_source'], + key=lambda s: s.map(_version_sort_key) if s.name == 'forecast_version' else s.astype(str), + ).reset_index(drop=True) return summary def compare_sources_by_question_type(results_df: pd.DataFrame) -> pd.DataFrame: """Aggregate scoring metrics by forecast source and question type.""" - _require_columns(results_df, REQUIRED_RESULT_COLUMNS) - - if "question_type" not in results_df.columns: + _require_columns(results_df, {'question_id', 'forecast_source'}) + if 'question_type' not in results_df.columns: raise ValueError("results_df must contain a 'question_type' column to compare by question type.") - summary = ( - results_df.groupby(["forecast_source", "question_type"], dropna=False) - .agg( - n_questions=("question_id", "nunique"), - mean_brier_score=("brier_score", "mean"), - mean_log_score=("log_score", "mean"), - mean_accuracy=("accuracy", "mean"), - mean_rps=("rps", "mean"), - mean_top_probability=("top_probability", "mean"), - mean_normalized_entropy=("normalized_entropy", "mean"), - mean_true_probability=("true_probability", "mean"), - ) - .reset_index() - .sort_values(["forecast_source", "question_type"]) - ) + summary = _summarise_by_group(results_df, ['forecast_source', 'question_type']) + if not summary.empty: + summary = summary.sort_values(['forecast_source', 'question_type']).reset_index(drop=True) return summary + + +def rank_sources_over_time(summary_df: pd.DataFrame, *, metric_column: str = 'median_brier_score', ascending: bool = True) -> pd.DataFrame: + """Rank sources within each version using the selected metric.""" + required = {'forecast_version', 'forecast_source', metric_column} + _require_columns(summary_df, required) + + rows = [] + for version, group in summary_df.groupby('forecast_version', dropna=False): + ordered = group.sort_values([metric_column, 'forecast_source'], ascending=[ascending, True]).reset_index(drop=True) + ranks = ordered[metric_column].rank(method='dense', ascending=ascending).astype(int) + for idx, (_, row) in enumerate(ordered.iterrows()): + rows.append( + { + 'forecast_version': version, + 'forecast_source': row['forecast_source'], + 'metric_column': metric_column, + 'metric_value': float(row[metric_column]), + 'rank': int(ranks.iloc[idx]), + } + ) + + ranked = pd.DataFrame(rows) + if ranked.empty: + return ranked + ranked = ranked.sort_values( + ['forecast_version', 'rank', 'forecast_source'], + key=lambda s: s.map(_version_sort_key) if s.name == 'forecast_version' else s.astype(str), + ) + return ranked.reset_index(drop=True) + + +def relative_improvement_over_time(results_df: pd.DataFrame, baseline_version: str | None = None) -> pd.DataFrame: + """Measure change versus the baseline forecast version for each source and metric. + + Positive values mean the forecast improved relative to the baseline because + the evaluated metrics are lower-is-better. + """ + required = {'question_id', 'forecast_source', 'forecast_version'} | set(METRIC_COLUMNS) + _require_columns(results_df, required) + + versions = sorted(_ordered_unique(results_df['forecast_version'].tolist()), key=_version_sort_key) + if not versions: + return pd.DataFrame() + if baseline_version is None: + baseline_version = versions[0] + + rows = [] + for source, source_df in results_df.groupby('forecast_source', dropna=False): + baseline_df = source_df[source_df['forecast_version'].astype(str) == str(baseline_version)] + if baseline_df.empty: + continue + for version in versions: + current_df = source_df[source_df['forecast_version'].astype(str) == str(version)] + merged = baseline_df[['question_id', *METRIC_COLUMNS]].merge( + current_df[['question_id', *METRIC_COLUMNS]], + on='question_id', + suffixes=('_baseline', '_current'), + ) + if merged.empty: + continue + for metric in METRIC_COLUMNS: + pair = merged[[f'{metric}_baseline', f'{metric}_current']].dropna() + if pair.empty: + continue + deltas = pair[f'{metric}_baseline'] - pair[f'{metric}_current'] + rows.append( + { + 'forecast_source': source, + 'forecast_version': version, + 'baseline_version': baseline_version, + 'metric': metric, + 'n_questions': int(len(deltas)), + 'mean_improvement': float(deltas.mean()), + 'median_improvement': float(deltas.median()), + 'std_improvement': float(deltas.std(ddof=1)) if len(deltas) > 1 else 0.0, + 'q1_improvement': float(deltas.quantile(0.25)), + 'q3_improvement': float(deltas.quantile(0.75)), + } + ) + + improvement = pd.DataFrame(rows) + if improvement.empty: + return improvement + improvement = improvement.sort_values( + ['metric', 'forecast_version', 'forecast_source'], + key=lambda s: s.map(_version_sort_key) if s.name == 'forecast_version' else s.astype(str), + ) + return improvement.reset_index(drop=True) diff --git a/bioscancast/stages/eval_stage/generate_perplexity_forecasts.py b/bioscancast/stages/eval_stage/generate_perplexity_forecasts.py new file mode 100644 index 0000000..7184788 --- /dev/null +++ b/bioscancast/stages/eval_stage/generate_perplexity_forecasts.py @@ -0,0 +1,209 @@ +from __future__ import annotations + +import argparse +import json +import os +import re +from pathlib import Path +from typing import Any, Dict, List + +import pandas as pd +from openai import OpenAI + +from bioscancast.stages.eval_stage.loaders import load_forecasts, load_questions + + +BASE_DIR = Path(__file__).resolve().parent +DEFAULT_QUESTIONS = BASE_DIR / "bioscancast_questions.csv" +DEFAULT_TEMPLATE_FORECASTS = BASE_DIR / "mock_forecasts" / "bioscancast_forecasts.csv" +DEFAULT_OUTPUT = BASE_DIR / "mock_forecasts" / "perplexity_forecasts.csv" +MODEL_CHOICES = ("sonar-pro", "sonar-reasoning-pro", "sonar-deep-research") + + +def _extract_json_object(text: str) -> Dict[str, Any]: + cleaned = text.strip() + if cleaned.startswith("```"): + cleaned = re.sub(r"^```(?:json)?\s*", "", cleaned) + cleaned = re.sub(r"\s*```$", "", cleaned) + + try: + parsed = json.loads(cleaned) + if isinstance(parsed, dict): + return parsed + except Exception: + pass + + start = cleaned.find("{") + end = cleaned.rfind("}") + if start < 0 or end < 0 or end <= start: + raise ValueError(f"Could not locate a JSON object in the response: {text[:500]}") + + parsed = json.loads(cleaned[start : end + 1]) + if not isinstance(parsed, dict): + raise ValueError("Expected a JSON object from Perplexity.") + return parsed + + +def _options_from_template(template_group: pd.DataFrame) -> List[str]: + options: List[str] = [] + seen = set() + for raw in template_group["option"].tolist(): + option = str(raw).strip() + if option and option not in seen: + options.append(option) + seen.add(option) + if not options: + raise ValueError("Template forecast group has no options.") + return options + + +def _build_prompt(question_row: pd.Series, options: List[str]) -> str: + options_block = "\n".join(f"- {option}" for option in options) + return f"""You are writing a probabilistic forecast for a benchmark dataset. + +Question ID: {question_row['question_id']} +Topic: {question_row.get('topic', '')} +Question text: {question_row.get('question_text', '')} +Question type: {question_row.get('question_type', '')} +Resolution criteria: {question_row.get('resolution_criteria', '')} + +Choose probabilities for exactly these options: +{options_block} + +Return ONLY valid JSON with this exact schema: +{{ + "probabilities": {{ + "option text 1": 0.0, + "option text 2": 0.0 + }} +}} + +Rules: +- include every option exactly once +- probabilities must be numbers between 0 and 1 +- probabilities should sum to 1.0 +- do not include markdown, explanations, or code fences +""".strip() + + +def _parse_probability_map(payload: Dict[str, Any], options: List[str]) -> Dict[str, float]: + if isinstance(payload.get("probabilities"), dict): + raw_map = payload["probabilities"] + elif all(option in payload for option in options): + raw_map = payload + elif isinstance(payload.get("options"), list): + raw_map = {} + for item in payload["options"]: + if isinstance(item, dict) and "option" in item and "probability" in item: + raw_map[str(item["option"]).strip()] = item["probability"] + else: + raise ValueError("Could not find a probability mapping in the response JSON.") + + result: Dict[str, float] = {} + for option in options: + value = raw_map.get(option, 0.0) + try: + result[option] = max(float(value), 0.0) + except Exception as exc: + raise ValueError(f"Invalid probability for option {option!r}: {value!r}") from exc + + total = sum(result.values()) + if total <= 0: + uniform = 1.0 / len(options) + return {option: uniform for option in options} + + return {option: value / total for option, value in result.items()} + + +def _get_client() -> OpenAI: + api_key = os.getenv("PERPLEXITY_API_KEY") + if not api_key: + raise RuntimeError("PERPLEXITY_API_KEY is not set.") + return OpenAI(api_key=api_key, base_url="https://api.perplexity.ai") + + +def _generate_rows_for_question(client: OpenAI, model: str, question_row: pd.Series, options: List[str]) -> List[Dict[str, Any]]: + prompt = _build_prompt(question_row, options) + response = client.chat.completions.create( + model=model, + messages=[{"role": "user", "content": prompt}], + temperature=0, + ) + + content = response.choices[0].message.content or "" + parsed = _extract_json_object(content) + probability_map = _parse_probability_map(parsed, options) + + return [ + { + "question_id": question_row["question_id"], + "forecast_source": "perplexity", + "forecast_version": model, + "option": option, + "probability": probability_map[option], + } + for option in options + ] + + +def build_forecasts(questions_path: str | Path, template_forecasts_path: str | Path, model: str) -> List[Dict[str, Any]]: + questions_df = load_questions(questions_path) + template_df = load_forecasts(template_forecasts_path) + + if "question_id" not in questions_df.columns: + raise ValueError("Questions file must contain question_id.") + if "question_id" not in template_df.columns or "option" not in template_df.columns: + raise ValueError("Template forecasts must contain question_id and option columns.") + + questions_df = questions_df.copy() + questions_df["question_id"] = questions_df["question_id"].astype(str).str.strip() + question_lookup = questions_df.set_index("question_id", drop=False) + + client = _get_client() + rows: List[Dict[str, Any]] = [] + + for question_id, group in template_df.groupby("question_id", sort=False): + question_id = str(question_id).strip() + if question_id not in question_lookup.index: + raise KeyError( + f"Question {question_id!r} is present in the template forecasts but missing from {questions_path}." + ) + + question_row = question_lookup.loc[question_id] + options = _options_from_template(group) + rows.extend(_generate_rows_for_question(client, model, question_row, options)) + + return rows + + +def parse_args() -> argparse.Namespace: + parser = argparse.ArgumentParser(description="Generate a Perplexity forecast CSV for BioScanCast.") + parser.add_argument("--questions", default=str(DEFAULT_QUESTIONS), help="Path to bioscancast_questions.csv.") + parser.add_argument( + "--template-forecasts", + default=str(DEFAULT_TEMPLATE_FORECASTS), + help="Forecast CSV used to reuse the option sets.", + ) + parser.add_argument("--output", default=str(DEFAULT_OUTPUT), help="Where to write the generated forecast CSV.") + parser.add_argument( + "--model", + choices=MODEL_CHOICES, + default="sonar-reasoning-pro", + help="Perplexity model to query.", + ) + return parser.parse_args() + + +def main() -> None: + args = parse_args() + rows = build_forecasts(args.questions, args.template_forecasts, args.model) + + output_path = Path(args.output) + output_path.parent.mkdir(parents=True, exist_ok=True) + df = pd.DataFrame(rows) + df.to_csv(output_path, sep=";", index=False, decimal=",", encoding="cp1252") + print(f"Wrote {len(df)} rows to {output_path}") + + +if __name__ == "__main__": + main() diff --git a/bioscancast/stages/eval_stage/loaders.py b/bioscancast/stages/eval_stage/loaders.py index 024d670..9a87d98 100644 --- a/bioscancast/stages/eval_stage/loaders.py +++ b/bioscancast/stages/eval_stage/loaders.py @@ -100,7 +100,7 @@ def load_forecasts(path: PathLike) -> pd.DataFrame: ) if "forecast_version" in df.columns: - df["forecast_version"] = df["forecast_version"].str.strip() + df["forecast_version"] = df["forecast_version"].astype(str).str.strip() if "question_id" in df.columns: df["question_id"] = df["question_id"].str.strip() diff --git a/bioscancast/stages/eval_stage/main.py b/bioscancast/stages/eval_stage/main.py index 4a16094..560f5b0 100644 --- a/bioscancast/stages/eval_stage/main.py +++ b/bioscancast/stages/eval_stage/main.py @@ -7,20 +7,25 @@ BASE_DIR = Path(__file__).resolve().parent +DEFAULT_FORECASTS = [ + str(BASE_DIR / 'mock_forecasts' / 'human_forecasts.csv'), + str(BASE_DIR / 'mock_forecasts' / 'bioscancast_forecasts.csv'), + str(BASE_DIR / 'mock_forecasts' / 'llm_baseline_forecasts.csv'), +] def parse_args() -> argparse.Namespace: - parser = argparse.ArgumentParser(description="Run the BioScanCast evaluation pipeline.") + parser = argparse.ArgumentParser(description='Run the BioScanCast evaluation pipeline.') parser.add_argument( - "--forecasts", - nargs="+", - default=[str(BASE_DIR / "bioscancast_forecasts.csv")], - help="One or more forecast CSV files. Pass multiple files to compare sources.", + '--forecasts', + nargs='+', + default=DEFAULT_FORECASTS, + help='One or more forecast CSV files. Use forecast_version to place repeated runs on the timeline.', ) parser.add_argument( - "--questions", - default=str(BASE_DIR / "bioscancast_questions.csv"), - help="Path to the questions CSV file.", + '--questions', + default=str(BASE_DIR / 'bioscancast_questions.csv'), + help='Path to the questions CSV file.', ) return parser.parse_args() @@ -30,5 +35,5 @@ def main() -> None: run_evaluation(forecasts_path=args.forecasts, questions_path=args.questions) -if __name__ == "__main__": +if __name__ == '__main__': main() diff --git a/bioscancast/stages/eval_stage/mock_forecasts/bioscancast_forecasts.csv b/bioscancast/stages/eval_stage/mock_forecasts/bioscancast_forecasts.csv index b8c15f2..6845a2a 100644 --- a/bioscancast/stages/eval_stage/mock_forecasts/bioscancast_forecasts.csv +++ b/bioscancast/stages/eval_stage/mock_forecasts/bioscancast_forecasts.csv @@ -1,46 +1,130 @@ question_id;forecast_source;forecast_version;option;probability -q1;bioscancast;final;70-100;0,977 -q1;bioscancast;final;100-150;0,017 -q1;bioscancast;final;150-200;0,002 -q1;bioscancast;final;200+;0,003 -q2;bioscancast;final;70-100;0,066 -q2;bioscancast;final;100-200;0,422 -q2;bioscancast;final;200-300;0,337 -q2;bioscancast;final;300+;0,067 -q3;bioscancast;final;970-1000;0,902 -q3;bioscancast;final;1000-1100;0,044 -q3;bioscancast;final;1100-1200;0,025 -q3;bioscancast;final;1200+;0,004 -q4;bioscancast;final;10-15;0,665 -q4;bioscancast;final;15-20;0,125 -q4;bioscancast;final;20-25;0,005 -q4;bioscancast;final;25+;0,003 -q5;bioscancast;final;YES;0,077 -q5;bioscancast;final;NO;0,923 -q6;bioscancast;before_new_info;YES;0,17 -q6;bioscancast;before_new_info;NO;0,83 -q6;bioscancast;after_new_info;YES;0,139 -q6;bioscancast;after_new_info;NO;0,861 -q7;bioscancast;final;124,831-126,000;0,35 -q7;bioscancast;final;126,001-128,500;0,373 -q7;bioscancast;final;128,501-131,000;0,006 -q7;bioscancast;final;131,001+;0,001 -q8;bioscancast;final;124,831-126,000;0,019 -q8;bioscancast;final;126,001-128,500;0,071 -q8;bioscancast;final;128,501-131,000;0,227 -q8;bioscancast;final;131,001+;0,276 -q9;bioscancast;final;0-1;0,002 -q9;bioscancast;final;2-5;0,006 -q9;bioscancast;final;6-10;0,017 -q9;bioscancast;final;11+;0,968 -q10;bioscancast;final;9-15;0,098 -q10;bioscancast;final;16-20;0,089 -q10;bioscancast;final;21-25;0,131 -q10;bioscancast;final;26-30;0,04 -q10;bioscancast;final;30+;0,077 -q11;bioscancast;final;Vector-borne;0,299 -q11;bioscancast;final;Viral;0,096 -q11;bioscancast;final;Water contamination;0,088 -q11;bioscancast;final;Combination of causes;0,341 -q11;bioscancast;final;Uncertain;0,132 -q11;bioscancast;final;Other;0,003 +q1;bioscancast;3;100-150;0,017 +q1;bioscancast;3;150-200;0,002 +q1;bioscancast;3;200+;0,003 +q1;bioscancast;3;70-100;0,977 +q1;bioscancast;2;100-150;0,1 +q1;bioscancast;2;150-200;0,05 +q1;bioscancast;2;200+;0,05 +q1;bioscancast;2;70-100;0,8 +q1;bioscancast;1;100-150;0,6 +q1;bioscancast;1;150-200;0,1 +q1;bioscancast;1;200+;0,1 +q1;bioscancast;1;70-100;0,2 +q10;bioscancast;3;16-20;0,089 +q10;bioscancast;3;21-25;0,131 +q10;bioscancast;3;26-30;0,04 +q10;bioscancast;3;30+;0,077 +q10;bioscancast;3;9-15;0,098 +q10;bioscancast;2;16-20;0,12785 +q10;bioscancast;2;21-25;0,15515 +q10;bioscancast;2;26-30;0,096 +q10;bioscancast;2;30+;0,12005 +q10;bioscancast;2;9-15;0,1337 +q10;bioscancast;1;16-20;0,189989 +q10;bioscancast;1;21-25;0,396007 +q10;bioscancast;1;26-30;0,041572 +q10;bioscancast;1;30+;0,144284 +q10;bioscancast;1;9-15;0,228148 +q11;bioscancast;3;Combination of causes;0,341 +q11;bioscancast;3;Other;0,003 +q11;bioscancast;3;Uncertain;0,132 +q11;bioscancast;3;Vector-borne;0,299 +q11;bioscancast;3;Viral;0,096 +q11;bioscancast;3;Water contamination;0,088 +q11;bioscancast;2;Combination of causes;0,279983 +q11;bioscancast;2;Other;0,060283 +q11;bioscancast;2;Uncertain;0,144133 +q11;bioscancast;2;Vector-borne;0,252683 +q11;bioscancast;2;Viral;0,120733 +q11;bioscancast;2;Water contamination;0,115533 +q11;bioscancast;1;Combination of causes;0,473907 +q11;bioscancast;1;Other;0,000059 +q11;bioscancast;1;Uncertain;0,078082 +q11;bioscancast;1;Vector-borne;0,369177 +q11;bioscancast;1;Viral;0,042636 +q11;bioscancast;1;Water contamination;0,036139 +q2;bioscancast;3;100-200;0,422 +q2;bioscancast;3;200-300;0,337 +q2;bioscancast;3;300+;0,067 +q2;bioscancast;3;70-100;0,066 +q2;bioscancast;2;100-200;0,3618 +q2;bioscancast;2;200-300;0,30655 +q2;bioscancast;2;300+;0,13105 +q2;bioscancast;2;70-100;0,1304 +q2;bioscancast;1;100-200;0,584119 +q2;bioscancast;1;200-300;0,380982 +q2;bioscancast;1;300+;0,017699 +q2;bioscancast;1;70-100;0,0172 +q3;bioscancast;3;1000-1100;0,044 +q3;bioscancast;3;1100-1200;0,025 +q3;bioscancast;3;1200+;0,004 +q3;bioscancast;3;970-1000;0,902 +q3;bioscancast;2;1000-1100;0,55 +q3;bioscancast;2;1100-1200;0,1 +q3;bioscancast;2;1200+;0,1 +q3;bioscancast;2;970-1000;0,25 +q3;bioscancast;1;1000-1100;0,15 +q3;bioscancast;1;1100-1200;0,1 +q3;bioscancast;1;1200+;0,1 +q3;bioscancast;1;970-1000;0,65 +q4;bioscancast;3;10-15;0,665 +q4;bioscancast;3;15-20;0,125 +q4;bioscancast;3;20-25;0,005 +q4;bioscancast;3;25+;0,003 +q4;bioscancast;2;10-15;0,51975 +q4;bioscancast;2;15-20;0,16875 +q4;bioscancast;2;20-25;0,09075 +q4;bioscancast;2;25+;0,08945 +q4;bioscancast;1;10-15;0,959796 +q4;bioscancast;1;15-20;0,040082 +q4;bioscancast;1;20-25;0,000088 +q4;bioscancast;1;25+;0,000034 +q5;bioscancast;3;NO;0,923 +q5;bioscancast;3;YES;0,077 +q5;bioscancast;2;NO;0,77495 +q5;bioscancast;2;YES;0,22505 +q5;bioscancast;1;NO;0,991157 +q5;bioscancast;1;YES;0,008843 +q6;bioscancast;3;NO;0,8455 +q6;bioscancast;3;YES;0,1545 +q6;bioscancast;2;NO;0,724575 +q6;bioscancast;2;YES;0,275425 +q6;bioscancast;1;NO;0,961929 +q6;bioscancast;1;YES;0,038071 +q7;bioscancast;3;124,831-126,000;0,35 +q7;bioscancast;3;126,001-128,500;0,373 +q7;bioscancast;3;128,501-131,000;0,006 +q7;bioscancast;3;131,001+;0,001 +q7;bioscancast;2;124,831-126,000;0,2 +q7;bioscancast;2;126,001-128,500;0,55 +q7;bioscancast;2;128,501-131,000;0,15 +q7;bioscancast;2;131,001+;0,1 +q7;bioscancast;1;124,831-126,000;0,15 +q7;bioscancast;1;126,001-128,500;0,2 +q7;bioscancast;1;128,501-131,000;0,15 +q7;bioscancast;1;131,001+;0,5 +q8;bioscancast;3;124,831-126,000;0,019 +q8;bioscancast;3;126,001-128,500;0,071 +q8;bioscancast;3;128,501-131,000;0,227 +q8;bioscancast;3;131,001+;0,276 +q8;bioscancast;2;124,831-126,000;0,09985 +q8;bioscancast;2;126,001-128,500;0,13365 +q8;bioscancast;2;128,501-131,000;0,23505 +q8;bioscancast;2;131,001+;0,2669 +q8;bioscancast;1;124,831-126,000;0,003495 +q8;bioscancast;1;126,001-128,500;0,042781 +q8;bioscancast;1;128,501-131,000;0,389323 +q8;bioscancast;1;131,001+;0,564401 +q9;bioscancast;3;0-1;0,002 +q9;bioscancast;3;11+;0,968 +q9;bioscancast;3;2-5;0,006 +q9;bioscancast;3;6-10;0,017 +q9;bioscancast;2;0-1;0,15 +q9;bioscancast;2;11+;0,25 +q9;bioscancast;2;2-5;0,15 +q9;bioscancast;2;6-10;0,45 +q9;bioscancast;1;0-1;0,2 +q9;bioscancast;1;11+;0,4 +q9;bioscancast;1;2-5;0,25 +q9;bioscancast;1;6-10;0,15 diff --git a/bioscancast/stages/eval_stage/mock_forecasts/human_forecasts.csv b/bioscancast/stages/eval_stage/mock_forecasts/human_forecasts.csv index ba00873..35990d5 100644 --- a/bioscancast/stages/eval_stage/mock_forecasts/human_forecasts.csv +++ b/bioscancast/stages/eval_stage/mock_forecasts/human_forecasts.csv @@ -1,44 +1,130 @@ question_id;forecast_source;forecast_version;option;probability -q1;human;mock;70-100;0,961 -q1;human;mock;100-150;0,027 -q1;human;mock;150-200;0,01 -q1;human;mock;200+;0,002 -q2;human;mock;70-100;0,1058941058941059 -q2;human;mock;100-200;0,4375624375624376 -q2;human;mock;200-300;0,3506493506493506 -q2;human;mock;300+;0,1058941058941058 -q3;human;mock;970-1000;0,893 -q3;human;mock;1000-1100;0,044 -q3;human;mock;1100-1200;0,05 -q3;human;mock;1200+;0,0129999999999999 -q4;human;mock;10-15;0,849 -q4;human;mock;15-20;0,091 -q4;human;mock;20-25;0,042 -q4;human;mock;25+;0,018 -q5;human;mock;YES;0,073 -q5;human;mock;NO;0,927 -q6;human;mock;YES;0,098 -q6;human;mock;NO;0,902 -q7;human;mock;124,831-126,000;0,372 -q7;human;mock;126,001-128,500;0,585 -q7;human;mock;128,501-131,000;0,003 -q7;human;mock;131,001+;0,04 -q8;human;mock;124,831-126,000;0,0359640359640359 -q8;human;mock;126,001-128,500;0,1718281718281718 -q8;human;mock;128,501-131,000;0,4085914085914085 -q8;human;mock;131,001+;0,3836163836163837 -q9;human;mock;0-1;0,005 -q9;human;mock;2-5;0,024 -q9;human;mock;6-10;0,036 -q9;human;mock;11+;0,935 -q10;human;mock;9-15;0,259 -q10;human;mock;16-20;0,204 -q10;human;mock;21-25;0,33 -q10;human;mock;26-30;0,068 -q10;human;mock;30+;0,139 -q11;human;mock;Vector-borne;0,347 -q11;human;mock;Viral;0,078 -q11;human;mock;Water contamination;0,088 -q11;human;mock;Combination of causes;0,346 -q11;human;mock;Uncertain;0,136 -q11;human;mock;Other;0,005 +q1;human;1;100-150;0,027 +q1;human;1;150-200;0,01 +q1;human;1;200+;0,002 +q1;human;1;70-100;0,961 +q1;human;2;100-150;0,1 +q1;human;2;150-200;0,1 +q1;human;2;200+;0,1 +q1;human;2;70-100;0,7 +q1;human;3;100-150;0,03 +q1;human;3;150-200;0,04 +q1;human;3;200+;0,03 +q1;human;3;70-100;0,9 +q10;human;1;16-20;0,204 +q10;human;1;21-25;0,33 +q10;human;1;26-30;0,068 +q10;human;1;30+;0,139 +q10;human;1;9-15;0,259 +q10;human;2;16-20;0,2022 +q10;human;2;21-25;0,2715 +q10;human;2;26-30;0,1274 +q10;human;2;30+;0,16645 +q10;human;2;9-15;0,23245 +q10;human;3;16-20;0,187349 +q10;human;3;21-25;0,404451 +q10;human;3;26-30;0,032304 +q10;human;3;30+;0,101407 +q10;human;3;9-15;0,274488 +q11;human;1;Combination of causes;0,346 +q11;human;1;Other;0,005 +q11;human;1;Uncertain;0,136 +q11;human;1;Vector-borne;0,347 +q11;human;1;Viral;0,078 +q11;human;1;Water contamination;0,088 +q11;human;2;Combination of causes;0,2653 +q11;human;2;Other;0,07775 +q11;human;2;Uncertain;0,1498 +q11;human;2;Vector-borne;0,26585 +q11;human;2;Viral;0,1179 +q11;human;2;Water contamination;0,1234 +q11;human;3;Combination of causes;0,410796 +q11;human;3;Other;0,000467 +q11;human;3;Uncertain;0,092207 +q11;human;3;Vector-borne;0,412697 +q11;human;3;Viral;0,037884 +q11;human;3;Water contamination;0,045949 +q2;human;1;100-200;0,437562 +q2;human;1;200-300;0,350649 +q2;human;1;300+;0,105894 +q2;human;1;70-100;0,105894 +q2;human;2;100-200;0,353159 +q2;human;2;200-300;0,305357 +q2;human;2;300+;0,170742 +q2;human;2;70-100;0,170742 +q2;human;3;100-200;0,524031 +q2;human;3;200-300;0,367697 +q2;human;3;300+;0,054136 +q2;human;3;70-100;0,054136 +q3;human;1;1000-1100;0,044 +q3;human;1;1100-1200;0,05 +q3;human;1;1200+;0,013 +q3;human;1;970-1000;0,893 +q3;human;2;1000-1100;0,1 +q3;human;2;1100-1200;0,5 +q3;human;2;1200+;0,1 +q3;human;2;970-1000;0,3 +q3;human;3;1000-1100;0,1 +q3;human;3;1100-1200;0,15 +q3;human;3;1200+;0,05 +q3;human;3;970-1000;0,7 +q4;human;1;10-15;0,849 +q4;human;1;15-20;0,091 +q4;human;1;20-25;0,042 +q4;human;1;25+;0,018 +q4;human;2;10-15;0,57945 +q4;human;2;15-20;0,16255 +q4;human;2;20-25;0,1356 +q4;human;2;25+;0,1224 +q4;human;3;10-15;0,9631 +q4;human;3;15-20;0,027032 +q4;human;3;20-25;0,007845 +q4;human;3;25+;0,002022 +q5;human;1;NO;0,927 +q5;human;1;YES;0,073 +q5;human;2;NO;0,73485 +q5;human;2;YES;0,26515 +q5;human;3;NO;0,98315 +q5;human;3;YES;0,01685 +q6;human;1;NO;0,902 +q6;human;1;YES;0,098 +q6;human;2;NO;0,7211 +q6;human;2;YES;0,2789 +q6;human;3;NO;0,972116 +q6;human;3;YES;0,027884 +q7;human;1;124,831-126,000;0,372 +q7;human;1;126,001-128,500;0,585 +q7;human;1;128,501-131,000;0,003 +q7;human;1;131,001+;0,04 +q7;human;2;124,831-126,000;0,45 +q7;human;2;126,001-128,500;0,25 +q7;human;2;128,501-131,000;0,2 +q7;human;2;131,001+;0,1 +q7;human;3;124,831-126,000;0,15 +q7;human;3;126,001-128,500;0,3 +q7;human;3;128,501-131,000;0,15 +q7;human;3;131,001+;0,4 +q8;human;1;124,831-126,000;0,035964 +q8;human;1;126,001-128,500;0,171828 +q8;human;1;128,501-131,000;0,408591 +q8;human;1;131,001+;0,383616 +q8;human;2;124,831-126,000;0,13228 +q8;human;2;126,001-128,500;0,207005 +q8;human;2;128,501-131,000;0,337225 +q8;human;2;131,001+;0,323489 +q8;human;3;124,831-126,000;0,009418 +q8;human;3;126,001-128,500;0,115004 +q8;human;3;128,501-131,000;0,45986 +q8;human;3;131,001+;0,415718 +q9;human;1;0-1;0,005 +q9;human;1;11+;0,935 +q9;human;1;2-5;0,024 +q9;human;1;6-10;0,036 +q9;human;2;0-1;0,1 +q9;human;2;11+;0,3 +q9;human;2;2-5;0,2 +q9;human;2;6-10;0,4 +q9;human;3;0-1;0,05 +q9;human;3;11+;0,8 +q9;human;3;2-5;0,05 +q9;human;3;6-10;0,1 diff --git a/bioscancast/stages/eval_stage/mock_forecasts/llm_baseline_forecasts.csv b/bioscancast/stages/eval_stage/mock_forecasts/llm_baseline_forecasts.csv index 469d579..97d89ae 100644 --- a/bioscancast/stages/eval_stage/mock_forecasts/llm_baseline_forecasts.csv +++ b/bioscancast/stages/eval_stage/mock_forecasts/llm_baseline_forecasts.csv @@ -1,45 +1,130 @@ question_id;forecast_source;forecast_version;option;probability -q1;llm_baseline;mock;70-100;0.15 -q1;llm_baseline;mock;100-150;0.55 -q1;llm_baseline;mock;150-200;0.20 -q1;llm_baseline;mock;200+;0.10 -q2;llm_baseline;mock;70-100;0.60 -q2;llm_baseline;mock;100-200;0.25 -q2;llm_baseline;mock;200-300;0.10 -q2;llm_baseline;mock;300+;0.05 -q3;llm_baseline;mock;970-1000;0.10 -q3;llm_baseline;mock;1000-1100;0.20 -q3;llm_baseline;mock;1100-1200;0.25 -q3;llm_baseline;mock;1200+;0.45 -q4;llm_baseline;mock;10-15;0.60 -q4;llm_baseline;mock;15-20;0.25 -q4;llm_baseline;mock;20-25;0.10 -q4;llm_baseline;mock;25+;0.05 -q5;llm_baseline;mock;YES;0.08 -q5;llm_baseline;mock;NO;0.92 -q6;llm_baseline;mock;YES;0.12 -q6;llm_baseline;mock;NO;0.88 -q7;llm_baseline;mock;124,831-126,000;0.05 -q7;llm_baseline;mock;126,001-128,500;0.15 -q7;llm_baseline;mock;128,501-131,000;0.25 -q7;llm_baseline;mock;131,001+;0.55 -q8;llm_baseline;mock;124,831-126,000;0.03 -q8;llm_baseline;mock;126,001-128,500;0.07 -q8;llm_baseline;mock;128,501-131,000;0.20 -q8;llm_baseline;mock;131,001+;0.70 -q9;llm_baseline;mock;0-1;0.02 -q9;llm_baseline;mock;2-5;0.05 -q9;llm_baseline;mock;6-10;0.08 -q9;llm_baseline;mock;11+;0.85 -q10;llm_baseline;mock;9-15;0.20 -q10;llm_baseline;mock;16-20;0.50 -q10;llm_baseline;mock;21-25;0.20 -q10;llm_baseline;mock;26-30;0.07 -q10;llm_baseline;mock;30+;0.03 -q11;llm_baseline;mock;Vector-borne;0.18 -q11;llm_baseline;mock;Viral;0.12 -q11;llm_baseline;mock;Water contamination;0.15 -q11;llm_baseline;mock;Combination of causes;0.20 -q11;llm_baseline;mock;Uncertain;0.30 -q11;llm_baseline;mock;Other;0.05 - +q1;llm_baseline;3;100-150;0,55 +q1;llm_baseline;3;150-200;0,2 +q1;llm_baseline;3;200+;0,1 +q1;llm_baseline;3;70-100;0,15 +q1;llm_baseline;2;100-150;0,45 +q1;llm_baseline;2;150-200;0,15 +q1;llm_baseline;2;200+;0,15 +q1;llm_baseline;2;70-100;0,25 +q1;llm_baseline;1;100-150;0,1 +q1;llm_baseline;1;150-200;0,1 +q1;llm_baseline;1;200+;0,1 +q1;llm_baseline;1;70-100;0,7 +q10;llm_baseline;3;16-20;0,5 +q10;llm_baseline;3;21-25;0,2 +q10;llm_baseline;3;26-30;0,07 +q10;llm_baseline;3;30+;0,03 +q10;llm_baseline;3;9-15;0,2 +q10;llm_baseline;2;16-20;0,335 +q10;llm_baseline;2;21-25;0,2 +q10;llm_baseline;2;26-30;0,1415 +q10;llm_baseline;2;30+;0,1235 +q10;llm_baseline;2;9-15;0,2 +q10;llm_baseline;1;16-20;0,762892 +q10;llm_baseline;1;21-25;0,111375 +q10;llm_baseline;1;26-30;0,012284 +q10;llm_baseline;1;30+;0,002073 +q10;llm_baseline;1;9-15;0,111375 +q11;llm_baseline;3;Combination of causes;0,2 +q11;llm_baseline;3;Other;0,05 +q11;llm_baseline;3;Uncertain;0,3 +q11;llm_baseline;3;Vector-borne;0,18 +q11;llm_baseline;3;Viral;0,12 +q11;llm_baseline;3;Water contamination;0,15 +q11;llm_baseline;2;Combination of causes;0,181667 +q11;llm_baseline;2;Other;0,114167 +q11;llm_baseline;2;Uncertain;0,226667 +q11;llm_baseline;2;Vector-borne;0,172667 +q11;llm_baseline;2;Viral;0,145667 +q11;llm_baseline;2;Water contamination;0,159167 +q11;llm_baseline;1;Combination of causes;0,196555 +q11;llm_baseline;1;Other;0,010694 +q11;llm_baseline;1;Uncertain;0,460548 +q11;llm_baseline;1;Vector-borne;0,157541 +q11;llm_baseline;1;Viral;0,067236 +q11;llm_baseline;1;Water contamination;0,107427 +q2;llm_baseline;3;100-200;0,25 +q2;llm_baseline;3;200-300;0,1 +q2;llm_baseline;3;300+;0,05 +q2;llm_baseline;3;70-100;0,6 +q2;llm_baseline;2;100-200;0,25 +q2;llm_baseline;2;200-300;0,1825 +q2;llm_baseline;2;300+;0,16 +q2;llm_baseline;2;70-100;0,4075 +q2;llm_baseline;1;100-200;0,133922 +q2;llm_baseline;1;200-300;0,019551 +q2;llm_baseline;1;300+;0,004561 +q2;llm_baseline;1;70-100;0,841966 +q3;llm_baseline;3;1000-1100;0,2 +q3;llm_baseline;3;1100-1200;0,25 +q3;llm_baseline;3;1200+;0,45 +q3;llm_baseline;3;970-1000;0,1 +q3;llm_baseline;2;1000-1100;0,2 +q3;llm_baseline;2;1100-1200;0,25 +q3;llm_baseline;2;1200+;0,35 +q3;llm_baseline;2;970-1000;0,2 +q3;llm_baseline;1;1000-1100;0,15 +q3;llm_baseline;1;1100-1200;0,2 +q3;llm_baseline;1;1200+;0,1 +q3;llm_baseline;1;970-1000;0,55 +q4;llm_baseline;3;10-15;0,6 +q4;llm_baseline;3;15-20;0,25 +q4;llm_baseline;3;20-25;0,1 +q4;llm_baseline;3;25+;0,05 +q4;llm_baseline;2;10-15;0,4075 +q4;llm_baseline;2;15-20;0,25 +q4;llm_baseline;2;20-25;0,1825 +q4;llm_baseline;2;25+;0,16 +q4;llm_baseline;1;10-15;0,841966 +q4;llm_baseline;1;15-20;0,133922 +q4;llm_baseline;1;20-25;0,019551 +q4;llm_baseline;1;25+;0,004561 +q5;llm_baseline;3;NO;0,92 +q5;llm_baseline;3;YES;0,08 +q5;llm_baseline;2;NO;0,689 +q5;llm_baseline;2;YES;0,311 +q5;llm_baseline;1;NO;0,994112 +q5;llm_baseline;1;YES;0,005888 +q6;llm_baseline;3;NO;0,88 +q6;llm_baseline;3;YES;0,12 +q6;llm_baseline;2;NO;0,671 +q6;llm_baseline;2;YES;0,329 +q6;llm_baseline;1;NO;0,984993 +q6;llm_baseline;1;YES;0,015007 +q7;llm_baseline;3;124,831-126,000;0,05 +q7;llm_baseline;3;126,001-128,500;0,15 +q7;llm_baseline;3;128,501-131,000;0,25 +q7;llm_baseline;3;131,001+;0,55 +q7;llm_baseline;2;124,831-126,000;0,55 +q7;llm_baseline;2;126,001-128,500;0,1 +q7;llm_baseline;2;128,501-131,000;0,15 +q7;llm_baseline;2;131,001+;0,2 +q7;llm_baseline;1;124,831-126,000;0,15 +q7;llm_baseline;1;126,001-128,500;0,55 +q7;llm_baseline;1;128,501-131,000;0,1 +q7;llm_baseline;1;131,001+;0,2 +q8;llm_baseline;3;124,831-126,000;0,03 +q8;llm_baseline;3;126,001-128,500;0,07 +q8;llm_baseline;3;128,501-131,000;0,2 +q8;llm_baseline;3;131,001+;0,7 +q8;llm_baseline;2;124,831-126,000;0,151 +q8;llm_baseline;2;126,001-128,500;0,169 +q8;llm_baseline;2;128,501-131,000;0,2275 +q8;llm_baseline;2;131,001+;0,4525 +q8;llm_baseline;1;124,831-126,000;0,00124 +q8;llm_baseline;1;126,001-128,500;0,007346 +q8;llm_baseline;1;128,501-131,000;0,066605 +q8;llm_baseline;1;131,001+;0,924809 +q9;llm_baseline;3;0-1;0,02 +q9;llm_baseline;3;11+;0,85 +q9;llm_baseline;3;2-5;0,05 +q9;llm_baseline;3;6-10;0,08 +q9;llm_baseline;2;0-1;0,1 +q9;llm_baseline;2;11+;0,15 +q9;llm_baseline;2;2-5;0,2 +q9;llm_baseline;2;6-10;0,55 +q9;llm_baseline;1;0-1;0,05 +q9;llm_baseline;1;11+;0,8 +q9;llm_baseline;1;2-5;0,05 +q9;llm_baseline;1;6-10;0,1 diff --git a/bioscancast/stages/eval_stage/outputs/accuracy_by_source.png b/bioscancast/stages/eval_stage/outputs/accuracy_by_source.png deleted file mode 100644 index 1bf320d..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/accuracy_by_source.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/brier_boxplot.png b/bioscancast/stages/eval_stage/outputs/brier_boxplot.png deleted file mode 100644 index 9840eba..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/brier_boxplot.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/brier_distribution.png b/bioscancast/stages/eval_stage/outputs/brier_distribution.png deleted file mode 100644 index 2396f4f..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/brier_distribution.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/calibration_bioscancast.png b/bioscancast/stages/eval_stage/outputs/calibration_bioscancast.png deleted file mode 100644 index b19d7db..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/calibration_bioscancast.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/calibration_human.png b/bioscancast/stages/eval_stage/outputs/calibration_human.png deleted file mode 100644 index dafe7bf..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/calibration_human.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/calibration_llm_baseline.png b/bioscancast/stages/eval_stage/outputs/calibration_llm_baseline.png deleted file mode 100644 index 1e1d977..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/calibration_llm_baseline.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/calibration_table_bioscancast.csv b/bioscancast/stages/eval_stage/outputs/calibration_table_bioscancast.csv deleted file mode 100644 index 5054569..0000000 --- a/bioscancast/stages/eval_stage/outputs/calibration_table_bioscancast.csv +++ /dev/null @@ -1,6 +0,0 @@ -bin,mean_probability,actual_frequency,count -"(-0.001, 0.2]",,,0 -"(0.2, 0.4]",,,0 -"(0.4, 0.6]",0.5109589041095891,1.0,1 -"(0.6, 0.8]",,,0 -"(0.8, 1.0]",0.9593099831569116,1.0,3 diff --git a/bioscancast/stages/eval_stage/outputs/calibration_table_human.csv b/bioscancast/stages/eval_stage/outputs/calibration_table_human.csv deleted file mode 100644 index b6f6155..0000000 --- a/bioscancast/stages/eval_stage/outputs/calibration_table_human.csv +++ /dev/null @@ -1,6 +0,0 @@ -bin,mean_probability,actual_frequency,count -"(-0.001, 0.2]",,,0 -"(0.2, 0.4]",,,0 -"(0.4, 0.6]",0.585,1.0,1 -"(0.6, 0.8]",,,0 -"(0.8, 1.0]",0.9296666666666665,1.0,3 diff --git a/bioscancast/stages/eval_stage/outputs/calibration_table_llm_baseline.csv b/bioscancast/stages/eval_stage/outputs/calibration_table_llm_baseline.csv deleted file mode 100644 index 913b19e..0000000 --- a/bioscancast/stages/eval_stage/outputs/calibration_table_llm_baseline.csv +++ /dev/null @@ -1,6 +0,0 @@ -bin,mean_probability,actual_frequency,count -"(-0.001, 0.2]",,,0 -"(0.2, 0.4]",,,0 -"(0.4, 0.6]",0.5166666666666667,0.0,3 -"(0.6, 0.8]",,,0 -"(0.8, 1.0]",0.85,1.0,1 diff --git a/bioscancast/stages/eval_stage/outputs/differences_accuracy.png b/bioscancast/stages/eval_stage/outputs/differences_accuracy.png deleted file mode 100644 index 5ebb172..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/differences_accuracy.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/differences_accuracy_bioscancast_vs_human.png b/bioscancast/stages/eval_stage/outputs/differences_accuracy_bioscancast_vs_human.png deleted file mode 100644 index 69c38a0..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/differences_accuracy_bioscancast_vs_human.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/differences_accuracy_bioscancast_vs_llm_baseline.png b/bioscancast/stages/eval_stage/outputs/differences_accuracy_bioscancast_vs_llm_baseline.png deleted file mode 100644 index e84b3ea..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/differences_accuracy_bioscancast_vs_llm_baseline.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/differences_accuracy_human_vs_llm_baseline.png b/bioscancast/stages/eval_stage/outputs/differences_accuracy_human_vs_llm_baseline.png deleted file mode 100644 index a88df65..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/differences_accuracy_human_vs_llm_baseline.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/differences_brier_score.png b/bioscancast/stages/eval_stage/outputs/differences_brier_score.png deleted file mode 100644 index 060b483..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/differences_brier_score.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/differences_brier_score_bioscancast_vs_human.png b/bioscancast/stages/eval_stage/outputs/differences_brier_score_bioscancast_vs_human.png deleted file mode 100644 index b627dc6..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/differences_brier_score_bioscancast_vs_human.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/differences_brier_score_bioscancast_vs_llm_baseline.png b/bioscancast/stages/eval_stage/outputs/differences_brier_score_bioscancast_vs_llm_baseline.png deleted file mode 100644 index ef306ea..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/differences_brier_score_bioscancast_vs_llm_baseline.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/differences_brier_score_human_vs_llm_baseline.png b/bioscancast/stages/eval_stage/outputs/differences_brier_score_human_vs_llm_baseline.png deleted file mode 100644 index 7030151..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/differences_brier_score_human_vs_llm_baseline.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/differences_log_score.png b/bioscancast/stages/eval_stage/outputs/differences_log_score.png deleted file mode 100644 index a7fdb38..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/differences_log_score.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/differences_log_score_bioscancast_vs_human.png b/bioscancast/stages/eval_stage/outputs/differences_log_score_bioscancast_vs_human.png deleted file mode 100644 index ce88d4e..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/differences_log_score_bioscancast_vs_human.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/differences_log_score_bioscancast_vs_llm_baseline.png b/bioscancast/stages/eval_stage/outputs/differences_log_score_bioscancast_vs_llm_baseline.png deleted file mode 100644 index 2f90aec..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/differences_log_score_bioscancast_vs_llm_baseline.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/differences_log_score_human_vs_llm_baseline.png b/bioscancast/stages/eval_stage/outputs/differences_log_score_human_vs_llm_baseline.png deleted file mode 100644 index 023ca9d..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/differences_log_score_human_vs_llm_baseline.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/differences_rps.png b/bioscancast/stages/eval_stage/outputs/differences_rps.png deleted file mode 100644 index 3a63820..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/differences_rps.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/differences_rps_bioscancast_vs_human.png b/bioscancast/stages/eval_stage/outputs/differences_rps_bioscancast_vs_human.png deleted file mode 100644 index da1cb4a..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/differences_rps_bioscancast_vs_human.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/differences_rps_bioscancast_vs_llm_baseline.png b/bioscancast/stages/eval_stage/outputs/differences_rps_bioscancast_vs_llm_baseline.png deleted file mode 100644 index fda630f..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/differences_rps_bioscancast_vs_llm_baseline.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/differences_rps_human_vs_llm_baseline.png b/bioscancast/stages/eval_stage/outputs/differences_rps_human_vs_llm_baseline.png deleted file mode 100644 index 4857e83..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/differences_rps_human_vs_llm_baseline.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/improvement_vs_v1.png b/bioscancast/stages/eval_stage/outputs/improvement_vs_v1.png new file mode 100644 index 0000000..a226ba1 Binary files /dev/null and b/bioscancast/stages/eval_stage/outputs/improvement_vs_v1.png differ diff --git a/bioscancast/stages/eval_stage/outputs/log_boxplot.png b/bioscancast/stages/eval_stage/outputs/log_boxplot.png deleted file mode 100644 index e2cfc45..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/log_boxplot.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/log_score_distribution.png b/bioscancast/stages/eval_stage/outputs/log_score_distribution.png deleted file mode 100644 index c9d0d6a..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/log_score_distribution.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/metric_improvement_over_time.csv b/bioscancast/stages/eval_stage/outputs/metric_improvement_over_time.csv new file mode 100644 index 0000000..6c95340 --- /dev/null +++ b/bioscancast/stages/eval_stage/outputs/metric_improvement_over_time.csv @@ -0,0 +1,73 @@ +forecast_source,forecast_version,baseline_version,metric,n_questions,mean_improvement,median_improvement,std_improvement,q1_improvement,q3_improvement +bioscancast,1,1,accuracy,4,0.0,0.0,0.0,0.0,0.0 +human,1,1,accuracy,4,0.0,0.0,0.0,0.0,0.0 +llm_baseline,1,1,accuracy,4,0.0,0.0,0.0,0.0,0.0 +bioscancast,2,1,accuracy,4,0.0,0.0,1.1547005383792515,-1.0,1.0 +human,2,1,accuracy,4,0.75,1.0,0.5,0.75,1.0 +llm_baseline,2,1,accuracy,4,1.0,1.0,0.0,1.0,1.0 +bioscancast,3,1,accuracy,4,-0.5,-0.5,0.5773502691896257,-1.0,0.0 +human,3,1,accuracy,4,0.25,0.0,0.5,0.0,0.25 +llm_baseline,3,1,accuracy,4,0.75,1.0,0.5,0.75,1.0 +bioscancast,1,1,accuracy_error,4,0.0,0.0,0.0,0.0,0.0 +human,1,1,accuracy_error,4,0.0,0.0,0.0,0.0,0.0 +llm_baseline,1,1,accuracy_error,4,0.0,0.0,0.0,0.0,0.0 +bioscancast,2,1,accuracy_error,4,0.0,0.0,1.1547005383792515,-1.0,1.0 +human,2,1,accuracy_error,4,-0.75,-1.0,0.5,-1.0,-0.75 +llm_baseline,2,1,accuracy_error,4,-1.0,-1.0,0.0,-1.0,-1.0 +bioscancast,3,1,accuracy_error,4,0.5,0.5,0.5773502691896257,0.0,1.0 +human,3,1,accuracy_error,4,-0.25,0.0,0.5,-0.25,0.0 +llm_baseline,3,1,accuracy_error,4,-0.75,-1.0,0.5,-1.0,-0.75 +bioscancast,1,1,brier_score,4,0.0,0.0,0.0,0.0,0.0 +human,1,1,brier_score,4,0.0,0.0,0.0,0.0,0.0 +llm_baseline,1,1,brier_score,4,0.0,0.0,0.0,0.0,0.0 +bioscancast,2,1,brier_score,4,0.14500000000000007,0.16750000000000012,0.797234804391613,-0.4237499999999999,0.7362500000000001 +human,2,1,brier_score,4,-0.5145630000000001,-0.59833,0.28428408723434845,-0.706395,-0.40649800000000014 +llm_baseline,2,1,brier_score,4,-0.8,-0.7950000000000002,0.19544820285692055,-0.93,-0.665 +bioscancast,3,1,brier_score,4,0.5314558491393631,0.4749638023560786,0.358193652104658,0.388592168254557,0.6178274832408848 +human,3,1,brier_score,4,-0.137913,-0.07891200000000005,0.16814912148050803,-0.177405,-0.03942000000000005 +llm_baseline,3,1,brier_score,4,-0.6467,-0.8275000000000001,0.45074186256289395,-0.8687500000000002,-0.6054499999999999 +bioscancast,1,1,log_score,4,0.0,0.0,0.0,0.0,0.0 +human,1,1,log_score,4,0.0,0.0,0.0,0.0,0.0 +llm_baseline,1,1,log_score,4,0.0,0.0,0.0,0.0,0.0 +bioscancast,2,1,log_score,4,0.2430950496312996,0.27079864121637215,1.1317645780878554,-0.5913805831911607,1.1052742740388326 +human,2,1,log_score,4,-0.8486532910373206,-0.970477517794954,0.3761271670129863,-1.1022940933233452,-0.7168367155089292 +llm_baseline,2,1,log_score,4,-1.3549862136674338,-1.3517979253764147,0.3863785492552821,-1.6816693482383598,-1.0251147908054885 +bioscancast,3,1,log_score,4,0.9422234260066498,0.9143819765205193,0.5052558709232472,0.7563341076184773,1.1002712949086917 +human,3,1,log_score,4,-0.2832125164188729,-0.19972052372692722,0.2665022808462984,-0.3495870275187347,-0.1333460126270654 +llm_baseline,3,1,log_score,4,-1.1209628738748498,-1.4198640125387048,0.8051330203486012,-1.581520803769968,-0.9593060826435869 +bioscancast,1,1,normalized_entropy,4,0.0,0.0,0.0,0.0,0.0 +human,1,1,normalized_entropy,4,0.0,0.0,0.0,0.0,0.0 +llm_baseline,1,1,normalized_entropy,4,0.0,0.0,0.0,0.0,0.0 +bioscancast,2,1,normalized_entropy,4,0.06966891438695627,0.04204739836396826,0.14839577751877664,0.00409638741801413,0.1076199253329104 +human,2,1,normalized_entropy,4,-0.519633708891973,-0.5301033901297312,0.16284177092122534,-0.5806819924793327,-0.46905510654237137 +llm_baseline,2,1,normalized_entropy,4,-0.17745689680138657,-0.19002173494244495,0.1417517039598126,-0.2634628121500746,-0.10401581959375691 +bioscancast,3,1,normalized_entropy,4,0.6023173360470978,0.6004189369570059,0.2178173094762693,0.4659799452820751,0.7367563277220284 +human,3,1,normalized_entropy,4,-0.2864250219986947,-0.3170289086928482,0.07831986173198353,-0.33849771332976486,-0.2649562173617781 +llm_baseline,3,1,normalized_entropy,4,-0.021945942721693604,-0.013250366816491388,0.11651001306566901,-0.0906458941306317,0.05544958459244671 +bioscancast,1,1,rps,4,0.0,0.0,0.0,0.0,0.0 +human,1,1,rps,4,0.0,0.0,0.0,0.0,0.0 +llm_baseline,1,1,rps,4,0.0,0.0,0.0,0.0,0.0 +bioscancast,2,1,rps,4,0.046458333333333296,0.03666666666666667,0.3635584006018127,-0.18062500000000004,0.26375 +human,2,1,rps,4,-0.1383255,-0.11429099999999998,0.11108443384017666,-0.20108466666666663,-0.05153183333333333 +llm_baseline,2,1,rps,4,-0.2227083333333334,-0.22916666666666674,0.10076415211696989,-0.29208333333333336,-0.15979166666666678 +bioscancast,3,1,rps,4,0.1975342812460111,0.11444971182462688,0.20324766941107703,0.07165404045797553,0.24032995261266243 +human,3,1,rps,4,-0.04380883333333335,-0.028393000000000015,0.0506999931050474,-0.06366700000000002,-0.00853483333333335 +llm_baseline,3,1,rps,4,-0.244475,-0.24958333333333335,0.19896177443213858,-0.31875,-0.17530833333333334 +bioscancast,1,1,top_probability,4,0.0,0.0,0.0,0.0,0.0 +human,1,1,top_probability,4,0.0,0.0,0.0,0.0,0.0 +llm_baseline,1,1,top_probability,4,0.0,0.0,0.0,0.0,0.0 +bioscancast,2,1,top_probability,4,-0.05000000000000003,-0.05000000000000002,0.12247448713915893,-0.08750000000000005,-0.012499999999999997 +human,2,1,top_probability,4,0.33099999999999996,0.327,0.17202325424197745,0.22949999999999998,0.4285 +llm_baseline,2,1,top_probability,4,0.17499999999999993,0.22499999999999992,0.11902380714238078,0.15000000000000005,0.24999999999999983 +bioscancast,3,1,top_probability,4,-0.309722213395081,-0.32655309155309153,0.23480058430421916,-0.42718942507462143,-0.20908587987355112 +human,3,1,top_probability,4,0.14350000000000004,0.1600000000000001,0.06069321763316449,0.11650000000000016,0.18699999999999997 +llm_baseline,3,1,top_probability,4,0.04999999999999995,0.05000000000000002,0.09128709291752772,-0.012500000000000039,0.1125 +bioscancast,1,1,true_probability,4,0.0,0.0,0.0,0.0,0.0 +human,1,1,true_probability,4,0.0,0.0,0.0,0.0,0.0 +llm_baseline,1,1,true_probability,4,0.0,0.0,0.0,0.0,0.0 +bioscancast,2,1,true_probability,4,-0.1,-0.1,0.45643546458763845,-0.41250000000000003,0.21250000000000002 +human,2,1,true_probability,4,0.45599999999999996,0.46399999999999997,0.18572021968541821,0.31649999999999995,0.6034999999999999 +llm_baseline,2,1,true_probability,4,0.475,0.45,0.12583057392117905,0.425,0.5 +bioscancast,3,1,true_probability,4,-0.484722213395081,-0.44289133523707047,0.23681256249323043,-0.6256123192679084,-0.3020012293642431 +human,3,1,true_probability,4,0.16850000000000004,0.16400000000000015,0.09460620134712805,0.11650000000000016,0.21600000000000003 +llm_baseline,3,1,true_probability,4,0.3374999999999999,0.42500000000000004,0.2657536453183663,0.2875,0.47500000000000003 diff --git a/bioscancast/stages/eval_stage/outputs/pairwise_comparison.csv b/bioscancast/stages/eval_stage/outputs/pairwise_comparison.csv deleted file mode 100644 index 09d95c8..0000000 --- a/bioscancast/stages/eval_stage/outputs/pairwise_comparison.csv +++ /dev/null @@ -1,5 +0,0 @@ -question_id,brier_score_human,brier_score_bioscancast,log_score_human,log_score_bioscancast,accuracy_human,accuracy_bioscancast,rps_human,rps_bioscancast,top_probability_human,top_probability_bioscancast,normalized_entropy_human,normalized_entropy_bioscancast,true_probability_human,true_probability_bioscancast -q1,0.0023540000000000024,0.0007875743611479325,0.0397808700118446,0.022268126605770724,1.0,1.0,0.0005563333333333344,0.00017301251869153035,0.961,0.977977977977978,0.14010886629561287,0.08726774697101813,0.961,0.977977977977978 -q3,0.01605400000000002,0.008316633793556866,0.11316869810563811,0.0778229509352235,1.0,1.0,0.005195666666666676,0.0021690992767915826,0.8929999999999999,0.9251282051282051,0.3208118267821115,0.23681862783604404,0.8929999999999999,0.9251282051282051 -q7,0.312218,0.4691048977294051,0.5361434317502807,0.6714661144986211,1.0,1.0,0.04727766666666666,0.07665603302683431,0.585,0.5109589041095891,0.5970469495834214,0.536708838458453,0.585,0.5109589041095891 -q9,0.006121999999999992,0.0009674975584377677,0.06720874969344999,0.025498576768595627,1.0,1.0,0.001697,0.00023426827672864132,0.935,0.9748237663645518,0.21533457770189754,0.09944818790950914,0.935,0.9748237663645518 diff --git a/bioscancast/stages/eval_stage/outputs/pairwise_comparison_bioscancast_vs_human.csv b/bioscancast/stages/eval_stage/outputs/pairwise_comparison_bioscancast_vs_human.csv deleted file mode 100644 index bd091e6..0000000 --- a/bioscancast/stages/eval_stage/outputs/pairwise_comparison_bioscancast_vs_human.csv +++ /dev/null @@ -1,5 +0,0 @@ -question_id,brier_score_bioscancast,brier_score_human,log_score_bioscancast,log_score_human,accuracy_bioscancast,accuracy_human,rps_bioscancast,rps_human,top_probability_bioscancast,top_probability_human,normalized_entropy_bioscancast,normalized_entropy_human,true_probability_bioscancast,true_probability_human -q1,0.0007875743611479325,0.0023540000000000024,0.022268126605770724,0.0397808700118446,1.0,1.0,0.00017301251869153035,0.0005563333333333344,0.977977977977978,0.961,0.08726774697101813,0.14010886629561287,0.977977977977978,0.961 -q3,0.008316633793556866,0.01605400000000002,0.0778229509352235,0.11316869810563811,1.0,1.0,0.0021690992767915826,0.005195666666666676,0.9251282051282051,0.8929999999999999,0.23681862783604404,0.3208118267821115,0.9251282051282051,0.8929999999999999 -q7,0.4691048977294051,0.312218,0.6714661144986211,0.5361434317502807,1.0,1.0,0.07665603302683431,0.04727766666666666,0.5109589041095891,0.585,0.536708838458453,0.5970469495834214,0.5109589041095891,0.585 -q9,0.0009674975584377677,0.006121999999999992,0.025498576768595627,0.06720874969344999,1.0,1.0,0.00023426827672864132,0.001697,0.9748237663645518,0.935,0.09944818790950914,0.21533457770189754,0.9748237663645518,0.935 diff --git a/bioscancast/stages/eval_stage/outputs/pairwise_comparison_bioscancast_vs_llm_baseline.csv b/bioscancast/stages/eval_stage/outputs/pairwise_comparison_bioscancast_vs_llm_baseline.csv deleted file mode 100644 index 423bb5b..0000000 --- a/bioscancast/stages/eval_stage/outputs/pairwise_comparison_bioscancast_vs_llm_baseline.csv +++ /dev/null @@ -1,5 +0,0 @@ -question_id,brier_score_bioscancast,brier_score_llm_baseline,log_score_bioscancast,log_score_llm_baseline,accuracy_bioscancast,accuracy_llm_baseline,rps_bioscancast,rps_llm_baseline,top_probability_bioscancast,top_probability_llm_baseline,normalized_entropy_bioscancast,normalized_entropy_llm_baseline,true_probability_bioscancast,true_probability_llm_baseline -q1,0.0007875743611479325,1.0750000000000002,0.022268126605770724,1.8971199848858815,1.0,0.0,0.00017301251869153035,0.2741666666666667,0.977977977977978,0.5499999999999999,0.08726774697101813,0.8407481647643377,0.977977977977978,0.14999999999999997 -q3,0.008316633793556866,1.1150000000000002,0.0778229509352235,2.3025850929940455,1.0,0.0,0.0021690992767915826,0.5008333333333334,0.9251282051282051,0.45,0.23681862783604404,0.9074899102582407,0.9251282051282051,0.1 -q7,0.4691048977294051,1.0899999999999999,0.6714661144986211,1.8971199848858813,1.0,0.0,0.07665603302683431,0.315,0.5109589041095891,0.55,0.536708838458453,0.8005071529034175,0.5109589041095891,0.15 -q9,0.0009674975584377677,0.03180000000000001,0.025498576768595627,0.16251892949777494,1.0,1.0,0.00023426827672864132,0.009266666666666668,0.9748237663645518,0.85,0.09944818790950914,0.4098887446566551,0.9748237663645518,0.85 diff --git a/bioscancast/stages/eval_stage/outputs/pairwise_comparison_human_vs_llm_baseline.csv b/bioscancast/stages/eval_stage/outputs/pairwise_comparison_human_vs_llm_baseline.csv deleted file mode 100644 index bca4d19..0000000 --- a/bioscancast/stages/eval_stage/outputs/pairwise_comparison_human_vs_llm_baseline.csv +++ /dev/null @@ -1,5 +0,0 @@ -question_id,brier_score_human,brier_score_llm_baseline,log_score_human,log_score_llm_baseline,accuracy_human,accuracy_llm_baseline,rps_human,rps_llm_baseline,top_probability_human,top_probability_llm_baseline,normalized_entropy_human,normalized_entropy_llm_baseline,true_probability_human,true_probability_llm_baseline -q1,0.0023540000000000024,1.0750000000000002,0.0397808700118446,1.8971199848858815,1.0,0.0,0.0005563333333333344,0.2741666666666667,0.961,0.5499999999999999,0.14010886629561287,0.8407481647643377,0.961,0.14999999999999997 -q3,0.01605400000000002,1.1150000000000002,0.11316869810563811,2.3025850929940455,1.0,0.0,0.005195666666666676,0.5008333333333334,0.8929999999999999,0.45,0.3208118267821115,0.9074899102582407,0.8929999999999999,0.1 -q7,0.312218,1.0899999999999999,0.5361434317502807,1.8971199848858813,1.0,0.0,0.04727766666666666,0.315,0.585,0.55,0.5970469495834214,0.8005071529034175,0.585,0.15 -q9,0.006121999999999992,0.03180000000000001,0.06720874969344999,0.16251892949777494,1.0,1.0,0.001697,0.009266666666666668,0.935,0.85,0.21533457770189754,0.4098887446566551,0.935,0.85 diff --git a/bioscancast/stages/eval_stage/outputs/question_heatmap.png b/bioscancast/stages/eval_stage/outputs/question_heatmap.png new file mode 100644 index 0000000..c8ce02e Binary files /dev/null and b/bioscancast/stages/eval_stage/outputs/question_heatmap.png differ diff --git a/bioscancast/stages/eval_stage/outputs/question_level_metrics.csv b/bioscancast/stages/eval_stage/outputs/question_level_metrics.csv index f243fa3..4926484 100644 --- a/bioscancast/stages/eval_stage/outputs/question_level_metrics.csv +++ b/bioscancast/stages/eval_stage/outputs/question_level_metrics.csv @@ -1,13 +1,37 @@ -question_id,topic,question_type,forecast_source,resolved_option,brier_score,log_score,accuracy,rps,top_probability,normalized_entropy,true_probability -q1,H5N1 (US),range,bioscancast,70-100,0.0007875743611479325,0.022268126605770724,1,0.00017301251869153035,0.977977977977978,0.08726774697101813,0.977977977977978 -q1,H5N1 (US),range,human,70-100,0.0023540000000000024,0.0397808700118446,1,0.0005563333333333344,0.961,0.14010886629561287,0.961 -q1,H5N1 (US),range,llm_baseline,70-100,1.0750000000000002,1.8971199848858815,0,0.2741666666666667,0.5499999999999999,0.8407481647643377,0.14999999999999997 -q3,H5N1 (US),range,bioscancast,970-1000,0.008316633793556866,0.0778229509352235,1,0.0021690992767915826,0.9251282051282051,0.23681862783604404,0.9251282051282051 -q3,H5N1 (US),range,human,970-1000,0.01605400000000002,0.11316869810563811,1,0.005195666666666676,0.8929999999999999,0.3208118267821115,0.8929999999999999 -q3,H5N1 (US),range,llm_baseline,970-1000,1.1150000000000002,2.3025850929940455,0,0.5008333333333334,0.45,0.9074899102582407,0.1 -q7,Mpox (World),range,bioscancast,"126,001-128,500",0.4691048977294051,0.6714661144986211,1,0.07665603302683431,0.5109589041095891,0.536708838458453,0.5109589041095891 -q7,Mpox (World),range,human,"126,001-128,500",0.312218,0.5361434317502807,1,0.04727766666666666,0.585,0.5970469495834214,0.585 -q7,Mpox (World),range,llm_baseline,"126,001-128,500",1.0899999999999999,1.8971199848858813,0,0.315,0.55,0.8005071529034175,0.15 -q9,Ebola,range,bioscancast,11+,0.0009674975584377677,0.025498576768595627,1,0.00023426827672864132,0.9748237663645518,0.09944818790950914,0.9748237663645518 -q9,Ebola,range,human,11+,0.006121999999999992,0.06720874969344999,1,0.001697,0.935,0.21533457770189754,0.935 -q9,Ebola,range,llm_baseline,11+,0.03180000000000001,0.16251892949777494,1,0.009266666666666668,0.85,0.4098887446566551,0.85 +question_id,topic,question_type,forecast_source,forecast_version,resolved_option,brier_score,log_score,accuracy,accuracy_error,rps,top_probability,normalized_entropy,true_probability +q1,H5N1 (US),range,bioscancast,1,70-100,1.02,1.6094379124341003,0,1,0.4966666666666666,0.6,0.7854752972273344,0.2 +q1,H5N1 (US),range,bioscancast,2,70-100,0.05499999999999998,0.2231435513142097,1,0,0.02416666666666667,0.8,0.5109640474436812,0.8 +q1,H5N1 (US),range,bioscancast,3,70-100,0.0007875743611479374,0.02226812660577084,1,0,0.0003787571355138924,0.9779779779779779,0.08726774697101822,0.9779779779779779 +q1,H5N1 (US),range,human,1,70-100,0.002354000000000003,0.0397808700118446,1,0,0.0012063333333333333,0.961,0.14010886629561287,0.961 +q1,H5N1 (US),range,human,2,70-100,0.12000000000000002,0.35667494393873245,1,0,0.04666666666666668,0.7,0.6783898247235198,0.7 +q1,H5N1 (US),range,human,3,70-100,0.013399999999999995,0.10536051565782628,1,0,0.005266666666666667,0.9,0.31304532651737405,0.9 +q1,H5N1 (US),range,llm_baseline,1,70-100,0.12000000000000002,0.35667494393873245,1,0,0.04666666666666668,0.7,0.6783898247235198,0.7 +q1,H5N1 (US),range,llm_baseline,2,70-100,0.81,1.3862943611198906,0,1,0.375,0.45,0.9197455351500672,0.25 +q1,H5N1 (US),range,llm_baseline,3,70-100,1.075,1.8971199848858813,0,1,0.5291666666666667,0.55,0.8407481647643377,0.15 +q3,H5N1 (US),range,bioscancast,1,970-1000,0.16499999999999998,0.4307829160924542,1,0,0.06916666666666665,0.65,0.7394489514937395,0.65 +q3,H5N1 (US),range,bioscancast,2,970-1000,0.885,1.3862943611198906,0,1,0.4291666666666667,0.55,0.8193793404575042,0.25 +q3,H5N1 (US),range,bioscancast,3,970-1000,0.008316633793556866,0.0778229509352235,1,0,0.004216874863028709,0.9251282051282051,0.23681862783604404,0.9251282051282051 +q3,H5N1 (US),range,human,1,970-1000,0.016054,0.11316869810563798,1,0,0.0074069999999999995,0.893,0.32081182678211173,0.893 +q3,H5N1 (US),range,human,2,970-1000,0.76,1.2039728043259361,0,1,0.2866666666666666,0.5,0.8427376486136672,0.3 +q3,H5N1 (US),range,human,3,970-1000,0.12500000000000003,0.35667494393873245,1,0,0.05416666666666667,0.7,0.659517637169433,0.7 +q3,H5N1 (US),range,llm_baseline,1,970-1000,0.27499999999999997,0.5978370007556204,1,0,0.1158333333333333,0.55,0.8407481647643377,0.55 +q3,H5N1 (US),range,llm_baseline,2,970-1000,0.8650000000000001,1.6094379124341003,0,1,0.29416666666666674,0.35,0.9794359242226802,0.2 +q3,H5N1 (US),range,llm_baseline,3,970-1000,1.1150000000000002,2.3025850929940455,0,1,0.35083333333333333,0.45,0.9074899102582407,0.1 +q7,Mpox (World),range,bioscancast,1,"126,001-128,500",0.935,1.6094379124341003,0,1,0.2316666666666667,0.5,0.8927376486136671,0.2 +q7,Mpox (World),range,bioscancast,2,"126,001-128,500",0.27499999999999997,0.5978370007556204,1,0,0.0375,0.55,0.8407481647643377,0.55 +q7,Mpox (World),range,bioscancast,3,"126,001-128,500",0.4691048977294051,0.6714661144986211,1,0,0.07665603302683431,0.5109589041095891,0.536708838458453,0.5109589041095891 +q7,Mpox (World),range,human,1,"126,001-128,500",0.312218,0.5361434317502807,1,0,0.04727766666666666,0.585,0.5970469495834214,0.585 +q7,Mpox (World),range,human,2,"126,001-128,500",0.8150000000000002,1.3862943611198904,0,1,0.10083333333333332,0.45000000000000007,0.9074899102582407,0.25000000000000006 +q7,Mpox (World),range,human,3,"126,001-128,500",0.695,1.2039728043259361,0,1,0.1616666666666667,0.4,0.9354752972273341,0.3 +q7,Mpox (World),range,llm_baseline,1,"126,001-128,500",0.27499999999999997,0.5978370007556204,1,0,0.050833333333333314,0.55,0.8407481647643377,0.55 +q7,Mpox (World),range,llm_baseline,2,"126,001-128,500",1.175,2.3025850929940455,0,1,0.155,0.55,0.8407481647643377,0.1 +q7,Mpox (World),range,llm_baseline,3,"126,001-128,500",1.0899999999999999,1.8971199848858813,0,1,0.315,0.55,0.8005071529034175,0.15 +q9,Ebola,range,bioscancast,1,11+,0.48500000000000004,0.916290731874155,1,0,0.07416666666666663,0.4,0.9518508480286741,0.4 +q9,Ebola,range,bioscancast,2,11+,0.8099999999999999,1.3862943611198906,0,1,0.19499999999999998,0.45,0.9197455351500671,0.25 +q9,Ebola,range,bioscancast,3,11+,0.0009674975584377624,0.025498576768595512,1,0,0.00027787665724522854,0.9748237663645519,0.09944818790950909,0.9748237663645519 +q9,Ebola,range,human,1,11+,0.006121999999999992,0.06720874969344999,1,0,0.001640333333333329,0.935,0.21533457770189754,0.935 +q9,Ebola,range,human,2,11+,0.7,1.2039728043259361,0,1,0.17666666666666664,0.4,0.9232196723355078,0.3 +q9,Ebola,range,human,3,11+,0.05500000000000006,0.22314355131421,1,0,0.011666666666666686,0.7999999999999998,0.5109640474436813,0.7999999999999998 +q9,Ebola,range,llm_baseline,1,11+,0.05500000000000006,0.22314355131421,1,0,0.011666666666666686,0.7999999999999998,0.5109640474436813,0.7999999999999998 +q9,Ebola,range,llm_baseline,2,11+,1.075,1.8971199848858813,0,1,0.2916666666666667,0.55,0.8407481647643377,0.15 +q9,Ebola,range,llm_baseline,3,11+,0.03180000000000001,0.16251892949777494,1,0,0.007899999999999999,0.85,0.4098887446566551,0.85 diff --git a/bioscancast/stages/eval_stage/outputs/rps_boxplot.png b/bioscancast/stages/eval_stage/outputs/rps_boxplot.png deleted file mode 100644 index bfae1b0..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/rps_boxplot.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/rps_distribution.png b/bioscancast/stages/eval_stage/outputs/rps_distribution.png deleted file mode 100644 index d8ab801..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/rps_distribution.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/scatter_accuracy.png b/bioscancast/stages/eval_stage/outputs/scatter_accuracy.png deleted file mode 100644 index e3c9cfa..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/scatter_accuracy.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/scatter_accuracy_bioscancast_vs_human.png b/bioscancast/stages/eval_stage/outputs/scatter_accuracy_bioscancast_vs_human.png deleted file mode 100644 index 0708538..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/scatter_accuracy_bioscancast_vs_human.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/scatter_accuracy_bioscancast_vs_llm_baseline.png b/bioscancast/stages/eval_stage/outputs/scatter_accuracy_bioscancast_vs_llm_baseline.png deleted file mode 100644 index 6776b08..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/scatter_accuracy_bioscancast_vs_llm_baseline.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/scatter_accuracy_human_vs_llm_baseline.png b/bioscancast/stages/eval_stage/outputs/scatter_accuracy_human_vs_llm_baseline.png deleted file mode 100644 index 813c665..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/scatter_accuracy_human_vs_llm_baseline.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/scatter_brier_score.png b/bioscancast/stages/eval_stage/outputs/scatter_brier_score.png deleted file mode 100644 index 98ea9ba..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/scatter_brier_score.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/scatter_brier_score_bioscancast_vs_human.png b/bioscancast/stages/eval_stage/outputs/scatter_brier_score_bioscancast_vs_human.png deleted file mode 100644 index 385e2c7..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/scatter_brier_score_bioscancast_vs_human.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/scatter_brier_score_bioscancast_vs_llm_baseline.png b/bioscancast/stages/eval_stage/outputs/scatter_brier_score_bioscancast_vs_llm_baseline.png deleted file mode 100644 index fa9491b..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/scatter_brier_score_bioscancast_vs_llm_baseline.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/scatter_brier_score_human_vs_llm_baseline.png b/bioscancast/stages/eval_stage/outputs/scatter_brier_score_human_vs_llm_baseline.png deleted file mode 100644 index 7548af4..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/scatter_brier_score_human_vs_llm_baseline.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/scatter_log_score.png b/bioscancast/stages/eval_stage/outputs/scatter_log_score.png deleted file mode 100644 index 4bc0959..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/scatter_log_score.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/scatter_log_score_bioscancast_vs_human.png b/bioscancast/stages/eval_stage/outputs/scatter_log_score_bioscancast_vs_human.png deleted file mode 100644 index 7e3acba..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/scatter_log_score_bioscancast_vs_human.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/scatter_log_score_bioscancast_vs_llm_baseline.png b/bioscancast/stages/eval_stage/outputs/scatter_log_score_bioscancast_vs_llm_baseline.png deleted file mode 100644 index 30c9e6f..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/scatter_log_score_bioscancast_vs_llm_baseline.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/scatter_log_score_human_vs_llm_baseline.png b/bioscancast/stages/eval_stage/outputs/scatter_log_score_human_vs_llm_baseline.png deleted file mode 100644 index 35323e0..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/scatter_log_score_human_vs_llm_baseline.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/scatter_rps.png b/bioscancast/stages/eval_stage/outputs/scatter_rps.png deleted file mode 100644 index b122b90..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/scatter_rps.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/scatter_rps_bioscancast_vs_human.png b/bioscancast/stages/eval_stage/outputs/scatter_rps_bioscancast_vs_human.png deleted file mode 100644 index 5ad21f0..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/scatter_rps_bioscancast_vs_human.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/scatter_rps_bioscancast_vs_llm_baseline.png b/bioscancast/stages/eval_stage/outputs/scatter_rps_bioscancast_vs_llm_baseline.png deleted file mode 100644 index fc929a7..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/scatter_rps_bioscancast_vs_llm_baseline.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/scatter_rps_human_vs_llm_baseline.png b/bioscancast/stages/eval_stage/outputs/scatter_rps_human_vs_llm_baseline.png deleted file mode 100644 index 932ccfb..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/scatter_rps_human_vs_llm_baseline.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/score_timeline_boxplots.png b/bioscancast/stages/eval_stage/outputs/score_timeline_boxplots.png new file mode 100644 index 0000000..4123c2f Binary files /dev/null and b/bioscancast/stages/eval_stage/outputs/score_timeline_boxplots.png differ diff --git a/bioscancast/stages/eval_stage/outputs/significance_tests.csv b/bioscancast/stages/eval_stage/outputs/significance_tests.csv deleted file mode 100644 index 309998d..0000000 --- a/bioscancast/stages/eval_stage/outputs/significance_tests.csv +++ /dev/null @@ -1,8 +0,0 @@ -metric,test,statistic,p_value,n -brier_score,paired_t_test,-0.8803545575668019,0.4434880231575905,4 -brier_score,wilcoxon_signed_rank,4.0,0.875,4 -log_score,paired_t_test,-0.24244168423869597,0.8240674327803076,4 -log_score,wilcoxon_signed_rank,4.0,0.875,4 -rps,paired_t_test,-0.7885127238223784,0.4879638838228039,4 -rps,wilcoxon_signed_rank,4.0,0.875,4 -accuracy,mcnemar_exact,0.0,1.0,4 diff --git a/bioscancast/stages/eval_stage/outputs/significance_tests_all_pairs.csv b/bioscancast/stages/eval_stage/outputs/significance_tests_all_pairs.csv deleted file mode 100644 index e88719d..0000000 --- a/bioscancast/stages/eval_stage/outputs/significance_tests_all_pairs.csv +++ /dev/null @@ -1,22 +0,0 @@ -source_a,source_b,metric,test,statistic,p_value,n -bioscancast,human,brier_score,paired_t_test,0.8803545575668019,0.4434880231575905,4 -bioscancast,human,brier_score,wilcoxon_signed_rank,4.0,0.875,4 -bioscancast,human,log_score,paired_t_test,0.24244168423869597,0.8240674327803076,4 -bioscancast,human,log_score,wilcoxon_signed_rank,4.0,0.875,4 -bioscancast,human,rps,paired_t_test,0.7885127238223784,0.4879638838228039,4 -bioscancast,human,rps,wilcoxon_signed_rank,4.0,0.875,4 -bioscancast,human,accuracy,mcnemar_exact,0.0,1.0,4 -bioscancast,llm_baseline,brier_score,paired_t_test,-2.8154027166366298,0.06699253608840353,4 -bioscancast,llm_baseline,brier_score,wilcoxon_signed_rank,0.0,0.125,4 -bioscancast,llm_baseline,log_score,paired_t_test,-2.9760927390077083,0.05878053545269533,4 -bioscancast,llm_baseline,log_score,wilcoxon_signed_rank,0.0,0.125,4 -bioscancast,llm_baseline,rps,paired_t_test,-2.544678613324008,0.08433390004149705,4 -bioscancast,llm_baseline,rps,wilcoxon_signed_rank,0.0,0.125,4 -bioscancast,llm_baseline,accuracy,mcnemar_exact,1.3333333333333333,0.25,4 -human,llm_baseline,brier_score,paired_t_test,-2.972828460689968,0.058934397724842796,4 -human,llm_baseline,brier_score,wilcoxon_signed_rank,0.0,0.125,4 -human,llm_baseline,log_score,paired_t_test,-2.9940131268202297,0.057944772269952566,4 -human,llm_baseline,log_score,wilcoxon_signed_rank,0.0,0.125,4 -human,llm_baseline,rps,paired_t_test,-2.6169617030707726,0.07920835139119317,4 -human,llm_baseline,rps,wilcoxon_signed_rank,0.0,0.125,4 -human,llm_baseline,accuracy,mcnemar_exact,1.3333333333333333,0.25,4 diff --git a/bioscancast/stages/eval_stage/outputs/significance_tests_bioscancast_vs_human.csv b/bioscancast/stages/eval_stage/outputs/significance_tests_bioscancast_vs_human.csv deleted file mode 100644 index d55bc22..0000000 --- a/bioscancast/stages/eval_stage/outputs/significance_tests_bioscancast_vs_human.csv +++ /dev/null @@ -1,8 +0,0 @@ -source_a,source_b,metric,test,statistic,p_value,n -bioscancast,human,brier_score,paired_t_test,0.8803545575668019,0.4434880231575905,4 -bioscancast,human,brier_score,wilcoxon_signed_rank,4.0,0.875,4 -bioscancast,human,log_score,paired_t_test,0.24244168423869597,0.8240674327803076,4 -bioscancast,human,log_score,wilcoxon_signed_rank,4.0,0.875,4 -bioscancast,human,rps,paired_t_test,0.7885127238223784,0.4879638838228039,4 -bioscancast,human,rps,wilcoxon_signed_rank,4.0,0.875,4 -bioscancast,human,accuracy,mcnemar_exact,0.0,1.0,4 diff --git a/bioscancast/stages/eval_stage/outputs/significance_tests_bioscancast_vs_llm_baseline.csv b/bioscancast/stages/eval_stage/outputs/significance_tests_bioscancast_vs_llm_baseline.csv deleted file mode 100644 index 0b4e9c7..0000000 --- a/bioscancast/stages/eval_stage/outputs/significance_tests_bioscancast_vs_llm_baseline.csv +++ /dev/null @@ -1,8 +0,0 @@ -source_a,source_b,metric,test,statistic,p_value,n -bioscancast,llm_baseline,brier_score,paired_t_test,-2.8154027166366298,0.06699253608840353,4 -bioscancast,llm_baseline,brier_score,wilcoxon_signed_rank,0.0,0.125,4 -bioscancast,llm_baseline,log_score,paired_t_test,-2.9760927390077083,0.05878053545269533,4 -bioscancast,llm_baseline,log_score,wilcoxon_signed_rank,0.0,0.125,4 -bioscancast,llm_baseline,rps,paired_t_test,-2.544678613324008,0.08433390004149705,4 -bioscancast,llm_baseline,rps,wilcoxon_signed_rank,0.0,0.125,4 -bioscancast,llm_baseline,accuracy,mcnemar_exact,1.3333333333333333,0.25,4 diff --git a/bioscancast/stages/eval_stage/outputs/significance_tests_human_vs_llm_baseline.csv b/bioscancast/stages/eval_stage/outputs/significance_tests_human_vs_llm_baseline.csv deleted file mode 100644 index 4b0a35c..0000000 --- a/bioscancast/stages/eval_stage/outputs/significance_tests_human_vs_llm_baseline.csv +++ /dev/null @@ -1,8 +0,0 @@ -source_a,source_b,metric,test,statistic,p_value,n -human,llm_baseline,brier_score,paired_t_test,-2.972828460689968,0.058934397724842796,4 -human,llm_baseline,brier_score,wilcoxon_signed_rank,0.0,0.125,4 -human,llm_baseline,log_score,paired_t_test,-2.9940131268202297,0.057944772269952566,4 -human,llm_baseline,log_score,wilcoxon_signed_rank,0.0,0.125,4 -human,llm_baseline,rps,paired_t_test,-2.6169617030707726,0.07920835139119317,4 -human,llm_baseline,rps,wilcoxon_signed_rank,0.0,0.125,4 -human,llm_baseline,accuracy,mcnemar_exact,1.3333333333333333,0.25,4 diff --git a/bioscancast/stages/eval_stage/outputs/source_comparison.png b/bioscancast/stages/eval_stage/outputs/source_comparison.png deleted file mode 100644 index 682a40b..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/source_comparison.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/source_ranking_over_time.csv b/bioscancast/stages/eval_stage/outputs/source_ranking_over_time.csv new file mode 100644 index 0000000..d2c55bb --- /dev/null +++ b/bioscancast/stages/eval_stage/outputs/source_ranking_over_time.csv @@ -0,0 +1,10 @@ +forecast_version,forecast_source,metric_column,metric_value,rank +1,human,median_brier_score,0.011087999999999995,1 +1,llm_baseline,median_brier_score,0.1975,2 +1,bioscancast,median_brier_score,0.7100000000000001,3 +2,bioscancast,median_brier_score,0.5425,1 +2,human,median_brier_score,0.73,2 +2,llm_baseline,median_brier_score,0.97,3 +3,bioscancast,median_brier_score,0.0046420656759973145,1 +3,human,median_brier_score,0.09000000000000005,2 +3,llm_baseline,median_brier_score,1.0825,3 diff --git a/bioscancast/stages/eval_stage/outputs/source_ranking_over_time.png b/bioscancast/stages/eval_stage/outputs/source_ranking_over_time.png new file mode 100644 index 0000000..91c7f0b Binary files /dev/null and b/bioscancast/stages/eval_stage/outputs/source_ranking_over_time.png differ diff --git a/bioscancast/stages/eval_stage/outputs/source_timeline_summary.png b/bioscancast/stages/eval_stage/outputs/source_timeline_summary.png new file mode 100644 index 0000000..706556d Binary files /dev/null and b/bioscancast/stages/eval_stage/outputs/source_timeline_summary.png differ diff --git a/bioscancast/stages/eval_stage/outputs/summary_metrics.csv b/bioscancast/stages/eval_stage/outputs/summary_metrics.csv index cfdf8db..97b698d 100644 --- a/bioscancast/stages/eval_stage/outputs/summary_metrics.csv +++ b/bioscancast/stages/eval_stage/outputs/summary_metrics.csv @@ -1,4 +1,4 @@ -forecast_source,n_questions,mean_brier_score,mean_log_score,mean_accuracy,mean_rps,mean_top_probability,mean_normalized_entropy,mean_true_probability -human,4,0.08418700000000001,0.18907543739030336,1.0,0.013681666666666667,0.8434999999999999,0.3183255550907608,0.8434999999999999 -bioscancast,4,0.11979415086063692,0.19926394220205274,1.0,0.019808103274761516,0.8472222133950811,0.24006085029375607,0.8472222133950811 -llm_baseline,4,0.8279500000000001,1.5648359980658957,0.25,0.27481666666666665,0.6,0.7396584931456627,0.3125 +forecast_source,n_questions,n_versions,mean_brier_score,median_brier_score,std_brier_score,q1_brier_score,q3_brier_score,mean_log_score,median_log_score,std_log_score,q1_log_score,q3_log_score,mean_accuracy,median_accuracy,std_accuracy,q1_accuracy,q3_accuracy,mean_accuracy_error,median_accuracy_error,std_accuracy_error,q1_accuracy_error,q3_accuracy_error,mean_rps,median_rps,std_rps,q1_rps,q3_rps,mean_top_probability,median_top_probability,std_top_probability,q1_top_probability,q3_top_probability,mean_normalized_entropy,median_normalized_entropy,std_normalized_entropy,q1_normalized_entropy,q3_normalized_entropy,mean_true_probability,median_true_probability,std_true_probability,q1_true_probability,q3_true_probability +human,4,3,0.301679,0.12250000000000003,0.33760355477286186,0.015390499999999998,0.6962499999999999,0.566364039875701,0.35667494393873245,0.5259562088308243,0.11121665249368506,1.2039728043259361,0.6666666666666666,1.0,0.4923659639173309,0.0,1.0,0.3333333333333333,0.0,0.49236596391733095,0.0,1.0,0.07509427777777779,0.046972166666666676,0.09024834233882482,0.006871916666666667,0.11604166666666667,0.6853333333333333,0.7,0.21399631829175061,0.48750000000000004,0.89475,0.5870117987209834,0.6282822933764272,0.28696558550535123,0.3188702017159273,0.8589257140248105,0.6353333333333332,0.7,0.2789816981072197,0.3,0.89475 +bioscancast,4,3,0.4257647169535457,0.37205244886470257,0.3977062640791657,0.0433291584483892,0.82875,0.7463812096627193,0.6346515576271208,0.6200894311767495,0.18681340121946316,1.3862943611198906,0.6666666666666666,1.0,0.4923659639173309,0.0,1.0,0.3333333333333333,0.0,0.49236596391733095,0.0,1.0,0.1365857951402185,0.07166666666666664,0.16972095999996234,0.01917921871575718,0.20416666666666666,0.6574074044650271,0.575,0.2083374782210013,0.5082191780821919,0.8312820512820513,0.6183827695295024,0.762462124360537,0.3197855636053254,0.4424276925417719,0.85374553572667,0.5574074044650271,0.5304794520547946,0.30529081870814934,0.25,0.8312820512820513 +llm_baseline,4,3,0.6634833333333333,0.8375000000000001,0.46852666914165053,0.23625,1.0787499999999999,1.2691894867051403,1.4978661367769954,0.8257904513711053,0.5375464865513984,1.8971199848858813,0.4166666666666667,0.0,0.5149286505444373,0.0,1.0,0.5833333333333334,1.0,0.5149286505444373,0.0,1.0,0.21197777777777774,0.22333333333333333,0.169957911803222,0.04979166666666666,0.32395833333333335,0.575,0.55,0.14381174563233062,0.525,0.5875,0.7841801635983291,0.8407481647643377,0.16892899213491266,0.769977820858443,0.8574336011378134,0.37916666666666665,0.225,0.2895594877908458,0.15,0.5875 diff --git a/bioscancast/stages/eval_stage/outputs/summary_metrics_by_question_type.csv b/bioscancast/stages/eval_stage/outputs/summary_metrics_by_question_type.csv index d9d1227..76fb5d7 100644 --- a/bioscancast/stages/eval_stage/outputs/summary_metrics_by_question_type.csv +++ b/bioscancast/stages/eval_stage/outputs/summary_metrics_by_question_type.csv @@ -1,4 +1,4 @@ -forecast_source,question_type,n_questions,mean_brier_score,mean_log_score,mean_accuracy,mean_rps,mean_top_probability,mean_normalized_entropy,mean_true_probability -bioscancast,range,4,0.11979415086063692,0.19926394220205274,1.0,0.019808103274761516,0.8472222133950811,0.24006085029375607,0.8472222133950811 -human,range,4,0.08418700000000001,0.18907543739030336,1.0,0.013681666666666667,0.8434999999999999,0.3183255550907608,0.8434999999999999 -llm_baseline,range,4,0.8279500000000001,1.5648359980658957,0.25,0.27481666666666665,0.6,0.7396584931456627,0.3125 +forecast_source,question_type,n_questions,n_versions,mean_brier_score,median_brier_score,std_brier_score,q1_brier_score,q3_brier_score,mean_log_score,median_log_score,std_log_score,q1_log_score,q3_log_score,mean_accuracy,median_accuracy,std_accuracy,q1_accuracy,q3_accuracy,mean_accuracy_error,median_accuracy_error,std_accuracy_error,q1_accuracy_error,q3_accuracy_error,mean_rps,median_rps,std_rps,q1_rps,q3_rps,mean_top_probability,median_top_probability,std_top_probability,q1_top_probability,q3_top_probability,mean_normalized_entropy,median_normalized_entropy,std_normalized_entropy,q1_normalized_entropy,q3_normalized_entropy,mean_true_probability,median_true_probability,std_true_probability,q1_true_probability,q3_true_probability +bioscancast,range,4,3,0.4257647169535457,0.37205244886470257,0.3977062640791657,0.0433291584483892,0.82875,0.7463812096627193,0.6346515576271208,0.6200894311767495,0.18681340121946316,1.3862943611198906,0.6666666666666666,1.0,0.4923659639173309,0.0,1.0,0.3333333333333333,0.0,0.49236596391733095,0.0,1.0,0.1365857951402185,0.07166666666666664,0.16972095999996234,0.01917921871575718,0.20416666666666666,0.6574074044650271,0.575,0.2083374782210013,0.5082191780821919,0.8312820512820513,0.6183827695295024,0.762462124360537,0.3197855636053254,0.4424276925417719,0.85374553572667,0.5574074044650271,0.5304794520547946,0.30529081870814934,0.25,0.8312820512820513 +human,range,4,3,0.301679,0.12250000000000003,0.33760355477286186,0.015390499999999998,0.6962499999999999,0.566364039875701,0.35667494393873245,0.5259562088308243,0.11121665249368506,1.2039728043259361,0.6666666666666666,1.0,0.4923659639173309,0.0,1.0,0.3333333333333333,0.0,0.49236596391733095,0.0,1.0,0.07509427777777779,0.046972166666666676,0.09024834233882482,0.006871916666666667,0.11604166666666667,0.6853333333333333,0.7,0.21399631829175061,0.48750000000000004,0.89475,0.5870117987209834,0.6282822933764272,0.28696558550535123,0.3188702017159273,0.8589257140248105,0.6353333333333332,0.7,0.2789816981072197,0.3,0.89475 +llm_baseline,range,4,3,0.6634833333333333,0.8375000000000001,0.46852666914165053,0.23625,1.0787499999999999,1.2691894867051403,1.4978661367769954,0.8257904513711053,0.5375464865513984,1.8971199848858813,0.4166666666666667,0.0,0.5149286505444373,0.0,1.0,0.5833333333333334,1.0,0.5149286505444373,0.0,1.0,0.21197777777777774,0.22333333333333333,0.169957911803222,0.04979166666666666,0.32395833333333335,0.575,0.55,0.14381174563233062,0.525,0.5875,0.7841801635983291,0.8407481647643377,0.16892899213491266,0.769977820858443,0.8574336011378134,0.37916666666666665,0.225,0.2895594877908458,0.15,0.5875 diff --git a/bioscancast/stages/eval_stage/outputs/summary_metrics_over_time.csv b/bioscancast/stages/eval_stage/outputs/summary_metrics_over_time.csv new file mode 100644 index 0000000..2e96d22 --- /dev/null +++ b/bioscancast/stages/eval_stage/outputs/summary_metrics_over_time.csv @@ -0,0 +1,10 @@ +forecast_version,forecast_source,n_questions,n_versions,mean_brier_score,median_brier_score,std_brier_score,q1_brier_score,q3_brier_score,mean_log_score,median_log_score,std_log_score,q1_log_score,q3_log_score,mean_accuracy,median_accuracy,std_accuracy,q1_accuracy,q3_accuracy,mean_accuracy_error,median_accuracy_error,std_accuracy_error,q1_accuracy_error,q3_accuracy_error,mean_rps,median_rps,std_rps,q1_rps,q3_rps,mean_top_probability,median_top_probability,std_top_probability,q1_top_probability,q3_top_probability,mean_normalized_entropy,median_normalized_entropy,std_normalized_entropy,q1_normalized_entropy,q3_normalized_entropy,mean_true_probability,median_true_probability,std_true_probability,q1_true_probability,q3_true_probability +1,bioscancast,4,1,0.65125,0.7100000000000001,0.40023690900931835,0.405,0.95625,1.1414873682087026,1.2628643221541276,0.5755489457772591,0.7949137779287299,1.6094379124341003,0.5,0.5,0.5773502691896257,0.0,1.0,0.5,0.5,0.5773502691896257,0.0,1.0,0.21791666666666662,0.15291666666666665,0.20056690488047454,0.07291666666666663,0.29791666666666666,0.5375,0.55,0.11086778913041724,0.475,0.6125,0.8423781863408538,0.8391064729205008,0.09721574850660904,0.7739687107939357,0.9075159484674189,0.36250000000000004,0.30000000000000004,0.21360009363293828,0.2,0.4625 +1,human,4,1,0.084187,0.011087999999999995,0.1521304558199968,0.0051799999999999945,0.090095,0.18907543739030333,0.090188723899544,0.2333512254069753,0.06035177977304864,0.21891238151679865,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.01438283333333333,0.0045236666666666646,0.022111262358220845,0.00153183333333333,0.017374666666666667,0.8435,0.914,0.1745957235062379,0.8160000000000001,0.9415,0.3183255550907609,0.26807320224200465,0.20005009946134036,0.19652814985032638,0.3898706074824392,0.8435,0.914,0.1745957235062379,0.8160000000000001,0.9415 +1,llm_baseline,4,1,0.18125,0.1975,0.11145813862911337,0.10375000000000004,0.27499999999999997,0.4438731241910458,0.4772559723471764,0.18595233937337136,0.3232920957826018,0.5978370007556204,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.056249999999999994,0.04875,0.043432045687402186,0.03791666666666668,0.06708333333333331,0.6499999999999999,0.625,0.12247448713915882,0.55,0.7249999999999999,0.7177125504239692,0.7595689947439288,0.15765653071717503,0.6365333804035602,0.8407481647643377,0.6499999999999999,0.625,0.12247448713915882,0.55,0.7249999999999999 +2,bioscancast,4,1,0.50625,0.5425,0.4053059543933035,0.21999999999999997,0.82875,0.8983923185774028,0.9920656809377555,0.5837782692643037,0.5041636383952677,1.3862943611198906,0.5,0.5,0.5773502691896257,0.0,1.0,0.5,0.5,0.5773502691896257,0.0,1.0,0.17145833333333332,0.11624999999999999,0.18850947392441386,0.034166666666666665,0.25354166666666667,0.5875,0.55,0.14930394055974097,0.525,0.6125,0.7727092719538976,0.8300637526109209,0.17975688348210636,0.7422755172040484,0.86049750736077,0.4625,0.4,0.26575364531836626,0.25,0.6125 +2,human,4,1,0.5987500000000001,0.73,0.3226033426154582,0.5549999999999999,0.77375,1.0377287284276238,1.2039728043259361,0.46209899487857536,0.9921483392291353,1.2495531935244246,0.25,0.0,0.5,0.0,0.25,0.75,1.0,0.5,0.75,1.0,0.15270833333333333,0.13874999999999998,0.10401071659529444,0.08729166666666666,0.20416666666666664,0.5125,0.47500000000000003,0.13149778198382914,0.43750000000000006,0.55,0.8379592639827339,0.8751137794359539,0.11193612301837233,0.8016506926411303,0.9114223507775574,0.3875,0.3,0.20966242709015204,0.2875,0.39999999999999997 +2,llm_baseline,4,1,0.9812500000000002,0.97,0.17240335456906475,0.8512500000000001,1.1,1.7988593378584794,1.753278948659991,0.3955945430961082,1.5536520246055479,1.9984862619129222,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.27895833333333336,0.2929166666666667,0.09125507342567825,0.2575,0.31437500000000007,0.47500000000000003,0.5,0.09574271077563384,0.425,0.55,0.8951694472253557,0.8802468499572025,0.06739974088137032,0.8407481647643377,0.9346681324182204,0.17500000000000002,0.175,0.06454972243679027,0.1375,0.21250000000000002 +3,bioscancast,4,1,0.11979415086063692,0.0046420656759973145,0.2329002457953387,0.0009225167591153062,0.12351369977751893,0.19926394220205274,0.05166076385190951,0.3158294512572063,0.024690964227889342,0.22623374182607292,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.020382385420655534,0.0022978159992713006,0.037560544740819435,0.00035353701594672643,0.02232666440398011,0.8472222133950811,0.9499759857463785,0.22547844169781928,0.8215858798735511,0.9756123192679085,0.24006085029375607,0.16813340787277656,0.20906792686780576,0.09640307767488637,0.3117911804916463,0.8472222133950811,0.9499759857463785,0.22547844169781928,0.8215858798735511,0.9756123192679085 +3,human,4,1,0.2221,0.09000000000000005,0.31861205250272623,0.04460000000000004,0.2675,0.4722879538091762,0.2899092476264712,0.4984769358388688,0.19369779240011406,0.5684994090355333,0.75,1.0,0.5,0.75,1.0,0.25,0.0,0.5,0.0,0.25,0.05819166666666668,0.03291666666666668,0.07231622109411785,0.010066666666666682,0.08104166666666668,0.7,0.7499999999999999,0.21602468994692864,0.625,0.8249999999999998,0.6047505770894557,0.5852408423065572,0.2622124696592444,0.4614843672121045,0.7285070521839083,0.675,0.7499999999999999,0.26299556396765833,0.6,0.8249999999999998 +3,llm_baseline,4,1,0.8279500000000001,1.0825,0.531023047208562,0.8142,1.09625,1.5648359980658957,1.8971199848858813,0.9542173378671409,1.4634697210388548,1.9984862619129222,0.25,0.0,0.5,0.0,0.25,0.75,1.0,0.5,0.75,1.0,0.300725,0.33291666666666664,0.21652305218989878,0.23822500000000002,0.39541666666666664,0.6,0.55,0.1732050807568877,0.525,0.625,0.7396584931456627,0.8206276588338776,0.2242299063393091,0.7028525508417268,0.8574336011378134,0.3125,0.15,0.35910769044025403,0.1375,0.32499999999999996 diff --git a/bioscancast/stages/eval_stage/outputs/win_rate_accuracy.png b/bioscancast/stages/eval_stage/outputs/win_rate_accuracy.png deleted file mode 100644 index f3b8e77..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/win_rate_accuracy.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/win_rate_accuracy_bioscancast_vs_human.png b/bioscancast/stages/eval_stage/outputs/win_rate_accuracy_bioscancast_vs_human.png deleted file mode 100644 index 5d1dab9..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/win_rate_accuracy_bioscancast_vs_human.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/win_rate_accuracy_bioscancast_vs_llm_baseline.png b/bioscancast/stages/eval_stage/outputs/win_rate_accuracy_bioscancast_vs_llm_baseline.png deleted file mode 100644 index 3532293..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/win_rate_accuracy_bioscancast_vs_llm_baseline.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/win_rate_accuracy_human_vs_llm_baseline.png b/bioscancast/stages/eval_stage/outputs/win_rate_accuracy_human_vs_llm_baseline.png deleted file mode 100644 index 0480a03..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/win_rate_accuracy_human_vs_llm_baseline.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/win_rate_brier_score.png b/bioscancast/stages/eval_stage/outputs/win_rate_brier_score.png deleted file mode 100644 index d1e134f..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/win_rate_brier_score.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/win_rate_brier_score_bioscancast_vs_human.png b/bioscancast/stages/eval_stage/outputs/win_rate_brier_score_bioscancast_vs_human.png deleted file mode 100644 index b1e36ba..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/win_rate_brier_score_bioscancast_vs_human.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/win_rate_brier_score_bioscancast_vs_llm_baseline.png b/bioscancast/stages/eval_stage/outputs/win_rate_brier_score_bioscancast_vs_llm_baseline.png deleted file mode 100644 index e5490f4..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/win_rate_brier_score_bioscancast_vs_llm_baseline.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/win_rate_brier_score_human_vs_llm_baseline.png b/bioscancast/stages/eval_stage/outputs/win_rate_brier_score_human_vs_llm_baseline.png deleted file mode 100644 index 65ba052..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/win_rate_brier_score_human_vs_llm_baseline.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/win_rate_log_score.png b/bioscancast/stages/eval_stage/outputs/win_rate_log_score.png deleted file mode 100644 index 594a6c3..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/win_rate_log_score.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/win_rate_log_score_bioscancast_vs_human.png b/bioscancast/stages/eval_stage/outputs/win_rate_log_score_bioscancast_vs_human.png deleted file mode 100644 index 063feff..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/win_rate_log_score_bioscancast_vs_human.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/win_rate_log_score_bioscancast_vs_llm_baseline.png b/bioscancast/stages/eval_stage/outputs/win_rate_log_score_bioscancast_vs_llm_baseline.png deleted file mode 100644 index f8fff20..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/win_rate_log_score_bioscancast_vs_llm_baseline.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/win_rate_log_score_human_vs_llm_baseline.png b/bioscancast/stages/eval_stage/outputs/win_rate_log_score_human_vs_llm_baseline.png deleted file mode 100644 index 71cd7e0..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/win_rate_log_score_human_vs_llm_baseline.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/win_rate_rps.png b/bioscancast/stages/eval_stage/outputs/win_rate_rps.png deleted file mode 100644 index 1811713..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/win_rate_rps.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/win_rate_rps_bioscancast_vs_human.png b/bioscancast/stages/eval_stage/outputs/win_rate_rps_bioscancast_vs_human.png deleted file mode 100644 index 0b99fa0..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/win_rate_rps_bioscancast_vs_human.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/win_rate_rps_bioscancast_vs_llm_baseline.png b/bioscancast/stages/eval_stage/outputs/win_rate_rps_bioscancast_vs_llm_baseline.png deleted file mode 100644 index 2c538a1..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/win_rate_rps_bioscancast_vs_llm_baseline.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/outputs/win_rate_rps_human_vs_llm_baseline.png b/bioscancast/stages/eval_stage/outputs/win_rate_rps_human_vs_llm_baseline.png deleted file mode 100644 index 89408b5..0000000 Binary files a/bioscancast/stages/eval_stage/outputs/win_rate_rps_human_vs_llm_baseline.png and /dev/null differ diff --git a/bioscancast/stages/eval_stage/pipeline.py b/bioscancast/stages/eval_stage/pipeline.py index 81b165b..0774586 100644 --- a/bioscancast/stages/eval_stage/pipeline.py +++ b/bioscancast/stages/eval_stage/pipeline.py @@ -1,15 +1,16 @@ from __future__ import annotations -from itertools import combinations from pathlib import Path -from typing import Iterable, List, Sequence, Tuple +from typing import Sequence, Tuple import pandas as pd -from bioscancast.stages.eval_stage.calibration import calibration_table from bioscancast.stages.eval_stage.compare import ( compare_sources, compare_sources_by_question_type, + compare_sources_over_time, + rank_sources_over_time, + relative_improvement_over_time, ) from bioscancast.stages.eval_stage.loaders import load_forecasts, load_questions from bioscancast.stages.eval_stage.scoring import ( @@ -21,24 +22,21 @@ top_probability, true_probability, ) -from bioscancast.stages.eval_stage.statistics import ( - exact_mcnemar_test, - paired_t_test, - wilcoxon_signed_rank_test, -) from bioscancast.stages.eval_stage.visualisation import ( - plot_accuracy_by_source, - plot_confidence_calibration, - plot_metric_boxplot, - plot_metric_distribution, - plot_question_level_differences, - plot_question_level_scatter, - plot_reliability_overview, - plot_source_comparison, - plot_win_rate, + plot_question_heatmap, + plot_relative_improvement, + plot_score_timeline_boxplots, + plot_source_ranking_over_time, + plot_source_timeline, ) -OUTPUT_DIR = Path(__file__).resolve().parent / "outputs" +OUTPUT_DIR = Path(__file__).resolve().parent / 'outputs' +BASE_DIR = Path(__file__).resolve().parent +DEFAULT_FORECASTS = [ + str(BASE_DIR / 'mock_forecasts' / 'human_forecasts.csv'), + str(BASE_DIR / 'mock_forecasts' / 'bioscancast_forecasts.csv'), + str(BASE_DIR / 'mock_forecasts' / 'llm_baseline_forecasts.csv'), +] def _ensure_output_dir() -> None: @@ -47,107 +45,110 @@ def _ensure_output_dir() -> None: def _canonicalize_text(value) -> str: if pd.isna(value): - return "" + return '' text = str(value) - text = text.replace("\u2013", "-") - text = text.replace("\u2014", "-") + text = text.replace('–', '-') + text = text.replace('—', '-') return text.strip() def _prepare_questions(df: pd.DataFrame) -> pd.DataFrame: df = df.copy() - required_cols = {"question_id", "question_status", "resolved_option"} + required_cols = {'question_id', 'question_status', 'resolved_option'} missing = [col for col in required_cols if col not in df.columns] if missing: - raise ValueError("questions dataframe is missing required columns: " + ", ".join(missing)) + raise ValueError('questions dataframe is missing required columns: ' + ', '.join(missing)) - df["question_id"] = df["question_id"].astype(str).str.strip() - df["question_status"] = df["question_status"].astype(str).str.lower().str.strip() - df["resolved_option"] = df["resolved_option"].apply(_canonicalize_text) + df['question_id'] = df['question_id'].astype(str).str.strip() + df['question_status'] = df['question_status'].astype(str).str.lower().str.strip() + df['resolved_option'] = df['resolved_option'].apply(_canonicalize_text) - if "question_type" in df.columns: - df["question_type"] = df["question_type"].astype(str).str.lower().str.strip() + if 'question_type' in df.columns: + df['question_type'] = df['question_type'].astype(str).str.lower().str.strip() else: - df["question_type"] = "unknown" + df['question_type'] = 'unknown' - if "topic" not in df.columns: - df["topic"] = "" + if 'topic' not in df.columns: + df['topic'] = '' return df def _infer_source_name(path: str | Path) -> str: stem = Path(path).stem.lower().strip() - for suffix in ("_forecasts", "_forecast", "_mock", "_data"): + for suffix in ('_forecasts', '_forecast', '_mock', '_data'): if stem.endswith(suffix): stem = stem[: -len(suffix)] break - return stem or "forecast" + return stem or 'forecast' def _prepare_forecasts(df: pd.DataFrame, source_name: str | None = None) -> pd.DataFrame: df = df.copy() - required_cols = {"question_id", "option", "probability"} + required_cols = {'question_id', 'option', 'probability'} missing = [col for col in required_cols if col not in df.columns] if missing: - raise ValueError("forecasts dataframe is missing required columns: " + ", ".join(missing)) + raise ValueError('forecasts dataframe is missing required columns: ' + ', '.join(missing)) - df["question_id"] = df["question_id"].astype(str).str.strip() - df["option"] = df["option"].apply(_canonicalize_text) - df["probability"] = pd.to_numeric(df["probability"], errors="coerce") + df['question_id'] = df['question_id'].astype(str).str.strip() + df['option'] = df['option'].apply(_canonicalize_text) + df['probability'] = pd.to_numeric(df['probability'], errors='coerce') - if df["probability"].isna().any(): - bad_rows = df[df["probability"].isna()] + if df['probability'].isna().any(): + bad_rows = df[df['probability'].isna()] raise ValueError( - "Some forecast probabilities could not be parsed as numeric values. " - "Problematic rows: " + str(bad_rows.index.tolist()) + 'Some forecast probabilities could not be parsed as numeric values. ' + 'Problematic rows: ' + str(bad_rows.index.tolist()) ) - if df["probability"].max() > 1.0: - df["probability"] = df["probability"] # keep as-is if already scaled + if df['probability'].max() > 1.0: + df['probability'] = df['probability'] # keep as-is if already scaled - if "forecast_source" not in df.columns: - df["forecast_source"] = source_name or "forecast" + if 'forecast_source' not in df.columns: + df['forecast_source'] = source_name or 'forecast' else: - df["forecast_source"] = df["forecast_source"].astype(str).str.strip() - if source_name and (df["forecast_source"].nunique() == 1) and ( - df["forecast_source"].iloc[0] in {"", "nan", "none"} + df['forecast_source'] = df['forecast_source'].astype(str).str.strip() + if source_name and (df['forecast_source'].nunique() == 1) and ( + df['forecast_source'].iloc[0] in {'', 'nan', 'none'} ): - df["forecast_source"] = source_name + df['forecast_source'] = source_name - if "forecast_version" in df.columns: - df["forecast_version"] = df["forecast_version"].astype(str).str.strip() + if 'forecast_version' not in df.columns: + df['forecast_version'] = '1' + else: + df['forecast_version'] = df['forecast_version'].astype(str).str.strip() + df.loc[df['forecast_version'].isin({'', 'nan', 'none'}), 'forecast_version'] = '1' return df def _get_resolved_option_for_group(group: pd.DataFrame, question_row: pd.Series) -> Tuple[str, str]: - resolved_option = _canonicalize_text(question_row["resolved_option"]) - options = [_canonicalize_text(opt) for opt in group["option"].tolist()] + resolved_option = _canonicalize_text(question_row['resolved_option']) + options = [_canonicalize_text(opt) for opt in group['option'].tolist()] if resolved_option in options: - return resolved_option, "" + return resolved_option, '' lowered = resolved_option.lower() - if lowered in {"", "tbd", "na", "n/a", "ambiguous"}: - return "", f"unscorable_status:{lowered or 'empty'}" - if lowered.startswith("resolved on "): - return "", "placeholder_resolution_text" + if lowered in {'', 'tbd', 'na', 'n/a', 'ambiguous'}: + return '', f"unscorable_status:{lowered or 'empty'}" + if lowered.startswith('resolved on '): + return '', 'placeholder_resolution_text' - return "", "resolved_option_not_in_forecast_options" + return '', 'resolved_option_not_in_forecast_options' -def build_distribution(group: pd.DataFrame) -> Tuple[List[str], List[float]]: +def build_distribution(group: pd.DataFrame) -> Tuple[list[str], list[float]]: group = group.copy() - if "option_order" in group.columns: - group = group.sort_values("option_order") + if 'option_order' in group.columns: + group = group.sort_values('option_order') - options = [_canonicalize_text(opt) for opt in group["option"].tolist()] - probabilities = group["probability"].astype(float).tolist() + options = [_canonicalize_text(opt) for opt in group['option'].tolist()] + probabilities = group['probability'].astype(float).tolist() total = sum(probabilities) if total <= 0: - raise ValueError("Forecast probabilities must sum to a positive value.") + raise ValueError('Forecast probabilities must sum to a positive value.') probabilities = [p / total for p in probabilities] return options, probabilities @@ -156,16 +157,17 @@ def score_all_forecasts(merged_df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataF results = [] skipped = [] - grouped = merged_df.groupby(["question_id", "forecast_source"], dropna=False) - for (question_id, source), group in grouped: + grouped = merged_df.groupby(['question_id', 'forecast_source', 'forecast_version'], dropna=False) + for (question_id, source, version), group in grouped: question_row = group.iloc[0] - if str(question_row["question_status"]).lower() != "resolved": + if str(question_row['question_status']).lower() != 'resolved': skipped.append( { - "question_id": question_id, - "forecast_source": source, - "skip_reason": f"question_status={question_row['question_status']}", + 'question_id': question_id, + 'forecast_source': source, + 'forecast_version': version, + 'skip_reason': f"question_status={question_row['question_status']}", } ) continue @@ -173,15 +175,23 @@ def score_all_forecasts(merged_df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataF options, probabilities = build_distribution(group) resolved_option, skip_reason = _get_resolved_option_for_group(group=group, question_row=question_row) if skip_reason: - skipped.append({"question_id": question_id, "forecast_source": source, "skip_reason": skip_reason}) + skipped.append( + { + 'question_id': question_id, + 'forecast_source': source, + 'forecast_version': version, + 'skip_reason': skip_reason, + } + ) continue if resolved_option not in options: skipped.append( { - "question_id": question_id, - "forecast_source": source, - "skip_reason": "resolved_option_missing_after_normalization", + 'question_id': question_id, + 'forecast_source': source, + 'forecast_version': version, + 'skip_reason': 'resolved_option_missing_after_normalization', } ) continue @@ -197,103 +207,49 @@ def score_all_forecasts(merged_df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataF results.append( { - "question_id": question_id, - "topic": question_row.get("topic", ""), - "question_type": question_row.get("question_type", "unknown"), - "forecast_source": source, - "resolved_option": resolved_option, - "brier_score": brier, - "log_score": logscore, - "accuracy": acc, - "rps": rps, - "top_probability": top_prob, - "normalized_entropy": norm_entropy, - "true_probability": true_prob, + 'question_id': question_id, + 'topic': question_row.get('topic', ''), + 'question_type': question_row.get('question_type', 'unknown'), + 'forecast_source': source, + 'forecast_version': version, + 'resolved_option': resolved_option, + 'brier_score': brier, + 'log_score': logscore, + 'accuracy': acc, + 'accuracy_error': 1 - acc, + 'rps': rps, + 'top_probability': top_prob, + 'normalized_entropy': norm_entropy, + 'true_probability': true_prob, } ) return pd.DataFrame(results), pd.DataFrame(skipped) -def _build_pairwise_comparison( - results_df: pd.DataFrame, - source_a: str, - source_b: str, -) -> pd.DataFrame: - metrics = [ - "brier_score", - "log_score", - "accuracy", - "rps", - "top_probability", - "normalized_entropy", - "true_probability", - ] - - a = results_df[results_df["forecast_source"] == source_a].set_index("question_id") - b = results_df[results_df["forecast_source"] == source_b].set_index("question_id") - common = sorted(set(a.index) & set(b.index)) - - rows = [] - for qid in common: - row = {"question_id": qid} - for metric in metrics: - row[f"{metric}_{source_a}"] = float(a.loc[qid, metric]) - row[f"{metric}_{source_b}"] = float(b.loc[qid, metric]) - rows.append(row) - - return pd.DataFrame(rows) - - -def _choose_pairs(results_df: pd.DataFrame) -> List[Tuple[str, str]]: - sources = list(dict.fromkeys(results_df["forecast_source"].tolist())) - if len(sources) < 2: - return [] - return list(combinations(sources, 2)) - - -def _significance_table(comparison_df: pd.DataFrame, source_a: str, source_b: str) -> pd.DataFrame: - rows = [] - for metric in ["brier_score", "log_score", "rps"]: - x = comparison_df[f"{metric}_{source_a}"].astype(float).tolist() - y = comparison_df[f"{metric}_{source_b}"].astype(float).tolist() - rows.append({"metric": metric, **paired_t_test(x, y).__dict__}) - rows.append({"metric": metric, **wilcoxon_signed_rank_test(x, y).__dict__}) - - accuracy_a = comparison_df[f"accuracy_{source_a}"].astype(int).tolist() - accuracy_b = comparison_df[f"accuracy_{source_b}"].astype(int).tolist() - rows.append({"metric": "accuracy", **exact_mcnemar_test(accuracy_a, accuracy_b).__dict__}) - return pd.DataFrame(rows) - - -def _confidence_calibration_overview(results_df: pd.DataFrame) -> None: - for source, group in results_df.groupby("forecast_source"): - table = calibration_table(group["top_probability"].tolist(), group["accuracy"].tolist(), bins=5) - table.to_csv(OUTPUT_DIR / f"calibration_table_{source}.csv", index=False) - plot_confidence_calibration(group, OUTPUT_DIR / f"calibration_{source}.png") - - def run_evaluation( - forecasts_path: str | Sequence[str] = "bioscancast_forecasts.csv", - questions_path: str = "bioscancast_questions.csv", + forecasts_path: str | Sequence[str] | None = None, + questions_path: str = str(BASE_DIR / 'bioscancast_questions.csv'), ) -> None: """End-to-end evaluation entry point.""" _ensure_output_dir() - if isinstance(forecasts_path, (str, Path)): + if forecasts_path is None: + forecast_paths = DEFAULT_FORECASTS + elif isinstance(forecasts_path, (str, Path)): forecast_paths = [str(forecasts_path)] else: forecast_paths = [str(p) for p in forecasts_path] questions = _prepare_questions(load_questions(questions_path)) - resolved_questions = questions[questions["question_status"] == "resolved"].copy() - ambiguous_questions = questions[questions["question_status"] == "ambiguous"].copy() - unresolved_questions = questions[questions["question_status"] == "unresolved"].copy() + resolved_questions = questions[questions['question_status'] == 'resolved'].copy() + ambiguous_questions = questions[questions['question_status'] == 'ambiguous'].copy() + unresolved_questions = questions[questions['question_status'] == 'unresolved'].copy() if not ambiguous_questions.empty: - ambiguous_questions.to_csv(OUTPUT_DIR / "ambiguous_questions.csv", index=False) + ambiguous_questions.to_csv(OUTPUT_DIR / 'ambiguous_questions.csv', index=False) if not unresolved_questions.empty: - unresolved_questions.to_csv(OUTPUT_DIR / "unresolved_questions.csv", index=False) + unresolved_questions.to_csv(OUTPUT_DIR / 'unresolved_questions.csv', index=False) forecast_frames = [] for forecast_path in forecast_paths: @@ -304,23 +260,26 @@ def run_evaluation( forecasts = pd.concat(forecast_frames, ignore_index=True) merged = forecasts.merge( resolved_questions, - on="question_id", - how="inner", - suffixes=("_forecast", "_question"), + on='question_id', + how='inner', + suffixes=('_forecast', '_question'), ) results_df, skipped_df = score_all_forecasts(merged) - results_path = OUTPUT_DIR / "question_level_metrics.csv" - skipped_path = OUTPUT_DIR / "skipped_questions.csv" - summary_path = OUTPUT_DIR / "summary_metrics.csv" - by_type_path = OUTPUT_DIR / "summary_metrics_by_question_type.csv" + results_path = OUTPUT_DIR / 'question_level_metrics.csv' + skipped_path = OUTPUT_DIR / 'skipped_questions.csv' + summary_path = OUTPUT_DIR / 'summary_metrics.csv' + by_type_path = OUTPUT_DIR / 'summary_metrics_by_question_type.csv' + timeline_summary_path = OUTPUT_DIR / 'summary_metrics_over_time.csv' + ranking_path = OUTPUT_DIR / 'source_ranking_over_time.csv' + improvement_path = OUTPUT_DIR / 'metric_improvement_over_time.csv' results_df.to_csv(results_path, index=False) skipped_df.to_csv(skipped_path, index=False) if results_df.empty: - print("No questions could be scored. Check the resolved_option values.") + print('No questions could be scored. Check the resolved_option values.') return summary_df = compare_sources(results_df) @@ -329,103 +288,32 @@ def run_evaluation( by_type_df = compare_sources_by_question_type(results_df) by_type_df.to_csv(by_type_path, index=False) - plot_source_comparison(summary_df, OUTPUT_DIR / "source_comparison.png") - plot_accuracy_by_source(summary_df, OUTPUT_DIR / "accuracy_by_source.png") - - plot_metric_distribution(results_df, "brier_score", OUTPUT_DIR / "brier_distribution.png") - plot_metric_distribution(results_df, "log_score", OUTPUT_DIR / "log_score_distribution.png") - plot_metric_distribution(results_df, "rps", OUTPUT_DIR / "rps_distribution.png") - - plot_metric_boxplot( - results_df, - "brier_score", - OUTPUT_DIR / "brier_boxplot.png", - ylabel="Brier score (lower is better)", - ) - plot_metric_boxplot( - results_df, - "log_score", - OUTPUT_DIR / "log_boxplot.png", - ylabel="Log score (lower is better)", - ) - plot_metric_boxplot( - results_df, - "rps", - OUTPUT_DIR / "rps_boxplot.png", - ylabel="RPS (lower is better)", - ) - - _confidence_calibration_overview(results_df) - plot_reliability_overview(results_df, OUTPUT_DIR) - - pairs = _choose_pairs(results_df) + timeline_summary_df = compare_sources_over_time(results_df) + timeline_summary_df.to_csv(timeline_summary_path, index=False) - all_stats = [] - if not pairs: - print("\nNo pairwise comparisons could be built.") - else: - for source_a, source_b in pairs: - comparison_df = _build_pairwise_comparison(results_df, source_a, source_b) - - if comparison_df.empty: - continue + ranking_df = rank_sources_over_time(timeline_summary_df, metric_column='median_brier_score', ascending=True) + ranking_df.to_csv(ranking_path, index=False) - pair_tag = f"{source_a}_vs_{source_b}" + improvement_df = relative_improvement_over_time(results_df) + improvement_df.to_csv(improvement_path, index=False) - comparison_df.to_csv( - OUTPUT_DIR / f"pairwise_comparison_{pair_tag}.csv", - index=False, - ) - - stats_df = _significance_table(comparison_df, source_a, source_b) - stats_df.insert(0, "source_a", source_a) - stats_df.insert(1, "source_b", source_b) - stats_df.to_csv( - OUTPUT_DIR / f"significance_tests_{pair_tag}.csv", - index=False, - ) + plot_score_timeline_boxplots(results_df, OUTPUT_DIR / 'score_timeline_boxplots.png') + plot_source_timeline(timeline_summary_df, OUTPUT_DIR / 'source_timeline_summary.png') + plot_relative_improvement(results_df, OUTPUT_DIR / 'improvement_vs_v1.png') + plot_question_heatmap(results_df, OUTPUT_DIR / 'question_heatmap.png') + plot_source_ranking_over_time(timeline_summary_df, OUTPUT_DIR / 'source_ranking_over_time.png') - all_stats.append(stats_df) - - for metric in ["brier_score", "log_score", "rps", "accuracy"]: - plot_question_level_scatter( - comparison_df, - metric, - source_a, - source_b, - OUTPUT_DIR / f"scatter_{metric}_{pair_tag}.png", - ) - plot_question_level_differences( - comparison_df, - metric, - source_a, - source_b, - OUTPUT_DIR / f"differences_{metric}_{pair_tag}.png", - ) - plot_win_rate( - comparison_df, - metric, - source_a, - source_b, - OUTPUT_DIR / f"win_rate_{metric}_{pair_tag}.png", - lower_is_better=(metric != "accuracy"), - ) - - if all_stats: - combined_stats_df = pd.concat(all_stats, ignore_index=True) - combined_stats_df.to_csv(OUTPUT_DIR / "significance_tests_all_pairs.csv", index=False) - - print("\nEvaluation complete.") + print('\nEvaluation complete.') print(summary_df.to_string(index=False)) - if all_stats: - print("\nPairwise statistical tests saved to:") - print(OUTPUT_DIR / "significance_tests_all_pairs.csv") + if not timeline_summary_df.empty: + print('\nTimeline summary saved to:') + print(timeline_summary_path) if not skipped_df.empty: - print("\nSkipped questions:") + print('\nSkipped questions:') print(skipped_df.to_string(index=False)) -if __name__ == "__main__": - run_evaluation() \ No newline at end of file +if __name__ == '__main__': + run_evaluation() diff --git a/bioscancast/stages/eval_stage/visualisation.py b/bioscancast/stages/eval_stage/visualisation.py index 5d79d8d..8591b12 100644 --- a/bioscancast/stages/eval_stage/visualisation.py +++ b/bioscancast/stages/eval_stage/visualisation.py @@ -6,8 +6,17 @@ import matplotlib.pyplot as plt import numpy as np import pandas as pd +from matplotlib.patches import Patch -from bioscancast.stages.eval_stage.calibration import calibration_table, plot_calibration_curve +from bioscancast.stages.eval_stage.compare import _version_sort_key, rank_sources_over_time, relative_improvement_over_time + + +METRICS = [ + ('brier_score', 'Brier score'), + ('log_score', 'Log score'), + ('accuracy_error', 'Accuracy error (1 - accuracy)'), + ('rps', 'RPS'), +] def _ensure_parent_dir(output_path: str | Path) -> Path: @@ -16,178 +25,256 @@ def _ensure_parent_dir(output_path: str | Path) -> Path: return output_path -def _title_metric(metric: str) -> str: - return metric.replace("_", " ").title() +def _ordered_unique(values: Sequence[object]) -> list[str]: + seen: list[str] = [] + for value in values: + text = str(value) + if text not in seen: + seen.append(text) + return seen -def plot_source_comparison(summary_df: pd.DataFrame, output_path: str | Path) -> None: - required = {"forecast_source", "mean_brier_score", "mean_log_score", "mean_accuracy", "mean_rps"} - missing = [c for c in required if c not in summary_df.columns] - if missing: - raise ValueError("summary_df is missing required columns: " + ", ".join(missing)) +def _color_cycle(n: int) -> list[str]: + colors = plt.rcParams['axes.prop_cycle'].by_key().get('color', ['C0', 'C1', 'C2', 'C3', 'C4']) + return [colors[i % len(colors)] for i in range(n)] - output_path = _ensure_parent_dir(output_path) - plot_df = summary_df.set_index("forecast_source")[["mean_brier_score", "mean_log_score", "mean_accuracy", "mean_rps"]] - ax = plot_df.plot(kind="bar", figsize=(10, 5), rot=0) - ax.set_xlabel("Forecast source") - ax.set_ylabel("Metric value") - ax.set_title("Forecast performance by source") - ax.legend(title="Metric") - plt.tight_layout() - plt.savefig(output_path, dpi=220) - plt.close() - - -def plot_metric_boxplot(results_df: pd.DataFrame, metric: str, output_path: str | Path, ylabel: str | None = None) -> None: - required = {"forecast_source", metric} - missing = [c for c in required if c not in results_df.columns] - if missing: - raise ValueError("results_df is missing required columns: " + ", ".join(missing)) - output_path = _ensure_parent_dir(output_path) - sources = list(dict.fromkeys(results_df["forecast_source"].tolist())) - data = [results_df.loc[results_df["forecast_source"] == s, metric].dropna().tolist() for s in sources] - - plt.figure(figsize=(8.5, 5)) - plt.boxplot( - data, - labels=sources, - showmeans=True, - whis=(0, 100) -) - for i, vals in enumerate(data, start=1): - if vals: - xvals = np.random.normal(i, 0.04, size=len(vals)) - plt.plot(xvals, vals, "o", alpha=0.75) - plt.ylabel(ylabel or _title_metric(metric)) - plt.title(f"Distribution of {metric.replace('_', ' ')} by source") - plt.figtext(0.5, -0.03, "Dots = questions, orange line = median, green triangle = mean", ha="center", fontsize=9) - plt.tight_layout() - plt.savefig(output_path, dpi=220, bbox_inches="tight") - plt.close() - - - - -def plot_accuracy_by_source(summary_df: pd.DataFrame, output_path: str | Path) -> None: - required = {"forecast_source", "mean_accuracy"} - missing = [c for c in required if c not in summary_df.columns] +def _require_columns(df: pd.DataFrame, required: set[str]) -> None: + missing = [c for c in required if c not in df.columns] if missing: - raise ValueError("summary_df is missing required columns: " + ", ".join(missing)) + raise ValueError('results_df is missing required columns: ' + ', '.join(missing)) - output_path = _ensure_parent_dir(output_path) - plot_df = summary_df.set_index("forecast_source")[["mean_accuracy"]] - ax = plot_df.plot(kind="bar", figsize=(7.5, 5), legend=False, rot=0) - ax.set_xlabel("Forecast source") - ax.set_ylabel("Accuracy") - ax.set_title("Mean accuracy by source") - plt.tight_layout() - plt.savefig(output_path, dpi=220) - plt.close() +def _version_axis(values: Sequence[object]) -> list[str]: + return sorted(_ordered_unique(values), key=_version_sort_key) -def plot_metric_distribution(results_df: pd.DataFrame, metric: str, output_path: str | Path) -> None: - if metric not in results_df.columns: - raise ValueError(f"results_df must contain a '{metric}' column.") + +def plot_score_timeline_boxplots(results_df: pd.DataFrame, output_path: str | Path) -> None: + required = {'forecast_source', 'forecast_version', 'brier_score', 'log_score', 'accuracy_error', 'rps'} + _require_columns(results_df, required) output_path = _ensure_parent_dir(output_path) - series = results_df[metric].dropna().astype(float) - plt.figure(figsize=(8, 5)) - bins = min(10, max(3, len(series))) - plt.hist(series, bins=bins) - plt.xlabel(_title_metric(metric)) - plt.ylabel("Count") - plt.title(f"Distribution of question-level {_title_metric(metric).lower()}") - if series.nunique() == 1: - val = float(series.iloc[0]) - plt.xlim(val - 0.05, val + 0.05) - plt.tight_layout() - plt.savefig(output_path, dpi=220) - plt.close() - - -def plot_question_level_scatter(comparison_df: pd.DataFrame, metric: str, source_a: str, source_b: str, output_path: str | Path) -> None: - col_a = f"{metric}_{source_a}" - col_b = f"{metric}_{source_b}" - required = {"question_id", col_a, col_b} - missing = [c for c in required if c not in comparison_df.columns] - if missing: - raise ValueError("comparison_df is missing required columns: " + ", ".join(missing)) + versions = _version_axis(results_df['forecast_version'].tolist()) + sources = _ordered_unique(results_df['forecast_source'].tolist()) + colors = _color_cycle(len(sources)) + offsets = np.linspace(-0.25, 0.25, max(1, len(sources))) if len(sources) > 1 else np.array([0.0]) + + fig, axes = plt.subplots(2, 2, figsize=(14, 10), sharex=True) + axes = axes.ravel() + + for ax, (metric, label) in zip(axes, METRICS): + base_positions = np.arange(len(versions), dtype=float) + for source_idx, source in enumerate(sources): + data = [] + positions = [] + for version_idx, version in enumerate(versions): + mask = (results_df['forecast_source'].astype(str) == source) & (results_df['forecast_version'].astype(str) == version) + vals = results_df.loc[mask, metric].dropna().astype(float).tolist() + if vals: + data.append(vals) + positions.append(base_positions[version_idx] + offsets[source_idx]) + + if data: + bp = ax.boxplot( + data, + positions=positions, + widths=0.18 if len(sources) > 1 else 0.35, + patch_artist=True, + showmeans=True, + whis=(0, 100), + ) + color = colors[source_idx] + for patch in bp['boxes']: + patch.set_facecolor(color) + patch.set_alpha(0.28) + patch.set_edgecolor(color) + for element in ['whiskers', 'caps', 'medians', 'means']: + for item in bp[element]: + item.set_color(color) + + ax.set_title(label) + ax.set_ylabel(label) + ax.grid(True, axis='y', alpha=0.2) + ax.set_xticks(np.arange(len(versions))) + ax.set_xticklabels([str(v) for v in versions]) + ax.set_xlabel('Forecast version') + + legend_handles = [Patch(facecolor=colors[i], edgecolor=colors[i], label=sources[i], alpha=0.28) for i in range(len(sources))] + if legend_handles: + fig.legend(handles=legend_handles, title='Forecast source', loc='upper center', bbox_to_anchor=(0.5, 0.955), ncol=min(4, len(legend_handles)), frameon=False) + fig.suptitle('Score distributions by source and version', y=0.972, fontsize=16) + fig.tight_layout(rect=(0, 0, 1, 0.96)) + fig.savefig(output_path, dpi=220, bbox_inches='tight') + plt.close(fig) + + +def plot_source_timeline(summary_df: pd.DataFrame, output_path: str | Path) -> None: + required = {'forecast_version', 'forecast_source'} | {f'median_{metric}' for metric, _ in METRICS} | {f'q1_{metric}' for metric, _ in METRICS} | {f'q3_{metric}' for metric, _ in METRICS} + _require_columns(summary_df, required) output_path = _ensure_parent_dir(output_path) - x = comparison_df[col_a].astype(float) - y = comparison_df[col_b].astype(float) - - plt.figure(figsize=(6.5, 6.5)) - plt.scatter(x, y) - mn = float(min(x.min(), y.min())) - mx = float(max(x.max(), y.max())) - plt.plot([mn, mx], [mn, mx], linestyle="--") - plt.xlabel(f"{source_a.title()} {metric.replace('_', ' ')}") - plt.ylabel(f"{source_b.title()} {metric.replace('_', ' ')}") - plt.title(f"Question-level {metric.replace('_', ' ')} comparison") - plt.tight_layout() - plt.savefig(output_path, dpi=220) - plt.close() - - -def plot_question_level_differences(comparison_df: pd.DataFrame, metric: str, source_a: str, source_b: str, output_path: str | Path) -> None: - col_a = f"{metric}_{source_a}" - col_b = f"{metric}_{source_b}" - required = {"question_id", col_a, col_b} - missing = [c for c in required if c not in comparison_df.columns] - if missing: - raise ValueError("comparison_df is missing required columns: " + ", ".join(missing)) + versions = _version_axis(summary_df['forecast_version'].tolist()) + sources = _ordered_unique(summary_df['forecast_source'].tolist()) + colors = _color_cycle(len(sources)) + x = np.arange(len(versions), dtype=float) + + fig, axes = plt.subplots(2, 2, figsize=(14, 10), sharex=True) + axes = axes.ravel() + + for ax, (metric, label) in zip(axes, METRICS): + for source_idx, source in enumerate(sources): + source_df = summary_df[summary_df['forecast_source'].astype(str) == source].copy() + source_df['forecast_version'] = source_df['forecast_version'].astype(str) + source_df = source_df.set_index('forecast_version').reindex(versions) + medians = source_df[f'median_{metric}'].astype(float).to_numpy() + q1 = source_df[f'q1_{metric}'].astype(float).to_numpy() + q3 = source_df[f'q3_{metric}'].astype(float).to_numpy() + color = colors[source_idx] + ax.plot(x, medians, marker='o', linewidth=2, label=source, color=color) + ax.fill_between(x, q1, q3, alpha=0.18, color=color) + + ax.set_title(label) + ax.set_ylabel(label) + ax.grid(True, axis='y', alpha=0.2) + ax.set_xticks(x) + ax.set_xticklabels([str(v) for v in versions]) + ax.set_xlabel('Forecast version') + + legend_handles = [Patch(facecolor=colors[i], edgecolor=colors[i], label=sources[i], alpha=0.18) for i in range(len(sources))] + if legend_handles: + fig.legend(handles=legend_handles, title='Forecast source', loc='upper center', bbox_to_anchor=(0.5, 0.955), ncol=min(4, len(legend_handles)), frameon=False) + fig.suptitle('Median score timelines with IQR bands', y=0.972, fontsize=16) + fig.tight_layout(rect=(0, 0, 1, 0.96)) + fig.savefig(output_path, dpi=220, bbox_inches='tight') + plt.close(fig) + + +def plot_relative_improvement(results_df: pd.DataFrame, output_path: str | Path) -> None: + output_path = _ensure_parent_dir(output_path) + improvement_df = relative_improvement_over_time(results_df) + if improvement_df.empty: + raise ValueError('No improvement data available to plot.') + + versions = _version_axis(improvement_df['forecast_version'].tolist()) + sources = _ordered_unique(improvement_df['forecast_source'].tolist()) + colors = _color_cycle(len(sources)) + x = np.arange(len(versions), dtype=float) + + fig, axes = plt.subplots(2, 2, figsize=(14, 10), sharex=True) + axes = axes.ravel() + + for ax, (metric, label) in zip(axes, METRICS): + metric_df = improvement_df[improvement_df['metric'] == metric].copy() + for source_idx, source in enumerate(sources): + source_df = metric_df[metric_df['forecast_source'].astype(str) == source].copy() + source_df['forecast_version'] = source_df['forecast_version'].astype(str) + source_df = source_df.set_index('forecast_version').reindex(versions) + medians = source_df['median_improvement'].astype(float).to_numpy() + q1 = source_df['q1_improvement'].astype(float).to_numpy() + q3 = source_df['q3_improvement'].astype(float).to_numpy() + color = colors[source_idx] + ax.plot(x, medians, marker='o', linewidth=2, label=source, color=color) + ax.fill_between(x, q1, q3, alpha=0.18, color=color) + + ax.axhline(0.0, color='black', linewidth=1, linestyle='--', alpha=0.6) + ax.set_title(f'{label} improvement vs version 1') + ax.set_ylabel('Improvement (positive = better)') + ax.grid(True, axis='y', alpha=0.2) + ax.set_xticks(x) + ax.set_xticklabels([str(v) for v in versions]) + ax.set_xlabel('Forecast version') + + legend_handles = [Patch(facecolor=colors[i], edgecolor=colors[i], label=sources[i], alpha=0.18) for i in range(len(sources))] + if legend_handles: + fig.legend(handles=legend_handles, title='Forecast source', loc='upper center', bbox_to_anchor=(0.5, 0.955), ncol=min(4, len(legend_handles)), frameon=False) + fig.suptitle('Change relative to the first forecast version', y=0.972, fontsize=16) + fig.tight_layout(rect=(0, 0, 1, 0.96)) + fig.savefig(output_path, dpi=220, bbox_inches='tight') + plt.close(fig) + + +def plot_question_heatmap(results_df: pd.DataFrame, output_path: str | Path, metric: str = 'brier_score') -> None: + required = {'question_id', 'forecast_source', 'forecast_version', metric} + _require_columns(results_df, required) output_path = _ensure_parent_dir(output_path) - diffs = comparison_df[col_a].astype(float) - comparison_df[col_b].astype(float) - plt.figure(figsize=(8, 4.5)) - plt.axhline(0, linestyle="--") - plt.scatter(range(len(diffs)), diffs) - plt.xticks(range(len(diffs)), comparison_df["question_id"].tolist(), rotation=45) - plt.ylabel(f"{source_a.title()} - {source_b.title()} {metric.replace('_', ' ')}") - plt.title(f"Per-question difference in {metric.replace('_', ' ')}") - plt.tight_layout() - plt.savefig(output_path, dpi=220) - plt.close() - - -def plot_win_rate(comparison_df: pd.DataFrame, metric: str, source_a: str, source_b: str, output_path: str | Path, lower_is_better: bool = True) -> None: - col_a = f"{metric}_{source_a}" - col_b = f"{metric}_{source_b}" - diffs = comparison_df[col_a].astype(float) - comparison_df[col_b].astype(float) - - if lower_is_better: - a_wins = int((diffs < 0).sum()) - b_wins = int((diffs > 0).sum()) + versions = _version_axis(results_df['forecast_version'].tolist()) + sources = _ordered_unique(results_df['forecast_source'].tolist()) + metric_label = dict(METRICS).get(metric, metric) + + fig, axes = plt.subplots(len(sources), 1, figsize=(12, max(4, 3.2 * len(sources))), sharex=True) + if len(sources) == 1: + axes = [axes] + + all_values = [] + ordered_frames = [] + for source in sources: + source_df = results_df[results_df['forecast_source'].astype(str) == source].copy() + pivot = source_df.pivot_table(index='question_id', columns='forecast_version', values=metric, aggfunc='mean') + pivot = pivot.reindex(columns=versions) + baseline = pivot[versions[0]] if versions else pd.Series(dtype=float) + order = baseline.sort_values(ascending=False).index.tolist() if not baseline.empty else pivot.index.tolist() + pivot = pivot.reindex(order) + ordered_frames.append((source, pivot)) + if not pivot.empty: + all_values.append(pivot.to_numpy(dtype=float)) + + if all_values: + stacked = np.concatenate([arr[np.isfinite(arr)] for arr in all_values if arr.size]) + vmin = float(np.nanmin(stacked)) + vmax = float(np.nanmax(stacked)) else: - a_wins = int((diffs > 0).sum()) - b_wins = int((diffs < 0).sum()) - ties = int((diffs == 0).sum()) - + vmin, vmax = 0.0, 1.0 + + for ax, (source, pivot) in zip(axes, ordered_frames): + values = pivot.to_numpy(dtype=float) + im = ax.imshow(values, aspect='auto', interpolation='nearest', cmap='viridis_r', vmin=vmin, vmax=vmax) + ax.set_title(source) + ax.set_yticks(np.arange(len(pivot.index))) + ax.set_yticklabels([str(q) for q in pivot.index]) + ax.set_xticks(np.arange(len(versions))) + ax.set_xticklabels([str(v) for v in versions]) + ax.set_ylabel('Question') + ax.grid(False) + + axes[-1].set_xlabel('Forecast version') + fig.colorbar(im, ax=axes, fraction=0.03, pad=0.02, label=metric_label) + fig.suptitle(f'Question-level {metric_label} across versions', y=0.99, fontsize=16) + fig.subplots_adjust(top=0.92, right=0.88) + fig.savefig(output_path, dpi=220, bbox_inches='tight') + plt.close(fig) + + +def plot_source_ranking_over_time(summary_df: pd.DataFrame, output_path: str | Path, metric_column: str = 'median_brier_score') -> None: output_path = _ensure_parent_dir(output_path) - plt.figure(figsize=(6.5, 4.5)) - plt.bar([source_a.title(), source_b.title(), "Ties"], [a_wins, b_wins, ties]) - plt.ylabel("Questions") - plt.title(f"Win rate by question for {metric.replace('_', ' ')}") - plt.tight_layout() - plt.savefig(output_path, dpi=220) - plt.close() - - -def plot_confidence_calibration(results_df: pd.DataFrame, output_path: str | Path, source: str | None = None) -> None: - if not {"top_probability", "accuracy"}.issubset(results_df.columns): - raise ValueError("results_df must contain 'top_probability' and 'accuracy'.") - - if source is not None and "forecast_source" in results_df.columns: - results_df = results_df[results_df["forecast_source"] == source] - - table = calibration_table(results_df["top_probability"].tolist(), results_df["accuracy"].tolist(), bins=5) - plot_calibration_curve(table, output_path) - - -def plot_reliability_overview(results_df: pd.DataFrame, output_dir: str | Path) -> None: - output_dir = Path(output_dir) - output_dir.mkdir(parents=True, exist_ok=True) - for source, group in results_df.groupby("forecast_source"): - plot_confidence_calibration(group, output_dir / f"calibration_{source}.png") + ranked = rank_sources_over_time(summary_df, metric_column=metric_column, ascending=True) + if ranked.empty: + raise ValueError('No ranking data available to plot.') + + versions = _version_axis(ranked['forecast_version'].tolist()) + sources = _ordered_unique(ranked['forecast_source'].tolist()) + matrix = pd.DataFrame(index=sources, columns=versions, dtype=float) + for _, row in ranked.iterrows(): + matrix.loc[str(row['forecast_source']), str(row['forecast_version'])] = float(row['rank']) + + data = matrix.to_numpy(dtype=float) + fig, ax = plt.subplots(figsize=(10, 4.5)) + im = ax.imshow(data, aspect='auto', interpolation='nearest', cmap='YlGn_r', vmin=1, vmax=max(3, int(np.nanmax(data)))) + ax.set_xticks(np.arange(len(versions))) + ax.set_xticklabels([str(v) for v in versions]) + ax.set_yticks(np.arange(len(sources))) + ax.set_yticklabels(sources) + ax.set_xlabel('Forecast version') + ax.set_ylabel('Forecast source') + ax.set_title('Source ranking over time (lower Brier score = better rank)') + + for i in range(len(sources)): + for j in range(len(versions)): + value = data[i, j] + if np.isfinite(value): + ax.text(j, i, f'{int(value)}', ha='center', va='center', color='black', fontsize=11, fontweight='bold') + + fig.colorbar(im, ax=ax, fraction=0.04, pad=0.03, label='Rank') + fig.tight_layout() + fig.savefig(output_path, dpi=220, bbox_inches='tight') + plt.close(fig)