Skip to content

Make microimpute.Imputer the canonical regime-gated sequential imputer#196

Draft
MaxGhenis wants to merge 2 commits into
mainfrom
claude/imputer-canonical
Draft

Make microimpute.Imputer the canonical regime-gated sequential imputer#196
MaxGhenis wants to merge 2 commits into
mainfrom
claude/imputer-canonical

Conversation

@MaxGhenis

Copy link
Copy Markdown
Contributor

Summary

Makes microimpute.Imputer the canonical, opinionated imputer — sign-regime gating + QRF base + sequential chained-equations imputation, all on by default — and renames the abstract base to BaseImputer. Follow-up to #193 (chaining) and built on top of #195 (QRF feature-order), both of which are preserved.

  • Sequential chaining is always on (no sequential flag): imputing a list of targets conditions each on the previously-imputed ones, so the imputed vector preserves cross-variable joint structure. The old per-variable-independent path (which was accidentally comonotonic — identical base seeds) is gone.
  • signregime: bool = True — the {neg,0,pos} gate (which fixes the "QRF on y>0 drops the negative tail" bug, e.g. capital/business losses); signregime=False imputes with the base model directly.
  • base_imputer_class=QRF (default) — the model knob, for experiments.
  • Rename: ZeroInflatedImputerImputer (it was a misnomer — regime-gated/hurdle, not zero-inflation); abstract base ImputerBaseImputer (still exported). Breaking — see migration.
  • sklearn-shaped fitted state instead of a bespoke lineage object: the result exposes regimes_, predictors_ (the chained predictor set per target), models_ ({var: {role: estimator}}), and QRFResults carries standard feature_importances_ (keyed by the forest's fitted columns — self-consistent) / feature_names_in_. Full per-variable lineage is assembled by callers (microplex), not microimpute.

Migration

  • from microimpute.models.zero_inflated import ZeroInflatedImputerfrom microimpute import Imputer
  • old base class Imputer (subclass/isinstance) → BaseImputer

Testing

  • tests/test_models/: 136 passed, 3 skipped (skips = optional rpy2 Matching).
  • ruff format --check clean.
  • Preserve QRF feature order during prediction #195 preserved (verified: _align_features byte-identical; not_numeric_categorical threading intact). Reviewed via an independent review-fix cycle that caught and fixed an earlier branch's accidental Preserve QRF feature order during prediction #195 revert + a feature_importances_/feature_names_in_ misalignment bug; round-2 review found no actionable issues.

🤖 Generated with Claude Code

…style fitted state

Rename ZeroInflatedImputer to the canonical microimpute.Imputer (regime-gated,
QRF-based, sequentially-chained), make it the opinionated default, and rename
the abstract base class Imputer -> BaseImputer. The module moves
models/zero_inflated.py -> models/regime_gated.py and the result class becomes
the internal RegimeGatedImputerResults.

- Chaining is always on: each numeric target conditions on the original
  predictors plus the previously-imputed numeric targets (the `sequential`
  constructor flag is removed).
- New `signregime: bool = True` arg; `signregime=False` skips regime detection
  and imputes each numeric target with a single base imputer over the full
  training set (the REGIME_NO_GATE path).
- The fitted result exposes sklearn-style state: regimes_ ({var: regime}),
  predictors_ ({var: chained predictor list}), models_ ({var: {role: estimator}}
  with roles single/gate/positive/negative). No lineage()/VariableLineage.
- QRFResults gains feature_names_in_ (original input predictor names, ndarray)
  and feature_importances_ ({fitted_feature: importance} dict keyed by the
  forest's ACTUAL fitted columns; {var: {feature: importance}} for multiple
  variables; AttributeError for non-QRF bases). Keying by the fitted columns
  fixes the prior name/value misalignment under categorical (dummy-expanded)
  predictors.

Preserves PR #195 in full: regime_gated.py still threads not_numeric_categorical
into every nested base fit, and qrf.py keeps the _align_features/feature_columns
prediction-order tracking byte-for-byte.

Tests: test_zero_inflated.py -> test_regime_gated.py (ported, classes renamed,
#195 not_numeric_categorical tests kept); test_qrf.py keeps the #195 reorder
test and adds feature_importances_/feature_names_in_ self-consistency regression
tests (including the categorical-predictor case that was broken before).
test_regime_gated_chaining.py replaces test_zero_inflated_chaining.py and adds
chaining-recovery, fitted-attributes, and signregime=False coverage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@vercel

vercel Bot commented Jun 6, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
microimpute-dashboard Ready Ready Preview, Comment Jun 10, 2026 3:20am

Imputer.fit(weight_col=...) only forwarded weights to the auxiliary
base imputer for non-numeric targets. The per-variable numeric chain
called _fit_single_numeric, which neither accepted nor forwarded
weights, so regime gate classifiers and per-regime base imputers all
fit unweighted: weighted and unweighted fits produced identical
imputations for every numeric target.

Two layers needed fixing:

- regime_gated.py: resolve weights once in fit() (column name, array,
  or Series — extracted BaseImputer's inline resolution into a
  reusable _resolve_sample_weights helper) and thread the per-row
  vector through _fit_single_numeric into every nested fit: gate
  classifiers receive it as sample_weight, and _fit_base_single
  forwards it as weight_col to each per-regime base imputer, sliced
  with the same row mask as the training slice
  (positive/negative/nonzero parts).

- qrf.py: passing sample_weight natively to
  RandomForestQuantileRegressor / RandomForestClassifier is a no-op
  for the predictive distribution — quantile_forest only uses it as a
  zero-weight filter when assembling leaf membership, and fully-grown
  forest leaves hold one training sample each, so weighted impurity
  moves nothing either. The QRF model classes now materialize weights
  by weighted bootstrap resampling of their training rows
  (_weighted_resample) before fitting.

_detect_regime stays unweighted deliberately: it is a structural
check of which sign classes appear in the donor data (presence-based
support detection), not an estimate, so sampling weights do not
apply.

Verified with a controlled repro (donor mixing a heavily-downweighted
y~1e6 block into a weight-1 y~1e4 block): before, weight_col=None and
weight_col="weight" both imputed mean ~202,329; after, the weighted
fit imputes ~11,090 (true weighted mean 10,755) while the unweighted
fit is unchanged. Regression tests cover both the no-gate numeric
path and the ZI_POSITIVE gate classifier path.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant