Make microimpute.Imputer the canonical regime-gated sequential imputer#196
Draft
MaxGhenis wants to merge 2 commits into
Draft
Make microimpute.Imputer the canonical regime-gated sequential imputer#196MaxGhenis wants to merge 2 commits into
MaxGhenis wants to merge 2 commits into
Conversation
…style fitted state
Rename ZeroInflatedImputer to the canonical microimpute.Imputer (regime-gated,
QRF-based, sequentially-chained), make it the opinionated default, and rename
the abstract base class Imputer -> BaseImputer. The module moves
models/zero_inflated.py -> models/regime_gated.py and the result class becomes
the internal RegimeGatedImputerResults.
- Chaining is always on: each numeric target conditions on the original
predictors plus the previously-imputed numeric targets (the `sequential`
constructor flag is removed).
- New `signregime: bool = True` arg; `signregime=False` skips regime detection
and imputes each numeric target with a single base imputer over the full
training set (the REGIME_NO_GATE path).
- The fitted result exposes sklearn-style state: regimes_ ({var: regime}),
predictors_ ({var: chained predictor list}), models_ ({var: {role: estimator}}
with roles single/gate/positive/negative). No lineage()/VariableLineage.
- QRFResults gains feature_names_in_ (original input predictor names, ndarray)
and feature_importances_ ({fitted_feature: importance} dict keyed by the
forest's ACTUAL fitted columns; {var: {feature: importance}} for multiple
variables; AttributeError for non-QRF bases). Keying by the fitted columns
fixes the prior name/value misalignment under categorical (dummy-expanded)
predictors.
Preserves PR #195 in full: regime_gated.py still threads not_numeric_categorical
into every nested base fit, and qrf.py keeps the _align_features/feature_columns
prediction-order tracking byte-for-byte.
Tests: test_zero_inflated.py -> test_regime_gated.py (ported, classes renamed,
#195 not_numeric_categorical tests kept); test_qrf.py keeps the #195 reorder
test and adds feature_importances_/feature_names_in_ self-consistency regression
tests (including the categorical-predictor case that was broken before).
test_regime_gated_chaining.py replaces test_zero_inflated_chaining.py and adds
chaining-recovery, fitted-attributes, and signregime=False coverage.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Imputer.fit(weight_col=...) only forwarded weights to the auxiliary base imputer for non-numeric targets. The per-variable numeric chain called _fit_single_numeric, which neither accepted nor forwarded weights, so regime gate classifiers and per-regime base imputers all fit unweighted: weighted and unweighted fits produced identical imputations for every numeric target. Two layers needed fixing: - regime_gated.py: resolve weights once in fit() (column name, array, or Series — extracted BaseImputer's inline resolution into a reusable _resolve_sample_weights helper) and thread the per-row vector through _fit_single_numeric into every nested fit: gate classifiers receive it as sample_weight, and _fit_base_single forwards it as weight_col to each per-regime base imputer, sliced with the same row mask as the training slice (positive/negative/nonzero parts). - qrf.py: passing sample_weight natively to RandomForestQuantileRegressor / RandomForestClassifier is a no-op for the predictive distribution — quantile_forest only uses it as a zero-weight filter when assembling leaf membership, and fully-grown forest leaves hold one training sample each, so weighted impurity moves nothing either. The QRF model classes now materialize weights by weighted bootstrap resampling of their training rows (_weighted_resample) before fitting. _detect_regime stays unweighted deliberately: it is a structural check of which sign classes appear in the donor data (presence-based support detection), not an estimate, so sampling weights do not apply. Verified with a controlled repro (donor mixing a heavily-downweighted y~1e6 block into a weight-1 y~1e4 block): before, weight_col=None and weight_col="weight" both imputed mean ~202,329; after, the weighted fit imputes ~11,090 (true weighted mean 10,755) while the unweighted fit is unchanged. Regression tests cover both the no-gate numeric path and the ZI_POSITIVE gate classifier path. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes
microimpute.Imputerthe canonical, opinionated imputer — sign-regime gating + QRF base + sequential chained-equations imputation, all on by default — and renames the abstract base toBaseImputer. Follow-up to #193 (chaining) and built on top of #195 (QRF feature-order), both of which are preserved.sequentialflag): imputing a list of targets conditions each on the previously-imputed ones, so the imputed vector preserves cross-variable joint structure. The old per-variable-independent path (which was accidentally comonotonic — identical base seeds) is gone.signregime: bool = True— the{neg,0,pos}gate (which fixes the "QRF ony>0drops the negative tail" bug, e.g. capital/business losses);signregime=Falseimputes with the base model directly.base_imputer_class=QRF(default) — the model knob, for experiments.ZeroInflatedImputer→Imputer(it was a misnomer — regime-gated/hurdle, not zero-inflation); abstract baseImputer→BaseImputer(still exported). Breaking — see migration.regimes_,predictors_(the chained predictor set per target),models_({var: {role: estimator}}), andQRFResultscarries standardfeature_importances_(keyed by the forest's fitted columns — self-consistent) /feature_names_in_. Full per-variable lineage is assembled by callers (microplex), not microimpute.Migration
from microimpute.models.zero_inflated import ZeroInflatedImputer→from microimpute import ImputerImputer(subclass/isinstance) →BaseImputerTesting
tests/test_models/: 136 passed, 3 skipped (skips = optional rpy2Matching).ruff format --checkclean._align_featuresbyte-identical;not_numeric_categoricalthreading intact). Reviewed via an independent review-fix cycle that caught and fixed an earlier branch's accidental Preserve QRF feature order during prediction #195 revert + afeature_importances_/feature_names_in_misalignment bug; round-2 review found no actionable issues.🤖 Generated with Claude Code