perf(encoding): #124 Phase 4 — HC speculative tail check + MatcherStorage enum dispatch#125
perf(encoding): #124 Phase 4 — HC speculative tail check + MatcherStorage enum dispatch#125polaz wants to merge 4 commits into
Conversation
Port donor `zstd_lazy.c:714` (`ZSTD_HcFindBestMatch`) speculative 4-byte tail compare into `HcMatcher::hash_chain_candidate`. Once a `best` candidate is set with `match_len >= 4`, gate the per-byte `common_prefix_len` walk on `concat[i+best_ml-3..best_ml+1]` vs `concat[c+best_ml-3..best_ml+1]`. A mismatch proves the candidate cannot reach `best_ml + 1` and the walk would only confirm `len <= best_ml`. Correctness rests on the monotonic-offset property of LIFO chain walks: each subsequent candidate has `new.offset_bits >= best.offset_bits`, so `better_candidate` (gain = `len*4 - offset_bits`) can return a strict winner only when `new.match_len > best_ml`. Equal-length or shorter candidates can only tie or lose, and `better_candidate` keeps the existing best on ties. Bounds: when `current_idx + best_ml + 1 > history_tail`, the current position cannot reach `best_ml + 1` either, so the candidate cannot win — same skip. Verification: - `cargo nextest run -p structured-zstd --features hash,std,dict_builder` 476/476 pass (incl. `level22_sequences_match_donor_on_corpus_proxy` and the cross-validation roundtrip + parity gates). - `cargo test --doc` 10/10 pass. - `cargo clippy -p structured-zstd --features hash,std,dict_builder --all-targets -- -D warnings` clean. - `cargo bench --quick '^compress/better'` (Lazy backend, the actual HC chain-walker user) on x86_64-darwin: one scenario −5.4 % (small-10k-random, p = 0.05), the rest within ±2 % of stored baseline — `--quick` noise floor swallows whatever steady-state win remains. Full baseline-vs-feature bench will be measured on the CI bench-matrix.
Collapse the driver's four parallel backend fields (`match_generator: MatchGenerator`, `dfast_match_generator: Option<DfastMatchGenerator>`, `row_match_generator: Option<RowMatchGenerator>`, `hc_match_generator: Option<HcMatchGenerator>`) plus the `active_backend: BackendTag` discriminator into a single `storage: MatcherStorage` enum with four variants (`Simple` / `Dfast` / `Row` / `HashChain`). The active backend is now derived directly from the storage variant via `MatcherStorage::backend()` — no separate runtime tag that could drift against the variant. Why - The prior layout kept drained-but-allocated inner structures for every backend the driver had ever touched (the lazy `Option`s + the always-present `match_generator` retained their backing metadata across backend switches). With the enum, the inactive variants are dropped on switch, so the driver carries exactly one backend's state at any time. - Every per-frame / per-slice driver operation (`reset` / `prime_with_dictionary` / `commit_space` / `get_last_space` / `skip_matching_with_hint` / `retire_dictionary_budget` / `trim_after_budget_retire` / `skip_matching_for_dictionary_priming`) used to dispatch on `active_backend` to pick the matching field. The match-on-tag + field-access pair is replaced with `match &mut self.storage`, which fuses dispatch and accessor and gives the borrow checker field-disjoint splits "for free" (manual destructure where the closure captures `vec_pool` / `suffix_pool` alongside the backend borrow). - This is non-hot-path dispatch (per block at most, per frame typically). The hot per-block `compress_block::<S>` already dispatches via `match S::BACKEND` over a `const`, which the compiler const-folds out per monomorphisation — that path is unchanged. Notes - Backend transitions inside `Matcher::reset` drain the old variant's allocations into `vec_pool` / `suffix_pool`, then swap `self.storage` to a freshly constructed variant for the new backend. The old variant is dropped at the swap; with the HashChain → Simple transition the pre-swap `hash_table` / `chain_table` / `hash3_table` zeroing kept from the prior field-shaped impl ensures the BtUltra2 hash3 allocation (`1 << HC3_HASH_LOG`) does not survive the switch. - Three tests went obsolete because they exercised the old lazy-init recovery / drained-but-allocated invariant (`driver_reset_from_row_backend_tolerates_missing_row_matcher` flipped `row_match_generator = None` by hand; `driver_best_to_fastest_releases_oversized_hc_tables` and `…reclaims_row_buffer_pool` inspected the post-switch state of a backend that no longer exists). Replaced with one new test proving the new invariant — buffer pool grows across the HashChain → Simple switch, backend tag flips, no field-shaped inspection of the dropped variant. - Immutable accessors (`simple` / `dfast_matcher` / `row_matcher` / `hc_matcher`) are now `#[cfg(test)]` — production code only reads via the mutable counterparts or pattern-matches `storage` directly. Verification - `cargo nextest run -p structured-zstd --features hash,std,dict_builder` 475/475 pass, 3 skipped (`level22_sequences_match_donor_on_corpus_proxy` + full cross-validation + roundtrip + parity gates). - `cargo test --doc` 10/10 pass. - `cargo clippy -p structured-zstd --features hash,std,dict_builder --all-targets -- -D warnings` clean.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR adds donor-style rep‑0 post-match chaining to the Dfast matcher, a speculative 4‑byte tail-equality check to the HC matcher candidate loop, and refactors MatchGeneratorDriver to hold exactly one active backend in a MatcherStorage enum with pattern-matched accessors and reset/drain semantics. ChangesBackend Storage Consolidation and HC + Dfast Enhancements
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
This PR continues the encoder architecture/performance work by adding an HC speculative tail check and replacing parallel matcher backend fields with a single active MatcherStorage enum.
Changes:
- Adds a speculative 4-byte tail comparison in HC hash-chain candidate probing.
- Collapses matcher backend storage into a single enum variant plus derived backend dispatch.
- Updates driver tests and backend-switch assertions for the new storage model.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
zstd/src/encoding/match_generator.rs |
Introduces MatcherStorage and rewires driver reset, commit, dictionary, and tests around enum-backed storage. |
zstd/src/encoding/hc/mod.rs |
Adds the HC speculative tail gate before running full prefix matching. |
Bundles the Dfast donor port and the HC speculative-gate correctness fix from Copilot review on PR #125. Same Phase 4 hot-path family, shares one retest cycle. ## Dfast post-match rep-0 extension Port donor `zstd_double_fast.c` post-match rep chain into the Dfast fast-loop and general-loop emit paths. After each `emit_candidate`, the new helper `extend_with_repcode_after_match` chains additional zero-literal rep sequences while 4 bytes at the new cursor keep matching `cursor - offset_hist[1]`. Each iteration emits one rep sequence with the swapped-in offset, advances `pos` and `literals_start` past the rep match, updates `offset_hist` via `encode_offset_with_history` (the donor `offset_2 = offset_1; offset_1 = old_offset_2;` swap), inserts every position of the rep match into the long/short hash tables so subsequent positions can hit them, and skips the full hash-table probe entirely on every extra match. Uses donor's `MINMATCH = 4` for the rep threshold rather than the stricter `DFAST_MIN_MATCH_LEN = 6` enforced on the main search — donor accepts any 4-byte rep extension and we mirror that because the rep emission carries no offset cost. ## HC speculative tail check — correctness fix Copilot flagged a backward-extension-unaware bound in the gate introduced by 82e85fc. The old code used `best.match_len` directly as the forward-extent reference, but `extend_backwards` may have added up to `lit_len` backward bytes to that total. A new chain candidate can in turn replace `B_best` of its own length with up to `lit_len` backward bytes, so the worst-case forward length it needs to outscore `best` (under the LIFO offset-monotonic walk argument) is `best.match_len - lit_len + 1`. The 4-byte tail probe now anchors at `tail_off = best.match_len - lit_len - 3` (via `checked_sub(lit_len + 3)`), which covers exactly that boundary. When `best.match_len <= lit_len + 3` the worst-case forward target is so close to `current_idx` that the chain-walk cost is already trivial — the gate is skipped via the `checked_sub`. When `lit_len == 0` the formula reduces to the pre-fix `best.match_len - 3` (no backward extension possible). Comment block in `hash_chain_candidate` now enumerates both the backward-extension bound and the offset-monotonicity argument. ## Driver test tightening (Copilot threads #2 / #3 / #4) - `driver_reset_from_row_backend_recycles_row_buffers_into_pool`: assertion changed from `vec_pool.len() >= before_pool` to `>` so the test fails on the regression it is meant to cover. With `before_pool = 0` on a fresh driver the weaker bound passed trivially even if the Row to Simple drain closure never pushed anything back into `vec_pool`. - `row_get_last_space_and_reset_to_fastest_clears_window` renamed to `..._drops_row_variant` with a doc comment explaining why the original `clears_window` assertion is intentionally gone (the variant itself is dropped on swap; no `row_matcher()` to inspect post-switch). Points at the sibling pool-recycling test by name. - `MatcherStorage` variant docs rewritten to enumerate the full `CompressionLevel` to backend mapping that `resolve_level_params` produces for each variant (Simple covers `Uncompressed`, `Fastest`, `Level(1)`, and any non-positive `Level(n) != 0`; Dfast covers `Default`, `Level(0)`, `Level(2)`, `Level(3)`; Row covers `Level(4)`; HashChain covers `Better`, `Best`, and `Level(5..=22)` across Lazy / BtOpt / BtUltra / BtUltra2). The pre-fix doc said "Level 1" / "Levels 2-3" which was misleading for the named-level variants. ## Verification - nextest 475/475 pass, 3 skipped (incl. `level22_sequences_match_donor_on_corpus_proxy` parity gate and the full cross-validation + roundtrip + ratio gates for default / better / best / level22). - doctests 10/10 pass. - clippy --all-targets -D warnings clean. - bench --quick on `^compress/default` (x86_64-darwin): every default-level scenario improves relative to the c_ffi baseline noise floor. Most relevant numbers: `decodecorpus-z000033` (the 27x C-FFI gap target from #111) gains -11 % vs c_ffi noise; `high-entropy-1m` -16 %; `low-entropy-1m` -8 %.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@zstd/src/encoding/dfast/mod.rs`:
- Around line 452-460: The check `rep > pos` incorrectly prevents chaining to
retained history across block boundaries; keep the `rep == 0` early break but
remove the `rep > pos` guard and rely on the `cur_idx.checked_sub(rep)`
(`cur_idx`, `cand_idx`, `current_abs_start`, `history_abs_start`, `offset_hist`)
to detect out-of-range repeats so repeated offsets that point into retained
history are allowed; in short, delete the `rep > pos` condition and let the
existing `checked_sub` None branch handle invalid sources.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 52ac21cd-64af-4ece-ad72-447894484c1f
📒 Files selected for processing (3)
zstd/src/encoding/dfast/mod.rszstd/src/encoding/hc/mod.rszstd/src/encoding/match_generator.rs
…culative gate Round-2 review fixes from CodeRabbit / Copilot on PR #125. Bundles the behavioural fix called out by both reviewers (block-local `rep > pos` guard wrongly rejected retained-history offsets) plus targeted regression tests for the helpers that previously had only roundtrip coverage. ## Behavioural fix: `extend_with_repcode_after_match` accepts cross-block offsets CodeRabbit (Major / Quick win) and Copilot both flagged that the helper's `if rep == 0 || rep > pos { break; }` guard rejected valid repeat offsets pointing into retained history whenever the source lived past the current block-local cursor. Since `DfastMatchGenerator` keeps a contiguous `live_history()` buffer covering `history_abs_start..history_abs_end`, the only correct bound is "does the candidate land at or after `history_abs_start`" — which `cur_idx.checked_sub(rep)` already enforces (`None` ⇒ rep points before `history_abs_start`, break). Replaced the guard with `if rep == 0 { break; }` and added a comment explaining that `checked_sub` is the authoritative bound. Near block boundaries this restores the donor-parity chain win the helper is meant to recover. ## Regression tests * `dfast::extend_with_repcode_tests::dfast_repcode_extension_emits_zero_literal_rep_on_constant_run` exercises the post-match rep helper directly with a hand-built post-primary-match state: a constant-byte block, `offset_hist[1] = 1`, and an in-block cursor mid-block. Verifies the helper emits a zero-literal Triple with `offset = 1` (donor parity) and advances `pos` / `literals_start` consistently. Going through `start_matching` was unreliable because the primary `best_match` greedily consumes the entire constant run in a single Triple (`offset = 1, match_len = block - 1`), leaving the helper nothing to extend. * `dfast::extend_with_repcode_tests::dfast_repcode_extension_walks_into_retained_history` builds a two-block fixture and probes with `rep = 40` against a block-local cursor at `pos = 5`. The pre-fix `rep > pos` guard would have rejected this; the post-fix code emits the rep correctly. * `hc::hc_tests::hash_chain_candidate_speculative_gate_handles_lit_len_backward_extension` guards the post-fix `tail_off = best.match_len - lit_len - 3` formula with a monotonicity check: for the same probe position and chain content, the returned `match_len` must be non-decreasing in `lit_len` because more backward-extension headroom can only enlarge (or leave unchanged) every candidate's effective length. A regression where the gate skips candidates that backward extension would have rescued surfaces as `len_high_lit < len_low_lit`. ## Doc / comment clarifications (Copilot threads on `MatcherStorage`) * The post-switch state comment in `driver_best_to_fastest_releases_oversized_hc_tables` previously said the manual `hash_table = Vec::new()` assignments release allocations "immediately rather than waiting for `Drop`". That was misleading — assigning `self.storage = MatcherStorage::Simple(...)` would drop the old `HashChain` variant in the same call regardless. The real reason for the up-front reassignment is to cap peak memory during the swap: by releasing the table allocations *before* constructing the replacement variant, the peak resident set is "old data buffers being drained into `vec_pool` + new `MatchGenerator` skeleton" instead of "old tables still resident + new variant under construction". Comment updated to reflect that. ## Verification - nextest 478/478 pass, 3 skipped (478 = 475 + the three new tests). - doctests 10/10 pass. - clippy --all-targets -D warnings clean.
Summary
Phase 4 of #111. Three commits land donor-parity hot-path improvements + the supporting backend refactor:
Phase 4.1 (
82e85fcf, hardened inc16ca32b): port donorzstd_lazy.c:714speculative tail check intoHcMatcher::hash_chain_candidate. Oncebestis set, the per-bytecommon_prefix_lenwalk is gated on a 4-byte tail compare attail_off = best.match_len − lit_len − 3(viachecked_sub(lit_len + 3)). The− lit_lenterm accounts forextend_backwardshaving added up tolit_lenbackward bytes tobest.match_len; a new candidate can in turn replaceB_bestof its own length with up tolit_lenbackward bytes, so the worst-case forward length it needs to outscorebest(under the LIFO offset-monotonic walk argument) isbest.match_len − lit_len + 1. Whenlit_len == 0the formula reduces tobest.match_len − 3(no backward extension possible).Phase 4.3 (
8809bc01): collapse the four parallel driver backend fields (match_generator,Option<DfastMatchGenerator>,Option<RowMatchGenerator>,Option<HcMatchGenerator>) plusactive_backend: BackendTaginto a singlestorage: MatcherStorageenum. Eliminates dead-variant allocation across backend switches — the old layout kept drained metadata for every backend ever touched. The active backend is derived directly from the storage variant; no separate runtime tag to drift. Variant docs enumerate the fullCompressionLevel→ backend mapping (Simple = Uncompressed / Fastest / Level(1) / non-positive levels; Dfast = Default / Level(0,2,3); Row = Level(4); HashChain = Better / Best / Level(5..=22)).Phase 4 Dfast donor parity (
c16ca32b): port donorzstd_double_fast.cpost-match rep-0 extension into the Dfast fast-loop and general-loop emit paths. After eachemit_candidate, the new helperextend_with_repcode_after_matchchains additional zero-literal rep sequences while 4 bytes at the new cursor keep matchingcursor − offset_hist[1]. Each iteration emits one rep sequence with the swapped-in offset, advancesposandliterals_start, updatesoffset_histvia the donoroffset_2 = offset_1; offset_1 = old_offset_2;swap, inserts every position of the rep match into the long/short hash tables, and skips the full hash probe on every extra match. Uses donor'sMINMATCH = 4for the rep threshold (notDFAST_MIN_MATCH_LEN = 6) — a 4-byte rep is always a net win over re-running the full hash search.Sub-goals from #124
c16ca32bafter Copilot review (the original formula usedbest.match_lendirectly and was unaware ofextend_backwardsbackward bytes; new formula subtractslit_len).skip_insert_until_abswired across BT walker + the gate at the top ofcollect_optimal_candidates_initialized_body!+bt_update_tree_until_*start-from-skip +apply_limited_update_after_long_match). Issue text outdated, no new work needed.MatcherStorage) per the user-approved trait-vs-enum trade-off (cleaner for closed set, zero vtable cost, matches the existingHcBackendprecedent insideHcMatchGenerator).compress_block::<S>→start_matching_strategy::<S>→start_matching_optimal::<S>→build_optimal_plan_impl<S,_,_>are fully generic; nomatch self.strategy_tag/parse_mode/active_backendinside. The remainingmatch self.active_backend()calls in the driver are per-frame/per-slice setup paths (eviction, dictionary priming, slice commit), not hot path.self.is_btultra2reads in per-CPUbt_update_tree_until_*kernels were kept — making themS-generic would multiply the binary by 4× across the target_feature umbrellas for a single register-resident bool read.extend_with_repcode_after_matchports donor's post-match rep-0 chain that the existing Rust Dfast skipped (it re-ran the full hash search at every position even after a primary match into a repetitive run). Closes the biggest hot-path feature gap between Rust Dfast and donorZSTD_compressBlock_doubleFast_*.Verification
cargo nextest run --profile ci -p structured-zstd --features hash,std,dict_builder475/475 pass, 3 skipped (incl.level22_sequences_match_donor_on_corpus_proxy, full cross-validation, roundtrip + parity gates for default / better / best / level22).cargo test --doc -p structured-zstd --features hash,std,dict_builder10/10 pass.cargo clippy -p structured-zstd --features hash,std,dict_builder --all-targets -- -D warningsclean.cargo bench --quick '^compress/default'on x86_64-darwin: every default-level scenario improves relative to the c_ffi baseline-noise floor (c_ffi shifted uniformly +25 % on the same run — criterion's stored baseline ran in a colder state). Most relevant numbers:decodecorpus-z000033(the 27× C-FFI gap target from feat: encoder architecture rewrite (split monolith, const-generic Strategy, arena allocator) #111) gains −11 % vs c_ffi noise;high-entropy-1m−16 %;low-entropy-1m−8 %.cargo clean -pbetween commits): Phase 2 / Phase 3 / Phase 4 HEAD all land within ±1 % of 1.52 ms oncompress/level22/low-entropy-1m/matrix/pure_rust— i.e. no level22 regression on a fresh clean build. The 1.527 → 2.483 ms CI alert that fired on the Phase 3 merge commit was not reproducible across two architectures (ARM64 darwin + x86_64-gnu) and appears to be GitHub-runner microarch / VM-noise, not a real regression.Test plan
cross-i686,Fuzz smoke,msrv.level22_sequences_match_donor_on_corpus_proxyratio gate green.compress/default/*) shards to confirm the Dfast post-match rep extension wins.Closes part of #124.
Summary by CodeRabbit
Refactoring
Tests