Per-entry seed in XNNPACK weights cache for XNNPACK-upgrade invalidation#20170
Per-entry seed in XNNPACK weights cache for XNNPACK-upgrade invalidation#20170doggeral wants to merge 2 commits into
Conversation
Summary: Add cross-load reuse + multi-PTE safety to the file-backed packed weight cache (D106673663). The first PTE in a session calls `save_packed_index()` to append a trailer; subsequent process launches mmap the file and pre-populate `name_to_packed_data_metadata_` so `look_up()` hits for every saved weight and `xnn_create_runtime` skips packing entirely. ## Cache file format ``` [packed data regions] (written by reserve_space) [index entries] (written by save_packed_index) each: name_len(4B) | name(N) | file_offset(8B) | data_size(8B) [footer: 20 bytes] index_start(8B) | entry_count(4B) | magic "XPWC"(4B) | version(4B) ``` ## Lifecycle invariants - `cache_loaded_` gate: `load_packed_cache()` runs at most once per process per path. Subsequent PTE inits for the same path reopen the write fd without re-reading the trailer. - `from_load` flag: persistent entries (loaded from trailer or promoted on save) skip `delete_packed_data` cleanup. This keeps the mmap region and metadata alive across PTE unload/reload, so the next init hits the cache instead of repacking. Without this, every PTE destroy/recreate cycle appended a fresh copy to the file (~450 MB per cycle). - No-op save short-circuit: `save_packed_index` returns early when no new `reserve_space` happened since the last save, avoiding the mtime churn that previously made the cache file look modified on every model load. ## Multi-PTE behavior - Multiple PTEs (or methods that don't share weights) in the same model load share one cache file. Each PTE's `reserve_space` extends the file; `finalize_for_runtime` msyncs only newly added regions; `save_packed_index` writes one trailer covering all PTEs at the end of the load. - Sibling PTEs that opt out of the mmap path (caller passes empty `packed_cache_path`) early-return from `initialize_for_runtime` and fall through to heap allocation, without touching the singleton's PLLM state. - Cross-model coexistence relies on caller-side discipline: only models that opt in set a non-empty cache path. Setting different non-empty paths concurrently is not supported by this singleton design. ## Caller change `XNNPACKBackend::init` always calls `set_packed_cache_path` (with empty string for non-opted-in PTEs). This keeps the singleton path in sync with the current PTE instead of inheriting a sibling's path. ## Test Plan ``` buck2 test fbcode//executorch/backends/xnnpack/test:test_xnn_weights_cache # 5 pass buck2 build fbsource//xplat/executorch/backends/xnnpack:xnnpack_backendApple buck2 build fbsource//xplat/executorch/backends/xnnpack:xnnpack_backend buck2 build fbcode//executorch/backends/xnnpack:xnnpack_backend ``` On device (iOS Stella build, PLLM + Llama3 runner): - Cold start: load `(1184 entries)` from cache, `reserve_mmap=0` for cached weights - Cache file size stable at ~593 MB across PLLM unload/reload cycles - `app_peak ~700 MB` (vs ~2.5 GB pre-fix) - `compressed ~100 MB` (vs ~1.7 GB pre-fix) Differential Revision: D106717093
Summary:
XNNPACK exposes `xnn_weights_cache_look_up_key.seed` — a per-ukernel value that XNNPACK guarantees is consistent across runs of the same ukernel and changes whenever a ukernel implementation changes. Store this seed per cache entry so a stale cached packing produced by an old XNNPACK ukernel is rejected after upgrade, instead of being handed back to a newer ukernel that expects a different layout.
Changes:
- `PackedDataMeta` gains `uint32_t seed{0}`.
- `look_up` rejects (returns `SIZE_MAX`) when a name hit has a stored seed that doesn't match `cache_key->seed`. This forces `look_up_or_insert` to re-pack with the current ukernel and avoids the slow `memcmp` path catching it later.
- `look_up_or_insert` records `cache_key->seed` on insert.
- On-disk index entry layout extended to `[name_len:u32][name][file_offset:u64][data_size:u64][seed:u32]` (was 16 bytes after the name, now 20).
- `load_packed_cache` reads the per-entry seed and bumps the trailing bytes bound check accordingly.
- `kCacheVersion` bumped 1 → 2 so existing v1 files (which carry no seed) are rejected at load instead of being loaded with `seed=0` and mismatching every fresh `look_up`.
Cleanup of orphaned in-memory and on-disk entries left by an invalidated look-up is a follow-up — this diff only adds the detection.
Differential Revision: D108082431
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20170
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 1 Unrelated FailureAs of commit 7739ca2 with merge base 26b4be8 ( NEW FAILURE - The following job has failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@doggeral has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108082431. |
This PR needs a
|
Summary:
XNNPACK exposes
xnn_weights_cache_look_up_key.seed— a per-ukernel value that XNNPACK guarantees is consistent across runs of the same ukernel and changes whenever a ukernel implementation changes. Store this seed per cache entry so a stale cached packing produced by an old XNNPACK ukernel is rejected after upgrade, instead of being handed back to a newer ukernel that expects a different layout.Changes:
PackedDataMetagainsuint32_t seed{0}.look_uprejects (returnsSIZE_MAX) when a name hit has a stored seed that doesn't matchcache_key->seed. This forceslook_up_or_insertto re-pack with the current ukernel and avoids the slowmemcmppath catching it later.look_up_or_insertrecordscache_key->seedon insert.[name_len:u32][name][file_offset:u64][data_size:u64][seed:u32](was 16 bytes after the name, now 20).load_packed_cachereads the per-entry seed and bumps the trailing bytes bound check accordingly.kCacheVersionbumped 1 → 2 so existing v1 files (which carry no seed) are rejected at load instead of being loaded withseed=0and mismatching every freshlook_up.Cleanup of orphaned in-memory and on-disk entries left by an invalidated look-up is a follow-up — this diff only adds the detection.
Differential Revision: D108082431