feat(adaptive): replication channel + adaptive store timeout for slow uplinks#80
Open
jacderida wants to merge 1 commit intoWithAutonomi:mainfrom
Open
feat(adaptive): replication channel + adaptive store timeout for slow uplinks#80jacderida wants to merge 1 commit intoWithAutonomi:mainfrom
jacderida wants to merge 1 commit intoWithAutonomi:mainfrom
Conversation
…w uplinks On slow residential uplinks the static 30 s store timeout combined with parallel CLOSE_GROUP_MAJORITY peer fan-out per chunk causes correlated timeouts (all peers in a fan-out time out simultaneously because the uplink saturates). The AIMD controller couldn't react in time on small files because observations were chunk-level and warm-start floored at cold defaults — every fresh process re-paid the cost. This change makes both the per-chunk peer fan-out and the single-payment store timeout adaptive, with no new user flags: - Per-peer observations: `observe_op` moved into `spawn_chunk_put` so each peer PUT is one sample. A 3-chunk file with majority-3 fan-out now yields 9 samples per attempt (crosses min_window_ops=8 within one attempt instead of needing 4). - New `replication` channel on the AIMD controller, bounded [1, CLOSE_GROUP_MAJORITY]. `chunk_put_to_close_group` reads it for the per-chunk parallelism and uses a top-up seeding pattern so parallelism=1 means strict sequential per-peer replication. - Eager saturation classifier: when ≥ ⅔ of attempted peers in a chunk's fan-out time out, force_decrease() bypasses the min_window_ops gate. The signature is unambiguous on its own. - Adaptive single-payment store timeout: derived as clamp(p95 × latency_inflation_factor, [config_floor, MAX]). Cold-start preserves the historic 30 s; `--store-timeout` now acts as floor on this path (was previously merkle-only). - Snapshot-as-truth warm-start: `clamp(snapshot, [min, max])` replaces `max(snapshot, cold_start)`. A previously-saturated uplink with persisted replication=1 boots that way, instead of re-paying the saturation cost every process. AIMD additive increase still ramps back up if the connection improves. - Snapshot schema bumped 1→2 (replication field added). Schema-1 snapshots are silently ignored on load. BREAKING CHANGE: `ChannelStart` and `ChannelMax` gained a required `replication` field. External crates building these structs via struct literal must add `replication: <value>`. The on-disk snapshot schema bumped 1 → 2; schema-1 snapshots are silently ignored at load (no data loss, just one fresh-start cycle on upgrade). User-facing CLI behavior, flags, and defaults are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cfc4e3b to
78ca55a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes per-chunk peer fan-out and the single-payment store timeout adaptive on slow uplinks, with no new user flags.
The motivating regression: a residential uplink uploading 4 MB chunks to
CLOSE_GROUP_MAJORITYpeers in parallel saturates outbound bandwidth → all peers in the fan-out time out simultaneously at the static 30 s mark → batch retries fail the same way → file fails. The AIMD controller couldn't react in time on small files because observations were chunk-level (one sample per chunk, well belowmin_window_ops=8) and warm-start floored at cold-start defaults so every fresh process re-paid the saturation cost.Changes
observe_opmoved from chunk-level (batch.rs/merkle.rs/file.rs) intospawn_chunk_put(chunk.rs). A 3-chunk file with majority-3 fan-out now generates 9 samples per attempt, crossingmin_window_opswithin a single attempt instead of needing four.replicationchannel on the AIMD controller, bounds[1, CLOSE_GROUP_MAJORITY].chunk_put_to_close_groupreads it for in-flight parallelism and uses a top-up seeding pattern, soparallelism = 1is strict sequential per-peer replication on slow uplinks.force_decrease()halves the replication channel immediately, bypassing themin_window_opsdecrease gate. The signature is unambiguous on its own — no need for a window of evidence.clamp(p95 × latency_inflation_factor, [config_floor, MAX]). Cold-start preserves the historic 30 s when no successful samples exist.--store-timeoutnow acts as floor on this path (was previously merkle-only).clamp(snapshot, [min, max])replacesmax(snapshot, cold_start). A previously-saturated uplink that persistedreplication = 1boots that way next session instead of re-paying saturation cost. AIMD additive-increase still ramps it back up if the connection improves.Test results
Slow residential connection (before vs. after)
Same file set, same machine, same upstream network. Before these changes —
PROD-LOCAL-UL-01:Chunks failed across the partial files: 21. Worst case clip29 (19.9 MB / 8 chunks): 5 failed, 12m 51s.
After these changes —
PROD-LOCAL-UL-02, 184 files, 100 % success, zero failed chunks. Spot checks against the previously-failing files:Larger files in the same set (clip5 23.7 MB, clip76 23.3 MB, clip33 19.7 MB) also uploaded cleanly.
Fast cloud connection — regression check
Cloud VM with high-bandwidth uplink —
PROD-UL-01-ant-client-upload-lon1-1:100 % success on multi-GB files, no regression for fast connections — replication channel cold-starts at the ceiling so fast paths are unchanged from prior behavior.
Test plan
cargo clippy --all-targets --all-features -- -D warnings— cleancargo fmt --all -- --check— cleancargo test -p ant-core --lib data::client::adaptive— 73 passedcargo test -p ant-core --lib data::client::chunk— 7 passed (new tests for adaptive timeout, saturation classifier, config-floor honoring, max ceiling)🤖 Generated with Claude Code