feat(adaptive): replication channel + adaptive store timeout for slow uplinks by jacderida · Pull Request #80 · WithAutonomi/ant-client

jacderida · 2026-05-10T14:51:44Z

Summary

Makes per-chunk peer fan-out and the single-payment store timeout adaptive on slow uplinks, with no new user flags.

The motivating regression: a residential uplink uploading 4 MB chunks to CLOSE_GROUP_MAJORITY peers in parallel saturates outbound bandwidth → all peers in the fan-out time out simultaneously at the static 30 s mark → batch retries fail the same way → file fails. The AIMD controller couldn't react in time on small files because observations were chunk-level (one sample per chunk, well below min_window_ops=8) and warm-start floored at cold-start defaults so every fresh process re-paid the saturation cost.

Changes

Per-peer observations. observe_op moved from chunk-level (batch.rs / merkle.rs / file.rs) into spawn_chunk_put (chunk.rs). A 3-chunk file with majority-3 fan-out now generates 9 samples per attempt, crossing min_window_ops within a single attempt instead of needing four.
New replication channel on the AIMD controller, bounds [1, CLOSE_GROUP_MAJORITY]. chunk_put_to_close_group reads it for in-flight parallelism and uses a top-up seeding pattern, so parallelism = 1 is strict sequential per-peer replication on slow uplinks.
Eager saturation classifier. When ≥ ⅔ of attempted peers in a chunk's fan-out time out, force_decrease() halves the replication channel immediately, bypassing the min_window_ops decrease gate. The signature is unambiguous on its own — no need for a window of evidence.
Adaptive single-payment store timeout. Replaces the hardcoded 30 s with clamp(p95 × latency_inflation_factor, [config_floor, MAX]). Cold-start preserves the historic 30 s when no successful samples exist. --store-timeout now acts as floor on this path (was previously merkle-only).
Snapshot-as-truth warm-start. clamp(snapshot, [min, max]) replaces max(snapshot, cold_start). A previously-saturated uplink that persisted replication = 1 boots that way next session instead of re-paying saturation cost. AIMD additive-increase still ramps it back up if the connection improves.
Snapshot schema bumped 1 → 2 (replication field added). Schema-1 snapshots are silently ignored on load.

Test results

Slow residential connection (before vs. after)

Same file set, same machine, same upstream network. Before these changes — PROD-LOCAL-UL-01:

Status	Count
ok	14 of 23
partial	9 of 23

Chunks failed across the partial files: 21. Worst case clip29 (19.9 MB / 8 chunks): 5 failed, 12m 51s.

After these changes — PROD-LOCAL-UL-02, 184 files, 100 % success, zero failed chunks. Spot checks against the previously-failing files:

File	Size	Before	After
clip15.avi	10.2 MB	partial 0/3, 7m 12s	ok 4/4, 1m 44s
clip17.avi	9.0 MB	partial 0/3, 6m 11s	ok 4/4, 1m 35s
clip24.avi	10.8 MB	partial 0/3, 7m 43s	ok 4/4, 1m 28s
clip27.avi	8.1 MB	partial 0/3, 5m 59s	ok 4/4, 1m 17s
clip28.avi	11.3 MB	partial 0/3, 5m 22s	ok 4/4, 1m 21s
clip29.avi	19.9 MB	partial 3/8, 12m 51s	ok 9/9, 1m 25s
clip3.avi	12.2 MB	partial 4/7, 8m 5s	ok 8/8, 1m 33s

Larger files in the same set (clip5 23.7 MB, clip76 23.3 MB, clip33 19.7 MB) also uploaded cleanly.

Fast cloud connection — regression check

Cloud VM with high-bandwidth uplink — PROD-UL-01-ant-client-upload-lon1-1:

File	Size	Status	Chunks	Duration
pinkman.5.mp4	1.16 GB	ok	302/302	11m 2s
interference-david.mp4	858.4 MB	ok	219/219	15m 22s
oddbeat.6.mp4	617.7 MB	ok	159/159	20m 47s
seer.3.mp4	3.52 GB	ok	907/907	40m 29s
pinkman.6.mp4	964.2 MB	ok	246/246	40m 4s

100 % success on multi-GB files, no regression for fast connections — replication channel cold-starts at the ceiling so fast paths are unchanged from prior behavior.

Test plan

cargo clippy --all-targets --all-features -- -D warnings — clean
cargo fmt --all -- --check — clean
cargo test -p ant-core --lib data::client::adaptive — 73 passed
cargo test -p ant-core --lib data::client::chunk — 7 passed (new tests for adaptive timeout, saturation classifier, config-floor honoring, max ceiling)
Real upload run on slow residential uplink — 184 files / 100 % success
Real upload run on fast cloud connection — 5 files (3.5 GB largest) / 100 % success
Reviewer to verify snapshot schema-1 → schema-2 migration (old snapshot silently ignored, falls back to cold-start, writes schema-2 at exit)

🤖 Generated with Claude Code

…w uplinks On slow residential uplinks the static 30 s store timeout combined with parallel CLOSE_GROUP_MAJORITY peer fan-out per chunk causes correlated timeouts (all peers in a fan-out time out simultaneously because the uplink saturates). The AIMD controller couldn't react in time on small files because observations were chunk-level and warm-start floored at cold defaults — every fresh process re-paid the cost. This change makes both the per-chunk peer fan-out and the single-payment store timeout adaptive, with no new user flags: - Per-peer observations: `observe_op` moved into `spawn_chunk_put` so each peer PUT is one sample. A 3-chunk file with majority-3 fan-out now yields 9 samples per attempt (crosses min_window_ops=8 within one attempt instead of needing 4). - New `replication` channel on the AIMD controller, bounded [1, CLOSE_GROUP_MAJORITY]. `chunk_put_to_close_group` reads it for the per-chunk parallelism and uses a top-up seeding pattern so parallelism=1 means strict sequential per-peer replication. - Eager saturation classifier: when ≥ ⅔ of attempted peers in a chunk's fan-out time out, force_decrease() bypasses the min_window_ops gate. The signature is unambiguous on its own. - Adaptive single-payment store timeout: derived as clamp(p95 × latency_inflation_factor, [config_floor, MAX]). Cold-start preserves the historic 30 s; `--store-timeout` now acts as floor on this path (was previously merkle-only). - Snapshot-as-truth warm-start: `clamp(snapshot, [min, max])` replaces `max(snapshot, cold_start)`. A previously-saturated uplink with persisted replication=1 boots that way, instead of re-paying the saturation cost every process. AIMD additive increase still ramps back up if the connection improves. - Snapshot schema bumped 1→2 (replication field added). Schema-1 snapshots are silently ignored on load. BREAKING CHANGE: `ChannelStart` and `ChannelMax` gained a required `replication` field. External crates building these structs via struct literal must add `replication: <value>`. The on-disk snapshot schema bumped 1 → 2; schema-1 snapshots are silently ignored at load (no data loss, just one fresh-start cycle on upgrade). User-facing CLI behavior, flags, and defaults are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jacderida force-pushed the adaptive-replication-and-timeout branch from cfc4e3b to 78ca55a Compare May 10, 2026 15:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(adaptive): replication channel + adaptive store timeout for slow uplinks#80

feat(adaptive): replication channel + adaptive store timeout for slow uplinks#80
jacderida wants to merge 1 commit intoWithAutonomi:mainfrom
jacderida:adaptive-replication-and-timeout

jacderida commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jacderida commented May 10, 2026

Summary

Changes

Test results

Slow residential connection (before vs. after)

Fast cloud connection — regression check

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant