Skip to content

feat(adaptive): replication channel + adaptive store timeout for slow uplinks#80

Open
jacderida wants to merge 1 commit intoWithAutonomi:mainfrom
jacderida:adaptive-replication-and-timeout
Open

feat(adaptive): replication channel + adaptive store timeout for slow uplinks#80
jacderida wants to merge 1 commit intoWithAutonomi:mainfrom
jacderida:adaptive-replication-and-timeout

Conversation

@jacderida
Copy link
Copy Markdown
Contributor

Summary

Makes per-chunk peer fan-out and the single-payment store timeout adaptive on slow uplinks, with no new user flags.

The motivating regression: a residential uplink uploading 4 MB chunks to CLOSE_GROUP_MAJORITY peers in parallel saturates outbound bandwidth → all peers in the fan-out time out simultaneously at the static 30 s mark → batch retries fail the same way → file fails. The AIMD controller couldn't react in time on small files because observations were chunk-level (one sample per chunk, well below min_window_ops=8) and warm-start floored at cold-start defaults so every fresh process re-paid the saturation cost.

Changes

  • Per-peer observations. observe_op moved from chunk-level (batch.rs / merkle.rs / file.rs) into spawn_chunk_put (chunk.rs). A 3-chunk file with majority-3 fan-out now generates 9 samples per attempt, crossing min_window_ops within a single attempt instead of needing four.
  • New replication channel on the AIMD controller, bounds [1, CLOSE_GROUP_MAJORITY]. chunk_put_to_close_group reads it for in-flight parallelism and uses a top-up seeding pattern, so parallelism = 1 is strict sequential per-peer replication on slow uplinks.
  • Eager saturation classifier. When ≥ ⅔ of attempted peers in a chunk's fan-out time out, force_decrease() halves the replication channel immediately, bypassing the min_window_ops decrease gate. The signature is unambiguous on its own — no need for a window of evidence.
  • Adaptive single-payment store timeout. Replaces the hardcoded 30 s with clamp(p95 × latency_inflation_factor, [config_floor, MAX]). Cold-start preserves the historic 30 s when no successful samples exist. --store-timeout now acts as floor on this path (was previously merkle-only).
  • Snapshot-as-truth warm-start. clamp(snapshot, [min, max]) replaces max(snapshot, cold_start). A previously-saturated uplink that persisted replication = 1 boots that way next session instead of re-paying saturation cost. AIMD additive-increase still ramps it back up if the connection improves.
  • Snapshot schema bumped 1 → 2 (replication field added). Schema-1 snapshots are silently ignored on load.

Test results

Slow residential connection (before vs. after)

Same file set, same machine, same upstream network. Before these changes — PROD-LOCAL-UL-01:

Status Count
ok 14 of 23
partial 9 of 23

Chunks failed across the partial files: 21. Worst case clip29 (19.9 MB / 8 chunks): 5 failed, 12m 51s.

After these changes — PROD-LOCAL-UL-02, 184 files, 100 % success, zero failed chunks. Spot checks against the previously-failing files:

File Size Before After
clip15.avi 10.2 MB partial 0/3, 7m 12s ok 4/4, 1m 44s
clip17.avi 9.0 MB partial 0/3, 6m 11s ok 4/4, 1m 35s
clip24.avi 10.8 MB partial 0/3, 7m 43s ok 4/4, 1m 28s
clip27.avi 8.1 MB partial 0/3, 5m 59s ok 4/4, 1m 17s
clip28.avi 11.3 MB partial 0/3, 5m 22s ok 4/4, 1m 21s
clip29.avi 19.9 MB partial 3/8, 12m 51s ok 9/9, 1m 25s
clip3.avi 12.2 MB partial 4/7, 8m 5s ok 8/8, 1m 33s

Larger files in the same set (clip5 23.7 MB, clip76 23.3 MB, clip33 19.7 MB) also uploaded cleanly.

Fast cloud connection — regression check

Cloud VM with high-bandwidth uplink — PROD-UL-01-ant-client-upload-lon1-1:

File Size Status Chunks Duration
pinkman.5.mp4 1.16 GB ok 302/302 11m 2s
interference-david.mp4 858.4 MB ok 219/219 15m 22s
oddbeat.6.mp4 617.7 MB ok 159/159 20m 47s
seer.3.mp4 3.52 GB ok 907/907 40m 29s
pinkman.6.mp4 964.2 MB ok 246/246 40m 4s

100 % success on multi-GB files, no regression for fast connections — replication channel cold-starts at the ceiling so fast paths are unchanged from prior behavior.

Test plan

  • cargo clippy --all-targets --all-features -- -D warnings — clean
  • cargo fmt --all -- --check — clean
  • cargo test -p ant-core --lib data::client::adaptive — 73 passed
  • cargo test -p ant-core --lib data::client::chunk — 7 passed (new tests for adaptive timeout, saturation classifier, config-floor honoring, max ceiling)
  • Real upload run on slow residential uplink — 184 files / 100 % success
  • Real upload run on fast cloud connection — 5 files (3.5 GB largest) / 100 % success
  • Reviewer to verify snapshot schema-1 → schema-2 migration (old snapshot silently ignored, falls back to cold-start, writes schema-2 at exit)

🤖 Generated with Claude Code

…w uplinks

On slow residential uplinks the static 30 s store timeout combined
with parallel CLOSE_GROUP_MAJORITY peer fan-out per chunk causes
correlated timeouts (all peers in a fan-out time out simultaneously
because the uplink saturates). The AIMD controller couldn't react
in time on small files because observations were chunk-level and
warm-start floored at cold defaults — every fresh process re-paid
the cost.

This change makes both the per-chunk peer fan-out and the
single-payment store timeout adaptive, with no new user flags:

- Per-peer observations: `observe_op` moved into `spawn_chunk_put`
  so each peer PUT is one sample. A 3-chunk file with majority-3
  fan-out now yields 9 samples per attempt (crosses min_window_ops=8
  within one attempt instead of needing 4).
- New `replication` channel on the AIMD controller, bounded
  [1, CLOSE_GROUP_MAJORITY]. `chunk_put_to_close_group` reads it for
  the per-chunk parallelism and uses a top-up seeding pattern so
  parallelism=1 means strict sequential per-peer replication.
- Eager saturation classifier: when ≥ ⅔ of attempted peers in a
  chunk's fan-out time out, force_decrease() bypasses the
  min_window_ops gate. The signature is unambiguous on its own.
- Adaptive single-payment store timeout: derived as
  clamp(p95 × latency_inflation_factor, [config_floor, MAX]).
  Cold-start preserves the historic 30 s; `--store-timeout` now
  acts as floor on this path (was previously merkle-only).
- Snapshot-as-truth warm-start: `clamp(snapshot, [min, max])`
  replaces `max(snapshot, cold_start)`. A previously-saturated
  uplink with persisted replication=1 boots that way, instead of
  re-paying the saturation cost every process. AIMD additive
  increase still ramps back up if the connection improves.
- Snapshot schema bumped 1→2 (replication field added). Schema-1
  snapshots are silently ignored on load.

BREAKING CHANGE: `ChannelStart` and `ChannelMax` gained a required
`replication` field. External crates building these structs via
struct literal must add `replication: <value>`. The on-disk snapshot
schema bumped 1 → 2; schema-1 snapshots are silently ignored at load
(no data loss, just one fresh-start cycle on upgrade). User-facing
CLI behavior, flags, and defaults are unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jacderida jacderida force-pushed the adaptive-replication-and-timeout branch from cfc4e3b to 78ca55a Compare May 10, 2026 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant