Skip to content

Split chunk GET timeout from store timeout#78

Open
mickvandijke wants to merge 5 commits into
mainfrom
fix/stability-improvements
Open

Split chunk GET timeout from store timeout#78
mickvandijke wants to merge 5 commits into
mainfrom
fix/stability-improvements

Conversation

@mickvandijke
Copy link
Copy Markdown
Contributor

Summary

  • Add an independent chunk_get_timeout_secs client config value and hidden --chunk-get-timeout-secs CLI flag.
  • Keep store_timeout_secs scoped to chunk store/PUT operations while preserving default config behavior unless a CLI override is provided.
  • Reduce the non-Merkle chunk PUT store-response timeout from 30 seconds to 10 seconds.

Testing

  • Not run; PR opened from the existing committed branch state.

mickvandijke and others added 2 commits May 9, 2026 23:40
PUTs and GETs have different payload directions and performance
profiles, so a single shared timeout was a poor fit. Adds
`chunk_get_timeout_secs` to `ClientConfig` (default 10s) and a
matching `--chunk-get-timeout-secs` CLI flag, while keeping
`store_timeout_secs` for the PUT path. Also bumps the non-merkle
store-response timeout from 5s to 10s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mickvandijke mickvandijke force-pushed the fix/stability-improvements branch 2 times, most recently from 0d760b8 to 8bcdc8b Compare May 10, 2026 16:41
@mickvandijke mickvandijke force-pushed the fix/stability-improvements branch from 8bcdc8b to cdb0945 Compare May 10, 2026 20:35
Previously `cleanup_stale` skipped any spill dir younger than 24h, even
if its lockfile was already releasable. The lockfile is the actual
correctness gate: a releasable lock means the owning `ChunkSpill` is
dropped or the owning process is gone. The age guard only ever needed
to cover the sub-millisecond TOCTOU window between `create_dir` and
`try_lock_exclusive` inside `ChunkSpill::new`.

The 24h policy was hiding a real leak on hosts where `ant` exits
non-gracefully (SIGKILL, kernel OOM, panic abort). `Drop` does not run
on those paths, so the dir is left in `~/.local/share/ant/spill/` with
its lock released. The next upload would not reap it. Under a systemd
restart loop, hundreds of `spill_*` dirs accumulate per hour — each
holding the encrypted chunks of one upload (= upload file size) — and
fill the disk well before the 24h grace expires.

Reduce the guard to 30 seconds (TOCTOU only) and gate primarily on the
lockfile. No other behaviour changes; the lockfile + symlink guard
already covered the safety surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants