Skip to content

manage: extract _fetch_image_info to reduce duplication#2217

Draft
ideaship wants to merge 2 commits intomainfrom
manage-image-info-refactor
Draft

manage: extract _fetch_image_info to reduce duplication#2217
ideaship wants to merge 2 commits intomainfrom
manage-image-info-refactor

Conversation

@ideaship
Copy link
Copy Markdown
Contributor

ImageClusterapi, ImageClusterapiGardener, and ImageOctavia all share
the same marker-fetch and checksum-fetch sequence: fetch a marker
file, parse the date and image filename, construct the image URL,
fetch the .CHECKSUM file, and log each step. This identical block
was repeated verbatim in all three take_action() implementations.

Extract this sequence into a private _fetch_image_info(base_url,
marker_url) helper that returns (date, image_filename, url, checksum).
Callers that need the image filename for version extraction
(ImageClusterapi, ImageClusterapiGardener) unpack it; ImageOctavia
discards it with _.

ImageGardenlinux is deliberately excluded: it constructs the image
URL directly from a known pattern rather than fetching a marker file,
so it shares only the checksum-fetch half of the pattern and does not
fit this helper without contortion.

AI-assisted: Claude Code
Signed-off-by: Roger Luethi luethi@osism.tech


Stack created with GitHub Stacks CLIGive Feedback 💬

Four `osism manage image` commands (octavia, clusterapi,
clusterapi-gardener, gardenlinux) fetch marker and checksum
files from nbg1.your-objectstorage.com using bare
requests.get() with no error handling and no retry. When the
Ceph RGW backend transiently returns an XML S3 error document,
the code parses <?xml as the checksum and
openstack-image-manager rejects it with
'sha256:<?xml' is not a valid checksum.

Analysis of 295 .CHECKSUM fetch events logged across testbed
builds in the window 2025-12-14 – 2026-04-27 shows 84 XML
failures (28.5 % of fetches). 94 % of those failures returned
in ≤ 2 s (fast canned RGW 503 response); the remaining 6 %
returned in 8–60 s. Zero failures were connection-level errors;
every failure returned an HTTP response. Successful fetches
span 0–53 s (p99 = 9 s).

New module osism/utils/http.py exports fetch_text, which wraps
requests.get with:

- Retry on {408, 429} ∪ 5xx (covers the observed RGW 503)
- Retry on non-HTTPError RequestException (connection / DNS / TLS)
- Retry when an optional validate callback rejects the body
  (guards against HTTP 200 with unexpected content)
- Immediate HTTPError propagation on non-retryable 4xx (404, 403)
- Structured INFO log lines per attempt for observability in Zuul

Default schedule: 3 retries, 2 s / 4 s / 8 s sleeps (14 s budget).

Two validators are added to manage.py:

- _validate_marker: generic YYYY-MM-DD <name>.qcow2 contract;
  rejects XML bodies without hard-coding any image-name prefix,
  so production deployments with unfamiliar names pass through
  to downstream validation rather than burning the retry budget.
- _is_sha256: requires a 64-char lowercase hex first token,
  matching sha256sum(1) output; accepting uppercase would mask a
  downstream mismatch rather than surface it.

All seven requests.get call sites in manage.py are replaced:
  clusterapi:         marker + .CHECKSUM  (take_action lines 110, 125)
  clusterapi-gardener: marker + .CHECKSUM (take_action lines 229, 245)
  gardenlinux:        .sha256             (take_action line 354)
  octavia:            marker + .CHECKSUM  (take_action lines 440, 451)

The checksum_url_status log line added to octavia in ce844a0 is
removed; fetch_text emits the status code on every attempt.

No per-attempt timeout is added. The distributions of slow
failures (8–60 s) and slow successes (9–53 s) overlap — a 41 s
duration appears as both a failure and a success in the data.
No timeout value cleanly separates the two populations without
introducing false positives on legitimate slow responses.

34 unit tests across three new files cover the retry helper
(test_http.py, 15 tests), the validators
(test_manage_validators.py, 15 tests), and the call-site wiring
(test_manage_wiring.py, 4 tests).

AI-assisted: Claude Code
Signed-off-by: Roger Luethi <luethi@osism.tech>
ImageClusterapi, ImageClusterapiGardener, and ImageOctavia all share
the same marker-fetch and checksum-fetch sequence: fetch a marker
file, parse the date and image filename, construct the image URL,
fetch the .CHECKSUM file, and log each step. This identical block
was repeated verbatim in all three take_action() implementations.

Extract this sequence into a private _fetch_image_info(base_url,
marker_url) helper that returns (date, image_filename, url, checksum).
Callers that need the image filename for version extraction
(ImageClusterapi, ImageClusterapiGardener) unpack it; ImageOctavia
discards it with _.

ImageGardenlinux is deliberately excluded: it constructs the image
URL directly from a known pattern rather than fetching a marker file,
so it shares only the checksum-fetch half of the pattern and does not
fit this helper without contortion.

AI-assisted: Claude Code
Signed-off-by: Roger Luethi <luethi@osism.tech>
@ideaship ideaship marked this pull request as draft April 27, 2026 14:06
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • In _fetch_image_info, consider validating the number of whitespace-separated fields in the marker body before doing strip().split()[:2] so that a malformed marker produces a clear, custom error instead of an unhandled ValueError from tuple unpacking.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `_fetch_image_info`, consider validating the number of whitespace-separated fields in the marker body before doing `strip().split()[:2]` so that a malformed marker produces a clear, custom error instead of an unhandled `ValueError` from tuple unpacking.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Base automatically changed from checksum-fetch-retry to main May 1, 2026 07:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants