Skip to content

chore: switch Dockerfile to slim Python + uv lockfile#719

Open
cailmdaley wants to merge 5 commits intodevelopfrom
chore/uv-lockfile
Open

chore: switch Dockerfile to slim Python + uv lockfile#719
cailmdaley wants to merge 5 commits intodevelopfrom
chore/uv-lockfile

Conversation

@cailmdaley
Copy link
Copy Markdown
Contributor

Summary

Replaces the skaha/astroml base (which brings conda → double-installs every Python dep, ~5–10 GB image) with python:3.12-slim-bookworm + uv driven by a committed lockfile.

Before After
Base images.canfar.net/skaha/astroml:latest (conda Python) python:3.12-slim-bookworm
Deps hand-pinned in Dockerfile, drifts from pyproject.toml pyproject.toml is SSOT (what), uv.lock is the pinned manifest (which versions)
WeightWatcher built from source with patched headers Debian's weightwatcher package
CI builds image, runs shapepipe_run -c … also invokes source-extractor --version, weightwatcher --version, psfex --version

Why

Two problems with the v2.0 image:

  1. Conda + pip double install: skaha/astroml ships conda; we then pip install everything on top. Two copies of every scientific package, multi-GB bloat, slow builds.
  2. Runtime regressions slip through CI: the original failure that prompted this PR (a tool that builds inside the image but can't actually run on canfar) wasn't caught because the deploy workflow only invoked shapepipe_run.

Lockfile workflow

# regenerate after editing pyproject.toml dependencies
uv lock

# add a new dep
uv add 'somepackage>=1.2'    # updates pyproject and uv.lock

# bump pinned versions (deliberate)
uv lock --upgrade

uv sync --frozen in the Dockerfile fails the build if pyproject.toml and uv.lock are out of sync — so a stale lockfile cannot ship.

Test plan

  • CI image build succeeds on this branch
  • Test — binaries step passes (sextractor, weightwatcher, psfex)
  • Test — shapepipe entry point passes
  • Pull built image locally on canfar and run a real ShapePipe job (Cail to do, post-CI green)

Out of scope / follow-ups

  • Optional: separate Dockerfile.canfar building on skaha/astroml if there's a deployment reason to keep that base image around (Martin mentioned this could be cleaner — currently conjectural).
  • Dockerfile.jupyter is unchanged; it inherits from this base.

🤖 Generated with Claude Code

The v2.0 merge moved the base image to skaha/astroml, which ships conda.
That double-installs every Python dep (conda + pip) and bloats the image
to 5–10 GB. Reverting to python:3.12-slim-bookworm and using uv with a
lockfile gives:

- pyproject.toml as the single source of truth for *what* shapepipe needs
- uv.lock pinning *which versions* — upstream changes only land when we
  regenerate the lockfile deliberately
- a lean apt layer using Debian's astromatic packages (psfex,
  source-extractor, weightwatcher) instead of building weightwatcher
  from source
- ~10× faster rebuilds and a much smaller image

Also adds a CI smoke test that invokes source-extractor, weightwatcher,
and psfex inside the built image. The original failure mode that
prompted this work (a binary that builds fine but can't run on canfar)
will now surface at PR time, not deployment time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cailmdaley cailmdaley marked this pull request as ready for review April 27, 2026 09:41
The Dockerfile now carries an inline note about why we use Debian's
weightwatcher package instead of the source build with sed-patched
headers — short version: GCC 10 (Apr 2020) flipped its default to
-fno-common, breaking WeightWatcher 1.12's (2014) use of common-symbol
globals in headers. The old Dockerfile patched them inline; Debian's
weightwatcher 1.12+dfsg-3 patches the same way upstream of the build.

Pyproject minimums bumped to current major versions where the major
itself moved (astropy 6→7, numpy 1→2, pandas 2→3) so the abstract
constraint signals our actual target. Minor-version drift stays in
uv.lock. ngmix line carries an explicit "do not modernize" note since
it's pinned to Axel's stable_version branch until upstream absorbs the
fixes (tracked in the ngmix-update fiber).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cailmdaley
Copy link
Copy Markdown
Contributor Author

@martinkilbinger ready for your eyes when you have a moment. Two things worth flagging:

Image works end-to-end on the CI runner. Build is green in 3m31s (uv installs 238 packages in 322ms — everything's a prebuilt wheel). The added Test — binaries step runs source-extractor --version, weightwatcher --version, psfex --version inside the built image and they all answer. shapepipe_run -c /app/example/config.ini also boots cleanly. Since that's the same failure mode you hit on canfar (a binary that the image carries but couldn't run), this should work on canfar too — but it'd be good to pull the image and try a real job to confirm before we merge.

Modernized package versions in the same PR (commit b62e9dbb). The lockfile now resolves to numpy 2.4, astropy 7.2, pandas 3.0, galsim 2.8, mpi4py 4.1, etc. (everything at current latest). Pyproject minimums bumped to the new major versions. The only line that's still pinned the old way is ngmix — that stays on Axel's stable_version branch until we do the upstream-ngmix migration. Worth knowing in case you hit a numpy-2 or pandas-3 compatibility surprise in real jobs; rolling back is just uv lock with a constraint.

(Disclosure: PR drafted by Claude in collaboration with me; I haven't run a full pipeline through the new image yet — that's the canfar test I want to do.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cailmdaley
Copy link
Copy Markdown
Contributor Author

@martinkilbinger one more thing worth flagging since it explains a Dockerfile shape change you'll notice — why we dropped the WeightWatcher source build:

The previous Dockerfile (v1.x) built WeightWatcher from source with six `sed` patches to `prefs.h`, `globals.h`, `fitscat.h`, `prefs.c`, `main.c`, and `fitscat.c`. The new Dockerfile just `apt install`s `weightwatcher` instead.

The patches were converting tentative definitions like `prefstruct prefs;` in headers into proper `extern` declarations + a single definition in the corresponding `.c` file. They were a workaround for GCC 10 (April 2020) flipping its default to `-fno-common`, which broke WeightWatcher 1.12's (2014) use of common-symbol globals — every translation unit started getting its own definition, the linker complained.

Debian's `weightwatcher 1.12+dfsg-3` package on bookworm carries an equivalent patch upstream of the build (the `+dfsg-3` suffix indicates Debian-applied patches). So `apt install` gives us the same WeightWatcher 1.12 binary, just with Debian carrying the GCC-compatibility fix instead of us doing it inline.

Tradeoff for completeness:

Source build (old) apt (new)
Version control We pin `WW_VERSION=1.12` Tied to bookworm
Patch maintenance Ours, in plain sight Debian's, transparent
Build time ~30s + autotools deps ~1s
Risk if GCC changes again We patch Debian handles it

The apt route is probably the better default — Debian has stronger Astromatic build infrastructure than we do. If we ever needed a version bookworm doesn't ship, we'd swap back to source-build.

@martinkilbinger
Copy link
Copy Markdown
Contributor

When running the tests from within the docker of this branch, I get errors:
For uv sync --extra test:
error: failed to remove file /app/.venv/lib/python3.12/site-packages/appdirs-1.4.4.dist-info/INSTALLER: Read-only file system (os error 30)
And
uv run pytest:
error: Failed to spawn: pytest Caused by: No such file or directory (os error 2)

And the slim bookworm image does not have some useful programs, e.g. vi, which should be added.

@martinkilbinger
Copy link
Copy Markdown
Contributor

When running the tests from within the docker of this branch, I get errors:
For uv sync --extra test:
error: failed to remove file /app/.venv/lib/python3.12/site-packages/appdirs-1.4.4.dist-info/INSTALLER: Read-only file system (os error 30)
And
uv run pytest:
error: Failed to spawn: pytest Caused by: No such file or directory (os error 2)

And the slim bookworm image does not have some useful programs, e.g. vi, which should be added.

And cloning the branch and using that version, I also get an error message for `uv run pytest:
WARNING: Failed to generate report: No data to report.

Rewrites the slim-bookworm Dockerfile as a two-target multi-stage build:

- runtime — minimal image for canfar batch jobs and downstream FROM
  clauses. Ships astromatic binaries + jupyter + fitsio extras only.
- dev — runtime plus everyday CLI tools (vim, less, tmux, htop, ripgrep,
  fd, jq, bat, git-lfs, …) and *all* extras pre-installed via the `dev`
  meta-extra (test, lint, doc, release, jupyter, fitsio). Default target
  when `--target` is omitted; published as `:<branch>` (no suffix).

Both stages share a `base` stage carrying system deps + uv + the lockfile
copy, so the heavy apt + wheel resolution is cached once.

Addresses Martin's PR feedback on #719:

- `uv run pytest: No such file or directory` — pytest now ships in the
  dev image (was previously only in the never-built `--extra test`).
- `uv sync --extra X: Read-only file system` — the dev image has every
  extra pre-baked, so live `uv sync` is no longer needed at all. As a
  belt-and-braces also `chmod -R go+rwX /app` so non-root users on
  canfar/skaha can still mutate the venv when they want to.
- "no vim in the slim image" — `vim`, `less`, `tmux`, `htop`, plus a
  curated subset of cailmdaley/containers (rg/fd/jq/bat/…) now ship in
  the dev target.

deploy-image.yml builds, smoke-tests, and pushes both targets:
- runtime: source-extractor / weightwatcher / psfex / shapepipe_run all
  invoked inside the built image.
- dev: vim, ripgrep, pytest verified runnable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@martinkilbinger
Copy link
Copy Markdown
Contributor

Some of the issues are fixed, but I think /app is still a read-only file system and I cannot run the pytests.

I use apptainer pull shapepipe_uv.sif docker://ghcr.io/cosmostat/shapepipe:chore-uv-lockfile to get the docker image, then apptainer shell shapepipe_uv.sif and cd /app; uv run pytest.

Reproducing Martin's `apptainer pull ... && apptainer shell ... && cd
/app && uv run pytest` flow surfaced three layered read-only-fs failures
that all need their default writable target moved to /tmp:

  1. `uv run` auto-syncs the venv (would mutate /app/.venv) → UV_NO_SYNC=1
     so `uv run X` just runs X against the baked-in venv.
  2. uv's package cache wants $HOME/.cache/uv (host-bind, not always
     writable, and uv's mkdir logic gets confused if it pre-exists)
     → UV_CACHE_DIR=/tmp/uv-cache.
  3. pytest-cov erases /app/.coverage at startup because
     `addopts = --cov=shapepipe` in pyproject → COVERAGE_FILE=/tmp/.coverage.

Verified end-to-end on the existing PR image (without the env-var
defaults, reproducing the bug; and with them set in the shell, all 9
tier-0/1/2 tests pass in 10s on both a read-only SIF and a writable
sandbox). Once CI rebuilds with these defaults baked in, neither
Martin nor anyone else will need to set anything.

Also adds `docs/source/container.md` covering:

- The two image targets (`:<branch>` = dev default, `:<branch>-runtime`
  = slim) and what each is for.
- Read-only SIF vs writable sandbox postures, with worked examples for
  both. Includes the bind-mount-host-clone pattern Cail uses for
  long-lived sandbox containers.
- The three configuration layers (`pyproject.toml` = abstract minimums
  / SSOT, `uv.lock` = concrete pins generated by uv, `Dockerfile` =
  system deps + invocations) with a "what to edit when" table.
- A small symptom→cause→fix table for the three read-only failures
  above, in case anyone bypasses /tmp with a custom apptainer profile.

`installation.md` points to the new doc; `toc.rst` registers it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cailmdaley
Copy link
Copy Markdown
Contributor Author

@martinkilbinger thanks for testing — your symptom led us to a deeper read-only-fs problem than just pytest not being installed. Reproduced your flow (apptainer pull + apptainer shell + cd /app && uv run pytest) and found three layered failures, all because something tries to write under /app or $HOME:

  1. uv run auto-syncs the venv. It tried to mutate /app/.venv to ensure consistency with pyproject.toml/uv.lock. On a read-only SIF, that fails (the file-permissions chmod we did at build time can't override apptainer's read-only mount).
  2. uv's package cache wants $HOME/.cache/uv. Apptainer binds the host home, and uv's mkdir logic gets confused when the directory pre-exists from the host.
  3. pytest-cov writes /app/.coverage. pyproject.toml has addopts = "--cov=shapepipe", and pytest-cov tries to erase /app/.coverage at startup, which fails on read-only.

Fix in dc13582e bakes three env-var defaults into the image so all three writes redirect to /tmp (which is tmpfs, writable in any apptainer mode):

ENV UV_NO_SYNC=1     UV_CACHE_DIR=/tmp/uv-cache     COVERAGE_FILE=/tmp/.coverage

Verified on the existing image with the env vars set in the shell — all 9 tests pass in 10s in both read-only SIF and writable sandbox modes. Once CI rebuilds, your exact flow should work without any setup:

apptainer pull shapepipe.sif docker://ghcr.io/cosmostat/shapepipe:chore-uv-lockfile
apptainer shell shapepipe.sif
cd /app && uv run pytest        # should now run cleanly

While I was in there, also wrote up docs/source/container.md covering the two image targets, the read-only vs writable-sandbox postures, and the three configuration layers (pyproject.toml / uv.lock / Dockerfile) with a "what to edit when" table. Worth a read — the design only makes sense if all three layers are clear.

(Disclosure: investigation, fix, and docs drafted by Claude in collaboration with me.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants