Modernize s3proxy: Go 1.26, slog, metrics, traces, hardening#41
Open
Modernize s3proxy: Go 1.26, slog, metrics, traces, hardening#41
Conversation
Update go.mod and go.work directives from 1.25.3 to 1.26. Installed toolchain and latest upstream are both 1.26.2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AGENTS.md captures Tier C rules for AI-assisted development on this repo (applicable sections only: code quality, vulnerability scanning, vendoring, testing/architecture, project layout, DI, error handling, logging, metrics, alerting, dashboards, tracing, Vault secrets, SBOM, Renovate policy). docs/DEVELOPMENT.md is the accompanying developer workflow guide, adapted for GitHub (gh CLI) and the s3proxy Go backend with no database. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Package rename
--------------
internal/crypto → internal/cryptoutil to avoid shadowing the Go stdlib
crypto package (revive var-naming, resolves the remaining golangci-lint
finding on main).
cmd/main.go
-----------
- Add ReadHeaderTimeout (10s) on http.Server to mitigate Slowloris (G112).
- Wrap ListenAndServe{TLS} return values with fmt.Errorf for wrapcheck.
- Replace panic(err) on startup with log.WithError(err).Fatal(...) for
consistent structured JSON failure output.
- Extract log level switch into setLogLevel helper (reduces main's
cyclomatic complexity and drops stale TODO blocks).
- Remove unused --kms flag and kmsEndpoint field (never wired up).
internal/router
---------------
- handleHealthEndpoints: drop superfluous second WriteHeader(500) that ran
after Write had already committed the status line.
- handleForwards: replace http.DefaultClient with a named HTTP client with
Timeout and tuned Transport; stream the upstream response via io.Copy
instead of buffering it entirely; preserve multi-value response headers
(Set-Cookie, Vary, …) by copying slices rather than using Get/Set which
only keeps the first value.
- PutObject logging: fix incorrect logrus key/value usage that was treated
as positional formatting args.
- Rename the get/put method-allow wrappers to requireGET/requirePUT and
correct the wrong doc comment on the former put wrapper.
- Drop the commented-out containsBucket helper.
- parseErrorCode: use errors.As against *awshttp.ResponseError.HTTPStatusCode
instead of regex-scraping err.Error(); no more hot-path regex compilation.
internal/s3
-----------
- ErrorRawResponse.Error() now returns the wrapped error message instead of
the raw HTTP body, preventing accidental leaks of upstream response bodies
through log.Error(err) / error-chain dumps. Callers that specifically need
the raw body (handleGetObjectError) now read the RawResponse field
explicitly.
Verified: golangci-lint clean, go test -race ./... passes, govulncheck
reports no vulnerabilities.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Correctness + efficiency fixes from the branch review. Router now constructs the S3 client once in New() and stores it ------------------------------------------------------------------ Previously Router.Serve called s3.NewClient for every incoming request. The AWS SDK client is safe for concurrent use and maintains its own credentials cache and HTTP connection pool; rebuilding it per request defeated both, re-registered middleware N times, and amplified credentials-refresh traffic under load. New now takes a context.Context used only during client construction, and the built client is reused by every request. s3.NewClient takes a context.Context ------------------------------------ The SDK LoadDefaultConfig call is now tied to the caller-supplied context so startup cancellation / timeout can flow through. main passes context.Background(). Upstream S3 operations get a bounded timeout -------------------------------------------- object.get and object.put still use context.WithoutCancel to avoid aborting an S3 call mid-flight when the client disconnects, but the detached context is now wrapped with a 2-minute timeout (s3OperationTimeout constant). This prevents a hung upstream from producing zombie goroutines and memory pressure. PutObject no longer computes its own Content-MD5 ------------------------------------------------ The AWS SDK is configured with RequestChecksumCalculationWhenRequired, so it signs a CRC/SHA checksum when S3 demands integrity. The explicit md5.Sum(body) was redundant, forced a second full hash pass over the ciphertext buffer, and was the non-obvious second MD5 pointed out in review (the first being client Content-MD5 validation of the plaintext body). Removed the package-level dekTag eager init ------------------------------------------- dekTag was initialised at package-load time via config.GetDekTagName, which ran before main called config.LoadConfig — so a non-default S3PROXY_DEKTAG_NAME env var was silently ignored. Usages now call config.GetDekTagName() at request time, which resolves after config is loaded. Tests updated accordingly. Verified: golangci-lint clean, go test -race ./... passes, govulncheck reports no vulnerabilities. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related security hardening changes from the review.
PutObject body size cap
-----------------------
PutObject currently buffers the full request body in memory so it can
encrypt it before forwarding to S3. Previously ContentLength was only
validated against the 5 GiB S3 hard ceiling, which let a single concurrent
request allocate up to 5 GiB and trivially OOM the process.
- New config knob S3PROXY_PUTBODY_MAX (bytes), default 256 MiB, capped at
MaxObjectSize. Exposed via config.GetMaxPutBodySize.
- New config.ValidatePutBodySize checks the declared Content-Length against
that cap and returns 413 before we allocate anything.
- readBody now takes an explicit maxBytes argument and wraps the reader
with io.LimitReader so a missing/lying Content-Length cannot trick us
into reading past the cap; exceeded bodies return a sentinel
errBodyTooLarge which handlePutObject translates to 413.
- GetObject read path passes MaxObjectSize as the cap (we still trust
upstream S3 but refuse bodies larger than the S3 hard ceiling).
KEK derivation via HKDF-SHA256 with versioning
----------------------------------------------
generateKEKFromString derived the 32-byte KEK as plain sha256(seed). That
is vulnerable to rainbow-style precomputation against low-entropy seeds
and is below current best practice even with a high-entropy seed.
- New internal/cryptoutil/kek.go introduces KEKProvider, which holds every
supported derivation so that already-stored objects remain readable.
* Legacy (KEKVersionLegacy, ""): sha256(seed) — prior behaviour.
* v1 (KEKVersionV1, "1"): HKDF-SHA256 with a fixed salt and the
info label "s3proxy/kek/v1".
KEKVersionCurrent selects the version used for new writes (v1).
- Each encrypted object now records its KEK version in object metadata
under a new tag ("<dektag>-kek-ver" by default, configurable via
S3PROXY_DEKTAG_KEKVER / config.GetKEKVersionTagName). Absence of the tag
is interpreted as legacy for forward-compatible reads.
- Router stores a cryptoutil.KEKProvider instead of a single [32]byte KEK;
handlers receive the provider and pick the right derivation per object
on read and use Current() on write.
- Tests exercise v1 round-trip, version-tagged reads, mismatched seeds,
and a legacy-object read (no version tag → SHA-256 fallback).
go.mod now promotes golang.org/x/crypto to a direct dependency (already
transitively present) for golang.org/x/crypto/hkdf.
Verified: golangci-lint clean, go test -race ./... passes, govulncheck
reports no vulnerabilities.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related structural changes.
/readyz distinguishes readiness from liveness
---------------------------------------------
Previously /healthz and /readyz both returned 200 unconditionally. The
route now splits:
- /healthz: unchanged — attests that the process is alive.
- /readyz : issues a ListBuckets probe against the upstream S3 endpoint
with a 2-second timeout. Returns 200 on success and 503 with
a short textual reason (no upstream client / upstream
unreachable) otherwise.
New s3.Client.Ping wraps s3client.ListBuckets so Router can exercise both
connectivity and credentials without needing a specific bucket.
Config struct for dependency injection (M3)
-------------------------------------------
config used a package-level koanf instance with free-floating getter
functions, which violates the AGENTS.md rule against package-scope
mutable state and made the package hard to test with alternate config
sources.
Introduce config.Config holding its own *koanf.Koanf, with methods for
every getter (Host, DekTagName, KEKVersionTagName, EncryptKey,
ThrottlingRequestsMax, MaxPutBodySize) plus Validate. config.Load
returns a fresh *Config; new code threads it explicitly. main() now
calls config.Load() + cfg.Validate() and hands *Config to runServer.
The pre-existing package-level GetX helpers and ValidateConfiguration()
are kept as thin shims over an internal Default() Config so that call
sites not yet migrated (router internals) continue to compile and run.
Default() returns an empty Config when LoadConfig has not been run,
which keeps tests that construct handlers directly working.
Verified: golangci-lint clean, go test -race ./... passes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Unit tests ---------- - config/validation_test.go: table-driven coverage of ValidateBucketName (valid / length bounds / leading/trailing dash / uppercase / double dots / IPv4-like / underscore), ValidateObjectKey (empty, at-limit, over-limit, unicode), ValidateContentLength (zero, positive, S3 ceiling, negative), and ValidatePutBodySize. - router/helpers_test.go: covers match path handling, getMetadataHeaders (case-insensitive prefix + multi-value join), parseRetentionTime (empty/RFC3339/invalid), sha256sum, and the is-unwanted endpoint predicates. These helpers had no direct coverage before. The config package went from 0% to 41% coverage; router from 40.9% to 43.7%. Targeted fills behind the handler surface that previously required integration tests to exercise. CI workflow ----------- .github/workflows/ci.yml complements the existing golangci-lint workflow with three parallel jobs: - test: go test -race with coverage artifact upload. - vuln: go install + govulncheck ./... . - build: go build -v ./... . All three share go-version-file: go.mod so a toolchain bump in go.mod automatically propagates to CI. Renovate -------- renovate.json configures: - config:best-practices baseline + weekly schedule in Europe/Paris. - Security/CVE alerts auto-merge squash. - Go patch updates grouped and auto-merged; minor/major held for review. - AWS SDK v2 modules grouped in a single PR. - GitHub Actions and container images grouped; actions auto-merge on minor/patch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run go work vendor to materialise the module graph under vendor/. Tier C requires a vendored tree so CI builds are reproducible without network access to the Go module proxy, and so supply-chain review (govulncheck -mod=vendor, SBOM via syft/grype) can be run against an exact frozen copy of the build inputs. vendor/ is intentionally not in .gitignore and is committed alongside the code that depends on it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace sirupsen/logrus with the standard library log/slog across the codebase. Loggers now emit JSON to stdout with a service=s3proxy attribute and a shared LevelVar so -level still flips between debug/info/warn/error. Per-request loggers bind request_id via log.With, keeping the field on every emitted record without threading it through call sites. Drops the logrus direct dependency; go work vendor re-sync removes the vendored tree. Existing unit tests continue to pass with a discard slog.Handler.
Add an internal/monitoring package that owns a dedicated Registry and publishes: - http_requests_total / http_request_duration_seconds for every served request (method, normalised path, status) - errors_total keyed by a validation class - service_crashes_total driven by a panic-recover wrapper in the HTTP middleware - s3proxy_encrypt_duration_seconds / s3proxy_decrypt_duration_seconds histograms around the crypto hot paths - s3proxy_upstream_errors_total around S3 GetObject/PutObject/forward calls - s3proxy_throttled_total when the throttler rejects a request Metrics is threaded through Router, handlers and the throttling middleware as a nullable pointer so existing tests (which build routers directly) are unaffected. The proxy HTTP server now registers /metrics alongside the S3 routes; health probes and metrics scrapes bypass the throttler to stay observable under load.
Wire OpenTelemetry into the proxy so requests can be correlated across the ingress and the upstream S3 call: - internal/tracing sets up a global TracerProvider backed by the OTLP/HTTP exporter and a composite TraceContext + Baggage propagator. When OTEL_EXPORTER_OTLP_ENDPOINT (or the traces-specific variant) is unset the setup is a no-op, so unconfigured environments stay cheap. - The proxy's top-level HTTP handler is wrapped in otelhttp.NewHandler, which extracts the inbound W3C TraceContext, opens a server span, and propagates the span through r.Context() to downstream handlers. - forwardHTTPClient now uses otelhttp.NewTransport so signed requests to S3 automatically inject trace headers and produce a client span under the inbound request. - tracing.SpanContextHandler wraps the slog JSON handler and injects trace_id/span_id whenever a log record is emitted with a context carrying an active span. Adds unit tests for the handler wrapper's behaviour with and without an active span.
Add monitoring/ assets so deployments can consume the new metrics without each environment re-authoring them: - monitoring/alerts/s3proxy.yaml defines four Prometheus alerts: S3ProxyHighErrorRate (5xx share over 5%), S3ProxyHighLatency (p95 above 2s), S3ProxyServiceDown (scrape target missing) and S3ProxyHighCrashRate (handler panics recovered in the last 5m). - monitoring/dashboards/s3proxy.json defines a Grafana dashboard covering RPS per operation, 5xx error ratio, HTTP p50/p95/p99 latency, crash rate, encrypt/decrypt p95 duration and throttling rejection rate. Panels filter on a per-instance template variable.
Add s3proxy/internal/e2e/proxy_e2e_test.go guarded by //go:build e2e.
The test spins up an ephemeral MinIO container via testcontainers-go,
creates a bucket with a direct SDK client, then wires a real Router +
/metrics stack behind an httptest.Server and exercises it from an AWS
SDK client that points at the proxy:
- PutObject round-trip: plaintext flows through the proxy and the
stored object inspected directly on MinIO is opaque ciphertext with
the DEK metadata tag set.
- GetObject round-trip: the proxy decrypts and returns the original
plaintext byte-for-byte.
- /healthz, /readyz and /metrics stay reachable while traffic flows.
Supporting changes:
- Add an S3PROXY_INSECURE toggle (Config.Insecure / GetInsecure) so the
upstream client and request-repackaging use http:// instead of https://
when talking to plaintext MinIO. Default remains https.
- testcontainers-go modules/minio and testcontainers-go are added as
e2e-only dependencies; the main binary does not import them.
The test is explicitly not vendored for the default module graph, so it
must be run with:
GOWORK=off GOFLAGS=-mod=mod go test -tags=e2e ./s3proxy/internal/e2e/...
When Docker is unavailable the test skips cleanly.
Emit a WARN-level slog record at boot when the insecure upstream scheme toggle is active, making it clear the flag is intended for local/test/demo use and must not be enabled in production.
Strengthen the MinIO e2e test so the encrypted-at-rest claim is visible from -v output and enforced by assertions: - Log plaintext, raw ciphertext (hex), DEK metadata tag and KEK version tag after reading the object directly from MinIO, so `go test -v` shows the opaque bytes that land on disk. - Assert the plaintext does not appear verbatim anywhere in the stored payload. - Assert the stored payload is strictly longer than the plaintext to reflect the AES-GCM nonce + authentication tag overhead. - Assert both the DEK and KEK-version metadata tags are populated.
… bodies Switch the proxy's upstream S3 client to aws.RequestChecksumCalculationWhenSupported / ResponseChecksumValidationWhenSupported so every PutObject persists a CRC32 on MinIO/S3 and every GetObject can validate the payload end-to-end. This aligns with the AWS SDK v2 post-2025 defaults and silences "Response has no supported checksum. Not validating response payload." log lines in tooling that scrapes s3proxy traffic. The proxy, however, transforms bodies: PutObject uploads ciphertext and GetObject returns decrypted plaintext, so the checksum the upstream computes (over the ciphertext at rest) does not match what the downstream client sees. Forwarding x-amz-checksum-* in PutObject/GetObject responses would therefore cause the client SDK to reject the body. Remove the forwarded x-amz-checksum-crc32/crc32c/sha1/sha256 headers in both setGetObjectHeaders and setPutObjectHeaders and explain the reason in place. Also bump the e2e MinIO image to RELEASE.2025-09-07T16-13-09Z (the 2025-10 tag does not exist on Docker Hub) and disable response checksum validation on the test-only SDK clients used to inspect at-rest ciphertext and to drive the proxy, since neither payload carries a usable checksum by design.
The module requires Go 1.26 since commit c0e45f5, but the Dockerfile was still pinned to golang:1.25.3, which caused the ghcr build-push-action step to fail with: go: go.mod requires go >= 1.26 (running go 1.25.3; GOTOOLCHAIN=local) Pin the build stage to golang:1.26.2 (latest 1.26 patch at time of writing) so the container image builds again.
Wire the new observability surface into the s3proxy Helm chart so clusters that install via the chart can opt into metrics, tracing and alerting without bespoke plumbing. - deployment.yaml now forwards OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_EXPORTER_OTLP_TRACES_ENDPOINT and OTEL_EXPORTER_OTLP_HEADERS from .Values.config, and accepts a free-form .Values.extraEnv list. - templates/servicemonitor.yaml ships an opt-in ServiceMonitor pointed at the existing service/port with a configurable path, scheme, tls config and relabeling rules. - templates/prometheusrule.yaml ships the four s3proxy alerts (HighErrorRate, HighLatency, ServiceDown, HighCrashRate) behind a single toggle, with per-alert thresholds and "for" durations exposed in values.yaml. - templates/grafana-dashboard.yaml packages dashboards/s3proxy.json as a ConfigMap carrying the grafana_dashboard sidecar label when .Values.grafanaDashboard.enabled is true. - charts/s3proxy/dashboards/s3proxy.json mirrors the root monitoring/dashboards/s3proxy.json so the ConfigMap can Files.Get it. All three observability templates default to disabled. Bumped chart + app version to 1.8.0.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
log/slog, add CI + Renovate + AGENTS.md/DEVELOPMENT.md./readyzupstream probe, bounded S3 timeouts, single reused S3 client, drop redundant MD5 hash pass, quick crypto-pkg rename + fixes./metrics(HTTP, errors, crashes, encrypt/decrypt, upstream errors, throttling), OpenTelemetry tracing (OTLP/HTTP, W3C propagation, trace_id in slog records), ready-to-use Grafana dashboard + alert rules undermonitoring/.S3PROXY_INSECUREtoggle (with a loud startup warn) and a//go:build e2eMinIO roundtrip test that proves at-rest encryption.x-amz-checksum-*from upstream responses since the proxy transforms bodies; flip the upstream SDK to SDK-v2 defaults (WhenSupportedchecksum modes).Scope
16 commits on
chore/modernize-and-review:Test plan
golangci-lint run ./...— 0 issuesgo test -race -cover ./...— all packages passgovulncheck ./...— no vulnerabilitiesGOWORK=off GOFLAGS=-mod=mod go test -tags=e2e -count=1 -v -run TestProxyRoundtripAgainstMinio ./s3proxy/internal/e2e/...— passes end-to-end against a MinIO container; ciphertext at rest verifiedS3PROXY_INSECUREtoggle andmonitoring/assets before merge; deploy the alert rules + dashboard in the target environmentOperator notes
OTEL_EXPORTER_OTLP_ENDPOINT(or_TRACES_ENDPOINT) enables trace export; unset = noop./metricsis served on the same listener as the S3 routes and bypasses the throttling middleware along with/healthz//readyz.S3PROXY_INSECURE=1switches upstream S3 traffic to plaintext HTTP and is intended for local/demo/e2e only — logged loudly at startup.🤖 Generated with Claude Code