Skip to content

Modernize s3proxy: Go 1.26, slog, metrics, traces, hardening#41

Open
ynsta wants to merge 18 commits intomainfrom
chore/modernize-and-review
Open

Modernize s3proxy: Go 1.26, slog, metrics, traces, hardening#41
ynsta wants to merge 18 commits intomainfrom
chore/modernize-and-review

Conversation

@ynsta
Copy link
Copy Markdown
Contributor

@ynsta ynsta commented Apr 22, 2026

Summary

  • Platform & tooling: bump to Go 1.26, drop logrus for stdlib log/slog, add CI + Renovate + AGENTS.md/DEVELOPMENT.md.
  • Hardening: HKDF-versioned KEK, PutObject body cap, /readyz upstream probe, bounded S3 timeouts, single reused S3 client, drop redundant MD5 hash pass, quick crypto-pkg rename + fixes.
  • Observability: Prometheus /metrics (HTTP, errors, crashes, encrypt/decrypt, upstream errors, throttling), OpenTelemetry tracing (OTLP/HTTP, W3C propagation, trace_id in slog records), ready-to-use Grafana dashboard + alert rules under monitoring/.
  • Dev experience: S3PROXY_INSECURE toggle (with a loud startup warn) and a //go:build e2e MinIO roundtrip test that proves at-rest encryption.
  • Fix: stop forwarding x-amz-checksum-* from upstream responses since the proxy transforms bodies; flip the upstream SDK to SDK-v2 defaults (WhenSupported checksum modes).

Scope

16 commits on chore/modernize-and-review:

Commit Summary
c0e45f5 chore: bump Go to 1.26
c2668bc docs: add AGENTS.md and development guide
a91fcd9 refactor: rename crypto pkg and apply quick hardening fixes
34d5afe refactor: reuse S3 client, bound upstream timeouts, drop redundant MD5
401c7ad feat: cap PutObject bodies and switch KEK derivation to HKDF-SHA256
8353ea3 feat: /readyz probes upstream S3; introduce Config struct for DI
aa552c2 test(ci): expand unit coverage and add CI + Renovate
5d79371 chore: vendor dependencies
d261330 refactor: migrate logrus to log/slog
bee9277 feat: expose Prometheus metrics on /metrics
2dbf244 feat: emit OpenTelemetry traces and correlate with slog
afcb316 chore(monitoring): ship alert rules and Grafana dashboard
b084b93 test(e2e): add MinIO roundtrip test behind an e2e build tag
e89bbd6 feat: warn loudly on startup when S3PROXY_INSECURE is enabled
dbb7925 test(e2e): demonstrate at-rest encryption in MinIO roundtrip test
55a8cc7 fix(router): stop forwarding upstream checksum headers on transformed bodies

Test plan

  • golangci-lint run ./... — 0 issues
  • go test -race -cover ./... — all packages pass
  • govulncheck ./... — no vulnerabilities
  • GOWORK=off GOFLAGS=-mod=mod go test -tags=e2e -count=1 -v -run TestProxyRoundtripAgainstMinio ./s3proxy/internal/e2e/... — passes end-to-end against a MinIO container; ciphertext at rest verified
  • Review the new S3PROXY_INSECURE toggle and monitoring/ assets before merge; deploy the alert rules + dashboard in the target environment

Operator notes

  • New optional envs: OTEL_EXPORTER_OTLP_ENDPOINT (or _TRACES_ENDPOINT) enables trace export; unset = noop.
  • /metrics is served on the same listener as the S3 routes and bypasses the throttling middleware along with /healthz//readyz.
  • S3PROXY_INSECURE=1 switches upstream S3 traffic to plaintext HTTP and is intended for local/demo/e2e only — logged loudly at startup.

🤖 Generated with Claude Code

ynsta and others added 18 commits April 22, 2026 18:28
Update go.mod and go.work directives from 1.25.3 to 1.26.
Installed toolchain and latest upstream are both 1.26.2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AGENTS.md captures Tier C rules for AI-assisted development on this repo
(applicable sections only: code quality, vulnerability scanning, vendoring,
testing/architecture, project layout, DI, error handling, logging, metrics,
alerting, dashboards, tracing, Vault secrets, SBOM, Renovate policy).

docs/DEVELOPMENT.md is the accompanying developer workflow guide, adapted
for GitHub (gh CLI) and the s3proxy Go backend with no database.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Package rename
--------------
internal/crypto → internal/cryptoutil to avoid shadowing the Go stdlib
crypto package (revive var-naming, resolves the remaining golangci-lint
finding on main).

cmd/main.go
-----------
- Add ReadHeaderTimeout (10s) on http.Server to mitigate Slowloris (G112).
- Wrap ListenAndServe{TLS} return values with fmt.Errorf for wrapcheck.
- Replace panic(err) on startup with log.WithError(err).Fatal(...) for
  consistent structured JSON failure output.
- Extract log level switch into setLogLevel helper (reduces main's
  cyclomatic complexity and drops stale TODO blocks).
- Remove unused --kms flag and kmsEndpoint field (never wired up).

internal/router
---------------
- handleHealthEndpoints: drop superfluous second WriteHeader(500) that ran
  after Write had already committed the status line.
- handleForwards: replace http.DefaultClient with a named HTTP client with
  Timeout and tuned Transport; stream the upstream response via io.Copy
  instead of buffering it entirely; preserve multi-value response headers
  (Set-Cookie, Vary, …) by copying slices rather than using Get/Set which
  only keeps the first value.
- PutObject logging: fix incorrect logrus key/value usage that was treated
  as positional formatting args.
- Rename the get/put method-allow wrappers to requireGET/requirePUT and
  correct the wrong doc comment on the former put wrapper.
- Drop the commented-out containsBucket helper.
- parseErrorCode: use errors.As against *awshttp.ResponseError.HTTPStatusCode
  instead of regex-scraping err.Error(); no more hot-path regex compilation.

internal/s3
-----------
- ErrorRawResponse.Error() now returns the wrapped error message instead of
  the raw HTTP body, preventing accidental leaks of upstream response bodies
  through log.Error(err) / error-chain dumps. Callers that specifically need
  the raw body (handleGetObjectError) now read the RawResponse field
  explicitly.

Verified: golangci-lint clean, go test -race ./... passes, govulncheck
reports no vulnerabilities.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Correctness + efficiency fixes from the branch review.

Router now constructs the S3 client once in New() and stores it
------------------------------------------------------------------
Previously Router.Serve called s3.NewClient for every incoming request.
The AWS SDK client is safe for concurrent use and maintains its own
credentials cache and HTTP connection pool; rebuilding it per request
defeated both, re-registered middleware N times, and amplified
credentials-refresh traffic under load. New now takes a context.Context
used only during client construction, and the built client is reused by
every request.

s3.NewClient takes a context.Context
------------------------------------
The SDK LoadDefaultConfig call is now tied to the caller-supplied
context so startup cancellation / timeout can flow through. main passes
context.Background().

Upstream S3 operations get a bounded timeout
--------------------------------------------
object.get and object.put still use context.WithoutCancel to avoid
aborting an S3 call mid-flight when the client disconnects, but the
detached context is now wrapped with a 2-minute timeout
(s3OperationTimeout constant). This prevents a hung upstream from
producing zombie goroutines and memory pressure.

PutObject no longer computes its own Content-MD5
------------------------------------------------
The AWS SDK is configured with RequestChecksumCalculationWhenRequired,
so it signs a CRC/SHA checksum when S3 demands integrity. The explicit
md5.Sum(body) was redundant, forced a second full hash pass over the
ciphertext buffer, and was the non-obvious second MD5 pointed out in
review (the first being client Content-MD5 validation of the plaintext
body).

Removed the package-level dekTag eager init
-------------------------------------------
dekTag was initialised at package-load time via config.GetDekTagName,
which ran before main called config.LoadConfig — so a non-default
S3PROXY_DEKTAG_NAME env var was silently ignored. Usages now call
config.GetDekTagName() at request time, which resolves after config is
loaded. Tests updated accordingly.

Verified: golangci-lint clean, go test -race ./... passes, govulncheck
reports no vulnerabilities.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related security hardening changes from the review.

PutObject body size cap
-----------------------
PutObject currently buffers the full request body in memory so it can
encrypt it before forwarding to S3. Previously ContentLength was only
validated against the 5 GiB S3 hard ceiling, which let a single concurrent
request allocate up to 5 GiB and trivially OOM the process.

- New config knob S3PROXY_PUTBODY_MAX (bytes), default 256 MiB, capped at
  MaxObjectSize. Exposed via config.GetMaxPutBodySize.
- New config.ValidatePutBodySize checks the declared Content-Length against
  that cap and returns 413 before we allocate anything.
- readBody now takes an explicit maxBytes argument and wraps the reader
  with io.LimitReader so a missing/lying Content-Length cannot trick us
  into reading past the cap; exceeded bodies return a sentinel
  errBodyTooLarge which handlePutObject translates to 413.
- GetObject read path passes MaxObjectSize as the cap (we still trust
  upstream S3 but refuse bodies larger than the S3 hard ceiling).

KEK derivation via HKDF-SHA256 with versioning
----------------------------------------------
generateKEKFromString derived the 32-byte KEK as plain sha256(seed). That
is vulnerable to rainbow-style precomputation against low-entropy seeds
and is below current best practice even with a high-entropy seed.

- New internal/cryptoutil/kek.go introduces KEKProvider, which holds every
  supported derivation so that already-stored objects remain readable.
  * Legacy (KEKVersionLegacy, ""): sha256(seed) — prior behaviour.
  * v1 (KEKVersionV1, "1"):        HKDF-SHA256 with a fixed salt and the
                                   info label "s3proxy/kek/v1".
  KEKVersionCurrent selects the version used for new writes (v1).
- Each encrypted object now records its KEK version in object metadata
  under a new tag ("<dektag>-kek-ver" by default, configurable via
  S3PROXY_DEKTAG_KEKVER / config.GetKEKVersionTagName). Absence of the tag
  is interpreted as legacy for forward-compatible reads.
- Router stores a cryptoutil.KEKProvider instead of a single [32]byte KEK;
  handlers receive the provider and pick the right derivation per object
  on read and use Current() on write.
- Tests exercise v1 round-trip, version-tagged reads, mismatched seeds,
  and a legacy-object read (no version tag → SHA-256 fallback).

go.mod now promotes golang.org/x/crypto to a direct dependency (already
transitively present) for golang.org/x/crypto/hkdf.

Verified: golangci-lint clean, go test -race ./... passes, govulncheck
reports no vulnerabilities.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related structural changes.

/readyz distinguishes readiness from liveness
---------------------------------------------
Previously /healthz and /readyz both returned 200 unconditionally. The
route now splits:
- /healthz: unchanged — attests that the process is alive.
- /readyz : issues a ListBuckets probe against the upstream S3 endpoint
           with a 2-second timeout. Returns 200 on success and 503 with
           a short textual reason (no upstream client / upstream
           unreachable) otherwise.

New s3.Client.Ping wraps s3client.ListBuckets so Router can exercise both
connectivity and credentials without needing a specific bucket.

Config struct for dependency injection (M3)
-------------------------------------------
config used a package-level koanf instance with free-floating getter
functions, which violates the AGENTS.md rule against package-scope
mutable state and made the package hard to test with alternate config
sources.

Introduce config.Config holding its own *koanf.Koanf, with methods for
every getter (Host, DekTagName, KEKVersionTagName, EncryptKey,
ThrottlingRequestsMax, MaxPutBodySize) plus Validate. config.Load
returns a fresh *Config; new code threads it explicitly. main() now
calls config.Load() + cfg.Validate() and hands *Config to runServer.

The pre-existing package-level GetX helpers and ValidateConfiguration()
are kept as thin shims over an internal Default() Config so that call
sites not yet migrated (router internals) continue to compile and run.
Default() returns an empty Config when LoadConfig has not been run,
which keeps tests that construct handlers directly working.

Verified: golangci-lint clean, go test -race ./... passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Unit tests
----------
- config/validation_test.go: table-driven coverage of ValidateBucketName
  (valid / length bounds / leading/trailing dash / uppercase / double
  dots / IPv4-like / underscore), ValidateObjectKey (empty, at-limit,
  over-limit, unicode), ValidateContentLength (zero, positive, S3
  ceiling, negative), and ValidatePutBodySize.
- router/helpers_test.go: covers match path handling, getMetadataHeaders
  (case-insensitive prefix + multi-value join), parseRetentionTime
  (empty/RFC3339/invalid), sha256sum, and the is-unwanted endpoint
  predicates. These helpers had no direct coverage before.

The config package went from 0% to 41% coverage; router from 40.9% to
43.7%. Targeted fills behind the handler surface that previously
required integration tests to exercise.

CI workflow
-----------
.github/workflows/ci.yml complements the existing golangci-lint
workflow with three parallel jobs:
- test: go test -race with coverage artifact upload.
- vuln: go install + govulncheck ./... .
- build: go build -v ./... .
All three share go-version-file: go.mod so a toolchain bump in go.mod
automatically propagates to CI.

Renovate
--------
renovate.json configures:
- config:best-practices baseline + weekly schedule in Europe/Paris.
- Security/CVE alerts auto-merge squash.
- Go patch updates grouped and auto-merged; minor/major held for review.
- AWS SDK v2 modules grouped in a single PR.
- GitHub Actions and container images grouped; actions auto-merge on
  minor/patch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run go work vendor to materialise the module graph under vendor/.
Tier C requires a vendored tree so CI builds are reproducible without
network access to the Go module proxy, and so supply-chain review
(govulncheck -mod=vendor, SBOM via syft/grype) can be run against an
exact frozen copy of the build inputs.

vendor/ is intentionally not in .gitignore and is committed alongside
the code that depends on it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace sirupsen/logrus with the standard library log/slog across the
codebase. Loggers now emit JSON to stdout with a service=s3proxy attribute
and a shared LevelVar so -level still flips between debug/info/warn/error.
Per-request loggers bind request_id via log.With, keeping the field on
every emitted record without threading it through call sites.

Drops the logrus direct dependency; go work vendor re-sync removes the
vendored tree. Existing unit tests continue to pass with a discard
slog.Handler.
Add an internal/monitoring package that owns a dedicated Registry and
publishes:
- http_requests_total / http_request_duration_seconds for every served
  request (method, normalised path, status)
- errors_total keyed by a validation class
- service_crashes_total driven by a panic-recover wrapper in the HTTP
  middleware
- s3proxy_encrypt_duration_seconds / s3proxy_decrypt_duration_seconds
  histograms around the crypto hot paths
- s3proxy_upstream_errors_total around S3 GetObject/PutObject/forward
  calls
- s3proxy_throttled_total when the throttler rejects a request

Metrics is threaded through Router, handlers and the throttling
middleware as a nullable pointer so existing tests (which build routers
directly) are unaffected. The proxy HTTP server now registers /metrics
alongside the S3 routes; health probes and metrics scrapes bypass the
throttler to stay observable under load.
Wire OpenTelemetry into the proxy so requests can be correlated across
the ingress and the upstream S3 call:

- internal/tracing sets up a global TracerProvider backed by the
  OTLP/HTTP exporter and a composite TraceContext + Baggage propagator.
  When OTEL_EXPORTER_OTLP_ENDPOINT (or the traces-specific variant) is
  unset the setup is a no-op, so unconfigured environments stay cheap.
- The proxy's top-level HTTP handler is wrapped in otelhttp.NewHandler,
  which extracts the inbound W3C TraceContext, opens a server span, and
  propagates the span through r.Context() to downstream handlers.
- forwardHTTPClient now uses otelhttp.NewTransport so signed requests to
  S3 automatically inject trace headers and produce a client span under
  the inbound request.
- tracing.SpanContextHandler wraps the slog JSON handler and injects
  trace_id/span_id whenever a log record is emitted with a context
  carrying an active span.

Adds unit tests for the handler wrapper's behaviour with and without an
active span.
Add monitoring/ assets so deployments can consume the new metrics
without each environment re-authoring them:

- monitoring/alerts/s3proxy.yaml defines four Prometheus alerts:
  S3ProxyHighErrorRate (5xx share over 5%), S3ProxyHighLatency (p95
  above 2s), S3ProxyServiceDown (scrape target missing) and
  S3ProxyHighCrashRate (handler panics recovered in the last 5m).
- monitoring/dashboards/s3proxy.json defines a Grafana dashboard
  covering RPS per operation, 5xx error ratio, HTTP p50/p95/p99
  latency, crash rate, encrypt/decrypt p95 duration and throttling
  rejection rate. Panels filter on a per-instance template variable.
Add s3proxy/internal/e2e/proxy_e2e_test.go guarded by //go:build e2e.
The test spins up an ephemeral MinIO container via testcontainers-go,
creates a bucket with a direct SDK client, then wires a real Router +
/metrics stack behind an httptest.Server and exercises it from an AWS
SDK client that points at the proxy:

- PutObject round-trip: plaintext flows through the proxy and the
  stored object inspected directly on MinIO is opaque ciphertext with
  the DEK metadata tag set.
- GetObject round-trip: the proxy decrypts and returns the original
  plaintext byte-for-byte.
- /healthz, /readyz and /metrics stay reachable while traffic flows.

Supporting changes:
- Add an S3PROXY_INSECURE toggle (Config.Insecure / GetInsecure) so the
  upstream client and request-repackaging use http:// instead of https://
  when talking to plaintext MinIO. Default remains https.
- testcontainers-go modules/minio and testcontainers-go are added as
  e2e-only dependencies; the main binary does not import them.

The test is explicitly not vendored for the default module graph, so it
must be run with:

    GOWORK=off GOFLAGS=-mod=mod go test -tags=e2e ./s3proxy/internal/e2e/...

When Docker is unavailable the test skips cleanly.
Emit a WARN-level slog record at boot when the insecure upstream
scheme toggle is active, making it clear the flag is intended for
local/test/demo use and must not be enabled in production.
Strengthen the MinIO e2e test so the encrypted-at-rest claim is visible
from -v output and enforced by assertions:

- Log plaintext, raw ciphertext (hex), DEK metadata tag and KEK version
  tag after reading the object directly from MinIO, so `go test -v`
  shows the opaque bytes that land on disk.
- Assert the plaintext does not appear verbatim anywhere in the stored
  payload.
- Assert the stored payload is strictly longer than the plaintext to
  reflect the AES-GCM nonce + authentication tag overhead.
- Assert both the DEK and KEK-version metadata tags are populated.
… bodies

Switch the proxy's upstream S3 client to
aws.RequestChecksumCalculationWhenSupported / ResponseChecksumValidationWhenSupported
so every PutObject persists a CRC32 on MinIO/S3 and every GetObject can
validate the payload end-to-end. This aligns with the AWS SDK v2 post-2025
defaults and silences "Response has no supported checksum. Not validating
response payload." log lines in tooling that scrapes s3proxy traffic.

The proxy, however, transforms bodies: PutObject uploads ciphertext and
GetObject returns decrypted plaintext, so the checksum the upstream
computes (over the ciphertext at rest) does not match what the downstream
client sees. Forwarding x-amz-checksum-* in PutObject/GetObject responses
would therefore cause the client SDK to reject the body.

Remove the forwarded x-amz-checksum-crc32/crc32c/sha1/sha256 headers in
both setGetObjectHeaders and setPutObjectHeaders and explain the reason in
place.

Also bump the e2e MinIO image to RELEASE.2025-09-07T16-13-09Z (the 2025-10
tag does not exist on Docker Hub) and disable response checksum validation
on the test-only SDK clients used to inspect at-rest ciphertext and to
drive the proxy, since neither payload carries a usable checksum by design.
The module requires Go 1.26 since commit c0e45f5, but the Dockerfile was
still pinned to golang:1.25.3, which caused the ghcr build-push-action
step to fail with:

    go: go.mod requires go >= 1.26 (running go 1.25.3; GOTOOLCHAIN=local)

Pin the build stage to golang:1.26.2 (latest 1.26 patch at time of
writing) so the container image builds again.
Wire the new observability surface into the s3proxy Helm chart so
clusters that install via the chart can opt into metrics, tracing and
alerting without bespoke plumbing.

- deployment.yaml now forwards OTEL_EXPORTER_OTLP_ENDPOINT,
  OTEL_EXPORTER_OTLP_TRACES_ENDPOINT and OTEL_EXPORTER_OTLP_HEADERS
  from .Values.config, and accepts a free-form .Values.extraEnv list.
- templates/servicemonitor.yaml ships an opt-in ServiceMonitor pointed
  at the existing service/port with a configurable path, scheme, tls
  config and relabeling rules.
- templates/prometheusrule.yaml ships the four s3proxy alerts
  (HighErrorRate, HighLatency, ServiceDown, HighCrashRate) behind a
  single toggle, with per-alert thresholds and "for" durations exposed
  in values.yaml.
- templates/grafana-dashboard.yaml packages dashboards/s3proxy.json as
  a ConfigMap carrying the grafana_dashboard sidecar label when
  .Values.grafanaDashboard.enabled is true.
- charts/s3proxy/dashboards/s3proxy.json mirrors the root
  monitoring/dashboards/s3proxy.json so the ConfigMap can Files.Get it.

All three observability templates default to disabled. Bumped chart +
app version to 1.8.0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant