Skip to content

feat(podman): make container health check interval configurable#1833

Open
sshnaidm wants to merge 2 commits into
NVIDIA:mainfrom
sshnaidm:1832-configurable-healthcheck-interval/sshnaidm
Open

feat(podman): make container health check interval configurable#1833
sshnaidm wants to merge 2 commits into
NVIDIA:mainfrom
sshnaidm:1832-configurable-healthcheck-interval/sshnaidm

Conversation

@sshnaidm

@sshnaidm sshnaidm commented Jun 9, 2026

Copy link
Copy Markdown

Summary

Make the Podman driver's container health check interval configurable instead of hardcoded at 3 seconds. The short default spawns a conmon subprocess every tick per container, creating sustained process churn and unnecessary CPU overhead on systems running multiple sandboxes.

Related Issue

Closes #1832

Changes

  • Add health_check_interval_secs field to PodmanComputeConfig (default: 10s)
  • Wire the config value into the health check spec in build_container_spec()
  • Backfill the new field via ..PodmanComputeConfig::default() in the standalone driver binary
  • Document the new field in docs/reference/gateway-config.mdx

Testing

  • mise run pre-commit passes (excluding pre-existing test:python grpc version mismatch)
  • Unit tests added: default_config_sets_health_check_interval, container_spec_healthcheck_interval_from_config
  • E2E tests added/updated (if applicable)
  • Deployed patched binary locally, verified podman inspect shows configured interval

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)

sshnaidm added 2 commits June 9, 2026 11:52
The Podman driver hardcoded a 3-second health check interval, which
spawns a conmon subprocess on every tick. On systems running multiple
sandboxes this creates sustained process churn and unnecessary CPU
overhead.

Add a `health_check_interval_secs` field to `PodmanComputeConfig`
(default: 10s) and wire it into the container health check spec.
Operators can tune it further via `[openshell.drivers.podman]` in
gateway.toml.

Signed-off-by: Sagi Shnaidman <sshnaidm@redhat.com>
Signed-off-by: Sagi Shnaidman <sshnaidm@redhat.com>
@sshnaidm sshnaidm requested a review from a team as a code owner June 9, 2026 09:04
@copy-pr-bot

copy-pr-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

Thank you for your interest in contributing to OpenShell, @sshnaidm.

This project uses a vouch system for first-time contributors. Before submitting a pull request, you need to be vouched by a maintainer.

To get vouched:

  1. Open a Vouch Request discussion.
  2. Describe what you want to change and why.
  3. Write in your own words — do not have an AI generate the request.
  4. A maintainer will comment /vouch if approved.
  5. Once vouched, open a new PR (preferred) or reopen this one after a few minutes.

See CONTRIBUTING.md for details.

@github-actions github-actions Bot closed this Jun 9, 2026
@sshnaidm

sshnaidm commented Jun 9, 2026

Copy link
Copy Markdown
Author

I have read the DCO document and I hereby sign the DCO.

@TaylorMutch TaylorMutch reopened this Jun 9, 2026
@TaylorMutch TaylorMutch requested a review from mrunalp as a code owner June 9, 2026 13:42
@TaylorMutch

Copy link
Copy Markdown
Collaborator

/ok to test 1d4d7ae

@TaylorMutch TaylorMutch added the test:e2e Requires end-to-end coverage label Jun 9, 2026
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

Label test:e2e applied for 1d4d7ae. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute the standard E2E suite after building the required gateway and supervisor images once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

@TaylorMutch TaylorMutch left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we validate the case for when the interval is set to 0?

Also this affects how quickly we check that the container is up and ready from the driver perspective; increasing to 10s doesn't seem like a lot but could be noticeable, I am okay with this change tho if we can validate the behavior when it's set to 0.

Thanks for the submission! If we can address my question above this LGTM

@TaylorMutch TaylorMutch self-assigned this Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: hardcoded 3-second health check interval causes excessive process churn

2 participants