Skip to content

libnvme: fix NULL handle dereference in discovery and identify paths#3341

Open
martin-belanger wants to merge 1 commit intolinux-nvme:masterfrom
martin-belanger:fix-libnvme-null-hdl
Open

libnvme: fix NULL handle dereference in discovery and identify paths#3341
martin-belanger wants to merge 1 commit intolinux-nvme:masterfrom
martin-belanger:fix-libnvme-null-hdl

Conversation

@martin-belanger
Copy link
Copy Markdown

Problem

libnvme_ctrl_get_transport_handle() uses lazy open semantics: it opens /dev/nvmeX on first use and returns NULL if the open fails. None of the callers validated the returned handle before passing it down the call chain, leading to a segmentation fault.

The crash was observed in a nvmf-connect@.service instance triggered by udevd. The service fires when a discovery controller emits an AEN, but udevd can process the event after the controller has already begun teardown. By that point the kernel returns EAGAIN from open() because the controller is no longer in LIVE state, libnvme_open() fails, and the NULL handle is eventually dereferenced in libnvme_get_log().

Fix

  • nvme_discovery_log() — guard after libnvme_ctrl_get_transport_handle() with a LIBNVME_LOG_DEBUG message to aid tracing
  • nvmf_dim() — same guard, placed after the existing pre-condition checks where ctx is known valid, with a LIBNVME_LOG_ERR message consistent with the surrounding style
  • libnvme_ctrl_identify() — same guard with a LIBNVME_LOG_DEBUG message
  • libnvme_get_log() — silent safety net with a comment as a last line of defence for any future caller that omits the check

Testing

Verified by running nvme-stas tests that rapidly create and delete NVMe devices, which reliably triggered the crash before this fix. No crash observed after applying the patch.

libnvme_ctrl_get_transport_handle() opens the device lazily and returns
NULL if the open fails. This can happen when a udev-triggered
nvmf-connect@.service fires for a discovery controller that is already
in the process of being removed: the device is still visible in sysfs
at scan time but the kernel returns EAGAIN on open because the
controller is no longer in LIVE state.

None of the callers checked the returned handle before passing it to
libnvme_get_log() or libnvme_submit_admin_passthru(), causing a
segmentation fault.

Add NULL handle guards with appropriate log messages in
nvme_discovery_log(), nvmf_dim(), and libnvme_ctrl_identify(). Add a
silent safety net with a comment in libnvme_get_log() itself as a last
line of defence against future callers that omit the check.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Assisted-by: Claude:claude-sonnet-4-6 [Claude Code]
@martin-belanger martin-belanger force-pushed the fix-libnvme-null-hdl branch from 0aabdf9 to ca62b42 Compare May 5, 2026 22:09
@igaw
Copy link
Copy Markdown
Collaborator

igaw commented May 7, 2026

Ah that's how you were able to get a NULL hdl. I think we have way more places where a check would be necessary. Maybe instead testing everywhere for a NULL hdl we could just do it where we actually are using it, in the pass through code? Where did it crash, in libnvme_submit_admin_passthru?

@martin-belanger
Copy link
Copy Markdown
Author

The crash was in libnvme_get_log() at nvme-cmds.c:88, not in libnvme_submit_admin_passthru. The hdl was passed through directly from nvme_discovery_log() without a check.

I agree with your suggestion. Centralizing the guard in the passthrough layer is cleaner and more future-proof. The current patch already adds the safety net in libnvme_get_log(). If you also want a guard in libnvme_submit_admin_passthru(), I can add it there and drop the individual call-site checks — though I'd prefer to keep the per-site LIBNVME_LOG_DEBUG messages since they make it easy to trace which code path triggered the NULL (especially useful during rapid device teardown).

Let me know if you'd prefer: (a) keep as-is, (b) safety nets only in the passthrough functions with no per-site checks, or (c) both.

@igaw
Copy link
Copy Markdown
Collaborator

igaw commented May 7, 2026

I think we should have the check only in ioctl.c to keep it inside the 'ioctl' layer. I think we could push

		if (hdl->uring_enabled)
			ret = libnvme_submit_admin_passthru_async(hdl, cmd);
		else
			ret = libnvme_submit_admin_passthru(hdl, cmd);

and

	if (hdl->uring_enabled) {
		ret = libnvme_wait_complete_passthru(hdl);
		if (ret)
			return ret;
	}

down and only call libnvme_submit_admin_passthru and always libnvme_submit_admin_complete (or what the name will be). So we don't have random checks if the hdl is NULL or not.

@martin-belanger
Copy link
Copy Markdown
Author

Understood — that makes sense architecturally. A few clarifying questions before I rework this:

  1. For the uring refactor: would the unified call remain named libnvme_submit_admin_passthru() (routing uring vs. ioctl internally), or do you have a different name in mind for the new combined entry point?
  2. Is this refactoring in scope for this PR, or should we merge the current minimal fix now and follow up with the uring unification separately?

The current patch fixes the immediate crash. I'm happy to do the larger refactor here or in a follow-up — just want to make sure we agree on scope before I start.

@igaw
Copy link
Copy Markdown
Collaborator

igaw commented May 8, 2026

I'd say we refactor first the code and then add the null pointer checks. That should avoid a bit of code churn.

Yes, I was thinking of just having libnvme_submit_admin_passthru and libnvme_wait_complete_passthru as API. Come to think of it, that means we should also touch all calling places of libnvme_submit_admin_passthru to pair up with libnvme_wait_complete_passthru. Only then the io_uring API will be used correctly throughout the code base.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants