libnvme: fix NULL handle dereference in discovery and identify paths#3341
libnvme: fix NULL handle dereference in discovery and identify paths#3341martin-belanger wants to merge 1 commit intolinux-nvme:masterfrom
Conversation
libnvme_ctrl_get_transport_handle() opens the device lazily and returns NULL if the open fails. This can happen when a udev-triggered nvmf-connect@.service fires for a discovery controller that is already in the process of being removed: the device is still visible in sysfs at scan time but the kernel returns EAGAIN on open because the controller is no longer in LIVE state. None of the callers checked the returned handle before passing it to libnvme_get_log() or libnvme_submit_admin_passthru(), causing a segmentation fault. Add NULL handle guards with appropriate log messages in nvme_discovery_log(), nvmf_dim(), and libnvme_ctrl_identify(). Add a silent safety net with a comment in libnvme_get_log() itself as a last line of defence against future callers that omit the check. Signed-off-by: Martin Belanger <martin.belanger@dell.com> Assisted-by: Claude:claude-sonnet-4-6 [Claude Code]
0aabdf9 to
ca62b42
Compare
|
Ah that's how you were able to get a NULL hdl. I think we have way more places where a check would be necessary. Maybe instead testing everywhere for a NULL hdl we could just do it where we actually are using it, in the pass through code? Where did it crash, in libnvme_submit_admin_passthru? |
|
The crash was in I agree with your suggestion. Centralizing the guard in the passthrough layer is cleaner and more future-proof. The current patch already adds the safety net in Let me know if you'd prefer: (a) keep as-is, (b) safety nets only in the passthrough functions with no per-site checks, or (c) both. |
|
I think we should have the check only in ioctl.c to keep it inside the 'ioctl' layer. I think we could push if (hdl->uring_enabled)
ret = libnvme_submit_admin_passthru_async(hdl, cmd);
else
ret = libnvme_submit_admin_passthru(hdl, cmd);and if (hdl->uring_enabled) {
ret = libnvme_wait_complete_passthru(hdl);
if (ret)
return ret;
}down and only call |
|
Understood — that makes sense architecturally. A few clarifying questions before I rework this:
The current patch fixes the immediate crash. I'm happy to do the larger refactor here or in a follow-up — just want to make sure we agree on scope before I start. |
|
I'd say we refactor first the code and then add the null pointer checks. That should avoid a bit of code churn. Yes, I was thinking of just having |
Problem
libnvme_ctrl_get_transport_handle()uses lazy open semantics: it opens/dev/nvmeXon first use and returnsNULLif the open fails. None of the callers validated the returned handle before passing it down the call chain, leading to a segmentation fault.The crash was observed in a
nvmf-connect@.serviceinstance triggered by udevd. The service fires when a discovery controller emits an AEN, but udevd can process the event after the controller has already begun teardown. By that point the kernel returnsEAGAINfromopen()because the controller is no longer inLIVEstate,libnvme_open()fails, and the NULL handle is eventually dereferenced inlibnvme_get_log().Fix
nvme_discovery_log()— guard afterlibnvme_ctrl_get_transport_handle()with aLIBNVME_LOG_DEBUGmessage to aid tracingnvmf_dim()— same guard, placed after the existing pre-condition checks wherectxis known valid, with aLIBNVME_LOG_ERRmessage consistent with the surrounding stylelibnvme_ctrl_identify()— same guard with aLIBNVME_LOG_DEBUGmessagelibnvme_get_log()— silent safety net with a comment as a last line of defence for any future caller that omits the checkTesting
Verified by running nvme-stas tests that rapidly create and delete NVMe devices, which reliably triggered the crash before this fix. No crash observed after applying the patch.