task: tighten AUTO read-ahead wait-clear to match MDI (#3650)#3971
task: tighten AUTO read-ahead wait-clear to match MDI (#3650)#3971grandixximo wants to merge 1 commit intoLinuxCNC:masterfrom
Conversation
Auto mode cleared the read-ahead wait on execState==DONE alone, so a G38 issued just before an emcMotionUpdate snapshot could let the next read fire while motion.traj.queue was still non-zero, tripping NCE_QUEUE_IS_NOT_EMPTY_AFTER_PROBING. Add the queue==0 + io.status ==DONE conjuncts already used by mdi_execute_hook. Closes LinuxCNC#3650, LinuxCNC#662, LinuxCNC#263.
|
Not sure I really qualify to comment, but explanation makes sense. |
|
Need testers, I will try to post in the forum as well |
|
I'll send a 2.9 PR just so I can have debs built, users on the forum want 2.9 to test the fix |
|
I'm slightly confused, does this issue occur when the probe move starts, or when it ends? Because the error message implies it is at the end, that is to say that the probe move ( Aka, currently it is possible that a state sync check at the end of the probe move passed when it should not do so. This would be because This makes sense to me as an explanation that also matches the error message (but of course it might not be correct)
But when you say that:
Are you saying that the error actually occurs during a state sync BEFORE the probe move takes place, instead of AFTER? Because you're talking about a sync where Or are you talking about the END of the probe move, and the following move just happens to be But either way, you're saying that the race condition is that the move is ADDED prematurely, not that the condition passed prematurely because the move has not been fully REMOVED? Surely the move wouldn't be added prematurely because the addition of the move to the interpreter queue would itself be gated behind the state sync check of
Or are you saying that is the race condition, that upon completion of the probe move there is a brief period where moves can be added to the queue before the sync is commanded? And this can allow the next move to be added to the queue before the sync takes place, causing the current sync logic to fail because the next move has been added to the queue despite the machine otherwise being in a state where the sync should pass? In which case, while tightening up this logic might be a fix, we'd ideally also want to fix the issue closer to its orgin by preventing moves which require a sync beforehand from being added to the queue before a sync is confirmed to have occurred?
Just to show where the error message originates: This is where the "Queue is not empty after probing" message is defined: linuxcnc/src/emc/rs274ngc/rs274ngc_return.hh Line 133 in 2378edb And this is how the message is called: linuxcnc/src/emc/rs274ngc/rs274ngc_pre.cc Line 1448 in 2378edb The message is triggered when the interpreter is commanded to read an input, if the probe_flag is true and the queue is not empty then the error occurs. So the error definitely seems to occur AFTER a probe move. |
Fixes the long-standing "Queue is not empty after probing" error.
mdi_execute_hookhas, since 2012, gated wait-clear onexecState==DONE && motion.traj.queue==0 && io.status==DONE.readahead_reading(AUTO mode) checks onlyexecState==DONE, on the assumption that the priorEMC_TASK_PLAN_SYNCHprecondition (WAITING_FOR_MOTION_AND_IO) had already drained motion.That assumption breaks across cycles:
emcTaskIssueCommandwritesEMC_TRAJ_PROBEto motion via shared memory, then the same task cycle'semcMotionUpdatemay snapshotemcmotStatusbefore the servo loop has added the new segment.motion.statusreadsDONE(stale), the SYNCH precondition transitions immediately, and the nextInterp::_readrunsread_inputsagainst a queue that is now non-empty. CHKS fires.Patch: add the same two conjuncts to the AUTO check.
Closes #3650, #662, #263.
Complementary to #3537 (already merged): that fix handled probe re-trip during deceleration. This one handles the inter-command snapshot-lag race that lets the same family of bugs leak through.
cc @andypugh, @DauntlessAq, @rmu75, @BsAtHome: would value review on the wait-clear semantics.
cc @tiket18, @Cromaglious: could you test against your reproducing workload? Can use built artifacts for testing
Test plan