WIP: Rtapi cleanup v2#3919
Conversation
|
Done:
Still open:
And
Hmm, I don't understand makefiles in depth. I just copied what was already there a few lines below and edited it to match my lib. As much as I understand this, it just adds this flags to the global EXTRAFLAGS in Makefile for one compile command only. I did not see any duplicated flags. |
|
Works here on Gentoo, rip, system install, clang and gcc builds, whole shebang! edit: Have not yet tested with RTAI. |
|
Thanks for testing! I renamed the library's to: A bit long names, but I think it is fine. But i'm open for other suggestions. Additionally, I reduced the globals. I tested all 5 configurations in a VM and they all work. There are two issues but I don't think they are due to this PR:
|
|
@NTULINUX: Do you have a test setup you can share or is it all just manually setup? A series of docker files would be nice, so different OS can be tested if something starts. I messed my VM up slightly by using |
|
I have VM images up right now but they're in flux. I'm going to post new VMs with all the right fixes in a bit. Currently tracking this PR here: My ebuilds at the moment are broken but we're in the middle of sorting it all out for good. Will share link to new VM soon with everything tied together, cleanly. |
| break; | ||
| if (i == 0) | ||
| srand48(t0.tv_sec ^ t0.tv_usec); | ||
| usleep(lrand48() % 100000); |
There was a problem hiding this comment.
Can there be unforeseen interaction with SIGALRM here? I don't think you want to involve signals. Also, wouldn't there be a minimum wait wanted here? A random wait can return close to zero all the time and you only loop three times.
There was a problem hiding this comment.
That one looks funny. Signal will break it, the for loop is only 0.3s (100000 != 1000000) and gettimeofday can fail if you change the system time. Improved.
I did not find a function for diff of timespec, so I added one above main.
There was a problem hiding this comment.
BTW: That one could also be simplified by just using (now->tv_sec - start->tv_sec) < 3 but this would be something between 2 and 4s.
|
Some discussion in the Sunday video meet-up has suggested that we should look at incorporating this into the rtapi cleanup: |
|
I commented and corrected the things I have changed. I always prefer to have refactoring (moving code around) and bug fixing/functional changes as separated as possible. That makes review and testing easier, you just check that the moved code arrived as it was and so avoids having merge conflicts due to the branch staying open for to long. But I have to admit, I also did some changes that I just was not able to leave it as it was and which simplified moving code due to removed dependency's and of course created a bug doing so. |
|
See: https://www.man7.org/linux/man-pages/man7/signal-safety.7.html for the callyou are allowed to make in signal handlers. |
Well, that is a good question. You are doing "cleanup" and that would imply refactor and fixes, IMO.
That is a possibility too, but there are soooo (key got stuck) many problems with this code that it is a good question why not hit those
And I appreciate the changes! They are very necessary. Before we're done it needs to be tested,of course. But when the code gets better structured, then that should also become easier, I hope. But, if you want to split it, then I think we need to have a very clear split where you move without actual changes and then refactor. My opinion is that moving code also classifies as refactoring and therefore it would be a missed opportunity if not fixed in one go. |
|
BTW, I've been wanting to vacuum this code for a long time but have not gotten arround to that point yet. You know, INI-file reader just done, HAL types/access revamp in the pipeline, tcl9 stuff that still needs fixing, build system cleanup, and so on. (and also got CI's -Werror merged) |
Might be I did not specify my (initial) intentions well enough. I just found the rtapi_uspace hard to manage while implementing xenomai and wanted to split it up without changing the existing code if not needed. But that diverged anyway already a bit. With the new structure, it should also be easier to implement rtapi_udp_sendto() or something similar needed for RTNet.
I have to look into it a bit more next week if I find time. But from my point of view it would make sense to split the more involved changes in a new PR or also two. But there is also always some testing effort behind which I don't know how much you have to do from your side. Also there is: #3925 I was now wondering if there is not a better solution for this. Often I use the following pattern to define the function and the matching type in the same header and then from then on only FUN_T* for function pointers. So you naturally change both. By some define magic, it should be possible do do both in one, but then it gets hard to read. You know of a good way to test if both definitions match? It would also be nicer to check at compile time if a .so has the needed main/exit functions. |
44ab357 to
f165b58
Compare
|
I'll need to make a new look over it after all these changes to see if I missed stuff. |
Yes, I changed again a lot based on your review... I rewrote the socket serialize/deserialize anyway. After looking for to long at this read_strings, I could not let it be that way. It's still a bit handwavy but I also did not want to import a new library just for this. Still open from your review:
I would like to postpone this to separate PR's:
Of course except you see something else/new. |
|
I quickly tested the signal handler by doing a segfault on purpose in the load command handler: Of course there is no core dump what so ever, also not on master. |
|
So lacking better options, I went for: cd41d50 @@ -661,7 +662,10 @@ static void signal_handler(int sig, siginfo_t * /*si*/, void * /*uctx*/) {
WRITE_STDERR_STR("rtapi_app: UNKNOWN - shutting down\n");
break;
}
-
+ rtapi_msg_queue.consume_all([](const message_t &m) {
+ WRITE_STDERR_STR(m.msg);
+ });
+
_exit(-1);
}so the last messages are shown. But I am not sure if this is signal signal-safety conform. However, it works. I blocked the normal consumer thread and then did |
9cf3880 to
cd41d50
Compare
If you run on a system with systemd (most likely) you may want to use |
That should work too. As long as the string is a constant and strlen is replaced by the compiler builtin, then the length expands as a constant too. Just a minor detail, you should put the expansion of a macro in parentheses.
You are now catching more signals than the original. That is wrong. You actually want to coredump on "bad" signals (anything not with explicit meaning for the running program). The system should behave predictable and if that does not happen you want to be able to backtrack why it didn't behave. That is why it codedumps, so you can backtrack. |
Ah now I get a coredump. It did work on master and I broke it. But I also told you I am not sure what I am doing. A fix will follow. If not setid: If setuid, References: |
|
So, next try. If you want to take over for the signal part, feel free. The last time I used coredumps is like 20 years ago and for signals like 10. I now just call abort() inside the handler which works and is allowed. But I don't know if it is good practice. I added the message queue write out also. As long as there is no malloc/free in consume_all(), that is fine. And I don't think there is due to the queue is fixed size but I won't bother reviewing the boost library. I can remove it again if you think it's to dangerous. For testing, I just run Coredump is there. However, latency-test not. No clue why but it's the same on master. : No coredump. |
| } | ||
|
|
||
| //Send | ||
| ssize_t res = send(fd, buf.data(), buf.size(), 0); |
There was a problem hiding this comment.
This may return -1 and errno == EAGAIN. In that case you need to retry.
There was a problem hiding this comment.
Is this needed? https://www.man7.org/linux/man-pages/man2/send.2.html says that happens only on non blocking sockets. And the socket used in here is blocking.
|
|
||
| //Get data | ||
| std::vector<char> buf(buff_size); | ||
| res = recv(fd, buf.data(), buff_size, 0); |
There was a problem hiding this comment.
May return -1 and errno == EAGAIN. You need to retry in that case.
There was a problem hiding this comment.
Is this needed? https://www.man7.org/linux/man-pages/man2/recv.2.html says that happens only on non blocking sockets. And the socket used in here is blocking.
There was a problem hiding this comment.
You are right, my mind was probably asleep or walkabout when I wrote that...
The error you need to check is EINTR (a signal interrupted the operation before it was executed).
There was a problem hiding this comment.
Hmm, I think interruption by a signal should be handled in a specific way if signals are used in a meaningful way, like setting a flag what to do and only retry in specific cases. This has to be done right trough the whole HAL like in hm2_eth.c or not at all.
rtapi_app just catches a few signals to exit in all cases, just with a different message and explicit coredump. Even ALRM just exits the process.
SIGTSTP is ignored according to the code: https://github.com/hdiethelm/linuxcnc-fork/blob/rtapi_cleanup_v2/src/rtapi/uspace_rtapi_main.cc#L816
So I think it only makes sense to add EINTR error handling if the signal handler does something usefull and in that case, it is specific to what you want to do. Just adding a generic while(EINTR) error handling could even make it more difficult in the future if someone wants to add a specific signal for example to nicely exit the process.
I tried killall -SIGTSTP rtapi_app to get accept https://github.com/hdiethelm/linuxcnc-fork/blob/rtapi_cleanup_v2/src/rtapi/uspace_rtapi_main.cc#L461 to return EINTR but this doesn't happen due to the signal is ignored.
Context:
https://www.man7.org/linux/man-pages/man7/signal.7.html
https://unix.stackexchange.com/questions/253349/what-is-the-rationale-behind-eintr
There was a problem hiding this comment.
A standard way to handle EINTR is:
retry_recv:
ssize_t err = recv(...); // Also for send(), read(), write(),...
if(err < 0) {
if(EINTR == errno)
goto retry_recv;
// else handle the error
}The question is whether it happens in practice. What parts in the code allow, handle or depend on signals that can cause an interrupted syscall.
Another problem that can occur with reads and writes is that there are only partial reads/writes. But this is usually ignored and in practice does not happen on Linux [pipes do have a PIPE_BUF limit for atomicity]. As long as you write/read under a page size, then it should never occur afaik, so lets ignore it ;-)
There was a problem hiding this comment.
The question is whether it happens in practice. What parts in the code allow, handle or depend on signals that can cause an interrupted syscall.
It seams to not happen in practice. I did not manage to achieve an EINTR to test it and I also did not find anything in the LinuxCNC issues. So I think it is save to just not check for EINTR, it was never checked in the 20 years the code uses sockets in a similar way in rtapi_usespace: 50e259c
Otherwise, EINTR should be handled also for connect/accept and might be more. So for consistency, I prefer to not check this at all as long as it never happens. Of course, if someone starts to use signals for specific actions, it should be checked all over the place and specific actions should be taken. This then would look like:
retry_recv:
ssize_t err = recv(...); // Also for send(), read(), write(),...
if(err < 0 && !exit_flag) {
if(EINTR == errno)
goto retry_recv;
// else handle the error
}There was a problem hiding this comment.
BTW: My only change was to replace write by send and read by recv for readability. read/write on sockets just feels wrong even though it is equal. The man page says:
The only difference between send() and write(2) is the presence of flags. With a zero flags argument, send() is equivalent to write(2).
and
The only difference between recv() and read(2) is the presence of flags. With a zero flags argument, recv() is generally equivalent to read(2) (but see NOTES)`
There was a problem hiding this comment.
Now I manage to get EINTR for accept():
Comment out https://github.com/hdiethelm/linuxcnc-fork/blob/rtapi_cleanup_v2/src/rtapi/uspace_rtapi_main.cc#L712 so the signal handler doesn't stop the program.
Terminal 1: ../bin/rtapi_app
Terminal 2: killall rtapi_app
Result:
rtapi_app: SIGTERM - shutting down
rtapi_app: failed to accept connection from slave: Interrupted system call
But I still think this is a use-case that should be properly implemented implemented if needed in the future.
Never the less, while looking trough the code, I found the run_threads() function to be pretty useless and removed it all together. See commit f52ef21.
Most of it is now fine :-)
Well, you are allowed to call abort() in the signal handler. The existing solution (using kill()) re-raises the caught signal with the handler to default. Using abort() seems the better choice. |
3998392 to
f52ef21
Compare
|
I tested the actual state of this PR both with latency-histogram and on my machine. No issues so far except #3946. |
|
@BsAtHome Thanks already for the detailed review. Still open points:
Anything else? |
Changed size check to -2 due to 1 byte is needed at the beginning and 1 for \0 at the end
Core dump did not work any way
This concept uses abort to always create a coredump, indepednent if the original signal has core as a default action. Write out remaining messages: This should be signal safe. Unshure: consume_all() could use free internaly. But due to the queue is fixed size, it would be strange.
There is nothing usefull you can do in run_threads due to callback() blocks on accept and only returns on a slave command. Return -1 did not break the while loop. It looks like a mistake. Never the less, this behavour is kept due to exiting all threads just because a socket error happened is most probably not desired.
Errno was always printed, even if there was an other issue
vector.insert increases size, so to much data was sent. Change to reserve() and push_back() to avoid allocation.
Otherwhise, master will hang forever if client doesn't send enough data
No references where found in this file
Struct for C++ instance is unusual
a579097 to
ab79dee
Compare
|
Just rebased to master again. |
|
I have converted this to draft, until it isn't WIP any more. |
|
I guess the just added comments are the last I have to say for this review. There might be other things, but that should not matter. The final question is who has been testing this. Are there any things caught in running this? |
do_comp_args() got the name in the args and then used args starting from 1. Instead, remove command and name before call and start from 0.
|
If other people can confirm proper function... I'm fine with it. |
|
I don't have a CNC to test on but I have LinuxCNC's sims working in Fedora and two VMs (Gentoo and Debian, both with PREEMPT_RT and RTAI available) |
|
Just one more thing... Can you run from the src directory: |
It shows all green. One (second) last commit from my side: cherry-picked 00d534c from 2.9 I will do some final testing in 1-2 days and it is also additionally tested in #3964 which bases of this PR. What I can test in Debian Trixie:
|
This sounds good, so together we have already good coverage. |
Continuation of #3908 reverted in #3918
Target: Move some classes out of the huge uspace_rtapi_app.cc
uspace_rtapi_main: Contains the main function and helpers
uspace_rtapi_app: Contains the RtapiApp class
uspace_posix: Contains the PosixApp class
Other fixes: