-
-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gui-agent: die when xorg fails to start #176
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are several bugs related to signal handling that need to be fixed.
gui-agent/vmside.c
Outdated
static void handle_sigchld() | ||
{ | ||
fprintf(stderr, "Xorg died unexpectedly, exiting!\n"); | ||
exit(1); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fprintf
and exit
are not async-signal-safe, so one cannot call them from a signal handler that could interrupt another async-signal-unsafe function. If one uses sigprocmask
or pthread_sigmask
to block SIGCHLD except during ppoll()
, then the problem goes away, as ppoll()
is async-signal-safe. For this to work, SIGCHLD must be blocked before the handler is registered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't see the ppoll
call you're talking about. I do see one to select
, though. Considering it's in gui-common/txrx-vchan.c
, though, I'd gather that doing that there would be quite painful. Perhaps it'd be better to set a flag that the main loop would then check? Not sure whether we'd get stuck in wait_for_vchan_or_argfd
, though. If we'll exit that function with a dead Xorg, I could check the flag in the main loop quite comfortably, but otherwise this seems like a painful thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use pselect()
or the self-pipe trick.
gui-agent/vmside.c
Outdated
@@ -2188,6 +2188,12 @@ static void handle_sigterm() | |||
exit(0); | |||
} | |||
|
|||
static void handle_sigchld() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static void handle_sigchld() | |
static void handle_sigchld(void) |
Otherwise recent clang will (rightly) complain. That said, one should really use the siginfo_t
argument to check that it was in fact Xorg that exited, as opposed to e.g. a signal sent from another process. Also, Xorg’s exit code should be logged.
gui-agent/vmside.c
Outdated
@@ -2255,6 +2261,7 @@ int main(int argc, char **argv) | |||
int wait_fds[2]; | |||
|
|||
parse_args(&g, argc, argv); | |||
signal(SIGCHLD, handle_sigchld); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using signal
to set a signal handler is deprecated. Use sigaction
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do anything about the use of signal
for SIGTERM
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should be fixed too.
There is also another issue: https://openqa.qubes-os.org/tests/67828#step/guivm_startup/21
I'm not yet sure if that just reported a crash elsewhere, or if that script is supposed to exit in case of sys-gui. I'll try to extract more logs from there. |
This also makes stopping |
Ok, this was stupid: #177 |
OpenQA test summaryComplete test suite and dependencies: https://openqa.qubes-os.org/tests/overview?distri=qubesos&version=4.2&build=2023030619-4.2&flavor=pull-requests New failures, excluding unstableCompared to: https://openqa.qubes-os.org/tests/overview?distri=qubesos&version=4.2&build=2023021823-4.2&flavor=update
Failed tests73 failures
Fixed failuresCompared to: https://openqa.qubes-os.org/tests/60652#dependencies 7 fixed
Unstable tests
|
gui-agent/vmside.c
Outdated
static void handle_sigchld() | ||
{ | ||
fprintf(stderr, "Xorg died unexpectedly, exiting!\n"); | ||
exit(1); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use pselect()
or the self-pipe trick.
gui-agent/vmside.c
Outdated
@@ -2255,6 +2261,7 @@ int main(int argc, char **argv) | |||
int wait_fds[2]; | |||
|
|||
parse_args(&g, argc, argv); | |||
signal(SIGCHLD, handle_sigchld); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should be fixed too.
Hmm, the approach of the signal handler just setting some kind of flag won't work, as in the motivating case, the signal gets delivered while the gui agent is stuck in I considered rewriting the signal handler for Another option would be to have the And then there's the option of starting up another thread, just in case Xorg dies on us, and handling this there? I don't think there is an elegant solution here. Unix was a mistake. |
Xorg exiting should cause I/O to fail, so if Xlib hangs that is an Xlib bug. Might be better to port the whole thing to XCB, which should at least be somewhat predictable.
It gets worse: due to race conditions, it isn’t safe to exit until all
Self-pipe is safe, alarm isn’t.
Does the agent call
Yes, it was. |
Could you expand on this? I don't really understand why.
What happens when we exit too early? I don't think we can rely on this either way, as this is happening across a security boundary.
I'm glad we agree. |
See below. The tl;dr is that we really do not want to exit uncleanly.
See this GUI daemon commit, which will (hopefully) be merged eventually. In short, while there is no security problem, it could cause a guest-wide hang or loss of network connectivity. Ideally, the agent would go even further, and wait for all windows to be unmapped on the agent side.
Unix got some things right, but a lot of things wrong. |
Citation needed. Anyway, that's one of the thing such application must take care of. More complex application (which this isn't really) have APIs to register callbacks around fork (before, after in parent, after in child etc). Anyway, I don't think any of this applies here, because we aren't talking about interacting with things after fork() (but - if we would, xenstore may need some special care, as it may use threads).
But you do realize we are talking about handling premature Xorg exit here, right? At this point all grants are unmapped already. So, there is no point in complicating things in this case. |
Say Xorg just randomly segfaults. It's a pile of C code so it's not that farfetched. What part of the system unmaps the grants in this case? |
Whoops! I forgot that the grants are mapped by Xorg, not the agent process. |
Kernel, on FD release. |
f640072
to
328dada
Compare
I pushed a much larger diff, that should handle all this properly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check for the return code of
if (atomic_load(&terminating)) { | ||
exit(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the exit status really be zero here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This happens if we get SIGTERM. Not sure if exit(0) is what we should do in this case, but it is what the previous code did. What would you suggest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure tbh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we would want to reset the SIGTERM sigaction and send SIGTERM to ourselves, so that the parent gets told we were killed by SIGTERM? Seems like unnecessary complexity though, if I'm being honest.
Uhm, I blinked and suddenly gui-agent got another thread. I must admit I don't really like it... I'd much prefer the approach with a flag and checking it where necessary. In case of waiting for Xorg startup, I guess that would be mostly |
See also QubesOS/qubes-issues#8060