Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gui-agent: die when xorg fails to start #176

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

meithecatte
Copy link

@meithecatte meithecatte commented Feb 28, 2023

Copy link
Contributor

@DemiMarie DemiMarie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several bugs related to signal handling that need to be fixed.

Comment on lines 2191 to 2212
static void handle_sigchld()
{
fprintf(stderr, "Xorg died unexpectedly, exiting!\n");
exit(1);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fprintf and exit are not async-signal-safe, so one cannot call them from a signal handler that could interrupt another async-signal-unsafe function. If one uses sigprocmask or pthread_sigmask to block SIGCHLD except during ppoll(), then the problem goes away, as ppoll() is async-signal-safe. For this to work, SIGCHLD must be blocked before the handler is registered.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see the ppoll call you're talking about. I do see one to select, though. Considering it's in gui-common/txrx-vchan.c, though, I'd gather that doing that there would be quite painful. Perhaps it'd be better to set a flag that the main loop would then check? Not sure whether we'd get stuck in wait_for_vchan_or_argfd, though. If we'll exit that function with a dead Xorg, I could check the flag in the main loop quite comfortably, but otherwise this seems like a painful thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use pselect() or the self-pipe trick.

@@ -2188,6 +2188,12 @@ static void handle_sigterm()
exit(0);
}

static void handle_sigchld()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static void handle_sigchld()
static void handle_sigchld(void)

Otherwise recent clang will (rightly) complain. That said, one should really use the siginfo_t argument to check that it was in fact Xorg that exited, as opposed to e.g. a signal sent from another process. Also, Xorg’s exit code should be logged.

@@ -2255,6 +2261,7 @@ int main(int argc, char **argv)
int wait_fds[2];

parse_args(&g, argc, argv);
signal(SIGCHLD, handle_sigchld);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using signal to set a signal handler is deprecated. Use sigaction.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do anything about the use of signal for SIGTERM?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be fixed too.

@marmarek
Copy link
Member

There is also another issue: https://openqa.qubes-os.org/tests/67828#step/guivm_startup/21

Feb 28 20:05:22 sys-gui qubes-gui[476]: Xorg died unexpectedly, exiting!

I'm not yet sure if that just reported a crash elsewhere, or if that script is supposed to exit in case of sys-gui. I'll try to extract more logs from there.

@marmarek
Copy link
Member

marmarek commented Mar 1, 2023

This also makes stopping qubes-gui-agent always unclean (as it hits this signal handler when waiting for Xorg to terminate).

@marmarek
Copy link
Member

marmarek commented Mar 1, 2023

I'm not yet sure if that just reported a crash elsewhere, or if that script is supposed to exit in case of sys-gui. I'll try to extract more logs from there.

Ok, this was stupid: #177

@qubesos-bot
Copy link

qubesos-bot commented Mar 1, 2023

OpenQA test summary

Complete test suite and dependencies: https://openqa.qubes-os.org/tests/overview?distri=qubesos&version=4.2&build=2023030619-4.2&flavor=pull-requests

New failures, excluding unstable

Compared to: https://openqa.qubes-os.org/tests/overview?distri=qubesos&version=4.2&build=2023021823-4.2&flavor=update

  • system_tests_whonix

    • whonix_torbrowser: unnamed test (unknown)

    • whonix_torbrowser: Failed (test died)
      # Test died: no candidate needle with tag(s) 'anon-whonix-tor-brows...

    • whonix_torbrowser: unnamed test (unknown)

  • system_tests_basic_vm_qrexec_gui

    • TC_03_QvmRevertTemplateChanges: test_000_revert_linux (error)
      qubes.exc.QubesMemoryError: Not enough memory to start domain 'test...

    • TC_30_Gui_daemon: test_000_clipboard (error)
      qubes.exc.QubesMemoryError: Not enough memory to start domain 'test...

    • TC_00_AppVM_debian-11: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_debian-11: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_fedora-37: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_fedora-37: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_whonix-ws-16: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_whonix-ws-16: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

  • system_tests_network

  • system_tests_pvgrub_salt_storage

    • TC_41_HVMGrub_debian-11: test_010_template_based_vm (error)
      qubes.exc.QubesMemoryError: Not enough memory to start domain 'test...
  • system_tests_splitgpg

    • TC_10_Thunderbird_debian-11: test_000_send_receive_default (failure + cleanup)
      dogtail.tree.SearchError: child of [desktop frame | main]: "Thunder...

    • TC_10_Thunderbird_fedora-37: test_000_send_receive_default (failure + cleanup)
      dogtail.tree.SearchError: child of [desktop frame | main]: "Thunder...

  • system_tests_guivm_gui_interactive

    • update_guivm: Failed (test died)
      # Test died: command '(set -o pipefail; qubesctl --all --show-outpu...
  • system_tests_usbproxy

    • TC_20_USBProxy_core3_whonix-ws-16: test_030_detach (failure)
      AssertionError: <AppVM at 0x706a3cbdce90 name='test-inst-frontend' ...
  • system_tests_network_ipv6

  • system_tests_network_updates

  • system_tests_dispvm

    • TC_20_DispVM_fedora-37: test_100_open_in_dispvm (failure)
      self.assertEqual(test_txt_content.s... AssertionError: b'' != b'test1'
  • system_tests_qwt_win7@hw1

    • windows_install: Failed (test died)
      # Test died: command './install.sh' failed at /usr/lib/os-autoinst/...
  • system_tests_basic_vm_qrexec_gui_zfs

    • TC_00_Basic: test_202_udev_block_exclude_default (failure)
      AssertionError: '7d60680b-393b-4e2e-858a-8e58f358ffbb' unexpectedly...

    • TC_03_QvmRevertTemplateChanges: test_000_revert_linux (failure)
      AssertionError: '583166f7a65890adbad26952ed8782b595cb3b8c' != 'd332...

    • TC_05_StandaloneVM_debian-11-pool: test_101_resize_root_img_online (failure)
      AssertionError: libvirt event impl drain timeout

    • TC_00_AppVM_debian-11-pool: test_130_qrexec_filemove_disk_full (failure)
      AssertionError: libvirt event impl drain timeout

    • TC_00_AppVM_debian-11-pool: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_debian-11-pool: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_fedora-37-pool: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_fedora-37-pool: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_whonix-gw-16-pool: test_105_qrexec_filemove (error)
      qubes.exc.QubesVMError: Cannot connect to qrexec agent for 90 secon...

    • TC_00_AppVM_whonix-gw-16-pool: test_300_bug_1028_gui_memory_pinning (failure)
      AssertionError: Dom0 window doesn't match VM window content

    • TC_00_AppVM_whonix-ws-16-pool: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_whonix-ws-16-pool: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

  • system_tests_basic_vm_qrexec_gui_btrfs

    • TC_00_AppVM_debian-11-pool: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_debian-11-pool: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_fedora-37-pool: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_fedora-37-pool: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_whonix-ws-16-pool: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_whonix-ws-16-pool: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

  • system_tests_basic_vm_qrexec_gui_ext4

    • TC_00_AppVM_debian-11-pool: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_debian-11-pool: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_fedora-37-pool: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_fedora-37-pool: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_whonix-ws-16-pool: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_whonix-ws-16-pool: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

  • system_tests_basic_vm_qrexec_gui_xfs

    • TC_00_AppVM_debian-11-pool: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_debian-11-pool: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_fedora-37-pool: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_fedora-37-pool: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_whonix-ws-16-pool: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_whonix-ws-16-pool: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

  • system_tests_basic_vm_qrexec_gui@hw1

    • TC_00_AppVM_debian-11: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_debian-11: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_fedora-37: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_fedora-37: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_whonix-ws-16: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_whonix-ws-16: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

  • system_tests_gui_tools@hw1

    • qubesmanager_backuprestore: unnamed test (unknown)
    • qubesmanager_backuprestore: Failed (test died)
      # Test died: no candidate needle with tag(s) 'qubes-backup' matched...
  • system_tests_gui_tools

    • qubesmanager_backuprestore: unnamed test (unknown)
    • qubesmanager_backuprestore: Failed (test died)
      # Test died: no candidate needle with tag(s) 'restore-success' matc...

Failed tests

73 failures
  • system_tests_whonix

    • whonix_torbrowser: unnamed test (unknown)

    • whonix_torbrowser: Failed (test died)
      # Test died: no candidate needle with tag(s) 'anon-whonix-tor-brows...

    • whonix_torbrowser: unnamed test (unknown)

  • system_tests_basic_vm_qrexec_gui

    • TC_03_QvmRevertTemplateChanges: test_000_revert_linux (error)
      qubes.exc.QubesMemoryError: Not enough memory to start domain 'test...

    • TC_30_Gui_daemon: test_000_clipboard (error)
      qubes.exc.QubesMemoryError: Not enough memory to start domain 'test...

    • TC_00_AppVM_debian-11: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_debian-11: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_fedora-37: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_fedora-37: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_whonix-ws-16: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_whonix-ws-16: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

  • system_tests_network

  • system_tests_pvgrub_salt_storage

    • TC_41_HVMGrub_debian-11: test_010_template_based_vm (error)
      qubes.exc.QubesMemoryError: Not enough memory to start domain 'test...
  • system_tests_splitgpg

  • system_tests_guivm_gui_interactive

    • update_guivm: Failed (test died)
      # Test died: command '(set -o pipefail; qubesctl --all --show-outpu...
  • system_tests_usbproxy

    • TC_20_USBProxy_core3_whonix-ws-16: test_030_detach (failure)
      AssertionError: <AppVM at 0x706a3cbdce90 name='test-inst-frontend' ...
  • system_tests_network_ipv6

  • system_tests_network_updates

  • system_tests_dispvm

    • TC_20_DispVM_fedora-37: test_100_open_in_dispvm (failure)
      self.assertEqual(test_txt_content.s... AssertionError: b'' != b'test1'

    • TC_20_DispVM_whonix-ws-16: test_100_open_in_dispvm (failure)
      AssertionError: libvirt event impl drain timeout

  • system_tests_qwt_win10@hw1

    • windows_install: Failed (test died)
      # Test died: command './install.sh' failed at /usr/lib/os-autoinst/...
  • system_tests_qwt_win7@hw1

    • windows_install: Failed (test died)
      # Test died: command './install.sh' failed at /usr/lib/os-autoinst/...
  • system_tests_basic_vm_qrexec_gui_zfs

    • TC_00_Basic: test_202_udev_block_exclude_default (failure)
      AssertionError: '7d60680b-393b-4e2e-858a-8e58f358ffbb' unexpectedly...

    • TC_03_QvmRevertTemplateChanges: test_000_revert_linux (failure)
      AssertionError: '583166f7a65890adbad26952ed8782b595cb3b8c' != 'd332...

    • TC_05_StandaloneVM_debian-11-pool: test_101_resize_root_img_online (failure)
      AssertionError: libvirt event impl drain timeout

    • TC_00_AppVM_debian-11-pool: test_130_qrexec_filemove_disk_full (failure)
      AssertionError: libvirt event impl drain timeout

    • TC_00_AppVM_debian-11-pool: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_debian-11-pool: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_fedora-37-pool: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_fedora-37-pool: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_whonix-gw-16-pool: test_105_qrexec_filemove (error)
      qubes.exc.QubesVMError: Cannot connect to qrexec agent for 90 secon...

    • TC_00_AppVM_whonix-gw-16-pool: test_300_bug_1028_gui_memory_pinning (failure)
      AssertionError: Dom0 window doesn't match VM window content

    • TC_00_AppVM_whonix-ws-16-pool: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_whonix-ws-16-pool: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

  • system_tests_basic_vm_qrexec_gui_btrfs

    • TC_00_AppVM_debian-11-pool: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_debian-11-pool: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_fedora-37-pool: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_fedora-37-pool: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_whonix-ws-16-pool: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_whonix-ws-16-pool: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

  • system_tests_basic_vm_qrexec_gui_ext4

    • TC_00_AppVM_debian-11-pool: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_debian-11-pool: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_fedora-37-pool: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_fedora-37-pool: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_whonix-ws-16-pool: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_whonix-ws-16-pool: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

  • system_tests_basic_vm_qrexec_gui_xfs

    • TC_00_AppVM_debian-11-pool: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_debian-11-pool: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_fedora-37-pool: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_fedora-37-pool: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_whonix-ws-16-pool: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_whonix-ws-16-pool: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

  • system_tests_basic_vm_qrexec_gui@hw1

    • TC_00_AppVM_debian-11: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_debian-11: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_fedora-37: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_fedora-37: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_whonix-ws-16: test_220_audio_play (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

    • TC_00_AppVM_whonix-ws-16: test_223_audio_play_hvm (error)
      subprocess.CalledProcessError: Command '['pkill', 'parecord']' retu...

  • system_tests_gui_tools@hw1

    • qubesmanager_backuprestore: unnamed test (unknown)
    • qubesmanager_backuprestore: Failed (test died)
      # Test died: no candidate needle with tag(s) 'qubes-backup' matched...
  • system_tests_gui_tools

    • qubesmanager_backuprestore: unnamed test (unknown)
    • qubesmanager_backuprestore: Failed (test died)
      # Test died: no candidate needle with tag(s) 'restore-success' matc...

Fixed failures

Compared to: https://openqa.qubes-os.org/tests/60652#dependencies

7 fixed
  • system_tests_network

  • system_tests_pvgrub_salt_storage

    • StorageFile: test_001_non_volatile (error)
      subprocess.CalledProcessError: Command '/usr/lib/qubes/destroy-snap...
  • system_tests_network_ipv6

  • system_tests_network_updates

    • TC_11_QvmTemplateMgmtVM_whonix-gw-16: test_000_template_list (failure)
      qvm-template: error: No matching templates to list
  • system_tests_qwt_win10@hw1

    • windows_install: wait_serial (wait serial expected)
      # wait_serial expected: qr/Rt7qO-\d+-/...
  • system_tests_basic_vm_qrexec_gui@hw1

Unstable tests

  • system_tests_update

    update/Failed (1/5 times with errors)
    • job 55329 # Test died: command '(set -o pipefail; qubesctl --show-output stat...
  • system_tests_update@hw1

    update/Failed (1/5 times with errors)
    • job 55329 # Test died: command '(set -o pipefail; qubesctl --show-output stat...
  • system_tests_gui_tools@hw1

    qubesmanager_vmsettings/ (1/2 times with errors)
    qubesmanager_vmsettings/Failed (1/2 times with errors)
    • job 60669 # Test died: no candidate needle with tag(s) 'vm-settings-devices-s...
  • system_tests_gui_tools

    qubesmanager_vmsettings/ (1/2 times with errors)
    qubesmanager_vmsettings/Failed (1/2 times with errors)
    • job 60669 # Test died: no candidate needle with tag(s) 'vm-settings-devices-s...

Comment on lines 2191 to 2212
static void handle_sigchld()
{
fprintf(stderr, "Xorg died unexpectedly, exiting!\n");
exit(1);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use pselect() or the self-pipe trick.

@@ -2255,6 +2261,7 @@ int main(int argc, char **argv)
int wait_fds[2];

parse_args(&g, argc, argv);
signal(SIGCHLD, handle_sigchld);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be fixed too.

@meithecatte
Copy link
Author

Hmm, the approach of the signal handler just setting some kind of flag won't work, as in the motivating case, the signal gets delivered while the gui agent is stuck in mkghandles. And to be honest, I don't know how much I'd trust it to not get stuck in Xlib when Xorg dies, between the calls to select at which we could handle this.

I considered rewriting the signal handler for SIGCHLD to use signal-safe functions, but snprintf is not one of them, and using _exit isn't very nice either.

Another option would be to have the SIGCHLD handler attempt notifying us via a self-pipe, but also set an alarm in case that fails? And in that case, I guess we don't have to feel as bad about using _exit...

And then there's the option of starting up another thread, just in case Xorg dies on us, and handling this there?

I don't think there is an elegant solution here. Unix was a mistake.

@DemiMarie
Copy link
Contributor

Hmm, the approach of the signal handler just setting some kind of flag won't work, as in the motivating case, the signal gets delivered while the gui agent is stuck in mkghandles. And to be honest, I don't know how much I'd trust it to not get stuck in Xlib when Xorg dies, between the calls to select at which we could handle this.

Xorg exiting should cause I/O to fail, so if Xlib hangs that is an Xlib bug. Might be better to port the whole thing to XCB, which should at least be somewhat predictable.

I considered rewriting the signal handler for SIGCHLD to use signal-safe functions, but snprintf is not one of them, and using _exit isn't very nice either.

It gets worse: due to race conditions, it isn’t safe to exit until all MSG_WINDOW_DUMP messages have been acknowledged. So the event loop needs to keep running a bit longer.

Another option would be to have the SIGCHLD handler attempt notifying us via a self-pipe, but also set an alarm in case that fails? And in that case, I guess we don't have to feel as bad about using _exit...

Self-pipe is safe, alarm isn’t.

And then there's the option of starting up another thread, just in case Xorg dies on us, and handling this there?

Does the agent call fork() and then do anything that isn’t async-signal-safe before execve()? If so, that will need to be dealt with first.

I don't think there is an elegant solution here. Unix was a mistake.

Yes, it was.

@meithecatte
Copy link
Author

Self-pipe is safe, alarm isn’t.

alarm(2) is listed in signal-safety(7), though?

Does the agent call fork() and then do anything that isn’t async-signal-safe before execve()? If so, that will need to be dealt with first.

Could you expand on this? I don't really understand why.

It gets worse: due to race conditions, it isn’t safe to exit until all MSG_WINDOW_DUMP messages have been acknowledged. So the event loop needs to keep running a bit longer.

What happens when we exit too early? I don't think we can rely on this either way, as this is happening across a security boundary.

I don't think there is an elegant solution here. Unix was a mistake.

Yes, it was.

I'm glad we agree.

@DemiMarie
Copy link
Contributor

Self-pipe is safe, alarm isn’t.

alarm(2) is listed in signal-safety(7), though?

See below. The tl;dr is that we really do not want to exit uncleanly.

Does the agent call fork() and then do anything that isn’t async-signal-safe before execve()? If so, that will need to be dealt with first.

Could you expand on this? I don't really understand why.

fork() interacts badly with locks. Any locks that were held in the parent process will remain held in the child process, but the threads that would unlock them do not exist in the child. Since e.g. malloc() uses locks internally, this could lead to a deadlock in the child process. Therefore, POSIX states that if the parent process is multi-threaded, the child process after fork() is only allowed to use async-signal-safe interfaces until execve(). Under glibc, one can generally get away with malloc(), but it’s best not to rely on that.

It gets worse: due to race conditions, it isn’t safe to exit until all MSG_WINDOW_DUMP messages have been acknowledged. So the event loop needs to keep running a bit longer.

What happens when we exit too early? I don't think we can rely on this either way, as this is happening across a security boundary.

See this GUI daemon commit, which will (hopefully) be merged eventually. In short, while there is no security problem, it could cause a guest-wide hang or loss of network connectivity. Ideally, the agent would go even further, and wait for all windows to be unmapped on the agent side.

I don't think there is an elegant solution here. Unix was a mistake.

Yes, it was.

I'm glad we agree.

Unix got some things right, but a lot of things wrong.

@marmarek
Copy link
Member

marmarek commented Mar 5, 2023

Therefore, POSIX states that if the parent process is multi-threaded, the child process after fork() is only allowed to use async-signal-safe interfaces until execve().

Citation needed.

Anyway, that's one of the thing such application must take care of. More complex application (which this isn't really) have APIs to register callbacks around fork (before, after in parent, after in child etc). Anyway, I don't think any of this applies here, because we aren't talking about interacting with things after fork() (but - if we would, xenstore may need some special care, as it may use threads).

See this GUI daemon commit, which will (hopefully) be merged eventually. In short, while there is no security problem, it could cause a guest-wide hang or loss of network connectivity. Ideally, the agent would go even further, and wait for all windows to be unmapped on the agent side.

But you do realize we are talking about handling premature Xorg exit here, right? At this point all grants are unmapped already. So, there is no point in complicating things in this case.

@meithecatte
Copy link
Author

But you do realize we are talking about handling premature Xorg exit here, right? At this point all grants are unmapped already. So, there is no point in complicating things in this case.

Say Xorg just randomly segfaults. It's a pile of C code so it's not that farfetched. What part of the system unmaps the grants in this case?

@DemiMarie
Copy link
Contributor

See this GUI daemon commit, which will (hopefully) be merged eventually. In short, while there is no security problem, it could cause a guest-wide hang or loss of network connectivity. Ideally, the agent would go even further, and wait for all windows to be unmapped on the agent side.

But you do realize we are talking about handling premature Xorg exit here, right? At this point all grants are unmapped already. So, there is no point in complicating things in this case.

Whoops! I forgot that the grants are mapped by Xorg, not the agent process.

@marmarek
Copy link
Member

marmarek commented Mar 6, 2023

What part of the system unmaps the grants in this case?

Kernel, on FD release.

@meithecatte
Copy link
Author

I pushed a much larger diff, that should handle all this properly.

Copy link
Contributor

@DemiMarie DemiMarie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check for the return code of

gui-agent/vmside.c Outdated Show resolved Hide resolved
gui-agent/vmside.c Outdated Show resolved Hide resolved
Comment on lines +1987 to +1988
if (atomic_load(&terminating)) {
exit(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the exit status really be zero here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens if we get SIGTERM. Not sure if exit(0) is what we should do in this case, but it is what the previous code did. What would you suggest?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure tbh.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we would want to reset the SIGTERM sigaction and send SIGTERM to ourselves, so that the parent gets told we were killed by SIGTERM? Seems like unnecessary complexity though, if I'm being honest.

@marmarek
Copy link
Member

marmarek commented Mar 8, 2023

Uhm, I blinked and suddenly gui-agent got another thread. I must admit I don't really like it... I'd much prefer the approach with a flag and checking it where necessary. In case of waiting for Xorg startup, I guess that would be mostly wait_for_unix_socket() (currently it will exit(1) if accept fails, including EINTR case).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants