Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(cuda): install NVML development library #4621

Merged
merged 1 commit into from
Apr 16, 2024

Conversation

ito-san
Copy link
Contributor

@ito-san ito-san commented Apr 12, 2024

Description

Recently, after running setup_dev_env.sh and installing NVIDIA libraries, there's an issue where part of NVML (nvml.h) is not installed. This affects the gpu_monitor node in system_monitor, which uses NVML. The gpu_monitor recognized NVML doesn't exist and publish errors as it is unable to access the GPU.

image

See also autowarefoundation/autoware.universe#6787.

I'd like to explicitly install NVML as a workaround for this issue.

Tests performed

  1. Completely remove NVIDIA drivers and libraries.

    sudo apt purge cuda-*
    sudo apt purge nvidia-*
    sudo apt purge libcudnn*
    sudo apt purge libnv*
  2. Confirm that only hwloc/nvml.h exists.

    find /usr -type f -name nvml.h
    /usr/include/hwloc/nvml.h
  3. Run setup_dev_env.sh and installi NVIDIA libraries.

    ./setup-dev-env.sh
    ...
    [Warning] Some Autoware components depend on the CUDA, cuDNN and TensorRT NVIDIA libraries which have end-user     license agreements that should be reviewed before installation.
    Install NVIDIA libraries? [y/N]: y
  4. Confirm that the NVIDIA's nvml.h is installed

    find /usr -type f -name nvml.h
    /usr/local/cuda-12.3/targets/x86_64-linux/include/nvml.h
    /usr/include/hwloc/nvml.h
  5. Delete the build and install directories for system_monitor.

    rm -rf install/system_monitor/ build/system_monitor/
  6. Build system_monitor and ensure build uses NVML (GPU PLATFORM: nvml), and build completes successfully.

    colcon build --symlink-install --cmake-args -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-w" -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache --event-handlers console_direct+ --packages-select system_monitor
    ...
    -- SYSTEM_PROCESSOR: x86_64
    -- CPU PLATFORM: intel
    -- GPU PLATFORM: nvml
    ...
    Finished <<< system_monitor [3min 2s]
    
    Summary: 1 package finished [3min 4s]
      1 package had stderr output: system_monitor
  7. Run Autoware.

    ros2 launch autoware_launch planning_simulator.launch.xml map_path:=/data/sample-map-planning vehicle_model:=sample_vehicle sensor_model:=sample_sensor_kit launch_system_monitor:=true
  8. Run runtime_monitor and Confirm the gpu_monitor does not report an error.

    ros2 run rqt_runtime_monitor rqt_runtime_monitor

image

Effects on system behavior

Not applicable.

Pre-review checklist for the PR author

The PR author must check the checkboxes below when creating the PR.

In-review checklist for the PR reviewers

The PR reviewers must check the checkboxes below before approval.

Post-review checklist for the PR author

The PR author must check the checkboxes below before merging.

  • There are no open discussions or they are tracked via tickets.

After all checkboxes are checked, anyone who has write access can merge the PR.

@ito-san ito-san marked this pull request as draft April 12, 2024 02:24
@ito-san ito-san self-assigned this Apr 12, 2024
@ito-san ito-san marked this pull request as ready for review April 12, 2024 08:28
Copy link
Contributor

@shmpwk shmpwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
I checked evaluator and test passed. Internal evaluator.

@mitsudome-r mitsudome-r merged commit 038f8a6 into autowarefoundation:main Apr 16, 2024
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants