Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Crash while compiling rocSPARSE #123

Open
AphidGit opened this issue Jul 23, 2024 · 6 comments
Open

[Issue]: Crash while compiling rocSPARSE #123

AphidGit opened this issue Jul 23, 2024 · 6 comments
Labels
generic Build error, or some other issue not caused by an LLVM bug Under Investigation

Comments

@AphidGit
Copy link

Problem Description

rocSPARSE compilation crashes, rather than producing an error or succeeding.
fail.txt

Operating System

Arch linux, kernel 6.9.7-arch1-1

CPU

AMD Threadripper 1950X

GPU

AMD Radeon RX 7900 XTX

ROCm Version

ROCm 6.1.0

ROCm Component

rocSPARSE

Steps to Reproduce

After compiling all prerequisites, try doing the following (or something like it):

cd $BASEDIR 
[[ -n "${BASEDIR}" ]] &&  rm -rf "$BASEDIR/14_sparse"
mkdir -p 14_sparse
cd 14_sparse 

mkdir -p build 
DEST="$BASEDIR/14_sparse/build"

git clone https://github.com/ROCmSoftwarePlatform/rocSPARSE
cd rocSPARSE


cmake \
    -Wno-dev \
    -D CMAKE_BUILD_TYPE=Release \
    -D CMAKE_CXX_COMPILER=${ROCM_INSTALL_DIR}/bin/hipcc \
    -D CMAKE_CXX_FLAGS="${CXXFLAGS} -fcf-protection=none" \
    -D CMAKE_INSTALL_PREFIX=${ROCM_INSTALL_DIR} \
    -G Ninja \
    $BASEDIR/14_sparse/rocSPARSE
    
"${NINJA:=ninja}" $NUMJOBS
DESTDIR=$DEST "$NINJA" install

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@jsandham
Copy link

Thanks for raising this issue.

I have a couple questions that might help track down the problem:

  1. It looks like you are cloning rocSPARSE and using latest develop branch. Can you tell me what specific commit id you are using?
  2. From the log, it looks like you are hitting an issue when trying to compile a rocSPARSE routine that uses rocPRIM. I assume that you are using the rocPRIM that came with 6.1 and not cloning rocPRIM and installing rocPRIM from latest source before compiling rocSPARSE. Is that correct?

@jsandham
Copy link

Issue appears to be caused by using rocprim from 6.1 but compiling latest rocsparse. The specific offending commit id in rocsparse is this: 81e4f9527b825195f53c8e3b660f6a699af829b7. Investigating a solution now. As a temporary workaround, compiling latest rocprim first and then compiling latest rocsparse should work.

@AphidGit
Copy link
Author

A dependency diagram would be helpful... I don't know what depends on what and there's 40 of these repositories before I arrive at my goal.

But no, I am installing all of it through cloning and compiling, because I want to see if I can debug problems. However, due to how long it takes, there might be patches/new versions released during the compilation process. (As it takes several days to compile it all, and I'm still debugging the whole process, writing some patches to overcome ubuntu-assumptions, and so on...)

@jsandham
Copy link

Fixing PR up now. Ill comment here once it is merged.

Regarding dependencies, currently rocSPARSE depends on rocPRIM and (optionally) rocBLAS. While the rocPRIM dependency is mentioned in the docs (see https://rocm.docs.amd.com/projects/rocSPARSE/en/latest/install/Linux_Install_Guide.html#linux-install), I agree we should present this information better as currently I don't think we are clear on how rocSPARSE should work when say using the latest rocSPARSE while also using older versions of rocPRIM (within the same major version). Ill look into improving that.

@jsandham
Copy link

Correcting something I said in my previous comment that is wrong:

I identified the cause of the compilation failures you are seeing as stemming from using rocPRIM 3.1.0 (this is the version that came with your installation of rocm 6.1) and trying to compile rocSPARSE using the latest develop branch. Specifically there was a change in rocPRIM 3.2.0 that is used by rocSPARSE develop (the develop branch being much further ahead of what was packaged with your rocm 6.1 installation). This then caused compilation failures when using rocPRIM 3.1.0 since this version of rocPRIM obviously does not have those changes. All of this is correct.

The part where I made an incorrect statement was regarding how rocSPARSE should work with older versions of rocPRIM. How it works is actually the opposite of what I stated. Given a rocm release with say rocPRIM version 3.1.0 and rocSPARSE version 3.1.0, it should be possible to re-build rocSPARSE 3.1.0 with any future rocPRIM 3.Y.Z version where 3.Y.Z >= 3.1.0 up to the next major version change.

This then explains the failure as trying to build rocSPARSE 3.2.0 with rocPRIM 3.1.0 is not supported.

Recommendations:
Some good rules to follow I think are as follows:

  1. If you are trying to build rocSPARSE for the purpose of active development (you plan to create PR's into develop branch adding new functionality etc), you will want to clone and build rocPRIM with latest develop branch prior to building rocSPARSE on latest develop.

  2. If you are building rocSPARSE from source to, say, just use a different architecture that is not included by default, it may be better to instead clone and build one of the release branches (for example release/rocm-rel-6.2) instead of develop as these branches are more stable. Just like before though, you will want to first clone and build rocPRIM with release/rocm-rel-6.2 followed by rocSPARSE with release/rocm-rel-6.2.

@ppanchad-amd ppanchad-amd added generic Build error, or some other issue not caused by an LLVM bug Under Investigation labels Jan 10, 2025
@sohaibnd
Copy link

sohaibnd commented Jan 10, 2025

Given a rocm release with say rocPRIM version 3.1.0 and rocSPARSE version 3.1.0, it should be possible to re-build rocSPARSE 3.1.0 with any future rocPRIM 3.Y.Z version where 3.Y.Z >= 3.1.0 up to the next major version change.

Hi @AphidGit, were you able to compile rocSPARSE successfully after compiling an appropriate version of rocPRIM as @jsandham suggested above? Note that the rocSPARSE docs have been updated explaining rocSPARSE's dependency on rocPRIM better and should show up in an upcoming release. If you have any follow-up questions/concerns, let us know otherwise we can close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
generic Build error, or some other issue not caused by an LLVM bug Under Investigation
Projects
None yet
Development

No branches or pull requests

4 participants