cuTensorNetError: CUTENSORNET_STATUS_DISTRIBUTED_FAILURE on Perlmutter #30

yapolyak · 2023-01-11T20:33:52Z

yapolyak
Jan 11, 2023

Dear cuQuantum team,

I am trying to run distributed cuQuantum-python on the Perlmutter supercomputer (specifically, I am running https://github.com/NVIDIA/cuQuantum/blob/main/python/samples/cutensornet/coarse/example22_mpi_auto.py), and I am facing two different errors depending on the way I try to run it - the first one seems to be dependent on Slurm (details below), but the second one I think is related to the setup, and I was wondering if you could kindly advice me on what could be wrong (I have also contacted Perlmutter service desk with this regards).

I install cuQuantum in the local Conda environment in the following way:

conda install -c conda-forge cuquantum-python "mpich=*=external_*"

this is in order to use Perlmutter-native MPICH library, which according to their documentation is CUDA-aware. I made sure that the $CUTENSORNET_COMM_LIB variable is pointing to libcutensornet_distributed_interface_mpi.so located in my Conda environment.
Then I install mpi4py using the following command suggested by Perlmutter docs:

MPICC="cc -shared" pip install --force-reinstall --no-cache-dir --no-binary=mpi4py mpi4py

If I then try to execute example22_mpi_auto.py within the interactive SLURM session using srun, I get the following type errors:

...
File "mpi4py/MPI/Comm.pyx", line 691, in mpi4py.MPI.Comm.Bcast
(GTL DEBUG: 3) cuIpcOpenMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 272
...

This however can be bypassed if I use the --gpu-bind=none option, and therefore I am guessing this is more a Perlmutter/SLURM issue rather than the cuQuantum setup (also discussed here E3SM-Project/E3SM#4834).
However, when I use this option, or if I avoid running comm.Bcast() command in the example, I get a different error, this time from inside cuQuantum:

Traceback (most recent call last):
  File "/global/u1/i/ipolyak/cuquantum-mpi-tests/example22_mpi_auto.py", line 45, in <module>
Traceback (most recent call last):
  File "/global/u1/i/ipolyak/cuquantum-mpi-tests/example22_mpi_auto.py", line 45, in <module>
Traceback (most recent call last):
  File "/global/u1/i/ipolyak/cuquantum-mpi-tests/example22_mpi_auto.py", line 45, in <module>
Traceback (most recent call last):
  File "/global/u1/i/ipolyak/cuquantum-mpi-tests/example22_mpi_auto.py", line 45, in <module>
    cutn.distributed_reset_configuration(
  File "cuquantum/cutensornet/cutensornet.pyx", line 2306, in cuquantum.cutensornet.cutensornet.distributed_reset_configuration
    cutn.distributed_reset_configuration(
    cutn.distributed_reset_configuration(
  File "cuquantum/cutensornet/cutensornet.pyx", line 2306, in cuquantum.cutensornet.cutensornet.distributed_reset_configuration
    cutn.distributed_reset_configuration(
  File "cuquantum/cutensornet/cutensornet.pyx", line 2306, in cuquantum.cutensornet.cutensornet.distributed_reset_configuration
  File "cuquantum/cutensornet/cutensornet.pyx", line 2306, in cuquantum.cutensornet.cutensornet.distributed_reset_configuration
  File "cuquantum/cutensornet/cutensornet.pyx", line 2328, in cuquantum.cutensornet.cutensornet.distributed_reset_configuration
  File "cuquantum/cutensornet/cutensornet.pyx", line 2328, in cuquantum.cutensornet.cutensornet.distributed_reset_configuration
  File "cuquantum/cutensornet/cutensornet.pyx", line 2328, in cuquantum.cutensornet.cutensornet.distributed_reset_configuration
  File "cuquantum/cutensornet/cutensornet.pyx", line 2328, in cuquantum.cutensornet.cutensornet.distributed_reset_configuration
  File "cuquantum/cutensornet/cutensornet.pyx", line 229, in cuquantum.cutensornet.cutensornet.check_status
  File "cuquantum/cutensornet/cutensornet.pyx", line 229, in cuquantum.cutensornet.cutensornet.check_status
  File "cuquantum/cutensornet/cutensornet.pyx", line 229, in cuquantum.cutensornet.cutensornet.check_status
  File "cuquantum/cutensornet/cutensornet.pyx", line 229, in cuquantum.cutensornet.cutensornet.check_status
cuquantum.cutensornet.cutensornet.cuTensorNetError: CUTENSORNET_STATUS_DISTRIBUTED_FAILURE
cuquantum.cutensornet.cutensornet.cuTensorNetError: CUTENSORNET_STATUS_DISTRIBUTED_FAILURE
cuquantum.cutensornet.cutensornet.cuTensorNetError: CUTENSORNET_STATUS_DISTRIBUTED_FAILURE
cuquantum.cutensornet.cutensornet.cuTensorNetError: CUTENSORNET_STATUS_DISTRIBUTED_FAILURE

(I am trying to run on 4 GPUs, apologies for the bloated output).

Would you have any suggestion of what in my setup could be causing this error? One hypothesis I have, is that there is some mismatch between what cudatoolkit libraries Perlmutter's MPICHis trying to talk to (I have the default cudatoolkit/11.7 module loaded, but cudatoolkit/11.8 seems to be installed by Conda locally - and this is probably what cuQuantum is using).

Many thanks!
Iakov

Answered by yapolyak

Jan 12, 2023

Two more updates:

I tried using the same setup (cuQuantum-python and mpi4py installed in a local Conda env, but using Perlmutter's default MPICH) and running a CUDA-aware MPI example from here https://github.com/mpi4py/mpi4py/blob/master/demo/cuda-aware-mpi/use_cupy.py - and it works fine. So my issues must be coming from cuQuantum (cuTensor) not talking correctly to MPICH I suppose...
I tried installing OpenMPI from within Conda together with cuQuantum-python, but every time it seems to install an "external package", specifically openmpi-4.1.4-external_2, and as a result there are neither MPI libraries no executables such as mpirun in my environment. Would you have any suggestion of wha…

View full answer

yapolyak · 2023-01-11T20:43:02Z

yapolyak
Jan 11, 2023
Author

A little correction: the last error says that it originates from line 45 of the example however I have deleted some docstrings at the top of the file in my local version, so it corresponds to line 60 in the original file, specifically, this line:

cutn.distributed_reset_configuration(
    handle, *cutn.get_mpi_comm_pointer(comm)
)

5 replies

DmitryLyakh Jan 13, 2023
Maintainer

This likely happens because the MPI communicator is not broadcast (or not properly broadcast) to all MPI processes, which is essentially what you describe above. As such, it is probably not directly related to cuQuantum per se, more to the environment configuration.

DmitryLyakh Jan 13, 2023
Maintainer

Could you please set the environment variable CUTENSORNET_LOG_LEVEL=5 and rerun. It should produce a more detailed log.

DmitryLyakh Jan 13, 2023
Maintainer

What I meant in the first comment, it looks like the MPI communicator could be inconsistent across MPI processes.

yapolyak Jan 13, 2023
Author

I see, that makes sense, thank you! Will rerun with the CUTENSORNET_LOG_LEVEL=5 and report back!

yapolyak Jan 13, 2023
Author

Right, this is the additional logs I get if I set the above variable to 5:

[2023-01-13 09:26:56][cuTensorNet][31853][Info][cutensornetCreate] Initializing cuTensorNet distributed communication service interface
[2023-01-13 09:26:56][cuTensorNet][31853][Info][cutensornetCreate] Opening cuTensorNet distributed communication service library /global/homes/i/ipolyak/.conda/envs/py-cuquantum-22.11.0.1-mpich-py3.9/lib/libcutensornet_distributed_interface_mpi.so
[2023-01-13 09:26:56][cuTensorNet][31853][Info][cutensornetCreate] WARNING: Unable to open distributed communication service library: No distributed communication service in use.
[2023-01-13 09:26:56][cuTensorNet][31853][Api][cutensornetDistributedResetConfiguration] handle=0X5577A7668D40, commPtr=0X150ACF5A1F80, commSize=4
[2023-01-13 09:26:56][cuTensorNet][31853][Info][cutensornetDistributedResetConfiguration] Resetting distributed communicator inside cuTensorNet context: 0X150ACF5A1F80, 4
[2023-01-13 09:26:56][cuTensorNet][31853][Info][cutensornetDistributedResetConfiguration] Synchronizing distributed communicator via barrier
[2023-01-13 09:26:56][cuTensorNet][31853][Error][cutensornetDistributedResetConfiguration] Unable to accept distributed communicator, no MPI library found!
[2023-01-13 09:26:56][cuTensorNet][31853][Hint][cutensornetDistributedResetConfiguration] Make sure $CUTENSORNET_COMM_LIB points to the cuTensorNet-MPI wrapper library.
[2023-01-13 09:26:56][cuTensorNet][31853][Api][cutensornetGetErrorString] error=27

So... "no MPI library found!" after all... I did check - $CUTENSORNET_COMM_LIB does point to libcutensornet_distributed_interface_mpi.so in my local conda env.

I read in the other thread here that installing mpi4py with pip (in conda env) will not work - and this is what I was doing (to provide the Perlmutter's compiler explicitly). I am now trying building with conda-forge's mpi4py - will let you know if that worked at all...

yapolyak · 2023-01-12T16:47:08Z

yapolyak
Jan 12, 2023
Author

Two more updates:

I tried using the same setup (cuQuantum-python and mpi4py installed in a local Conda env, but using Perlmutter's default MPICH) and running a CUDA-aware MPI example from here https://github.com/mpi4py/mpi4py/blob/master/demo/cuda-aware-mpi/use_cupy.py - and it works fine. So my issues must be coming from cuQuantum (cuTensor) not talking correctly to MPICH I suppose...
I tried installing OpenMPI from within Conda together with cuQuantum-python, but every time it seems to install an "external package", specifically openmpi-4.1.4-external_2, and as a result there are neither MPI libraries no executables such as mpirun in my environment. Would you have any suggestion of what could be the issue here? I am simply running

conda install -c conda-forge cuquantum-python openmpi

18 replies

DmitryLyakh Jan 18, 2023
Maintainer

You may try launching your application via a bash script where inside the bash script you can check environment variable PMI_RANK (should be set to the rank of the process) and set CUDA_VISIBLE_DEVICES to the appropriate value based on the PMI_RANK. Thus, if the MPI processes are bound to specific CPUs, you can control which GPU is assigned to each such bound MPI process.

DmitryLyakh Jan 18, 2023
Maintainer

By default each MPI process within a node will get its own GPU (process_rank % number_of_gpus_per_node).

leofang Jan 23, 2023
Maintainer

By the way (I will see this soon myself, but still) does distributed cuTensorNet support the interleaved style of passing a tensor network to contract?

@yapolyak Yes, sure 🙂 The only requirements of the automated MPI support -- as documented -- were 1. Using CUDA-aware MPI, 2. Having the distributed shim layer compiled against the local MPI (this is automatically satisfied by our conda-forge package). Changing the way contract() is called is not a requirement.

yapolyak Jan 24, 2023
Author

You may try launching your application via a bash script

Yes, thank you very much, I tried a similar approach (using SLURM_LOCALID instead of PMI_RANK) on Perlmutter and it works nicely! Problem solved!

yapolyak Jan 24, 2023
Author

@leofang, many thanks, indeed it works for me!

leofang · 2023-01-23T18:33:07Z

leofang
Jan 23, 2023
Maintainer

Tracked this issue in #31.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuTensorNetError: CUTENSORNET_STATUS_DISTRIBUTED_FAILURE on Perlmutter #30

{{title}}

Replies: 3 comments 23 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

cuTensorNetError: CUTENSORNET_STATUS_DISTRIBUTED_FAILURE on Perlmutter #30

yapolyak Jan 11, 2023

Replies: 3 comments · 23 replies

yapolyak Jan 11, 2023 Author

DmitryLyakh Jan 13, 2023 Maintainer

DmitryLyakh Jan 13, 2023 Maintainer

DmitryLyakh Jan 13, 2023 Maintainer

yapolyak Jan 13, 2023 Author

yapolyak Jan 13, 2023 Author

yapolyak Jan 12, 2023 Author

DmitryLyakh Jan 18, 2023 Maintainer

DmitryLyakh Jan 18, 2023 Maintainer

leofang Jan 23, 2023 Maintainer

yapolyak Jan 24, 2023 Author

yapolyak Jan 24, 2023 Author

leofang Jan 23, 2023 Maintainer

yapolyak
Jan 11, 2023

Replies: 3 comments 23 replies

yapolyak
Jan 11, 2023
Author

DmitryLyakh Jan 13, 2023
Maintainer

DmitryLyakh Jan 13, 2023
Maintainer

DmitryLyakh Jan 13, 2023
Maintainer

yapolyak Jan 13, 2023
Author

yapolyak Jan 13, 2023
Author

yapolyak
Jan 12, 2023
Author

DmitryLyakh Jan 18, 2023
Maintainer

DmitryLyakh Jan 18, 2023
Maintainer

leofang Jan 23, 2023
Maintainer

yapolyak Jan 24, 2023
Author

yapolyak Jan 24, 2023
Author

leofang
Jan 23, 2023
Maintainer