-
Dear cuQuantum team, I am trying to run distributed cuQuantum-python on the Perlmutter supercomputer (specifically, I am running https://github.com/NVIDIA/cuQuantum/blob/main/python/samples/cutensornet/coarse/example22_mpi_auto.py), and I am facing two different errors depending on the way I try to run it - the first one seems to be dependent on Slurm (details below), but the second one I think is related to the setup, and I was wondering if you could kindly advice me on what could be wrong (I have also contacted Perlmutter service desk with this regards). I install cuQuantum in the local Conda environment in the following way:
this is in order to use Perlmutter-native MPICH library, which according to their documentation is CUDA-aware. I made sure that the $CUTENSORNET_COMM_LIB variable is pointing to libcutensornet_distributed_interface_mpi.so located in my Conda environment.
If I then try to execute example22_mpi_auto.py within the interactive SLURM session using srun, I get the following type errors:
This however can be bypassed if I use the
(I am trying to run on 4 GPUs, apologies for the bloated output). Would you have any suggestion of what in my setup could be causing this error? One hypothesis I have, is that there is some mismatch between what cudatoolkit libraries Perlmutter's MPICHis trying to talk to (I have the default cudatoolkit/11.7 module loaded, but cudatoolkit/11.8 seems to be installed by Conda locally - and this is probably what cuQuantum is using). Many thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 23 replies
-
A little correction: the last error says that it originates from line 45 of the example however I have deleted some docstrings at the top of the file in my local version, so it corresponds to line 60 in the original file, specifically, this line:
|
Beta Was this translation helpful? Give feedback.
-
Two more updates:
|
Beta Was this translation helpful? Give feedback.
-
Tracked this issue in #31. |
Beta Was this translation helpful? Give feedback.
Two more updates: