How many blocks or threads per block are used by the cuQuantum API? #20

haiyongsong1921 · 2022-10-13T10:06:45Z

haiyongsong1921
Oct 13, 2022

As I understood, the underhood of the cuStateVec API is kernel functions. But all the cuQuantum APIs cannot set the blocks and the threads_per_block parameters, just like the kernel function <<<blocks_count, threads_per_block>>>. How could I know the exactly count of threads that are used by the cuQuantum API.

Answered by leofang

Oct 13, 2022

As in any CUDA program (and in particular CUDA Libraries) the grid/block/shmem sizes are impacted by many many factors, e.g. algorithm, implementation, hardware, driver, ... Plus, a single API call might have multiple kernels invoked, so there's no way to answer this question. Finally, even if you have this information, I don't think there's much you can do with it.

If you're interested, one way to check is to run your workload under nsys (part of the Nsight Profiler), and then open the generated file in the Nsight visualizer. In the GPU timeline you can inspect the kernel configurations.

View full answer

leofang · 2022-10-13T13:51:10Z

leofang
Oct 13, 2022
Maintainer

As in any CUDA program (and in particular CUDA Libraries) the grid/block/shmem sizes are impacted by many many factors, e.g. algorithm, implementation, hardware, driver, ... Plus, a single API call might have multiple kernels invoked, so there's no way to answer this question. Finally, even if you have this information, I don't think there's much you can do with it.

If you're interested, one way to check is to run your workload under nsys (part of the Nsight Profiler), and then open the generated file in the Nsight visualizer. In the GPU timeline you can inspect the kernel configurations.

6 replies

leofang Oct 14, 2022
Maintainer

@haiyongsong1921 could you share a reproducer? We cannot analyze it for you unless we see how you call cusv.apply_matirx() how you do timing, your problem size, etc. Also, 10-80 us is a vastly wide range, this is certainly not making sense.

haiyongsong1921 Oct 18, 2022
Author

Given:
My cuQuantum code(just follow the sample code):

def hadamard_gate(self, target):
        # hadamard gate matrix
        hGate = np.asarray([1/np.sqrt(2)+1j*0.0, 1/np.sqrt(2)+1j*0.0,
                            1/np.sqrt(2)+1j*0.0, -1/np.sqrt(2)+1j*0.0],
                           dtype=np.complex128)
        
        """
        prepare state vector array (d_sv)
        """
        # apply the h gate
        cusv.apply_matrix(
                    self.handle, d_sv.data.ptr, cudtype.CUDA_C_64F, self.qubits,
                    hGate.ctypes.data, cudtype.CUDA_C_64F, cusv.MatrixLayout.ROW, 0,
                    hTargets, hNTargets, 0, 0, 0, ctype.COMPUTE_DEFAULT,
                    0, 0)

When:
I setup a 10 qubits circuit and use a for loop to make each of them apply a hadamard gate.
then:
It will call hadamard_gate function 10 times, and each time will cost almost 80μs.

In my test I found that if I call cusv.apply_matrix() multi-times in one stack(call apply_matrix() multi-times in one function), it will take much more time for the first call, but after that the rest calls will become much more faster. It seems like it warm up the environment(It maybe load some cuda dependencies) in the first time, and then all speed up. I think it's not a good idea to use for loop to do the apply_matrix() out of one stack. It maybe better to apply_matix() to all the qubits in one stack using a for loop. How do you think about this? I am really appreciate if you can share me some helpful suggestions.

leofang Oct 19, 2022
Maintainer

Hi @haiyongsong1921 as suggested earlier, it'd be better if you offer a full reproducer, not just a short snippet, so that we can easily reproduce on our side. Also, we need to know your system configuration. Since CuPy is a required dependency, you can just run

$ python -c "import cupy; cupy.show_config()"

and attach the output.

Regardless, what you described

it will take much more time for the first call, but after that the rest calls will become much more faster

is common in any benchmarking task (CPU or GPU), it's not unique to cuQuantum. Also, this is how I did it, with the handy cupyx.profiler.benchmark helper that does this warmup for you and performs the CPU/GPU timing correctly. On my system (A6000, CUDA 11.4) I got ~12 us only.

from cuquantum import custatevec as cusv
from cuquantum import cudaDataType
import cupy as cp
import numpy as np
from cupyx.profiler import benchmark

hGate = np.asarray([1/np.sqrt(2)+1j*0.0, 1/np.sqrt(2)+1j*0.0,
                    1/np.sqrt(2)+1j*0.0, -1/np.sqrt(2)+1j*0.0],
                    dtype=np.complex128)

handle = cusv.create()
sv = cp.zeros((2**10,), dtype=cp.complex128)  # 10 qubits
# in this particular case, workspace size is 0, so no need to allocate workspace

# apply the gate on the 0-th qubit
out = benchmark(cusv.apply_matrix,
          (handle, sv.data.ptr, cudaDataType.CUDA_C_64F, 10, hGate.ctypes.data, cudaDataType.CUDA_C_64F,
           cusv.MatrixLayout.ROW, 0, [0], 1, 0, 0, 0, 0, 0, 0),
          n_warmup=10, n_repeat=100)
print(out)
# apply the gate on the 9-th qubit
out = benchmark(cusv.apply_matrix,
          (handle, sv.data.ptr, cudaDataType.CUDA_C_64F, 10, hGate.ctypes.data, cudaDataType.CUDA_C_64F,
           cusv.MatrixLayout.ROW, 0, [9], 1, 0, 0, 0, 0, 0, 0),
          n_warmup=10, n_repeat=100)
print(out)

output:

$ python cusv_benchmark.py
apply_matrix        :    CPU:    9.385 us   +/- 0.192 (min:    9.160 / max:   10.160) us     GPU-0:   11.972 us   +/- 0.606 (min:   11.264 / max:   15.360) us
apply_matrix        :    CPU:    9.354 us   +/- 0.678 (min:    8.930 / max:   15.690) us     GPU-0:   11.967 us   +/- 0.857 (min:   10.240 / max:   18.432) us

haiyongsong1921 Oct 19, 2022
Author

Thanks Leo, the following is the output of cupy config in my env:

My code is just a groover search algorithms, as the project refers many files, it's a little hassle to share all of them.

out = benchmark(cusv.apply_matrix,
          (handle, sv.data.ptr, cudaDataType.CUDA_C_64F, 10, hGate.ctypes.data, cudaDataType.CUDA_C_64F,
           cusv.MatrixLayout.ROW, 0, [0], 1, 0, 0, 0, 0, 0, 0),
          n_warmup=0, n_repeat=100)

In your code, if you change the n_warmup=0, it will cost 50-60us in my env. So I think the warmup step is the root cause.
BTW: What's the best practice for warming up issue in a cuQuantum way?

leofang Dec 18, 2022
Maintainer

Sorry (again) that we dropped the ball. Usually, warm-ups are for microbenchmarks (as is done here) to ensure your CPU & GPU are ready to in some "stable" state that is relatively insensitive to the system fluctuation. This ensures that the performance is dominated by the code that you benchmark/profile, not by something else from your system. If your goal here is to see how much speedup you gain by using cusv.apply_matrix in your application, I would recommend to benchmark the entire algorithm end-to-end. Then you can reduce the number of warmups as it'd become less impactful.

In production runs, people would just run the code, and let the system sort itself out. In particular, if a code is meant to run say, days or weeks, it makes no sense to warm up because it is no longer a microbenchmark.

FYI, we have just released cuquantum-benchmarks that does this kind of end-to-end algorithm benchmarks, see #24 🙂

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How many blocks or threads per block are used by the cuQuantum API? #20

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How many blocks or threads per block are used by the cuQuantum API? #20

haiyongsong1921 Oct 13, 2022

Replies: 1 comment · 6 replies

leofang Oct 13, 2022 Maintainer

leofang Oct 14, 2022 Maintainer

haiyongsong1921 Oct 18, 2022 Author

leofang Oct 19, 2022 Maintainer

haiyongsong1921 Oct 19, 2022 Author

leofang Dec 18, 2022 Maintainer

haiyongsong1921
Oct 13, 2022

Replies: 1 comment 6 replies

leofang
Oct 13, 2022
Maintainer

leofang Oct 14, 2022
Maintainer

haiyongsong1921 Oct 18, 2022
Author

leofang Oct 19, 2022
Maintainer

haiyongsong1921 Oct 19, 2022
Author

leofang Dec 18, 2022
Maintainer