In this benchmark, we compare performance of SVD solvers between Taichi built-in and CUDA implementations. The Taichi built-in SVD solver implements the method described in [1], which also attaches a carefully optimized CUDA code repository 3x3_SVD_CUDA. We slightly modified the CUDA benchmark code in order to fit our benchmark suite, but kept all the SVD kernel code unchanged in order to conduct a fair comparison.
We conduct performance evaluation on the following device.
Device | Nvidia RTX 3080 (10GB) |
---|---|
FP32 performance | 29700 GFLOPS |
Memory bandwidth | 760 GB/s |
L2 cache capacity | 5 MB |
Driver version | 470.57.02 |
CUDA version | 11.4 |
Performance is measured as the kernel compute time measured with the cudaEvent
APIs, lower is better. The unit is milliseconds (ms). In each experiment, we first conduct a warm-up run, and time for 10 repeated invokes.
The figure reveals that Taichi slightly outperform the CUDA implementation in the AOS layout. The performance in SOA layout is neck-to-neck. We have also noticed that the overall performance is generally bound by memory access efficiency. The results indicate that the compute kernels implemented with Taichi and CUDA are both highly efficient.
- Pre-requisites
python3 -m pip install --upgrade taichi
python3 -m pip install matplotlib
If you want to compare with CUDA, make sure you have nvcc
properly installed.
- Run the benchmark and draw the plots
python3 plot_benchmark.py
[1] Gao, M., Wang, X., Wu, K., Pradhana, A., Sifakis, E., Yuksel, C., & Jiang, C. (2018). GPU optimization of material point methods. ACM Transactions on Graphics (TOG), 37(6), 1-12.