-
Notifications
You must be signed in to change notification settings - Fork 730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KunminghuV2Config doesn't seem to dual issue vector instructions #4190
Comments
Thank you for your interest in XiangShan and for raising this issue. Regarding RISC-V vector instructions, XiangShan splits them into several micro-operations, called uops, during decoding. Each cycle can decode at most one vector instruction. Therefore, when LMUL is 1, even though there are two vector execution units capable of executing When LMUL is 2, only one vector instruction can be decoded per cycle, and each vector instruction is split into two uops. These two uops can then be dispatched to the two vector execution units supporting However, the test results indicate that it takes over 3000 cycles. We have identified this performance issue and determined that it occurs because dependencies are still treated as present even in cases where the |
The above issue has been fixed by #4198 |
Thanks for the reply. I tested the new branch and the LMUL=2 code works as expected. Decoding one vector instruction per cycle seems like a weird design decision for a core with such a wide backend (5 vector execution units + 2 vector load/store units). |
Decoding one vector instruction per cycle is considered for timing reasons.
This issue needs to be addressed in the future, but there are no plans for it in the near term.
You’re right. With LMUL=1, the code won’t fully utilize half of the execution units and this would indeed result in similar performance characteristics at LMUL=1, even with only half the vector execution units available. |
Before start
Describe the bug
I wasn't able to measure dual issue of LMUL=1 vector instructions on KunminghuV2Config.
In an unrolled loop without/minimal dependency chain, e.g.:
with different instructions (
vadd.vv
,vid.v
, interleavedvadd.vv
&vfadd.vv
), the measured average throughput (cycles per "LMUL=1 instruction") was always 1 at LMUL=1, and about 0.75 at LMUL=2.The results were similar with different register dependency chains.
Expected behavior
The expected behavior would be for a measured throughput of 0.5, since KunminghuV2Config has two vector execution units that support
vadd.vv
.To Reproduce
I used this buildscript and the following files to create the measurements:
Executing this results in the following output:
Notice that the code above executes
add/mul/LMUL=1 vadd.vv
4096 times, andLMUL=2 vadd.vv
2048 times.add
is quad issue, so about 1024 cycles are expected, this matches roughly.mul
is dual issue, so about 2048 cycles are expected, this matches roughly.vadd.vv
is supposed to be dual issue, so about 2048 cycles are expected, but we get a cycle count of a single issue instruction at LMUL=1 and 4/3 issue at LMUL=2Environment
Additional context
No response
The text was updated successfully, but these errors were encountered: