Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KunminghuV2Config doesn't seem to dual issue vector instructions #4190

Closed
4 tasks done
camel-cdr opened this issue Jan 16, 2025 · 4 comments
Closed
4 tasks done

KunminghuV2Config doesn't seem to dual issue vector instructions #4190

camel-cdr opened this issue Jan 16, 2025 · 4 comments
Labels
bug report Bugs to be confirmed

Comments

@camel-cdr
Copy link
Contributor

camel-cdr commented Jan 16, 2025

Before start

  • I have read the RISC-V ISA Manual and this is not a RISC-V ISA question. 我已经阅读过 RISC-V 指令集手册,这不是一个指令集本身的问题。
  • I have read the XiangShan Documents. 我已经阅读过香山文档。
  • I have searched the previous issues and did not find anything relevant. 我已经搜索过之前的 issue,并没有找到相关的。
  • I have reviewed the commit messages from the relevant commit history. 我已经浏览过相关的提交历史和提交信息。

Describe the bug

I wasn't able to measure dual issue of LMUL=1 vector instructions on KunminghuV2Config.

In an unrolled loop without/minimal dependency chain, e.g.:

vadd.vv v8,v16,v24
vadd.vv v9,v17,v25
vadd.vv v10,v18,v26
vadd.vv v11,v19,v27
...

with different instructions (vadd.vv, vid.v, interleaved vadd.vv&vfadd.vv), the measured average throughput (cycles per "LMUL=1 instruction") was always 1 at LMUL=1, and about 0.75 at LMUL=2.
The results were similar with different register dependency chains.

Expected behavior

The expected behavior would be for a measured throughput of 0.5, since KunminghuV2Config has two vector execution units that support vadd.vv.

To Reproduce

I used this buildscript and the following files to create the measurements:

// asm.S
.text
.balign 8

#define UNROLL 8
#define LOOP 64

.global bench_add
bench_add:
        li a0, LOOP
        csrr a1, cycle
1:
.rept UNROLL
add t0,t1,t2
add t3,t4,t5
add t6,a2,a3
add a4,a5,a6
add t0,t1,t2
add t3,t4,t5
add t6,a2,a3
add a4,a5,a6
.endr
        addi a0, a0, -1
        bnez a0, 1b
        fence.i
        csrr a0, cycle
        sub a0, a0, a1
ret


.global bench_mul
bench_mul:
        li a0, LOOP
        csrr a1, cycle
1:
.rept UNROLL
mul t0,t1,t2
mul t3,t4,t5
mul t6,a2,a3
mul a4,a5,a6
mul t0,t1,t2
mul t3,t4,t5
mul t6,a2,a3
mul a4,a5,a6
.endr
        addi a0, a0, -1
        bnez a0, 1b
        fence.i
        csrr a0, cycle
        sub a0, a0, a1
ret

.global bench_vaddvv_m1
bench_vaddvv_m1:
        vsetvli t0, x0, e32, m1, ta, ma
        li a0, LOOP
        csrr a1, cycle
1:
.rept UNROLL
vadd.vv v8,v16,v24
vadd.vv v9,v17,v25
vadd.vv v10,v18,v26
vadd.vv v11,v19,v27
vadd.vv v12,v20,v28
vadd.vv v13,v21,v29
vadd.vv v14,v22,v30
vadd.vv v15,v23,v31
.endr
        addi a0, a0, -1
        bnez a0, 1b
        fence.i
        csrr a0, cycle
        sub a0, a0, a1
ret

.global bench_vaddvv_m2
bench_vaddvv_m2:
        vsetvli t0, x0, e32, m2, ta, ma
        li a0, LOOP
        csrr a1, cycle
1:
.rept UNROLL
vadd.vv v8,v16,v24
vadd.vv v10,v18,v26
vadd.vv v12,v20,v28
vadd.vv v14,v22,v30
.endr
        addi a0, a0, -1
        bnez a0, 1b
        fence.i
        csrr a0, cycle
        sub a0, a0, a1
ret

// main.c
#include <klib.h>

size_t bench_add(void);
size_t bench_vaddvv_m1(void);
size_t bench_vaddvv_m2(void);

int main(void) {
	for (size_t i = 0; i < 10; ++i) {
		printf("add:         %u\n", bench_add());
		printf("mul:         %u\n", bench_mul());
		printf("LMUL=1 vadd: %u\n", bench_vaddvv_m1());
		printf("LMUL=2 vadd: %u\n", bench_vaddvv_m2());
	}
	return 0;
}


// Makefile
SRCS = asm.S main.c
include $(AM_HOME)/Makefile.app

Executing this results in the following output:

$ make ARCH=riscv64-xs && /xs-env/XiangShan/build/emu --no-diff -i ./build/-riscv64-xs.bin 2>/dev/null
add:         1294
mul:         2125
LMUL=1 vadd: 4244
LMUL=2 vadd: 3136

Notice that the code above executes add/mul/LMUL=1 vadd.vv 4096 times, and LMUL=2 vadd.vv 2048 times.

  • add is quad issue, so about 1024 cycles are expected, this matches roughly.
  • mul is dual issue, so about 2048 cycles are expected, this matches roughly.
  • vadd.vv is supposed to be dual issue, so about 2048 cycles are expected, but we get a cycle count of a single issue instruction at LMUL=1 and 4/3 issue at LMUL=2

Environment

  • XiangShan branch: master
  • XiangShan commit id: ebd53cd

Additional context

No response

@camel-cdr camel-cdr added the bug report Bugs to be confirmed label Jan 16, 2025
@NewPaulWalker
Copy link
Contributor

Thank you for your interest in XiangShan and for raising this issue.

Regarding RISC-V vector instructions, XiangShan splits them into several micro-operations, called uops, during decoding. Each cycle can decode at most one vector instruction.

Therefore, when LMUL is 1, even though there are two vector execution units capable of executing vadd.vv, only one vector instruction can be decoded per cycle due to decoding limitations. As a result, executing 4096 vadd.vv instructions with LMUL = 1 would still require at least 4096 cycles.

When LMUL is 2, only one vector instruction can be decoded per cycle, and each vector instruction is split into two uops. These two uops can then be dispatched to the two vector execution units supporting vadd.vv for execution. Thus, executing 2048 vadd.vv instructions with LMUL = 2 would require at least 2048 cycles.

However, the test results indicate that it takes over 3000 cycles. We have identified this performance issue and determined that it occurs because dependencies are still treated as present even in cases where the vd operand is not required. For example, vector registers such as v8, v10, v12, and v14 in the current instruction must wait for the corresponding write operations from the previous instruction, resulting in execution time exceeding 2048 cycles.

@NewPaulWalker
Copy link
Contributor

However, the test results indicate that it takes over 3000 cycles. We have identified this performance issue and determined that it occurs because dependencies are still treated as present even in cases where the vd operand is not required. For example, vector registers such as v8, v10, v12, and v14 in the current instruction must wait for the corresponding write operations from the previous instruction, resulting in execution time exceeding 2048 cycles.

The above issue has been fixed by #4198

@camel-cdr
Copy link
Contributor Author

@NewPaulWalker

Thanks for the reply.

I tested the new branch and the LMUL=2 code works as expected.

Decoding one vector instruction per cycle seems like a weird design decision for a core with such a wide backend (5 vector execution units + 2 vector load/store units).
Doesn't this mean LMUL=1 code will never use half of the execution units, as the core can execute two basic integer and two regular basic point vector instruction at a time? So you'd essentially get the same performance characteristics at LMUL=1 with half of the vector execution units?
Are there plans to increase the vector decode width?

@NewPaulWalker
Copy link
Contributor

Decoding one vector instruction per cycle seems like a weird design decision for a core with such a wide backend (5 vector execution units + 2 vector load/store units).

Decoding one vector instruction per cycle is considered for timing reasons.

Are there plans to increase the vector decode width?

This issue needs to be addressed in the future, but there are no plans for it in the near term.

Doesn't this mean LMUL=1 code will never use half of the execution units, as the core can execute two basic integer and two regular basic point vector instruction at a time? So you'd essentially get the same performance characteristics at LMUL=1 with half of the vector execution units?

You’re right. With LMUL=1, the code won’t fully utilize half of the execution units and this would indeed result in similar performance characteristics at LMUL=1, even with only half the vector execution units available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug report Bugs to be confirmed
Projects
None yet
Development

No branches or pull requests

2 participants