-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #12049
base: master
Are you sure you want to change the base?
Conversation
…om branch kantvai-ggmlqnn-npurpc, https://github.com/kantv-ai/llama.cpp/wiki/offloading-mulmat-to-QNN-backend)
…om branch kantvai-ggmlqnn-npurpc, https://github.com/kantv-ai/llama.cpp/wiki/offloading-mulmat-to-QNN-backend)
…omplex/redundant pointer operation
a simple tech doc: mapping ggml compute graph to QNN compute graph with the breakthrough help from chiwwang@QTI on April 2024,
I already found that there are different technical paths to utilize the Qualcomm Hexagon NPU in ggml-qnn via QNN SDK:
prons: this approach can benefit greatly from the excellent "backend scheduler" feature in the ggml backend subsystem, can be a "functional implementation" or a good starting-point in the upstream llama.cpp community. accordingly, this approach can be verified easily with my self-made script build-run-android.sh cons: there mightbe performance concern in ggml-qnn backend
prons: this approach might be equivalent to the principle shown in the above quoted code, and I guess that's the secret of how to utilize the Hexagon NPU maximally in QNN backend. I don't know why there is such big difference between ggml-qnn and ggml-sycl/ggml-cann/ggml-opencl. cons: can not take advantage of backend scheduler feature and too much work load there are many undocumented(or not very clear) technical details in QNN SDK, so I think the necessary technical support should be provided from Qualcomm's tech team even I reach the final mission according to the first approach with help from the great llama.cpp community.
correction from domain technical experts is greatly welcomed and appricated. |
How do you handle the QNN graph build-execute-free during inference? As we are also integrating the QNN in our framework, the graph building is time consuming and the memory increases hugely when finalizing the QNN graph. It seems that the QNN has no easy way to free a graph during execution. |
Hi @oreomaker , Nice question! This is actually the key point for similar QNN backend implementations. QNN's interface functions more like a traditional ML framework that requires "compilation" (what they call "finalize") before execution. In my fork, I've taken this a step further by generating a QNN graph based on the GGML graph. This approach allows the QNN framework to perform more comprehensive optimizations during compilation. |
I'm a little curious of that whether you are a regular employee from Qualcomm Shenzhen branch. as I said before many times, you can submit your standalone PR and I personally like to see your success in this community, but pls don't brings some non-technical comments again and again in my PR:
thanks so much! btw, I personally don't think you are a regular employee from Qualcomm because your behavior breaks many default rules and Qualcomm's top-talent regular employee don't do that. |
I don't want to continue debating this here. The reason for restricting your access to this repo was your inappropriate comments on unrelated PRs (here, here2 and here3). And repo's owner gave a clear reason about it. I'd suggest focusing on improving your codebase in an objective manner without making assumptions about or judging others' work. If my comments made you uncomfortable, I apologize. I'm happy to step back from this discussion. I can also create my own PR where anyone interested can discuss the design approach more constructively. |
as a very old programmer, as I said before: I have no intention of getting involved in a meaningless competition between me and you or your team and I'd like to see your success in this community. what you did seems you are a PUA master(offered an unacceptable help firstly, then angered the other person, then use other's hand to punish other person, at last achieve your purpose). I don't understand why you spent efforts to study my comments in this community? I already admitted my mistake in last year at here. updated on 02/25/2025,22:14, I just blocked this unknown(here means I don't know) Chinese programmer again although I tried to cooperation with him(he has no respond for my invitation) last week:
at the same time I wish his success in this community because:
all my above comments are off-topic obviously even I would be blocked again but I don't want to hide my real thoughts. |
thanks for your comment and this is a good question, your concern is correct:
|
* [ ] Low
* [x] Medium
* [ ] High
PR Description
this PR is a continued effort of my original PR #6869
thanks to the huge changes in the software architecture in the latest llama.cpp (especially the maturation of the "backend scheduler" feature),
this implementation put main logic in one single source file(ggml-qnn.cpp) because it will helpful for other experienced programmers be involved in dev activity which similar to what ggerganov did in the very beginning of ggml.c/llama.cpp or what Intel did in the very beginning of ggml-sycl.cpp or what Qualcomm did in the ggml-opencl.cpp. another main reason of this coding style is I think this will make the developers' workflow more easily:
this implementation is a concise implementation and focus on the final mission "how to utilize the Hexagon NPU maximally", this implementation is not cool(lack of some cool modern C++ features such as lambda expression and complex/complicated C++ encapsulation), but I hope every domain programmers can understand codes and domain technical details easily and quickly and it can be considered an open source reference implementation of ggml-qnn and also can/might be used in production project . I think this PR will be helpful to the llama.cpp community although this PR might be not accepted.
after spent too much efforts on ggml-qnn backend, I personally think a fully open-source ggml-qnn backend might be a team-work between experienced software programmers and AI experts even the professional technical help from Qualcomm. in other words, I personally think NO single independent programmer or independent development team can provide a fully implementation of ggml-qnn backend, because experts and programmers in this community should haven't seen a programmer who is familiar with both Android system software programming and Windows system software programming, and is proficient in hard-core AI technology and Qualcomm QNN SDK, one more important thing, familiar with source code of ggml/llama.cpp although he/she is not an AI expert.
Big picture of ggml-qnn backend
pls refer to a simple tech doc below: mapping ggml compute graph to QNN compute graph
the first technical approach can be seen in this PR. accordingly, the second technical approach can be easily extended base on this PR with the similar coding style or complex/complicated C++ encapsulation.
What I did in this PR
all above items can be found in project KanTV which is a device-AI learning project and heavily depend on ggml/whisper.cpp/llama.cpp. I personally think the rest parts of ggml-qnn is just a routine-work(many many QNN SDK API calling and assembling), lack of tech challenge although it's not easy and much workloads and efforts are required for a real product, especially the team work between AI experts and experienced programmers. what I'm really interested in is the real hard-core AI tech in ggml/whiserp.cpp/llama.cpp rather than these routine work or daily task, so I'm really have no intention of getting involved in a meaningless competition with some Chinese programmers whom I understand very much from Mainland China and I'm sincerely like to see their success in this community then I can re-use it, that's the meaning of open-source.
Performance of ggml-qnn backend
performance is the key-point I heavily/always concerns on Android or other embedded system. here is the result in local dev envs of how I solve this performance issue at the early phase(because this ggml-qnn backend lacks of many ops although it's a really functional backend) of ggml-qnn backend:
before finetune:
Fig-1:llama-bench with QNN NPU backend(Hexagon NPU) on Xiaomi14

after finetune:

Fig-2: llama-bench with QNN NPU backend(Hexagon NPU) on Xiaomi14, some mulmat operations has been offloaded to NPU
Fig-3:llama-bench with ggml cpu backend("3" is a "fake" ggml-qnn backend which used to compare performance between QNN NPU backend and ggml cpu bakcned)) on Xiaomi14(AI experts can explain why there is so big difference of the second data between Fig-2 and Fig-3, this would be helpful for performance fine-tuning in this PR)

How to setup dev envs on a Linux machine
Ubuntu 20.04,22.04 is validated and recommended, other Linux distributions might be also ok.
the dev activity in this PR can be done in pure command line without any IDE, so setup dev envs is simple:
How to build ggml‐qnn source code for Android and verify ggml-qnn backend on Snapdragon based phone
we can find that this backend works fine as expected from the log output of "adb logcat | grep ggml-qnn". for programmers, we can use "adb logcat | grep ggml-qnn" to help troubleshooting work.
How to build ggml‐qnn source code for Snapdragon based WoA(Windows on ARM) device
I have nothing knowledge about Windows programming although the source code of ggml/llama.cpp are both portable codes and Qualcomm's highly-well designed QNN SDK is also portable.
I know @chraac 's team has done them from a loop in this community. I think it can be migrated to this very easily(with help from his team or by me manually) after I carefully check the implementation of Windows port from his team.
Acknowledgement
AI-assisted programming for ggml-qnn backend
recently I tried AI-assisted programming for ggml-qnn with the help from the powerful Grok 3(because DeepSeek-R1 is not suitable for such a massive complicated programming task as my personal experience, I personally think Grok 3 is similar/closer to a natural human domain expert after it really helped me a lot in this PR, my English level can't express my real feeling of Grok 3 and we can try Grok 3). here is a example:
To debug why your ggml_qnn_mul_mat_4d function produces incorrect results on a Qualcomm mobile SoC using the QNN backend, let’s analyze the code step-by-step, identify potential issues, and propose corrections. This function is designed to perform a 4D matrix multiplication using Qualcomm's Neural Network (QNN) SDK, with operations like Gather, MatMul, and Transpose. The primary goal is to offload computation to the Hexagon NPU efficiently. Since the computation result is incorrect, the root cause likely lies in tensor dimension handling, operation configuration, or data flow between operations.
Step 1: Understanding the Intended Operation
The function takes two 4D input tensors (src0 and src1) and computes their matrix multiplication, storing the result in dst. The process involves:
For a 4D matrix multiplication, if src0 has shape [B0, H0, W0, C0] and src1 has shape [B1, H1, W1, C1], the multiplication typically occurs over compatible dimensions (e.g., C0 and H1), and the output shape depends on the remaining dimensions. The Gather and Transpose operations suggest the code is reordering dimensions to align them for MatMul.
Step 2: Potential Issues in the Code
Let’s examine the code for common pitfalls in tensor operations, especially with QNN’s graph-based execution:
Dimension Mismatch or Incorrect Indexing in Gather Operations
First Gather (gather0):
Input: p_tensor0 (from src0, shape [ne[0], ne[1], ne[2], ne[3]]).
Indices: p_gather0_index with data2 = {0, src1->ne[1], src0->ne[2], src0->ne[3], 0, 0} (6 elements).
Output: p_gather0_out with shape [src0->ne[3], src1->ne[2], src0->ne[1], src0->ne[0]].
Axis: 1 (ne[1] dimension of src0).
Issue: The indices array has 6 elements, but the axis size (src0->ne[1]) might not match this. The Gather operation selects slices along axis 1 based on these indices, but {0, src1->ne[1], ...} suggests an intent to permute or select specific indices. If src0->ne[1] ≠ 6 or the indices exceed bounds, this will produce incorrect results or runtime errors (though you’d see that in QNN logs).
Second Gather (gather1):
Input: p_gather0_out.
Indices: p_gather1_index with data4 = {src1->ne[1], src1->ne[1], src0->ne[3], src0->ne[3]} (4 elements).
Output: p_gather1_out with shape [src1->ne[3], src1->ne[2], src0->ne[1], src0->ne[0]].
Axis: 0.
Issue: The indices should correspond to valid positions in p_gather0_out’s axis 0 (src0->ne[3]). Repeated values (src1->ne[1]) and potential out-of-bounds indices could corrupt the output shape or data.
MatMul Configuration
Inputs: p_gather1_out and p_tensor1 (from src1).
Output: p_tensor2_transpose.
Parameter: transpose_in1 = true (transposes p_tensor1 before multiplication).
Issue: For MatMul, the last dimension of the first input (p_gather1_out: src0->ne[0]) must match the second-to-last dimension of the transposed second input (p_tensor1 transposed: [src1->ne[3], src1->ne[2], src1->ne[1], src1->ne[0]] → src1->ne[0]). If src0->ne[0] ≠ src1->ne[0], the multiplication is invalid, leading to incorrect results.
Transpose Configuration
Input: p_tensor2_transpose.
Permutation: p_param_tensor with data0 = {0, src1->ne[1], src0->ne[2], src0->ne[3]}.
Output: p_tensor2 (matches dst).
Issue: The permutation should restore the output to dst’s expected shape ([dst->ne[0], dst->ne[1], dst->ne[2], dst->ne[3]]). The values in data0 seem arbitrary and not aligned with a standard 4D matmul output shape (e.g., [src0->ne[0], src0->ne[1], src1->ne[2], src1->ne[3]]).
Tensor Initialization and Reuse
When graph_initialized is false, tensors are created with dimensions from src0 and src1, but p_tensor2_transpose’s dimensions (tensor2_transpose_dims) are uninitialized ({}).
When reusing the graph, dimensions aren’t rechecked, assuming prior setup is correct.
Issue: Uninitialized or mismatched dimensions in p_tensor2_transpose could propagate errors through the graph.
Data Flow and Debugging
The print_tensors_info call logs tensor details, but there’s no indication you’ve verified the intermediate outputs (e.g., p_gather0_out, p_gather1_out) match expectations.
Issue: Without validating each step’s output, errors could accumulate silently.
Step 3: Root Cause Hypothesis
The most likely root cause is incorrect dimension handling or indexing in the Gather and Transpose operations, leading to misaligned tensors for MatMul. Specifically:
Step 4: Corrected Code
Here’s a revised version of the function, focusing on proper dimension alignment for a 4D matrix multiplication (e.g., src0: [B, M, K, N] × src1: [B, K, P, Q] → dst: [B, M, P, Q]):