PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #12049

zhouwg · 2025-02-24T08:59:59Z

I have read the contributing guidelines
Self-reported review complexity:
* [ ] Low
* [x] Medium
* [ ] High

PR Description

this PR is a continued effort of my original PR #6869

thanks to the huge changes in the software architecture in the latest llama.cpp (especially the maturation of the "backend scheduler" feature),

data path of ggml-qnn backend works pretty good as expected with whisper.cpp and llama.cpp.
the official command line tool "test-backend-ops" & "llama-cli" has verified on a Qualcomm Snapdragon 8 Gen3 equipped Android phone.
works pretty good with ASR inference via whisper.cpp and LLM inference via llama.cpp with a standard Android APP(which is a self-made complex Android APP) on a Qualcomm Snapdragon 8 Gen 3 equipped Android phone.

this implementation put main logic in one single source file(ggml-qnn.cpp) because it will helpful for other experienced programmers be involved in dev activity which similar to what ggerganov did in the very beginning of ggml.c/llama.cpp or what Intel did in the very beginning of ggml-sycl.cpp or what Qualcomm did in the ggml-opencl.cpp. another main reason of this coding style is I think this will make the developers' workflow more easily:

try to overcome all relevant technical issues/difficulties with a specified op GGML_OP_ADD or GGML_OP_MUL_MAT
then expand other ggml ops accordingly with team-work from AI experts in the upstream llama.cpp community
the last step is code reconstruction via complex C++ encapsulation for a real project or specified need(this is not mandatory)
cross validation or internal dev activities will be convenient cause of only replace one single source file

this implementation is a concise implementation and focus on the final mission "how to utilize the Hexagon NPU maximally", this implementation is not cool(lack of some cool modern C++ features such as lambda expression and complex/complicated C++ encapsulation), but I hope every domain programmers can understand codes and domain technical details easily and quickly and it can be considered an open source reference implementation of ggml-qnn and also can/might be used in production project . I think this PR will be helpful to the llama.cpp community although this PR might be not accepted.

after spent too much efforts on ggml-qnn backend, I personally think a fully open-source ggml-qnn backend might be a team-work between experienced software programmers and AI experts even the professional technical help from Qualcomm. in other words, I personally think NO single independent programmer or independent development team can provide a fully implementation of ggml-qnn backend, because experts and programmers in this community should haven't seen a programmer who is familiar with both Android system software programming and Windows system software programming, and is proficient in hard-core AI technology and Qualcomm QNN SDK, one more important thing, familiar with source code of ggml/llama.cpp although he/she is not an AI expert.

Big picture of ggml-qnn backend

pls refer to a simple tech doc below: mapping ggml compute graph to QNN compute graph

the first technical approach can be seen in this PR. accordingly, the second technical approach can be easily extended base on this PR with the similar coding style or complex/complicated C++ encapsulation.

What I did in this PR

data path works good between QNN SDK and ggml/llama.cpp through reverse engineering from executorch(the QNN backend's implementation in executorch comes from Qualcomm), I personally think this is the first open-source implementation of ggml-qnn in llama.cpp community since 04/2024(correction is greatly welcomed if I made incorrect expression)
graph cache mechanism
a simple skeleton in function ggml_qnn_add:offload GGML_OP_ADD to QNN backend
a complex skeleton in function ggml_qnn_mulmat: offload GGML_OP_MULMAT to QNN backend, this skeleton can be used to illustrate the second technical approach of "how to utilize the Hexagon NPU maximally"
QNN NPU RPC feature(UT passed but some unknown bugs should be fixed and should be seen in all hard-forked ggml-qnn projects, this is not an intentional bug)
provide big picture and different technical approaches of ggml-qnn in my forked llama.cpp and this PR
overcome necessary technical difficulties in this PR
a concise implementation without complex encapsulation, code is simple and easily understand, because I believe in the philosophy of "simple is beautiful" which comes from the great Unix which borned in the US and this philosophy also can be found in the great llama.cpp. btw, I personally think that's one of the key-reasons why ggml/whisper.cpp/llama.cpp is so popular and so successful, especially we can see that there are already TensorflowLite from Google, Executorch from Meta, MNN from Alibaba, MACE from Xiaomi... they are both good but many programmers and IT giants and IT startups and research institutions prefer ggml for device-AI scenarios. accordingly, other experienced programmers can do code construction with cool and complex C++ wrapper/encapsulation for a real product although the core ideas and tech difficulties are same to this implementation.

all above items can be found in project KanTV which is a device-AI learning project and heavily depend on ggml/whisper.cpp/llama.cpp. I personally think the rest parts of ggml-qnn is just a routine-work(many many QNN SDK API calling and assembling), lack of tech challenge although it's not easy and much workloads and efforts are required for a real product, especially the team work between AI experts and experienced programmers. what I'm really interested in is the real hard-core AI tech in ggml/whiserp.cpp/llama.cpp rather than these routine work or daily task, so I'm really have no intention of getting involved in a meaningless competition with some Chinese programmers whom I understand very much from Mainland China and I'm sincerely like to see their success in this community then I can re-use it, that's the meaning of open-source.

Performance of ggml-qnn backend

performance is the key-point I heavily/always concerns on Android or other embedded system. here is the result in local dev envs of how I solve this performance issue at the early phase(because this ggml-qnn backend lacks of many ops although it's a really functional backend) of ggml-qnn backend:

before finetune:

Fig-1:llama-bench with QNN NPU backend(Hexagon NPU) on Xiaomi14

after finetune:
Fig-2: llama-bench with QNN NPU backend(Hexagon NPU) on Xiaomi14, some mulmat operations has been offloaded to NPU

Fig-3:llama-bench with ggml cpu backend("3" is a "fake" ggml-qnn backend which used to compare performance between QNN NPU backend and ggml cpu bakcned)) on Xiaomi14(AI experts can explain why there is so big difference of the second data between Fig-2 and Fig-3, this would be helpful for performance fine-tuning in this PR)

How to setup dev envs on a Linux machine

Ubuntu 20.04,22.04 is validated and recommended, other Linux distributions might be also ok.

the dev activity in this PR can be done in pure command line without any IDE, so setup dev envs is simple:

download and install Qualcomm QNN SDK accordingly from https://www.qualcomm.com/developer/software/qualcomm-ai-engine-direct-sdk
utilize my self-made script build-run-android.sh to download Android NDK automatically(pls see below section)

How to build ggml‐qnn source code for Android and verify ggml-qnn backend on Snapdragon based phone

  git clone https://github.com/kantv-ai/ggml-qnn
  cd ggml-qnn
  git checkout build_fix
  ./scripts/build-run-android.sh build          (it'll setup local build envs automatically and build the entire project)
  ./scripts/build-run-android.sh updateqnnlib   (upload Qualcomm's QNN binary runtime libs to Android phone)
  ./scripts/build-run-android.sh run_llamacli   (running llama-cli on Android pohone)
  ./scripts/build-run-android.sh run_testop     (running test-backend-ops on Android phone)

we can find that this backend works fine as expected from the log output of "adb logcat | grep ggml-qnn". for programmers, we can use "adb logcat | grep ggml-qnn" to help troubleshooting work.

How to build ggml‐qnn source code for Snapdragon based WoA(Windows on ARM) device

I have nothing knowledge about Windows programming although the source code of ggml/llama.cpp are both portable codes and Qualcomm's highly-well designed QNN SDK is also portable.

I know @chraac 's team has done them from a loop in this community. I think it can be migrated to this very easily(with help from his team or by me manually) after I carefully check the implementation of Windows port from his team.

Acknowledgement

this implementation of ggml-qnn is mainly porting/reverse engineering from executorch(the QNN backend's implementation in executorch comes from Qualcomm).this is also one of the key-reasons why I personally disagree another implementation through complex/complicated C++ encapsulation also complex C++ encapsulation really has some advantages.
I got breakthrough help from chiwwang@Qualcomm Technologies Inc on 04/2024.
I also got a meaningful help from XiaoMi-StableDiffusionOnDevice on 05/2024.
thanks for that I borrowed 5-7 functions from implementation which comes from @chraac 's team.

AI-assisted programming for ggml-qnn backend

recently I tried AI-assisted programming for ggml-qnn with the help from the powerful Grok 3(because DeepSeek-R1 is not suitable for such a massive complicated programming task as my personal experience, I personally think Grok 3 is similar/closer to a natural human domain expert after it really helped me a lot in this PR, my English level can't express my real feeling of Grok 3 and we can try Grok 3). here is a example:

To debug why your ggml_qnn_mul_mat_4d function produces incorrect results on a Qualcomm mobile SoC using the QNN backend, let’s analyze the code step-by-step, identify potential issues, and propose corrections. This function is designed to perform a 4D matrix multiplication using Qualcomm's Neural Network (QNN) SDK, with operations like Gather, MatMul, and Transpose. The primary goal is to offload computation to the Hexagon NPU efficiently. Since the computation result is incorrect, the root cause likely lies in tensor dimension handling, operation configuration, or data flow between operations.
Step 1: Understanding the Intended Operation
The function takes two 4D input tensors (src0 and src1) and computes their matrix multiplication, storing the result in dst. The process involves:

Gather Operations: Reshaping or permuting the input tensors (src0 and src1) using Gather to prepare them for matrix multiplication.
Matrix Multiplication: Performing MatMul on the gathered outputs.
Transpose: Adjusting the result’s dimensions to match the expected output shape.

For a 4D matrix multiplication, if src0 has shape [B0, H0, W0, C0] and src1 has shape [B1, H1, W1, C1], the multiplication typically occurs over compatible dimensions (e.g., C0 and H1), and the output shape depends on the remaining dimensions. The Gather and Transpose operations suggest the code is reordering dimensions to align them for MatMul.
Step 2: Potential Issues in the Code
Let’s examine the code for common pitfalls in tensor operations, especially with QNN’s graph-based execution:

Dimension Mismatch or Incorrect Indexing in Gather Operations

First Gather (gather0):
Input: p_tensor0 (from src0, shape [ne[0], ne[1], ne[2], ne[3]]).
Indices: p_gather0_index with data2 = {0, src1->ne[1], src0->ne[2], src0->ne[3], 0, 0} (6 elements).
Output: p_gather0_out with shape [src0->ne[3], src1->ne[2], src0->ne[1], src0->ne[0]].
Axis: 1 (ne[1] dimension of src0).
Issue: The indices array has 6 elements, but the axis size (src0->ne[1]) might not match this. The Gather operation selects slices along axis 1 based on these indices, but {0, src1->ne[1], ...} suggests an intent to permute or select specific indices. If src0->ne[1] ≠ 6 or the indices exceed bounds, this will produce incorrect results or runtime errors (though you’d see that in QNN logs).
Second Gather (gather1):
Input: p_gather0_out.
Indices: p_gather1_index with data4 = {src1->ne[1], src1->ne[1], src0->ne[3], src0->ne[3]} (4 elements).
Output: p_gather1_out with shape [src1->ne[3], src1->ne[2], src0->ne[1], src0->ne[0]].
Axis: 0.
Issue: The indices should correspond to valid positions in p_gather0_out’s axis 0 (src0->ne[3]). Repeated values (src1->ne[1]) and potential out-of-bounds indices could corrupt the output shape or data.
MatMul Configuration

Inputs: p_gather1_out and p_tensor1 (from src1).
Output: p_tensor2_transpose.
Parameter: transpose_in1 = true (transposes p_tensor1 before multiplication).
Issue: For MatMul, the last dimension of the first input (p_gather1_out: src0->ne[0]) must match the second-to-last dimension of the transposed second input (p_tensor1 transposed: [src1->ne[3], src1->ne[2], src1->ne[1], src1->ne[0]] → src1->ne[0]). If src0->ne[0] ≠ src1->ne[0], the multiplication is invalid, leading to incorrect results.
Transpose Configuration

Input: p_tensor2_transpose.
Permutation: p_param_tensor with data0 = {0, src1->ne[1], src0->ne[2], src0->ne[3]}.
Output: p_tensor2 (matches dst).
Issue: The permutation should restore the output to dst’s expected shape ([dst->ne[0], dst->ne[1], dst->ne[2], dst->ne[3]]). The values in data0 seem arbitrary and not aligned with a standard 4D matmul output shape (e.g., [src0->ne[0], src0->ne[1], src1->ne[2], src1->ne[3]]).
Tensor Initialization and Reuse

When graph_initialized is false, tensors are created with dimensions from src0 and src1, but p_tensor2_transpose’s dimensions (tensor2_transpose_dims) are uninitialized ({}).
When reusing the graph, dimensions aren’t rechecked, assuming prior setup is correct.
Issue: Uninitialized or mismatched dimensions in p_tensor2_transpose could propagate errors through the graph.
Data Flow and Debugging

The print_tensors_info call logs tensor details, but there’s no indication you’ve verified the intermediate outputs (e.g., p_gather0_out, p_gather1_out) match expectations.
Issue: Without validating each step’s output, errors could accumulate silently.

Step 3: Root Cause Hypothesis
The most likely root cause is incorrect dimension handling or indexing in the Gather and Transpose operations, leading to misaligned tensors for MatMul. Specifically:

The Gather indices (p_gather0_index, p_gather1_index) and permutation (p_param_tensor) use values (src1->ne[1], etc.) that may not correspond to a valid reshaping or permutation for a 4D matrix multiplication.
The MatMul operation assumes compatibility between p_gather1_out and p_tensor1 (transposed), but this isn’t guaranteed by the prior steps.

Step 4: Corrected Code
Here’s a revised version of the function, focusing on proper dimension alignment for a 4D matrix multiplication (e.g., src0: [B, M, K, N] × src1: [B, K, P, Q] → dst: [B, M, P, Q]):

… ggml-qnn

…blob/ggml-qnn-quantize/core/ggml/llamacpp/ggml-qnn.cpp

…om branch kantvai-ggmlqnn-npurpc, https://github.com/kantv-ai/llama.cpp/wiki/offloading-mulmat-to-QNN-backend)

… ggml-qnn

…blob/ggml-qnn-quantize/core/ggml/llamacpp/ggml-qnn.cpp

…om branch kantvai-ggmlqnn-npurpc, https://github.com/kantv-ai/llama.cpp/wiki/offloading-mulmat-to-QNN-backend)

…omplex/redundant pointer operation

zhouwg · 2025-02-24T12:42:33Z

a simple tech doc: mapping ggml compute graph to QNN compute graph

with the breakthrough help from chiwwang@QTI on April 2024,

//==============================================================================
//
//  Copyright (c) 2020-2024 Qualcomm Technologies, Inc.
//  All Rights Reserved.
//  Confidential and Proprietary - Qualcomm Technologies, Inc.

//  saver_output.c is generated automatically by Qualcomm's dedicated tool
//
//  this customized saver_output.c is used to troubleshooting issue in
//  PoC-S26: offload a simple f32 2x2 matrix addition operation to QNN CPU backend
//  https://github.com/zhouwg/kantv/issues/121
//
//==============================================================================

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include "QnnInterface.h"

#include "ggml-jni.h"

#define VALIDATE(res, value) \
   do { \
      if (res == 0 || res == QNN_COMMON_ERROR_NOT_SUPPORTED) \
      { \
         res = value; \
         if (res != 0) \
         { \
            if (res == QNN_COMMON_ERROR_NOT_SUPPORTED) \
            { \
               LOGGD("WARNING! Line %d QNN feature/API not supported\n", __LINE__); \
               GGML_JNI_NOTIFY("WARNING! Line %d QNN feature/API not supported\n", __LINE__); \
            } else { \
               LOGGD("ERROR! Line %d with error value: %d\n", __LINE__, (unsigned int)error); \
            } \
         } \
      } \
   } \
   while(0)


static void qnn_saver_logcallback(const char* fmt,
                                 QnnLog_Level_t level,
                                 uint64_t timestamp,
                                 va_list argp) {

    static unsigned char s_qnn_saver_buf[JNI_BUF_LEN];

    const char * levelStr = "";
    switch (level) {
        case QNN_LOG_LEVEL_ERROR:
            levelStr = " ERROR ";
            break;
        case QNN_LOG_LEVEL_WARN:
            levelStr = "WARNING";
            break;
        case QNN_LOG_LEVEL_INFO:
            levelStr = "  INFO ";
            break;
        case QNN_LOG_LEVEL_DEBUG:
            levelStr = " DEBUG ";
            break;
        case QNN_LOG_LEVEL_VERBOSE:
            levelStr = "VERBOSE";
            break;
        case QNN_LOG_LEVEL_MAX:
            levelStr = "UNKNOWN";
            break;
    }

    double ms = (double)timestamp / 1000000.0;

    {
        int len_content = 0;
        memset(s_qnn_saver_buf, 0, JNI_BUF_LEN);
        len_content = vsnprintf(s_qnn_saver_buf, JNI_BUF_LEN, fmt, argp);
        snprintf((s_qnn_saver_buf + len_content), JNI_BUF_LEN - len_content, "\n");
        LOGGD("%8.1fms [%-7s] %s ", ms, levelStr, s_qnn_saver_buf);
        //if (level <= QNN_LOG_LEVEL_INFO)
        {
            GGML_JNI_NOTIFY("%8.1fms [%-7s] %s ", ms, levelStr, s_qnn_saver_buf);
        }
    }
}

int qnn_saver_main(int argc, char **argv) {
    LOGGI("enter %s", __func__);
    GGML_JNI_NOTIFY("enter %s", __func__);
    Qnn_ErrorHandle_t error = 0;
    QnnLog_Level_t logLevel = QNN_LOG_LEVEL_VERBOSE;
    int logging = 1;
    for (int i = 1; i < argc; i++) {
        char *arg = argv[i];
        if (!strcmp("--logging", arg) || !strcmp("-l", arg)) {
            logging = 1;
            if (i + 1 == argc) {
                printf("No log level provided, defaulting to QNN_LOG_LEVEL_ERROR\n");
                break;
            }
            char *value = argv[++i];
            if (!strcmp("error", value)) {
                logLevel = QNN_LOG_LEVEL_ERROR;
            } else if (!strcmp("warn", value)) {
                logLevel = QNN_LOG_LEVEL_WARN;
            } else if (!strcmp("info", value)) {
                logLevel = QNN_LOG_LEVEL_INFO;
            } else if (!strcmp("debug", value)) {
                logLevel = QNN_LOG_LEVEL_DEBUG;
            } else if (!strcmp("verbose", value)) {
                logLevel = QNN_LOG_LEVEL_VERBOSE;
            } else {
                printf("WARNING: Unknown log level provided: %s, defaulting to QNN_LOG_LEVEL_ERROR\n",
                       value);
            }
        } else {
            printf("Usage: %s [options]\n\n"
                   "-l <level>, --logging <level>      Enable logging, acceptable levels are: error,warn,info,debug,verbose\n",
                   argv[0]);
            return -1;
        }
    }

    LOGGD("log level %d\n", logLevel);
    FILE *fp = fopen("/sdcard/kantv/params.bin", "rb");
    if (!fp) {
        error = -1;
        LOGGI("ERROR! Could not open params.bin, ensure this file is in the current working directory when executing this program\n");
        GGML_JNI_NOTIFY("ERROR! Could not open params.bin, ensure this file is in the current working directory when executing this program\n");
        return error;
    }

    const QnnInterface_t **providerList = NULL;
    uint32_t numProviders;
    VALIDATE(error, QnnInterface_getProviders(&providerList, &numProviders));
    LOGGD("numProviders %d\n", numProviders);
    GGML_JNI_NOTIFY("numProviders %d\n", numProviders);
    for (int idx = 0; idx < numProviders; idx++) {
        LOGGD("backend name %s\n", providerList[idx]->providerName);
        GGML_JNI_NOTIFY("backend name %s\n", providerList[idx]->providerName);
    }
    QNN_INTERFACE_VER_TYPE interface = providerList[0]->QNN_INTERFACE_VER_NAME;

    Qnn_LogHandle_t loghandle = NULL;
    if (logging) {
        VALIDATE(error, interface.logCreate(qnn_saver_logcallback, logLevel, &loghandle));
    }
    //VALIDATE(error, interface.propertyHasCapability((QnnProperty_Key_t) 304)); //QNN_PROPERTY_GRAPH_SUPPORT_NULL_INPUTS
    VALIDATE(error, interface.propertyHasCapability((QnnProperty_Key_t) QNN_PROPERTY_GRAPH_SUPPORT_NULL_INPUTS));

    const QnnBackend_Config_t *backend_0_config_0[] = {NULL};
    Qnn_BackendHandle_t backend_0;
    VALIDATE(error, interface.backendCreate(loghandle, backend_0_config_0, &backend_0));

    const QnnDevice_Config_t *device_0_config_0[] = {NULL};
    Qnn_DeviceHandle_t device_0;
    VALIDATE(error, interface.deviceCreate(loghandle, device_0_config_0, &device_0));

    const QnnContext_Config_t *context_0_config_0[] = {NULL};
    Qnn_ContextHandle_t context_0;
    VALIDATE(error, interface.contextCreate(backend_0, device_0, context_0_config_0, &context_0));

    const QnnGraph_Config_t *context_0_convReluModel_config_0[] = {NULL};
    Qnn_GraphHandle_t context_0_convReluModel;
    VALIDATE(error,
             interface.graphCreate(context_0, "convReluModel", context_0_convReluModel_config_0,
                                   &context_0_convReluModel));

    //how to compose qnn graph

    //step-1:
    uint32_t context_0_convReluModel_tensor_0_dims[] = {1, 299, 299, 3};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_0_quantizeParams = {
            (Qnn_Definition_t) 2147483647/*QNN_DEFINITION_UNDEFINED*/,
            (Qnn_QuantizationEncoding_t) 2147483647/*QNN_QUANTIZATION_ENCODING_UNDEFINED*/, .scaleOffsetEncoding = {0.0, 0}};
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_0_clientBuf = {NULL, 0};
    Qnn_TensorV1_t context_0_convReluModel_tensor_0_v1 = {0, "input_0",
                                                          (Qnn_TensorType_t) 0/*QNN_TENSOR_TYPE_APP_WRITE*/,
                                                          0/*QNN_TENSOR_DATA_FORMAT_FLAT_BUFFER*/,
                                                          (Qnn_DataType_t) 562/*QNN_DATATYPE_FLOAT_32*/,
                                                          context_0_convReluModel_tensor_0_quantizeParams,
                                                          4, context_0_convReluModel_tensor_0_dims,
                                                          (Qnn_TensorMemType_t) 0/*QNN_TENSORMEMTYPE_RAW*/,
                                                          context_0_convReluModel_tensor_0_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_0 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_0_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel,
                                                      &context_0_convReluModel_tensor_0));




    //step-2:
    uint32_t context_0_convReluModel_tensor_1_dims[] = {3, 3, 3, 32};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_1_quantizeParams = {
            (Qnn_Definition_t) 2147483647,
            (Qnn_QuantizationEncoding_t) 2147483647, .scaleOffsetEncoding = {0.0, 0}};
    static float context_0_convReluModel_tensor_1_data[864];
    fread(context_0_convReluModel_tensor_1_data, 4, 864, fp);
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_1_clientBuf = {
            (void *) context_0_convReluModel_tensor_1_data, 3456};
    Qnn_TensorV1_t context_0_convReluModel_tensor_1_v1 = {0,
                                                          "InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_weight",
                                                          (Qnn_TensorType_t) 4/*QNN_TENSOR_TYPE_STATIC*/,
                                                          0/*QNN_TENSOR_DATA_FORMAT_FLAT_BUFFER*/,
                                                          (Qnn_DataType_t) 562/*QNN_DATATYPE_FLOAT_32*/,
                                                          context_0_convReluModel_tensor_1_quantizeParams,
                                                          4, context_0_convReluModel_tensor_1_dims,
                                                          (Qnn_TensorMemType_t) 0,
                                                          context_0_convReluModel_tensor_1_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_1 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_1_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel,
                                                      &context_0_convReluModel_tensor_1));



    //step-3:
    uint32_t context_0_convReluModel_tensor_2_dims[] = {32};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_2_quantizeParams = {
            (Qnn_Definition_t) 2147483647,
            (Qnn_QuantizationEncoding_t) 2147483647, .scaleOffsetEncoding = {0.0, 0}};
    static float context_0_convReluModel_tensor_2_data[32];
    fread(context_0_convReluModel_tensor_2_data, 4, 32, fp);
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_2_clientBuf = {
            (void *) context_0_convReluModel_tensor_2_data, 128};
    Qnn_TensorV1_t context_0_convReluModel_tensor_2_v1 = {0,
                                                          "InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_bias",
                                                          (Qnn_TensorType_t) 4/*QNN_TENSOR_TYPE_STATIC*/,
                                                          0,
                                                          (Qnn_DataType_t) 562/*QNN_DATATYPE_FLOAT_32*/,
                                                          context_0_convReluModel_tensor_2_quantizeParams,
                                                          1, context_0_convReluModel_tensor_2_dims,
                                                          (Qnn_TensorMemType_t) 0,
                                                          context_0_convReluModel_tensor_2_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_2 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_2_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel,
                                                      &context_0_convReluModel_tensor_2));



    //step-4:
    uint32_t context_0_convReluModel_tensor_3_dims[] = {2};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_3_quantizeParams = {
            (Qnn_Definition_t) 2147483647,
            (Qnn_QuantizationEncoding_t) 2147483647, .scaleOffsetEncoding = {0.0, 0}};
    static uint32_t context_0_convReluModel_tensor_3_data[2];
    fread(context_0_convReluModel_tensor_3_data, 4, 2, fp);
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_3_clientBuf = {
            (void *) context_0_convReluModel_tensor_3_data, 8};
    Qnn_TensorV1_t context_0_convReluModel_tensor_3_v1 = {0,
                                                          "InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_dilation",
                                                          (Qnn_TensorType_t) 4/*QNN_TENSOR_TYPE_STATIC*/, 0,
                                                          (Qnn_DataType_t) 306/*QNN_DATATYPE_UINT_32*/,
                                                          context_0_convReluModel_tensor_3_quantizeParams,
                                                          1, context_0_convReluModel_tensor_3_dims,
                                                          (Qnn_TensorMemType_t) 0/*QNN_TENSORMEMTYPE_RAW*/,
                                                          context_0_convReluModel_tensor_3_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_3 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_3_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel, &context_0_convReluModel_tensor_3));




    //step-5:
    uint32_t context_0_convReluModel_tensor_4_dims[] = {2, 2};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_4_quantizeParams = {
            (Qnn_Definition_t) 2147483647,
            (Qnn_QuantizationEncoding_t) 2147483647, .scaleOffsetEncoding = {0.0, 0}};
    static uint32_t context_0_convReluModel_tensor_4_data[4];
    fread(context_0_convReluModel_tensor_4_data, 4, 4, fp);
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_4_clientBuf = {
            (void *) context_0_convReluModel_tensor_4_data, 16};
    Qnn_TensorV1_t context_0_convReluModel_tensor_4_v1 = {0,
                                                          "InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_pad_amount",
                                                          (Qnn_TensorType_t) 4/*QNN_TENSOR_TYPE_STATIC*/, 0,
                                                          (Qnn_DataType_t) 306/*QNN_DATATYPE_UINT_32*/,
                                                          context_0_convReluModel_tensor_4_quantizeParams,
                                                          2, context_0_convReluModel_tensor_4_dims,
                                                          (Qnn_TensorMemType_t) 0,
                                                          context_0_convReluModel_tensor_4_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_4 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_4_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel,
                                                      &context_0_convReluModel_tensor_4));




    //step-6:
    uint32_t context_0_convReluModel_tensor_5_dims[] = {2};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_5_quantizeParams = {
            (Qnn_Definition_t) 2147483647,
            (Qnn_QuantizationEncoding_t) 2147483647, .scaleOffsetEncoding = {0.0, 0}};
    static uint32_t context_0_convReluModel_tensor_5_data[2];
    fread(context_0_convReluModel_tensor_5_data, 4, 2, fp);
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_5_clientBuf = {
            (void *) context_0_convReluModel_tensor_5_data, 8};
    Qnn_TensorV1_t context_0_convReluModel_tensor_5_v1 = {0,
                                                          "InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_stride",
                                                          (Qnn_TensorType_t) 4/*QNN_TENSOR_TYPE_STATIC*/, 0,
                                                          (Qnn_DataType_t) 306/*QNN_DATATYPE_UINT_32*/,
                                                          context_0_convReluModel_tensor_5_quantizeParams,
                                                          1, context_0_convReluModel_tensor_5_dims,
                                                          (Qnn_TensorMemType_t) 0,
                                                          context_0_convReluModel_tensor_5_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_5 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_5_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel,
                                                      &context_0_convReluModel_tensor_5));




    //step-7:
    uint32_t context_0_convReluModel_tensor_6_dims[] = {1, 149, 149, 32};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_6_quantizeParams = {
            (Qnn_Definition_t) 2147483647,
            (Qnn_QuantizationEncoding_t) 2147483647, .scaleOffsetEncoding = {0.0, 0}};
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_6_clientBuf = {NULL, 0};
    Qnn_TensorV1_t context_0_convReluModel_tensor_6_v1 = {0,
                                                          "InceptionV3_InceptionV3_Conv2d_1a_3x3_BatchNorm_FusedBatchNorm_0",
                                                          (Qnn_TensorType_t) 3/*QNN_TENSOR_TYPE_NATIVE*/, 0,
                                                          (Qnn_DataType_t) 562/*QNN_DATATYPE_FLOAT_32*/,
                                                          context_0_convReluModel_tensor_6_quantizeParams,
                                                          4, context_0_convReluModel_tensor_6_dims,
                                                          (Qnn_TensorMemType_t) 0,
                                                          context_0_convReluModel_tensor_6_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_6 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_6_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel, &context_0_convReluModel_tensor_6));


    //step-8:
    Qnn_Param_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_0 = {
            (Qnn_ParamType_t) 1/*QNN_PARAMTYPE_TENSOR*/,
            "dilation",
            .tensorParam = context_0_convReluModel_tensor_3
    };
    Qnn_Param_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_1 = {
            (Qnn_ParamType_t) 1/*QNN_PARAMTYPE_TENSOR*/,
            "pad_amount",
            .tensorParam = context_0_convReluModel_tensor_4
    };
    Qnn_Param_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_2 = {
            (Qnn_ParamType_t) 1/*QNN_PARAMTYPE_TENSOR*/,
            "stride",
            .tensorParam = context_0_convReluModel_tensor_5
    };
    Qnn_Param_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_3 = {
            (Qnn_ParamType_t) 0/*QNN_PARAMTYPE_SCALAR*/,
            "group",
            .scalarParam = {
                    (Qnn_DataType_t) 306, .uint32Value = 1}
    };
    Qnn_Param_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_params[] = {
            context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_0,
            context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_1,
            context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_2,
            context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_3};

    Qnn_Tensor_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_inputs[] = {
            context_0_convReluModel_tensor_0,
            context_0_convReluModel_tensor_1,
            context_0_convReluModel_tensor_2};

    Qnn_Tensor_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_outputs[] = {
            context_0_convReluModel_tensor_6
    };

    Qnn_OpConfig_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0 = {
            (Qnn_OpConfigVersion_t) 1,
            .v1 = {
                    "InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D",
                    "qti.aisw",
                    "Conv2d",
                    4,
                    context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_params,
                    3,
                    context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_inputs,
                    1,
                    context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_outputs
            }
    };
    VALIDATE(error, interface.backendValidateOpConfig(backend_0, context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0));
    VALIDATE(error, interface.graphAddNode(context_0_convReluModel, context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0));




    //step-9:
    uint32_t context_0_convReluModel_tensor_7_dims[] = {1, 149, 149, 32};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_7_quantizeParams = {
            (Qnn_Definition_t) 2147483647,
            (Qnn_QuantizationEncoding_t) 2147483647, .scaleOffsetEncoding = {0.0, 0}};
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_7_clientBuf = {NULL, 0};
    Qnn_TensorV1_t context_0_convReluModel_tensor_7_v1 = {0,
                                                          "InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0",
                                                          (Qnn_TensorType_t) 1/*QNN_TENSOR_TYPE_APP_READ*/, 0,
                                                          (Qnn_DataType_t) 562/*QNN_DATATYPE_FLOAT_32*/,
                                                          context_0_convReluModel_tensor_7_quantizeParams,
                                                          4, context_0_convReluModel_tensor_7_dims,
                                                          (Qnn_TensorMemType_t) 0,
                                                          context_0_convReluModel_tensor_7_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_7 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_7_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel,
                                                      &context_0_convReluModel_tensor_7));



    //step-10:
    Qnn_Param_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0_params[] = {};
    Qnn_Tensor_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0_inputs[] = {
            context_0_convReluModel_tensor_6
    };
    Qnn_Tensor_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0_outputs[] = {
            context_0_convReluModel_tensor_7
    };
    Qnn_OpConfig_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0 = {
            (Qnn_OpConfigVersion_t) 1, .v1 = {
                    "InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu",
                    "qti.aisw",
                    "Relu",
                    0,
                    context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0_params,
                    1,
                    context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0_inputs,
                    1,
                    context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0_outputs
            }
    };
    VALIDATE(error, interface.backendValidateOpConfig(backend_0,context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0));
    VALIDATE(error, interface.graphAddNode(context_0_convReluModel,context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0));

    //step-10:
    VALIDATE(error, interface.graphFinalize(context_0_convReluModel, NULL, NULL));

    Qnn_Tensor_t context_0_convReluModel_inputTensors_0[] = {context_0_convReluModel_tensor_0};
    Qnn_Tensor_t context_0_convReluModel_outputTensors_0[] = {context_0_convReluModel_tensor_7};
    VALIDATE(error,interface.graphExecute(context_0_convReluModel, context_0_convReluModel_inputTensors_0,
                                    1, context_0_convReluModel_outputTensors_0, 1, NULL, NULL));


    VALIDATE(error, interface.contextFree(context_0, NULL));

    VALIDATE(error, interface.deviceFree(device_0));

    VALIDATE(error, interface.backendFree(backend_0));

    if (logging) {
        VALIDATE(error, interface.logFree(loghandle));
    }

    if (fclose(fp)) error = -1;

    LOGGI("leave %s", __func__);
    GGML_JNI_NOTIFY("leave %s", __func__);
    return error == 0 || error == QNN_COMMON_ERROR_NOT_SUPPORTED ? 0 : error;
}

I already found that there are different technical paths to utilize the Qualcomm Hexagon NPU in ggml-qnn via QNN SDK:

the general approach in many ggml backend(such as ggml-sycl, ggml-cann, ggml-opencl...), this approach can be found at current implementation of ggml-qnn backend in source file ggml-qnn.cpp:

prons: this approach can benefit greatly from the excellent "backend scheduler" feature in the ggml backend subsystem, can be a "functional implementation" or a good starting-point in the upstream llama.cpp community. accordingly, this approach can be verified easily with my self-made script build-run-android.sh

cons: there mightbe performance concern in ggml-qnn backend

mapping ggml computational graph to QNN computational graph

prons: this approach might be equivalent to the principle shown in the above quoted code, and I guess that's the secret of how to utilize the Hexagon NPU maximally in QNN backend. I don't know why there is such big difference between ggml-qnn and ggml-sycl/ggml-cann/ggml-opencl.

cons: can not take advantage of backend scheduler feature and too much work load

there are many undocumented(or not very clear) technical details in QNN SDK, so I think the necessary technical support should be provided from Qualcomm's tech team even I reach the final mission according to the first approach with help from the great llama.cpp community.

the dedicated tech provided by Qualcomm(although the QNN SDK also provided by Qualcomm), such as QNNSample or XiaoMiStableDiffusionOnDevice, I think this approach doesn't make sense in ggml-qnn's implementation.
the dedicated tech stack (https://github.com/quic/aimet) provided by Qualcomm, I also think this approach doesn't make sense in ggml-qnn's implementation.

correction from domain technical experts is greatly welcomed and appricated.

ggml/src/ggml-qnn/ggml-qnn.cpp

oreomaker · 2025-02-25T06:25:50Z

How do you handle the QNN graph build-execute-free during inference? As we are also integrating the QNN in our framework, the graph building is time consuming and the memory increases hugely when finalizing the QNN graph. It seems that the QNN has no easy way to free a graph during execution.
Maybe a better execution pipeline is needed for it.

chraac · 2025-02-25T06:40:35Z

How do you handle the QNN graph build-execute-free during inference? As we are also integrating the QNN in our framework, the graph building is time consuming and the memory increases hugely when finalizing the QNN graph. It seems that the QNN has no easy way to free a graph during execution. Maybe a better execution pipeline is needed for it.

Hi @oreomaker ,

Nice question! This is actually the key point for similar QNN backend implementations. QNN's interface functions more like a traditional ML framework that requires "compilation" (what they call "finalize") before execution.
To reduce compilation time, this PR utilizes a mechanism called "graph cache" to store each operation graph with its specific tensor configuration:

In my fork, I've taken this a step further by generating a QNN graph based on the GGML graph. This approach allows the QNN framework to perform more comprehensive optimizations during compilation.

zhouwg · 2025-02-25T06:45:21Z

How do you handle the QNN graph build-execute-free during inference? As we are also integrating the QNN in our framework, the graph building is time consuming and the memory increases hugely when finalizing the QNN graph. It seems that the QNN has no easy way to free a graph during execution. Maybe a better execution pipeline is needed for it.

Hi @oreomaker ,

Nice question! This is actually the key point for similar QNN backend implementations. QNN's interface functions more like a traditional ML framework that requires "compilation" (what they call "finalize") before execution. To reduce compilation time, this PR utilizes a mechanism called "graph cache" to store each operation graph with its specific tensor configuration:

In my fork, I've taken this a step further by generating a QNN graph based on the GGML graph. This approach allows the QNN framework to perform more comprehensive optimizations during compilation.

I'm a little curious of that whether you are a regular employee from Qualcomm Shenzhen branch. as I said before many times, you can submit your standalone PR and I personally like to see your success in this community, but pls don't brings some non-technical comments again and again in my PR:

what you did in my first PR cased my blocking in this community from 07/19/2024-----02/16/2025.
what you did in my second PR resulted in my PR being closed by the maintainers because two Chinese programmers dropped two much pointless arguments in this community.

thanks so much!

btw, I personally don't think you are a regular employee from Qualcomm because your behavior breaks many default rules and Qualcomm's top-talent regular employee don't do that.

chraac · 2025-02-25T07:04:44Z

How do you handle the QNN graph build-execute-free during inference? As we are also integrating the QNN in our framework, the graph building is time consuming and the memory increases hugely when finalizing the QNN graph. It seems that the QNN has no easy way to free a graph during execution. Maybe a better execution pipeline is needed for it.

Hi @oreomaker ,
Nice question! This is actually the key point for similar QNN backend implementations. QNN's interface functions more like a traditional ML framework that requires "compilation" (what they call "finalize") before execution. To reduce compilation time, this PR utilizes a mechanism called "graph cache" to store each operation graph with its specific tensor configuration:
In my fork, I've taken this a step further by generating a QNN graph based on the GGML graph. This approach allows the QNN framework to perform more comprehensive optimizations during compilation.

I'm a little curious of that whether you are a regular employee from Qualcomm Shenzhen branch. as I said before many times, you can submit your standalone PR and I personally like to see your success in this community, but pls don't brings some non-technical comments again and again in my PR:

what you did in my first PR got me blocked from this community

what you did in my second PR resulted in my PR being closed by the maintainers because two Chinese programmers dropped two much pointless arguments in this community and I think this is another joke from China.
thanks so much!

I don't want to continue debating this here. The reason for restricting your access to this repo was your inappropriate comments on unrelated PRs (here, here2 and here3). And repo's owner gave a clear reason about it.

I'd suggest focusing on improving your codebase in an objective manner without making assumptions about or judging others' work.

If my comments made you uncomfortable, I apologize. I'm happy to step back from this discussion. I can also create my own PR where anyone interested can discuss the design approach more constructively.

zhouwg · 2025-02-25T07:13:55Z

How do you handle the QNN graph build-execute-free during inference? As we are also integrating the QNN in our framework, the graph building is time consuming and the memory increases hugely when finalizing the QNN graph. It seems that the QNN has no easy way to free a graph during execution. Maybe a better execution pipeline is needed for it.

Hi @oreomaker ,
Nice question! This is actually the key point for similar QNN backend implementations. QNN's interface functions more like a traditional ML framework that requires "compilation" (what they call "finalize") before execution. To reduce compilation time, this PR utilizes a mechanism called "graph cache" to store each operation graph with its specific tensor configuration:
In my fork, I've taken this a step further by generating a QNN graph based on the GGML graph. This approach allows the QNN framework to perform more comprehensive optimizations during compilation.

I'm a little curious of that whether you are a regular employee from Qualcomm Shenzhen branch. as I said before many times, you can submit your standalone PR and I personally like to see your success in this community, but pls don't brings some non-technical comments again and again in my PR:

what you did in my first PR got me blocked from this community

what you did in my second PR resulted in my PR being closed by the maintainers because two Chinese programmers dropped two much pointless arguments in this community and I think this is another joke from China.
thanks so much!

I don't want to continue debating this here. The reason for restricting your access to this repo was your inappropriate comments on unrelated PRs (here, here2 and here3). And repo's owner gave a clear reason about it.

I'd suggest focusing on improving your codebase in an objective manner without making assumptions about or judging others' work.

If my comments made you uncomfortable, I apologize. I'm happy to step back from this discussion. I can also create my own PR where anyone interested can discuss the design approach more constructively.

as a very old programmer, as I said before: I have no intention of getting involved in a meaningless competition between me and you or your team and I'd like to see your success in this community. what you did seems you are a PUA master(offered an unacceptable help firstly, then angered the other person, then use other's hand to punish other person, at last achieve your purpose). I don't understand why you spent efforts to study my comments in this community? I already admitted my mistake in last year at here.

updated on 02/25/2025,22:14, I just blocked this unknown(here means I don't know) Chinese programmer again although I tried to cooperation with him(he has no respond for my invitation) last week:

he forcefully added me in his PR's loop and I have to received many Github notifications.
I have nothing interesting with his PR even he is a truly talented programmer.
I personally think he brought many troubles to me in my first&second PR.
after I carefully check his comments and my stupid reply, I feel very strange why this Chinese programmer does not go to the Intel sycl backend or Huawei cann backend to make comments but goes to my PR again and again to make various comments, and fire the conflict again in my third PR.
his behaviors in my third PR breaks my bottom-line.
I made a stupid mistake(I should ignore all comments from this unfriendly CN programmer in my third PR), and I also think the most serious impact of this Chinese programmer’s behavior is that: less and less independent experienced programmers choose to closed-source or doesn't actively participate in open source project which similar to the US helped CN a lot in the past years but now the US decide to change it's policy of engagement with China because China attempt to create it's own rule and think they has an undoubtedly right to do that. my previous colleague already suggested me doing such open-source activity is a meaningless and foolish behavior and I disagree with his advice, at the moment I still think participate in open source project is just for fun but I'm sure the US should regret very much that it helped CN a lot and from now on I understand why the US think like that.

at the same time I wish his success in this community because:

I think he has good C++ skills and broad tech skillsets.
I'm not surprise there is a such Chinese programmer suddenly happened in my first&second&third PR because I deeply understand China and some Chinese after spent many years in China.

all my above comments are off-topic obviously even I would be blocked again but I don't want to hide my real thoughts.

zhouwg · 2025-02-25T07:32:08Z

How do you handle the QNN graph build-execute-free during inference? As we are also integrating the QNN in our framework, the graph building is time consuming and the memory increases hugely when finalizing the QNN graph. It seems that the QNN has no easy way to free a graph during execution. Maybe a better execution pipeline is needed for it.

thanks for your comment and this is a good question, your concern is correct:

the graph cache mechanism already used in my PoC of ggml-qnn or my first PR here on 04/2024:https://github.com/kantv-ai/kantv/blob/ggml-qnn-quantize/core/ggml/llamacpp/ggml-qnn.cpp#L2091. of course, we can re-construct the code with complex C++ encapsulation or C++ wrapper: we can see this behavior between H264 and AVS or other similar scenarios which they called "original innovation".
a better execution pipeline is needed because of performance concern, pls see my simple tech doc:mapping ggml compute graph to QNN compute graph

zhouwg added 27 commits February 16, 2025 12:39

ggml-qnn: add Qualcomm QNN backend for GGML

74029f3

ggml-qnn: santiy check

986a37d

ggml-qnn: update script build-run-android.sh to compare peformance of…

af604d5

… ggml-qnn

ggml-qnn: fix minor issue in test-backend-ops.cpp

816ebb9

ggml-qnn: merge QNN RPC feature from https://github.com/zhouwg/kantv/…

2a8020b

…blob/ggml-qnn-quantize/core/ggml/llamacpp/ggml-qnn.cpp

ggml-qnn: sync from branch kantvai-ggmlqnn-npurpc

da4d007

ggml-qnn: a concise approach to offload mulmat to QNN backend(sync fr…

7cb1a86

…om branch kantvai-ggmlqnn-npurpc, https://github.com/kantv-ai/llama.cpp/wiki/offloading-mulmat-to-QNN-backend)

ggml-qnn: remove redundant codes

c8cf291

ggml-qnn: sync from branch kantvai-ggmlqnn-npurpc

84317c7

ggml-qnn: sync from branch kantvai-ggmlqnn-npurpc

c6a04c6

ggml-qnn: sync from branch kantvai-ggmlqnn-npurpc

59a2fbe

ggml-qnn: add Qualcomm QNN backend for GGML

1e6f4a7

ggml-qnn: santiy check

6974079

ggml-qnn: update script build-run-android.sh to compare peformance of…

ea970f9

… ggml-qnn

ggml-qnn: fix minor issue in test-backend-ops.cpp

d0c01c0

ggml-qnn: merge QNN RPC feature from https://github.com/zhouwg/kantv/…

b48ad85

…blob/ggml-qnn-quantize/core/ggml/llamacpp/ggml-qnn.cpp

ggml-qnn: sync from branch kantvai-ggmlqnn-npurpc

5ac113b

ggml-qnn: a concise approach to offload mulmat to QNN backend(sync fr…

31152be

…om branch kantvai-ggmlqnn-npurpc, https://github.com/kantv-ai/llama.cpp/wiki/offloading-mulmat-to-QNN-backend)

ggml-qnn: remove redundant codes

e16dd3c

ggml-qnn: sync from branch kantvai-ggmlqnn-npurpc

1d56350

ggml-qnn: sync from branch kantvai-ggmlqnn-npurpc

12f4911

ggml-qnn: sync from branch kantvai-ggmlqnn-npurpc

37985f9

rebase to the latest upstream

9fa0765

ggml-qnn: fix a minior typo in internal doc

60ca941

ggml-qnn: refine function ggml_qnn_create_general_tensor() to avoid c…

d5d110d

…omplex/redundant pointer operation

ggml-qnn: fix a minor typo in source code

c687f26

build: avoid ggml-qnn backend breaking other backend's builds

d1b9d1b

zhouwg closed this Feb 24, 2025

github-actions bot added script Script related testing Everything test related labels Feb 24, 2025

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 24, 2025

zhouwg changed the title ~~build: avoid ggml-qnn backend breaking other backend's builds~~ PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp Feb 24, 2025

zhouwg reopened this Feb 24, 2025

chraac reviewed Feb 24, 2025

View reviewed changes

ggml/src/ggml-qnn/ggml-qnn.cpp Show resolved Hide resolved

chraac reviewed Feb 24, 2025

View reviewed changes

ggml/src/ggml-qnn/ggml-qnn.cpp Show resolved Hide resolved

zhouwg added 2 commits February 25, 2025 08:22

ggml-qnn: remove redundant codes to make PR reviewers happy

35a289a

ggml-qnn: refine code format

71dae47

zhouwg force-pushed the build_fix branch from c902c4d to 71dae47 Compare February 25, 2025 05:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #12049

PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #12049

zhouwg commented Feb 24, 2025 •

edited

Loading

zhouwg commented Feb 24, 2025 •

edited

Loading

oreomaker commented Feb 25, 2025 •

edited

Loading

chraac commented Feb 25, 2025 •

edited

Loading

zhouwg commented Feb 25, 2025 •

edited

Loading

chraac commented Feb 25, 2025

zhouwg commented Feb 25, 2025 •

edited

Loading

zhouwg commented Feb 25, 2025 •

edited

Loading

PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #12049

Are you sure you want to change the base?

PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #12049

Conversation

zhouwg commented Feb 24, 2025 • edited Loading

PR Description

Big picture of ggml-qnn backend

What I did in this PR

Performance of ggml-qnn backend

How to setup dev envs on a Linux machine

How to build ggml‐qnn source code for Android and verify ggml-qnn backend on Snapdragon based phone

How to build ggml‐qnn source code for Snapdragon based WoA(Windows on ARM) device

Acknowledgement

AI-assisted programming for ggml-qnn backend

zhouwg commented Feb 24, 2025 • edited Loading

oreomaker commented Feb 25, 2025 • edited Loading

chraac commented Feb 25, 2025 • edited Loading

zhouwg commented Feb 25, 2025 • edited Loading

chraac commented Feb 25, 2025

zhouwg commented Feb 25, 2025 • edited Loading

zhouwg commented Feb 25, 2025 • edited Loading

zhouwg commented Feb 24, 2025 •

edited

Loading

zhouwg commented Feb 24, 2025 •

edited

Loading

oreomaker commented Feb 25, 2025 •

edited

Loading

chraac commented Feb 25, 2025 •

edited

Loading

zhouwg commented Feb 25, 2025 •

edited

Loading

zhouwg commented Feb 25, 2025 •

edited

Loading

zhouwg commented Feb 25, 2025 •

edited

Loading