Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intel NPU operation related #1081

Open
Oneul-hyeon opened this issue Dec 19, 2024 · 2 comments
Open

Intel NPU operation related #1081

Oneul-hyeon opened this issue Dec 19, 2024 · 2 comments

Comments

@Oneul-hyeon
Copy link

Oneul-hyeon commented Dec 19, 2024

Hello

I want to use On-device sLM using NPU which is currently equipped in "Intel(R) Core(TM) Ultra 5".

However, although I confirmed the operation of CPU and iGPU in the code below, no answer is output for NPU.

from optimum.intel import OVModelForCausalLM
from transformers import AutoTokenizer
import time

def make_template(context) :
    instruction=f"""You are an assistant who translates meeting contents.
Translate the meeting contents given after #Context into English.

#Context:{context}

#Translation:"""
    
    messages=[{"role": "user", "content": f"{instruction}"}]

    input_ids=tokenizer.apply_chat_template(messages,
                                                    add_generation_prompt = True,
                                                    return_tensors="pt")

    return input_ids

def translate(context) : 
    input_ids=make_template(context=context)
    outputs=model.generate(input_ids,
                                max_new_tokens=max_new_tokens,
                                do_sample=do_sample,
                                temperature=temperature,
                                top_p=top_p)
    
    answer=tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
    
    return answer.rstrip()

if __name__ == "__main__" :
    model_id = "AIFunOver/gemma-2-2b-it-openvino-8bit"
    model = OVModelForCausalLM.from_pretrained(model_id, device="npu")
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    print(f"Model Device : {model.device}")

    max_new_tokens=1024
    do_sample=False
    temperature=0.1
    top_p=0.9

    context = '''A: Hello.
B: Oh, yes, hello. I'm contacting you because I have a question. They're doing water pipe construction in my neighborhood, and I'm curious as to how long it will take.
A: Where is your area?
B: Daejeon Byeundae-dong.
A: The construction will continue until tomorrow, sir.
B: Oh really? Oh, but won't there be muddy water after the construction is over?
A: It's better to let out enough water before using it after the construction is over, sir.
B: How much water should I drain?
A: Let out for 2~3 minutes.
B: Okay, I understand. Then, can there be another problem?
A: The water pressure may temporarily drop slightly.
B: Temporarily?
A: Yes, it's a temporary phenomenon and will return to normal pressure right away.
B: What should I do if it lasts a long time?
A: In that case, you can report it to the Waterworks Headquarters.
B: Yes, I understand.
B: But they say it's going to rain tomorrow, so can the construction be finished tomorrow? I think they usually don't do construction on rainy days? A: In case of rain, construction may be slightly delayed. If it doesn't rain too much, construction will proceed as scheduled. Customer, please don't worry too much.
B: Oh, yes, I understand. Thank you.
A: Yes, thank you.'''

    start_time = time.time()
    generated_text = translate(context)
    end_time = time.time()

    print("generated_text:", generated_text)

    num_generated_tokens = len(tokenizer.tokenize(generated_text))
    total_time = end_time - start_time
    avg_token_speed = total_time / num_generated_tokens if num_generated_tokens > 0 else float('inf')

    print(f"Total Inference Time : {total_time} s")
    print(f"Average token generation speed: {avg_token_speed:.4f} seconds/token")

However, the devices currently available for openvino include NPUs.

image

If there is a way to use NPU, can you tell me?

Thank you.

@eaidova
Copy link
Collaborator

eaidova commented Dec 19, 2024

@Oneul-hyeon currently optimum-intel does not support inference sLM on NPU, but there is another solution that allow to do that and working on the same optimum-intel converted models, please check this guide https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/genai-guide-npu.html

@endomorphosis
Copy link

@eaidova i had done experiments with this on windows, and the conclusion was that this does not work (on windows), the process exists without providing error output, when you try that guide. see e.g. openvinotoolkit/openvino.genai#1358

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants