Best way to deploy a SLM/LLM model from huggingface. Need best approach. #12657
Unanswered
AakashNakarmi
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I want to deploy a SLM/LLM Model from hugging face with the least response time.
Please help me with the best library and approach for inferencing. If there is a template for production level inferencing that will be helpful.
I want a image or build a image that will be able to serve multiple requests at a time through a port.
Library: Transformers with pipeline
Current inference time: ~4-5 sec for 2000 tokens.
Model to deploy: Phi 3 mini instruct ~7.5GB.
https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
System Configuation:
Nvidia 24 GB GPU
400 GB RAM
Beta Was this translation helpful? Give feedback.
All reactions