From 1518a6fa6abcd636065b371800b4bc474a4487a2 Mon Sep 17 00:00:00 2001 From: Valentin Date: Sun, 7 Apr 2024 22:21:39 +0200 Subject: [PATCH] extend explanation and add links --- README.md | 53 ++++++++++++++++++++++++++++------------------------- 1 file changed, 28 insertions(+), 25 deletions(-) diff --git a/README.md b/README.md index de19d34..ee64bba 100644 --- a/README.md +++ b/README.md @@ -1,9 +1,10 @@

awesome-cheap-llms

- :yellow_heart: Costs of RAG based applications
+ :yellow_heart: Costs of RAG based applications
:blue_heart: Follow Joanna on LinkedIn :heavy_plus_sign: Follow Magdalena on LinkedIn
- :white_heart: Sign up to DataTalksClub LLM Zoomcamp + :white_heart: Sign up to DataTalksClub LLM Zoomcamp
+ :star: Give this repository a star to support the initiative!


@@ -13,7 +14,8 @@ :point_right: Let’s make sure that your LLM application doesn’t burn a hole in your pocket.
-:point_right: Let’s instead make sure your LLM application generates a positive ROI for you, your company and your users. +:point_right: Let’s instead make sure your LLM application generates a positive ROI for you, your company and your users.
+:point_right: A nice side effect of choosing cheaper models over expensive models: the response time is shorter! # Techniques to reduce costs @@ -22,7 +24,7 @@

## 1) :blue_book: Choose model family and type -Selecting a suitable model or combination of models builds the foundation of building const-sensible LLM applications. +Selecting a suitable model or combination of models based on factors, such as speciality, size and benchmark results, builds the foundation for developing cost-sensible LLM applications. ### Papers * Naveed, Humza, et al. ["A comprehensive overview of large language models."](https://arxiv.org/abs/2307.06435?utm) arXiv preprint arXiv:2307.06435 (2023). @@ -31,12 +33,12 @@ Selecting a suitable model or combination of models builds the foundation of bui * [MTEB (Massive Text Embedding Benchmark) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) by Huggingface * [Models](https://huggingface.co/models) by Huggingface ### Blog posts & courses -* [How to Evaluate, Compare, and Optimize LLM Systems](https://wandb.ai/ayush-thakur/llm-eval-sweep/reports/How-to-Evaluate-Compare-and-Optimize-LLM-Systems--Vmlldzo0NzgyMTQz?utm) -* :speaking_head: call-for-contributions :speaking_head: +* [Which LLM to choose for your use case?](https://ubiops.com/which-llm-to-choose-for-your-use-case/#:~:text=While%20choosing%20an%20LLM%20that,procedures%20and%20biases%2C%20and%20licensing.) +* [6x Key factors to consider in choosing an LLM](https://www.solitontech.com/key-factors-to-consider-in-llm/) ## 2) :blue_book: Reducing model size After chosing the suitable model family, you should consider models with fewer parameters and other techniques that reduce model size. -* Model parameter size +* Model parameter size (i.e. 7B, 13B ... 175B) * Quantization of models * Higher degree of model customization (i.e. through RAG or fine-tuning) can achieve the same performance as a bigger model * Distillation @@ -49,7 +51,8 @@ After chosing the suitable model family, you should consider models with fewer p * :speaking_head: call-for-contributions :speaking_head: ### Blog posts & courses -* :speaking_head: call-for-contributions :speaking_head: +* [Basics of quantization in ML](https://iq.opengenus.org/basics-of-quantization-in-ml/) +* [How LLM quantization impacts model quality](https://deci.ai/blog/how-llm-quantization-impacts-model-quality/#:~:text=LLM%20Quantization%20Methods,-Quantization%20reduces%20the&text=Common%20quantization%20approaches%20include%20Bitsnbytes,and%20potentially%20speeding%20up%20computation.) ## 3) :blue_book: Use open source models Consider self-hosting models instead of using proprietary models if you have capable developers in house. Still, have an oversight of Total Cost of Ownership, when benchmarking managed LLMs vs. setting up everything on your own. @@ -60,50 +63,50 @@ Consider self-hosting models instead of using proprietary models if you have cap * [Ollama ](https://github.com/ollama/ollama) * [vLLM](https://github.com/vllm-project/vllm) * [llama.cpp](https://github.com/ggerganov/llama.cpp) -* :speaking_head: call-for-contributions :speaking_head: ### Blog posts & courses -* :speaking_head: call-for-contributions :speaking_head: +* [5x easy ways to run an llm locally](https://www.infoworld.com/article/3705035/5-easy-ways-to-run-an-llm-locally.html) +* [Ollama - Deploy and run llms locally](https://blog.gopenai.com/ollama-deploy-and-run-llms-locally-d20e41dd9a2d) ## 4) :blue_book: Reduce input/output tokens -A key cost driver is the amount of input tokens (user prompt + context) and output tokens you allow for your LLM. Different techniques to reduce the amount of tokens help in saving costs. -* Compression -* Summarization +A key cost driver is the amount of input tokens (user prompt + context) and output tokens, that you allow for your LLM. Different techniques to reduce the amount of tokens help in saving costs. +* Chunking of input documents +* Compression of input tokens +* Summarization of input tokens +* Prompting to instruct the LLM how many output tokens are desired ### Papers * :speaking_head: call-for-contributions :speaking_head: ### Tools & frameworks -* [LLMLingua](https://github.com/microsoft/LLMLingua) by Microsoft for input prompt compression +* [LLMLingua](https://github.com/microsoft/LLMLingua) by Microsoft to compress input prompts * :speaking_head: call-for-contributions :speaking_head: ### Blog posts & courses * :speaking_head: call-for-contributions :speaking_head: ## 5) :blue_book: Prompt and model routing -Add automatic checks to route incoming user prompts to a suitable model. Follow Least-Model-Principle, which means to by default use the simplest possible logic or LM to answer a users question and only route to more complex LMs if necessary (aka. "LLM Cascading"). +Send your incoming user prompts to a model router (= Python logic + SLM) to automatically choose a suitable model for actually answering the question. Follow Least-Model-Principle, which means to by default use the simplest possible logic or LM to answer a users question and only route to more complex LMs if necessary (aka. "LLM Cascading"). ### Tools & frameworks -* Native implementation in Python of your custom logic +* [LLamaIndex Routers and LLMSingleSelector](https://docs.llamaindex.ai/en/latest/module_guides/querying/router/#using-selector-as-a-standalone-module) to select the best fitting model from a range of potential models * [Nemo guardrails](https://github.com/NVIDIA/NeMo-Guardrails) to detect and route based on intent -* :speaking_head: call-for-contributions :speaking_head: ### Blog posts & courses -* :speaking_head: call-for-contributions :speaking_head: +* [Dynamically route logic based on input +](https://python.langchain.com/docs/expression_language/how_to/routing/) with LangChain, prompting and output parsing ## 6) :blue_book: Caching -If your users tend to send very similar prompts to your LLM system, you can reduce costs by using different cachin techniques. -* :speaking_head: call-for-contributions :speaking_head: -### Papers +If your users tend to send semantically similar or repetitive prompts to your LLM system, you can reduce costs by using different caching techniques. The key lies in developing a caching strategy, that does not only look for exact matches, but rather semantic overlap to have a decent cache hit ratio. * :speaking_head: call-for-contributions :speaking_head: ### Tools & frameworks -* :speaking_head: call-for-contributions :speaking_head: +* [GPTCache](https://github.com/zilliztech/GPTCache) for semantic caching ### Blog posts & courses * :speaking_head: call-for-contributions :speaking_head: ## 7) :blue_book: Rate limiting Make sure one single customer is not able to penetrate your LLM and skyrocket your bill. Track amount of prompts per month per user and either hard limit to max amount of prompts or reduce response time when a user is hitting the limit. In addition, detect unnatural/sudden spikes in user requests (similar to DDOS attacks, users/competitors can harm your business by sending tons of requests to your model). ### Tools & frameworks -* Simple tracking logic can be implemented in native Python +* Simple tracking and rate limiting logic can be implemented in native Python * :speaking_head: call-for-contributions :speaking_head: ### Blog posts & courses -* :speaking_head: call-for-contributions :speaking_head: +* [How to design a scalable rate limiting algorithm](https://konghq.com/blog/engineering/how-to-design-a-scalable-rate-limiting-algorithm) ## 8) :blue_book: Cost tracking "You can't improve what you don't measure" --> Make sure to know where your costs are coming from. Is it super active users? Is it a premium model? etc. ### Tools & frameworks -* Simple tracking logic can be implemented in native Python +* Simple tracking and cost attribution logic can be implemented in native Python * :speaking_head: call-for-contributions :speaking_head: ### Blog posts & courses * :speaking_head: call-for-contributions :speaking_head: