Skip to content

Commit

Permalink
adjust branch trigger
Browse files Browse the repository at this point in the history
  • Loading branch information
Magdalena Kuhn committed Apr 4, 2024
1 parent 1ca5634 commit c9d4557
Show file tree
Hide file tree
Showing 2 changed files with 52 additions and 45 deletions.
18 changes: 12 additions & 6 deletions .github/workflows/fetch_contributors.yaml
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
name: Update Contributors List

on:
push:
branches:
- NOREF_adjust_structure
workflow_dispatch:
push: main

jobs:
update-contributors:
Expand All @@ -21,11 +20,18 @@ jobs:
- name: Update README with contributors
run: python src/fetch_contributors.py
- name: Commit and push if changed
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
git config --global user.email "[email protected]"
git config --global user.name "GitHub Action"
PR_BRANCH="auto-pr-${GITHUB_SHA}"
git checkout -b $PR_BRANCH
git add README.md
git commit -m "Update contributors list" || exit 0
git push
git push origin $PR_BRANCH
echo "PR_BRANCH=${PR_BRANCH}" >> $GITHUB_ENV
- name: Create Pull Request
id: create_pr
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
gh pr create --title "NOREF_update_list_of_contributors" --body "Fetch latest contributors"
79 changes: 40 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,100 +8,101 @@
</div>
<br/>
<p align="center">
Cost reduction tools and techniques for LLM based systems <br> <br>
<img src="images/Screenshot%202024-04-04%20at%2007.41.00.png" alt="Alt text" title="Expectation vs. Reality">
<img src="images/Screenshot%202024-04-04%20at%2007.41.00.png" alt="Alt text" title="Expectation vs. Reality"> <br>
Cost reduction tools and techniques for LLM based systems
</p>


:point_right: Let’s make sure that your LLM application doesn’t burn a hole in your pocket. <br>
:point_right: Let’s instead make sure your LLM application generates a positive ROI for you, your company and your users.

# Tools & frameworks to reduce costs
# Techniques to reduce costs

## 1) :blue_book: Model family and type
## 1) :blue_book: Choose model family and type
Selecting a suitable model or combination of models builds the foundation of building const-sensible LLM applications.

### In-depth papers that explain underlying concepts
### Papers
* Naveed, Humza, et al. ["A comprehensive overview of large language models."](https://arxiv.org/abs/2307.06435?utm) arXiv preprint arXiv:2307.06435 (2023).
* Minaee, Shervin, et al. ["Large Language Models: A Survey."](https://arxiv.org/abs/2402.06196) arXiv preprint arXiv:2402.06196 (2024).
* :speaking_head: call-for-contributions :speaking_head:
### Tools & frameworks that help with selecting the correct model
* Hugging face open leaderboard
### Hands-on blog posts & courses with step by step guide
### Tools & frameworks
* [MTEB (Massive Text Embedding Benchmark) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) by Huggingface
* [Models](https://huggingface.co/models) by Huggingface
### Blog posts & courses
* [How to Evaluate, Compare, and Optimize LLM Systems](https://wandb.ai/ayush-thakur/llm-eval-sweep/reports/How-to-Evaluate-Compare-and-Optimize-LLM-Systems--Vmlldzo0NzgyMTQz?utm)
* :speaking_head: call-for-contributions :speaking_head:

## 2) :blue_book: Model size
## 2) :blue_book: Reducing model size
After chosing the suitable model family, you should consider models with fewer parameters and other techniques that reduce model size.
* Selection of model parameter size
* Model parameter size
* Quantization of models
* Higher degree of model customization (i.e. through RAG or fine-tuning) can achieve the same performance as a bigger model
* Distillation

### In-depth papers that explain underlying concepts
### Papers
* :speaking_head: call-for-contributions :speaking_head:

### Tools & frameworks that help reducing model size
### Tools & frameworks
* [LoRA](https://huggingface.co/docs/diffusers/training/lora#lora) and [QLoRA](https://medium.com/@dillipprasad60/qlora-explained-a-deep-dive-into-parametric-efficient-fine-tuning-in-large-language-models-llms-c1a4794b1766) make training large models more efficient
* :speaking_head: call-for-contributions :speaking_head:

### Hands-on blog posts & courses with step by step guide
### Blog posts & courses
* :speaking_head: call-for-contributions :speaking_head:
## 3) :blue_book: Open source vs. proprietary models
## 3) :blue_book: Use open source models
Consider self-hosting models instead of using proprietary models if you have capable developers in house. Still, have an oversight of Total Cost of Ownership, when benchmarking managed LLMs vs. setting up everything on your own.

### In-depth papers that explain underlying concepts
### Papers
* :speaking_head: call-for-contributions :speaking_head:
### Tools & frameworks that help with self-hosting
* Huggingface
* LocalAI
* Ollama
* vLLM
### Tools & frameworks
* [LocalAI](https://github.com/mudler/LocalAI)
* [Ollama ](https://github.com/ollama/ollama)
* [vLLM](https://github.com/vllm-project/vllm)
* [llama.cpp](https://github.com/ggerganov/llama.cpp)
* :speaking_head: call-for-contributions :speaking_head:
### Hands-on blog posts & courses with step by step guide
### Blog posts & courses
* :speaking_head: call-for-contributions :speaking_head:
## 4) :blue_book: Input/Output tokens
A key cost driver is the amount of input token (user prompt + context) and output token you allow for your LLM. Different techniques to reduce the amount of tokens help in saving costs.
## 4) :blue_book: Reduce input/output tokens
A key cost driver is the amount of input tokens (user prompt + context) and output tokens you allow for your LLM. Different techniques to reduce the amount of tokens help in saving costs.
* Compression
* Summarization

### In-depth papers that explain underlying concepts
### Papers
* :speaking_head: call-for-contributions :speaking_head:
### Tools & frameworks that help with reducing tokens
### Tools & frameworks
* [LLMLingua](https://github.com/microsoft/LLMLingua) by Microsoft for input prompt compression
* :speaking_head: call-for-contributions :speaking_head:
### Hands-on blog posts & courses with step by step guide
### Blog posts & courses
* :speaking_head: call-for-contributions :speaking_head:
## 5) :blue_book: Prompt and model routing
Add automatic checks to route all incoming user prompts to a suitable model. Follow Least-Model-Principle, which means to by default use the simplest possible logic or LM to answer a users question and only route to more complex LMs if necessary (aka. "LLM Cascading"). This can result to answering certain questions with a predefined response, using SLMs for simple questions and LLMs for complex questions.
Add automatic checks to route incoming user prompts to a suitable model. Follow Least-Model-Principle, which means to by default use the simplest possible logic or LM to answer a users question and only route to more complex LMs if necessary (aka. "LLM Cascading").

### Tools & frameworks that help with routing
### Tools & frameworks
* Native implementation in Python of your custom logic
* **Nemo guardrails** to detect and Route based on intent
* [Nemo guardrails](https://github.com/NVIDIA/NeMo-Guardrails) to detect and route based on intent
* :speaking_head: call-for-contributions :speaking_head:
### Hands-on blog posts & courses with step by step guide
### Blog posts & courses
* :speaking_head: call-for-contributions :speaking_head:
## 6) :blue_book: Caching
If your users tend to send very similar prompts to your LLM system, you can reduce costs by using different cachin techniques:
If your users tend to send very similar prompts to your LLM system, you can reduce costs by using different cachin techniques.
* :speaking_head: call-for-contributions :speaking_head:
### In-depth papers that explain underlying concepts
### Papers
* :speaking_head: call-for-contributions :speaking_head:
### Tools & frameworks that help with caching
### Tools & frameworks
* :speaking_head: call-for-contributions :speaking_head:
### Hands-on blog posts & courses with step by step guide
### Blog posts & courses
* :speaking_head: call-for-contributions :speaking_head:
## 7) :blue_book: Rate limiting
Make sure one single customer is not able to penetrate your LLM and skyrocket your bill. Track amount of prompts per month per user and either hard limit to max amount of prompts or reduce response time when a user is hitting the limit. In addition, detect unnatural/sudden spikes in user requests (similar to DDOS attacks, users/competitors can harm your business by sending tons of requests to your model).
### Tools & frameworks that help with rate limiting:
### Tools & frameworks
* Simple tracking logic can be implemented in native Python
* :speaking_head: call-for-contributions :speaking_head:
### Hands-on blog posts & courses with step by step guide
### Blog posts & courses
* :speaking_head: call-for-contributions :speaking_head:
## 8) :blue_book: Cost tracking
"You can't improve what you don't measure" --> Make sure to know where your costs are coming from. Is it super active users? Is it a premium model? etc.
### Tools & frameworks that help with cost tracking
### Tools & frameworks
* Simple tracking logic can be implemented in native Python
* :speaking_head: call-for-contributions :speaking_head:
### Hands-on blog posts & courses with step by step guide
### Blog posts & courses
* :speaking_head: call-for-contributions :speaking_head:
## 9) :blue_book: During development time
* Make sure to not send endless API calls to your LLM during development and manual testing.
Expand Down

0 comments on commit c9d4557

Please sign in to comment.