Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarking #32

Closed
Keyrxng opened this issue Nov 2, 2024 · 12 comments
Closed

Benchmarking #32

Keyrxng opened this issue Nov 2, 2024 · 12 comments

Comments

@Keyrxng
Copy link
Contributor

Keyrxng commented Nov 2, 2024

I enjoyed working on the first few AI features when I started contributing but I quickly realized the struggle in obtaining desired outcomes with our intricate use-cases and extended context windows.

It's even harder to unit test, or test in the normal sense so validating became pretty hard and so PRs took a long time to merge. Which is why I focused on other areas of the DAO to add value.

I suggest we finally implement an absolutely unshakeable and beyond reproach benchmarking strategy for our AI features but this spec focuses on /ask in particular.


Objective: Establish a benchmark to evaluate the LLM's factual accuracy and relevance in responses using two primary data sources:

  1. Static Context: Fixed linked issues in a benchmark issue, which remain unchanged across tests.
  2. Dynamic Embeddings: Embeddings updated with each new comment in the org, affecting similarity-based data retrieval.

Core Components

1. Static Context (Linked Issues)

  • Purpose: Ensure consistent accuracy in the LLM’s responses using static linked issues within a single benchmark issue.
  • Method: Programmatically collect and freeze the linked issues for the benchmark issue. Track how effectively these fixed issues contribute to answering queries.
  • Evaluation Metric: Measure response consistency by repeating tests. Responses should remain factually stable, given the fixed nature of the linked issues.

2. Dynamic Embeddings Search

  • Purpose: Evaluate the LLM's use of dynamically updated embeddings to retrieve relevant information and factor it into responses.
  • Aspects to Measure:
    • Relevance: Assess how closely the returned embeddings match the query intent, using cosine similarity thresholds or human evaluation.
    • Impact on Output: Track how embeddings modify or enrich the response, especially as new, unrelated comments add potential "noise."
  • Evaluation Metric: Compare responses over time to note if relevance or factual accuracy decreases with embedding dilution.

3. Output Effectiveness with Combined Context

  • Purpose: Determine the effectiveness of the LLM’s response when combining static context (linked issues) and dynamic embeddings.
  • Method: Establish target answers for each test query and evaluate responses against these targets. Track deviations in relevance and factual accuracy.
  • Evaluation Metric: Use precision, recall, and human evaluators to score output accuracy and relevance based on combined context.

Test Workflow

  1. Initial Baseline Run: Execute the benchmark test against the chosen issue, storing the initial output and measuring linked issues' contribution, embedding relevance, and combined context effectiveness.
  2. Recurrent Testing: With each change (e.g., prompt adjustments, logic changes), re-run the benchmark test, noting any shifts in relevance, accuracy, or response consistency.

Deliverables

  1. Baseline Output and Metrics: Document the initial output with relevance and accuracy metrics for linked issues, embeddings, and combined context.
  2. Automated Test Script: Create a script to repeat the test scenario on the benchmark issue and log metrics.
  3. Evaluation Report: Provide a concise report after each test run to track trends and deviations in LLM performance over time, especially as embeddings evolve.

This benchmark framework will allow reliable measurement of the LLM's accuracy and adaptability with respect to static context and dynamic embeddings, enabling consistent future optimizations.

https://chatgpt.com/share/67265190-bec8-8000-ae58-a973a006e3aa

@Keyrxng
Copy link
Contributor Author

Keyrxng commented Nov 2, 2024

It needs mapped out some more with standardizing question sets and building issues/prs to benchmark with, data points that we should target and how to improve things over time.

As well as the actual implementation details, I'm hoping @sshivaditya2019 can contribute to the spec in that regard if they can. Otherwise I'd be happy to share my thoughts on how I'd go about it.

@0x4007
Copy link
Member

0x4007 commented Nov 2, 2024

This is how we do it

https://platform.openai.com/docs/guides/prompt-engineering#tactic-evaluate-model-outputs-with-reference-to-gold-standard-answers

@0x4007
Copy link
Member

0x4007 commented Nov 2, 2024

Not sure how sophisticated this testing will be but some time frame between 1 day to 1 week of work is what I expect to wrangle the infra

@shiv810
Copy link
Collaborator

shiv810 commented Nov 2, 2024

I think this should be a part of #18. This is should handled, as a performance metric within the RL framework. This could be an extension of the framework metric we use within to quantify the responses.

@Keyrxng
Copy link
Contributor Author

Keyrxng commented Nov 2, 2024

This is how we do it

https://platform.openai.com/docs/guides/prompt-engineering#tactic-evaluate-model-outputs-with-reference-to-gold-standard-answers

Nice I've never seen that before, just found https://github.com/openai/evals through it with seems to be a testing frameworks of sorts but it's python based so I'm ruling myself out if that's the way to go.

To automate things we could write a workflow that only collaborators can run during a PR, which will comment on the various benchmarking issues with the exact same phrase that caused issue or against predefined test questions. Would require logic to accept a bot comment only if it contains say <!--benchmarking--!> for example.

The analysis would need parsed from the response on the issue but that's fine as that issue would become our store for visualizing how things are progressing over time very easily.

Effectively this would be our unit test for the /ask plugin as the test authored stub the llm responses and inject false context simply to validate formatting and basic github API fetching. They do not test anything in regards to embedding data retrieval, response accuracy or provide a "boolean" outcome for if this have improved/worsened so they are misleading in terms of plugin effectiveness.

Manual QA would not be required by contributors as this workflow would be the QA itself.

@shiv810
Copy link
Collaborator

shiv810 commented Nov 2, 2024

For testing, we can use metrics-based evaluation techniques or some form of functional testing. In terms of metrics, we can use the following to compare the gold standard responses with the responses generated by the model:

  • QAG Score
  • GEval
  • BLEU

These are some metrics that could be used to compare against the gold standard responses. However, this approach is not a solution for benchmarking; it would instead serve as a functional testing case.

But, manual QA should still be required, gives reviewers more insight into the PR, and what it actually accomplishes.

@Keyrxng
Copy link
Contributor Author

Keyrxng commented Nov 2, 2024

My biggest concern is accurately evaluating the influence that the ever changing embeddings are having on responses when in my mind in order to actually validate we'd need to see what was returned and then either manually or programmatically (without an llm) see if there is more relevant information that could have been returned from the DB, why wasn't it and how can we optimize to return it.

Can you share with me your embeddings database CSV on telegram please bud?

This is not something I have experience with although I'd be happy to work on it after my other PRs are reviewed and closed out. But if this is a task you'd feel right at home with @sshivaditya2019 lmk if you'd prefer to take it (between us but obviously anyone is welcome to take it)

I read this article covering those metrics, perhaps an LLM based approach is not so bad after all. But it feels kinda wrong to me lmao

@0x4007
Copy link
Member

0x4007 commented Nov 2, 2024

I would be happy to react to the responses to RLHF the results. I know what's accurate for most of the queries, especially anything ubiquity specific.

@0x4007 0x4007 closed this as not planned Won't fix, can't repro, duplicate, stale Nov 2, 2024
@Keyrxng
Copy link
Contributor Author

Keyrxng commented Nov 2, 2024

I would be happy to react to the responses to RLHF the results. I know what's accurate for most of the queries, especially anything ubiquity specific.

This works for final output perfectly but it's everything before that we need to benchmark

The precursor to the response is what needs benchmarked folks not the final output. If you could RLHF the returned embeddings or the fetched context I'd agree build it in to the RLFH feature, but how do we plan to do that and validate what makes up our context window?

@shiv810
Copy link
Collaborator

shiv810 commented Nov 2, 2024

This works for final output perfectly but it's everything before that we need to benchmark

And for that, we need functional metrics, as I mentioned before, which are essentially what is required for the RL design as of now.

The precursor to the response is what needs benchmarked folks not the final output. If you could RLHF the returned embeddings or the fetched context I'd agree build it in to the RLFH feature, but how do we plan to do that and validate what makes up our context window?

This is should not be a task on its own, but an extension of the RLHF task, once we are done with metrics implementation, this should be just comparisons against gold standard responses.

@Keyrxng
Copy link
Contributor Author

Keyrxng commented Nov 2, 2024

Okay let's rename this task to

Data retrieval validation and accuracy, perhaps you are misunderstanding me when I say "Benchmark". I don't mean in the traditional LLM sense, I mean our own internal benchmarks of the core features that make this plugin "work". I am not talking about adjusting the LLM responses or RLHF at all, I want to set benchmarks that we can compare our large changes to.

RLHF by your previous explanation is going to be part of the global prompt and global goals of the various applications, all of which are subject to minor or major changes which affect the core baseline performance of this plugin. Not the ongoing adjustment to it's assumed perfect underlying logic which is not perfect.

@Keyrxng
Copy link
Contributor Author

Keyrxng commented Nov 2, 2024

Going to leave it there I suppose, I've said all I can on it. Thanks for the back and forth I do appreciate it but I don't want to hold anything back obviously so I look forward to seeing it implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants