-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarking #32
Comments
It needs mapped out some more with standardizing question sets and building issues/prs to benchmark with, data points that we should target and how to improve things over time. As well as the actual implementation details, I'm hoping @sshivaditya2019 can contribute to the spec in that regard if they can. Otherwise I'd be happy to share my thoughts on how I'd go about it. |
Not sure how sophisticated this testing will be but some time frame between 1 day to 1 week of work is what I expect to wrangle the infra |
I think this should be a part of #18. This is should handled, as a performance metric within the RL framework. This could be an extension of the framework metric we use within to quantify the responses. |
Nice I've never seen that before, just found https://github.com/openai/evals through it with seems to be a testing frameworks of sorts but it's python based so I'm ruling myself out if that's the way to go. To automate things we could write a workflow that only collaborators can run during a PR, which will comment on the various benchmarking issues with the exact same phrase that caused issue or against predefined test questions. Would require logic to accept a bot comment only if it contains say The analysis would need parsed from the response on the issue but that's fine as that issue would become our store for visualizing how things are progressing over time very easily. Effectively this would be our unit test for the Manual QA would not be required by contributors as this workflow would be the QA itself. |
For testing, we can use metrics-based evaluation techniques or some form of functional testing. In terms of metrics, we can use the following to compare the gold standard responses with the responses generated by the model:
These are some metrics that could be used to compare against the gold standard responses. However, this approach is not a solution for benchmarking; it would instead serve as a functional testing case. But, manual QA should still be required, gives reviewers more insight into the PR, and what it actually accomplishes. |
My biggest concern is accurately evaluating the influence that the ever changing embeddings are having on responses when in my mind in order to actually validate we'd need to see what was returned and then either manually or programmatically (without an llm) see if there is more relevant information that could have been returned from the DB, why wasn't it and how can we optimize to return it. Can you share with me your embeddings database CSV on telegram please bud? This is not something I have experience with although I'd be happy to work on it after my other PRs are reviewed and closed out. But if this is a task you'd feel right at home with @sshivaditya2019 lmk if you'd prefer to take it (between us but obviously anyone is welcome to take it) I read this article covering those metrics, perhaps an LLM based approach is not so bad after all. But it feels kinda wrong to me lmao |
I would be happy to react to the responses to RLHF the results. I know what's accurate for most of the queries, especially anything ubiquity specific. |
This works for final output perfectly but it's everything before that we need to benchmark The precursor to the response is what needs benchmarked folks not the final output. If you could RLHF the returned embeddings or the fetched context I'd agree build it in to the RLFH feature, but how do we plan to do that and validate what makes up our context window? |
And for that, we need functional metrics, as I mentioned before, which are essentially what is required for the RL design as of now.
This is should not be a task on its own, but an extension of the RLHF task, once we are done with metrics implementation, this should be just comparisons against gold standard responses. |
Okay let's rename this task to Data retrieval validation and accuracy, perhaps you are misunderstanding me when I say "Benchmark". I don't mean in the traditional LLM sense, I mean our own internal benchmarks of the core features that make this plugin "work". I am not talking about adjusting the LLM responses or RLHF at all, I want to set benchmarks that we can compare our large changes to. RLHF by your previous explanation is going to be part of the global prompt and global goals of the various applications, all of which are subject to minor or major changes which affect the core baseline performance of this plugin. Not the ongoing adjustment to it's assumed perfect underlying logic which is not perfect. |
Going to leave it there I suppose, I've said all I can on it. Thanks for the back and forth I do appreciate it but I don't want to hold anything back obviously so I look forward to seeing it implemented. |
I enjoyed working on the first few AI features when I started contributing but I quickly realized the struggle in obtaining desired outcomes with our intricate use-cases and extended context windows.
It's even harder to unit test, or test in the normal sense so validating became pretty hard and so PRs took a long time to merge. Which is why I focused on other areas of the DAO to add value.
I suggest we finally implement an absolutely unshakeable and beyond reproach benchmarking strategy for our AI features but this spec focuses on
/ask
in particular.Objective: Establish a benchmark to evaluate the LLM's factual accuracy and relevance in responses using two primary data sources:
Core Components
1. Static Context (Linked Issues)
2. Dynamic Embeddings Search
3. Output Effectiveness with Combined Context
Test Workflow
Deliverables
This benchmark framework will allow reliable measurement of the LLM's accuracy and adaptability with respect to static context and dynamic embeddings, enabling consistent future optimizations.
https://chatgpt.com/share/67265190-bec8-8000-ae58-a973a006e3aa
The text was updated successfully, but these errors were encountered: