Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation: Evaluation run lifecycle #2506

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

yifanmai
Copy link
Collaborator

No description provided.

@yifanmai yifanmai requested review from percyliang and farzaank March 25, 2024 21:54
@@ -0,0 +1,46 @@
# Evaluation Run Lifecycle

Each invocation of `helm-run` will run a number of **evaluation runs**. Each evaluation run uses a single scenario and a single model, and is performed independently form other evaluation runs. Evaluation runs are usually executed serially / one at a time by the default runner, though some alternate runners (e.g. `SlurmRunner`) may execute evaluation runs in parallel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

form -> from

4. Send the requests to the models and receives the request responses.
5. Compute the per-instance stats and aggregate them to per-run stats.

The following code and data objects are responsible involved in an evaluation run:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very helpful that the 5 points here map back to the previous 5


When a user runs `helm-run`, the evaluation runner will perform a number of evaluation runs, each specified by a `RunSpec`. However, the user typically does not provide the `RunSpec`s directly. Instead, the `RunSpec`s are produced by **run spec functions**. The user instead passes one or more **run entries** to `helm-run`, which are short strings (e.g. `mmlu:subject=anatomy,model=openai/gpt2`) that specify how to invoke the run spec functions to get the actual `RunSpec`s.

The run entry format is explained further on its own documentation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be useful to link to that here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and then we can just say "further here"

3. An `AdapterSpec` specifies an `Adapter` instance.
4. `MetricSpec`s specifies `Metric` instances.

Note: The `RunSpec` does not contain a `ClientSpec` specifies the `Client` instance. Instead, the `RunSpec` specifies the name of the model deployment inside `AdapterSpec`. During the evaluation run, the model deployment name is used to retreive the `ClientSpec` from built-in or user-provided model deployment configurations, which is then used to construct the `Client`. This late binding allows the HELM user to perform user-specific configuration of clients, such as changing the type or location of the model inference platform for the model.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retreive -> retrieve


The following code and data objects are responsible involved in an evaluation run:

1. A `Scenario` provides the in context learning and evaluation `Instance`s.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in-context

2. A `DataAugmenter` takes in base `Instance` and generates perturbed `Instance`.
3. A `Adapter` transforms in-context learning instances and evaluation instances into model inference `Request`s.
4. A `Client` sends the `Requests` to the models and receives `RequestResponse`s.
5. `Metrics`s take in `RequestState`s (which each contain a `Instance`, `Request`,`RequestResponse`, and additional instance context) and compute aggregated adn per-instanace `Stat`s.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "and" I think (towards the end of the sentence).

1. Get in-context learning and evaluation instances from scenario. Each instance has an input (e.g. question) and a set of reference outputs (e.g. multiple choice options).
2. (Advanced) Run _data augmenters / perturbations_ on the base instances to generate perturbed instances.
3. Perform _adaptation_ to transform the in-context learning instances and evaluation instances into model inference requests, which contain prompts and other request parameters such as request temperature and stop sequences.
4. Send the requests to the models and receives the request responses.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

receives should be receive

4. Send the requests to the models and receives the request responses.
5. Compute the per-instance stats and aggregate them to per-run stats.

The following code and data objects are responsible involved in an evaluation run:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

responsible involved in should either be responsible for or just involved in


1. Specifications (`RunSpec`, `ScenarioSpec`, `DataAugmenterSpec`, `AdapterSpec`, `ClientSpec`, and `MetricsSpec`) are serializable. They may be written to evaluation run output files, to provide a record of how the evaluation run was configured and how to reproduce it.
2. Code objects (`Scenario`, `DataAugmenter`, `Adapter`, `Client`, `Metric`) are _not_ serializable. These contain program logic used for by the evlauation run. Users can implement custom subclasses of these objects if needed.
3. Data objects (`Instance`, `Request`, `Response`, `Stat`) are serializable. These are typcically produced as outputs of code objects and written to the evaluation run output files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: typically


When a user runs `helm-run`, the evaluation runner will perform a number of evaluation runs, each specified by a `RunSpec`. However, the user typically does not provide the `RunSpec`s directly. Instead, the `RunSpec`s are produced by **run spec functions**. The user instead passes one or more **run entries** to `helm-run`, which are short strings (e.g. `mmlu:subject=anatomy,model=openai/gpt2`) that specify how to invoke the run spec functions to get the actual `RunSpec`s.

The run entry format is explained further on its own documentation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and then we can just say "further here"


The objects above can be grouped into three categories:

1. Specifications (`RunSpec`, `ScenarioSpec`, `DataAugmenterSpec`, `AdapterSpec`, `ClientSpec`, and `MetricsSpec`) are serializable. They may be written to evaluation run output files, to provide a record of how the evaluation run was configured and how to reproduce it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the comma after run output files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants