Skip to content

Commit

Permalink
update README to include replication steps
Browse files Browse the repository at this point in the history
  • Loading branch information
ConorOBrien-Foxx committed Jul 25, 2024
1 parent 7956054 commit d5365b4
Show file tree
Hide file tree
Showing 10 changed files with 86 additions and 60 deletions.
50 changes: 44 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,50 @@
# emergent-capabilities
# Emergent Capabilities of LLMs for Software Engineering
A growing interest for Large Language Models (LLMs) is how increasing their size might result in changes to their behavior not predictable from relatively smaller-scaled models. Analyzing these emergent capabilities is therefore crucial to understanding and developing LLMs. Yet, whether LLMs exhibit emergence, or possess emergent capabilities, is a contested question. Furthermore, most research into LLM emergence has focused on natural language processing tasks and models suited for them.

## Primary Files
We focus on investigating emergence in the context of software engineering, and recontextualize the discussion of emergence in the context of prior research. We propose a multifaceted pipeline for evaluating and reasoning about emergent capabilities of LLMs in any context and instantiate this pipeline to analyze the emergent capabilities of the CodeGen1-multi model across four scales ranging from 350M parameters to 16.1B parameters. We examine the model's performance on the software engineering tasks of automatic bug fixing, code translation, and commit message generation. We find no evidence of emergent growth at this scale on these tasks and consequently discuss the future investigation of emergent capabilities.

## How to Replicate

### Installing the remote tests
Our `pull-tests.ipynb` includes many candidate avenues of research, but the only one to install is under the header `CoDiSum's data4CopynetV3.zip`. Otherwise, `cd data && git clone https://github.com/microsoft/CodeXGLUE.git` instantiates the other tasks' data.

Each software engineering task has a corresponding `.ipynb` file which is responsible for loading the models (defined in `model_wrapper.py`), tasks (defined in `run_battery.py`) and metrics (defined in `metric.py`). We include `bugs2fix.ipynb`, `bugs2fix-checklist.ipynb`, `code2code-trans.ipynb`, and `commit-message.ipynb` to run the models on the tasks and grade them according to our metrics, generating the results graphs. We generate the additional graphs (i.e. bootstrapping) with `metric-progress.ipynb`. We generate our tables with `tabulate-results.ipynb`.

In general, our software engineering tasks follow the following code format:

```py
from run_battery import BatteryRunner, BatteryConfigs

runner = BatteryRunner.of(BatteryConfigs.TaskName) # replace with correct config name
runner.load_cases()

# Generate results
from model_wrapper import ModelFamily
runner.run_battery(
family=ModelFamily.CodeGen1.multi, # e.g.
patch=False, # change to True if you want to fill in blank lines, shouldn't be necessary
)

# Render graphs
import metric
runner.init_render(family=ModelFamily.CodeGen1.multi) # e.g.
runner.render_metric_multi(
[ metric.ExactMatch, metric.BLEU, metric.CodeBLEUJava ],
save="./figs/OUTPATH-path-all.png",
)
```


## Repository Structure

### Primary Files
- `bugs2fix.ipynb` generates the graphs for the Bugs2Fix code repair task.
- `bugs2fix-checklist.ipynb` generates the graphs for the Bugs2Fix (Checklist) code repair task.
- `code2code-trans.ipynb` generates the graphs for the Code2Code code translation tastk.
- `commit-message.ipynb` generates the graphs for the commit message generation task.
- `tabulate-results.ipynb` generates the tables and pulls together the information.

- `pull-tests.ipynb` installs the datasets from BIG and other various places. (I'm pretty sure CodeXGLUE was not installed this way - the repository was simply cloned to `data/CodeXGLUE`.)
- `trim-tokens.ipynb` (***TODO***) is to uniformly trim output lines to ensure all lines are at most 500 tokens long (useful because various configurations were used during the testing process).
- `pull-tests.ipynb` installs the datasets from BIG and other various places. (Note: CodeXGLUE was not installed this way - the repository was simply cloned to `data/CodeXGLUE`.)

- `bleu.py` is code adapted from CodeXGLUE which calculates the BLEU metric.
- `metric.py` is a wrapper around the various metrics we used in this project.
Expand All @@ -16,7 +53,7 @@
- `run_battery.py` is helper code which streamlines the testcase running process.
- `timehelp.py` is helper code which is responsible for timing operations and formatting them.

## Scaffolding Files
### Scaffolding Files

- `accelerate-test.ipynb` is scaffolding code which became the basis for `model_wrapper.py`, testing GPU loading & caching of the codegen models.
- `codexglue-test.ipynb` is a scratchpad for initial testing of various prompts.
Expand All @@ -26,4 +63,5 @@
- `testing.ipynb` is used for miscellaneous testing, but primarily the parsing of multiple choice questions.
- `verify-result.ipynb` is debugging code used to examine questionable input/output pairs and assess what caused them (in this case, a bug/false assumption in the generation code).
- `wrapper-test.ipynb` is a simple testing file for making sure the model wrapper works correctly.
- `test.py` is an old testing file.
- `test.py` is an old testing file.
- `trim-tokens.ipynb` was a planned experiment to normalize token lengths across experiments.
18 changes: 5 additions & 13 deletions bugs2fix-checklist.ipynb

Large diffs are not rendered by default.

14 changes: 2 additions & 12 deletions bugs2fix.ipynb

Large diffs are not rendered by default.

20 changes: 7 additions & 13 deletions code2code-trans.ipynb

Large diffs are not rendered by default.

Binary file modified figs/b2f-all.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified figs/b2f-cl-all.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified figs/b2f-cl-bootstrap-all.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified figs/c2c-all.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
38 changes: 24 additions & 14 deletions metric-progress.ipynb

Large diffs are not rendered by default.

6 changes: 4 additions & 2 deletions metric.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,8 +65,10 @@ def from_shortname(shortname):
return Metric.Directory[shortname]

ExactMatch = Metric(
name="Accuracy% (Exact Match)",
simplename="Accuracy%",
#name="Accuracy% (Exact Match)",
#simplename="Accuracy%",
name="Exact Match",
simplename="Exact Match",
shortname="em",
latex_name="EM",
grade_single = lambda truth, answer: truth == answer,
Expand Down

0 comments on commit d5365b4

Please sign in to comment.