Skip to content

Commit

Permalink
update correct scripts and readme.md
Browse files Browse the repository at this point in the history
  • Loading branch information
lingjzhu committed May 12, 2023
1 parent a39c4c0 commit af99753
Show file tree
Hide file tree
Showing 4 changed files with 88 additions and 546 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,5 @@ necessary used for model training.
- Filters for GitHub Issues
- Filters for Git Commits
- Script to convert Jupyter notebooks to scripts
- Scripts to convert Jupyter notebooks to structured markdown-code-output triplets
- `decontamination`: script to remove files that match test-samples from code generation benchmarks.
19 changes: 19 additions & 0 deletions preprocessing/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,3 +40,22 @@ We release the Jupyter scripts dataset as part of [StarCoderData](https://huggin
python jupyter_script_conversion.py
```

# Creating Jupyter-structured dataset

## Step 1
Parse Jupyter notebooks from `the Stack`.
```
python jupyter-structured/jupyter-segment-notebooks.py
```

## Step 2
Generate markdown-code-output triplets.
```
python jupyter-structured/jupyter-generate-triplets.py
```

## Step 3
Create notebook-level structured dataset using `jupyter-structured/jupyter-structured.ipynb`.



17 changes: 0 additions & 17 deletions preprocessing/jupyter-structured/README.md

This file was deleted.

Loading

0 comments on commit af99753

Please sign in to comment.