Skip to content

Commit

Permalink
reformat readme
Browse files Browse the repository at this point in the history
  • Loading branch information
loubnabnl authored May 12, 2023
1 parent 6c269c7 commit 0b3c1ba
Showing 1 changed file with 4 additions and 6 deletions.
10 changes: 4 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,10 @@ necessary used for model training.

- `language_selection`: notebooks and file with language to file extensions mapping used to build the Stack v1.1.
- `pii`: code for running PII detection and anonymization on code datasets.
- `decontamination`: script to remove files that match test-samples from code generation benchmarks.
- `preprocessing`: code for filtering code datasets based on:
- line length and percentage of alphanumeric characters (basic filter)
- number of stars.
- comments to code ratio.
- tokenizer fertility
- number of stars, comments to code ratio, tokenizer fertility
- Additionnal filters used for StarCoder Training:
- basic-filter with parameters that depend on the file's extension.
- filter to remove XML files
Expand All @@ -20,6 +19,5 @@ necessary used for model training.
- code to generate full-content with meta (repo-name, filename, num stars) for training
- Filters for GitHub Issues
- Filters for Git Commits
- Script to convert Jupyter notebooks to scripts
- Scripts to convert Jupyter notebooks to structured markdown-code-output triplets
- `decontamination`: script to remove files that match test-samples from code generation benchmarks.
- Code to convert Jupyter notebooks to scripts
- Code to convert Jupyter notebooks to structured markdown-code-output triplets

0 comments on commit 0b3c1ba

Please sign in to comment.