diff --git a/README.md b/README.md index ecdbd1d..9d6462a 100644 --- a/README.md +++ b/README.md @@ -7,11 +7,10 @@ necessary used for model training. - `language_selection`: notebooks and file with language to file extensions mapping used to build the Stack v1.1. - `pii`: code for running PII detection and anonymization on code datasets. +- `decontamination`: script to remove files that match test-samples from code generation benchmarks. - `preprocessing`: code for filtering code datasets based on: - line length and percentage of alphanumeric characters (basic filter) - - number of stars. - - comments to code ratio. - - tokenizer fertility + - number of stars, comments to code ratio, tokenizer fertility - Additionnal filters used for StarCoder Training: - basic-filter with parameters that depend on the file's extension. - filter to remove XML files @@ -20,6 +19,5 @@ necessary used for model training. - code to generate full-content with meta (repo-name, filename, num stars) for training - Filters for GitHub Issues - Filters for Git Commits - - Script to convert Jupyter notebooks to scripts - - Scripts to convert Jupyter notebooks to structured markdown-code-output triplets -- `decontamination`: script to remove files that match test-samples from code generation benchmarks. + - Code to convert Jupyter notebooks to scripts + - Code to convert Jupyter notebooks to structured markdown-code-output triplets