Skip to content

Commit

Permalink
Merge pull request #63 from bigcode-project/v2-dedup
Browse files Browse the repository at this point in the history
Add v2 near dedup scripts
  • Loading branch information
ChenghaoMou authored Nov 28, 2023
2 parents 876f8ca + d8c47d4 commit a0f8041
Show file tree
Hide file tree
Showing 5 changed files with 693 additions and 0 deletions.
6 changes: 6 additions & 0 deletions near_deduplication/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@

This is our implementation of near deduplication for BigCode dataset. It is largely evolved from the [original repo](https://github.com/bigcode-project/bigcode-analysis/tree/main/data_analysis/near-deduplication).

## V2

We use Google Dataproc and Cloud Storage for the deduplication. The actual script to run is at `bigcode-v2/run.sh`. Feel free to update the parameters in the script to run on your own dataset.

## V1.*

### Setup

````
Expand Down
Loading

0 comments on commit a0f8041

Please sign in to comment.