diff --git a/near_deduplication/README.md b/near_deduplication/README.md index f0f7a2b..c61f17e 100644 --- a/near_deduplication/README.md +++ b/near_deduplication/README.md @@ -27,8 +27,6 @@ python minhash_deduplication.py --dataset codeparrot/codeparrot-clean-valid \ --column content \ --cache-dir .cache \ --min-ngram-size 5 -# For details on the arguments, see the help message -python minhash_deduplication.py --help ``` Spark Script @@ -140,4 +138,4 @@ Total Reduction : 4974628 (19.80%) Total Time : 37881.83 seconds (10.5 hours) ``` -More details can be found on https://zippy-anise-556.notion.site/Deduplication-Log-d75d1b3f2e684e96a12b069c5aff68cb. We ignore physical size change because it is less relevant to the deduplication process and varies a lot depending on the data format. \ No newline at end of file +More details can be found on https://zippy-anise-556.notion.site/Deduplication-Log-d75d1b3f2e684e96a12b069c5aff68cb. We ignore physical size change because it is less relevant to the deduplication process and varies a lot depending on the data format.