From 9ba85d0a1e7a264ace192e51993287dcee2e2435 Mon Sep 17 00:00:00 2001 From: Aleksey Korshuk <48794610+AlekseyKorshuk@users.noreply.github.com> Date: Tue, 6 Jun 2023 22:32:36 -0700 Subject: [PATCH] Update README.md --- near_deduplication/README.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/near_deduplication/README.md b/near_deduplication/README.md index f0f7a2b..c61f17e 100644 --- a/near_deduplication/README.md +++ b/near_deduplication/README.md @@ -27,8 +27,6 @@ python minhash_deduplication.py --dataset codeparrot/codeparrot-clean-valid \ --column content \ --cache-dir .cache \ --min-ngram-size 5 -# For details on the arguments, see the help message -python minhash_deduplication.py --help ``` Spark Script @@ -140,4 +138,4 @@ Total Reduction : 4974628 (19.80%) Total Time : 37881.83 seconds (10.5 hours) ``` -More details can be found on https://zippy-anise-556.notion.site/Deduplication-Log-d75d1b3f2e684e96a12b069c5aff68cb. We ignore physical size change because it is less relevant to the deduplication process and varies a lot depending on the data format. \ No newline at end of file +More details can be found on https://zippy-anise-556.notion.site/Deduplication-Log-d75d1b3f2e684e96a12b069c5aff68cb. We ignore physical size change because it is less relevant to the deduplication process and varies a lot depending on the data format.