Merge pull request #65 from hynky1999/hynky1999-patch-1

Update README.md
hynky1999 · May 12, 2023 · cbbe900 · cbbe900
2 parents e510ab7 + 5e6ed30
commit cbbe900
Showing 1 changed file with 10 additions and 12 deletions.
diff --git a/README.md b/README.md
@@ -14,9 +14,9 @@ To create them you need an example html files you want to extract.
 You can use the following command to get html files from the CommonCrawl dataset:
 
 ```bash
-$ cmondownload --limit=100 --output_type=html yoursite.com output_dir
+$ cmon download --match_type=domain --limit=1000 example.com html_output html
 ```
-This will download a first 100 html files from yoursite.com and save them in output_dir.
+This will download a first 100 html files from example.com and save them in html_output.
 
 #### Extractor creation
 Once you have your the files to extract, you can create your extractor.
@@ -36,9 +36,9 @@ class MyExtractor(BaseExtractor):
    # You can also override the following methods to drop the files you don't want to extracti
    # Return True to keep the file, False to drop it
    def filter_raw(self, response: str, metadata: PipeMetadata) -> bool:
-      pass
+      return True
    def filter_soup(self, soup: BeautifulSoup, metadata: PipeMetadata) -> bool:
-      pass
+      return True
 
 # Make sure to instantiate your extractor into extractor variable
 # The name must match so that the framework can find it
@@ -74,7 +74,7 @@ In our case the config would look like this:
 To test the extraction, you can use the following command:
 
 ```bash
-$ cmonextract --mode=html html_file1 html_file2 ... html_fileN extraction_output_dir config_file
+$ cmon extract config.json extracted_output html_output/*/*.html html
 ```
 
 ### Crawl the sites
@@ -85,25 +85,23 @@ To do this you will proceed in two steps:
 To do this, you can use the following command:
 
 ```bash
-$ cmondownload --limit=100000 --output_type=record yoursite.com output_dir
+cmon download --match_type=domain --limit=100000 example.com dr_output record
 ```
 
-This will download the first 100000 records from yoursite.com and save them in output_dir. By default it saves 100_000 records per file, you can change this with the --max_crawl_per_file option.
+This will download the first 100000 records from example.com and save them in dr_output. By default it saves 100_000 records per file, you can change this with the --max_crawls_per_file option.
 
 #### 2. Extract the records
 Once you have the records, you can use the following command to extract them:
 
 ```bash
-$ cmonextract --nproc=4 --mode=record record_file1 record_file2 ... record_fileN extraction_output_dir config_file
+$ cmon extract --n_proc=4 config.json extracted_output dr_output/*/*.jsonl record
 ```
 
-Note that you can use the --nproc option to specify the number of processes to use for the extraction. Multiprocessing is done on file level, so if you have just one file it will not be used.
+Note that you can use the --n_proc option to specify the number of processes to use for the extraction. Multiprocessing is done on file level, so if you have just one file it will not be used.
 
 
 ### Advanced usage
 The whole project was written with modularity in mind. That means that you
-can adjust the framework to your needs.
-
-#TODO add more info about pipeline
+can adjust the framework to your needs. To know more check  see [https://hynky1999.github.io/CmonCrawl/build/html/index.html](documentation)
 
 Instead of first getting the records and then extracting them, you can do both in a distributed setting. For more info look at [CZE-NEC](https://github.com/hynky1999/Czech-News-Classification-dataset) project.