-
Notifications
You must be signed in to change notification settings - Fork 131
Benchmark Guide
Sergei Alonichau edited this page May 27, 2020
·
3 revisions
This is a wiki for benchmarking BlingFire.
Performed on December 2018.
Total Running time in sec | Bling Fire | SpaCy | NLTK | ||||||
---|---|---|---|---|---|---|---|---|---|
100 times repeated | Min | Max | Avg | Min | Max | Avg | Min | Max | Avg |
1K | 0.074 | 0.125 | 0.079 | 1.470 | 1.581 | 1.529 | 1.681 | 1.793 | 1.733 |
10K | 0.805 | 0.865 | 0.823 | 8.500 | 9.370 | 8.653 | 17.739 | 18.213 | 17.821 |
100K | 7.941 | 8.161 | 8.018 | 86.577 | 93.095 | 87.700 | 181.032 | 185.407 | 182.079 |
- OS: Linux Ubuntu
- Machine: Azure VM 6 VCPUs(Intel Xeon CPU E5-2690 v3 @ 2.60GHz), 56GB memory.
- Python version: 3.5.6
- SpaCy version: 2.0.17
- NLTK version: 3.4
- Corpus: English Gigawords
- Enabled subtraction of warm-up time. First 10% of passages used as warm up, excluded from benchmarking calculation
- Collect data based on 100 times repeat of each data.
This script currently support 3 types of corpus.
- English Gigawords. You can get a sample and run with it.
- MS-MARCO
- Plain text. Any English text file that documents are splitted by "\n"
Go to the ** /scripts ** folder, you should see benchmark.py. Run it with desired parameters will give you the benchmark result.
Args | Comment | Example |
---|---|---|
-d | Specify the data set | Python3 benchmark.py -d englishgigawords.txt |
-n | Number of passages | Python3 benchmark.py -n 1000 |
-o | Output result. No output if this arg is not specified | Python3 benchmark.py -o |
-s | Sepcify the type of data set. By default is plain text. Options: - marco - plaintext - englishgigawords |
Python benchmark.py -s englishgigawords |
-w | Warm up until. Set the size of warm up set. Default is 100. Use this together with '-n'. Example '-n 1000 -w 100' then the reported result will be processing time of 900 passages | Python benchmark.py -n 1100 -w 100 |
Comparing Bling Fire with other popular NLP libraries, Bling Fire shows 10X faster speed in tokenization task
System | Avg Run Time (Second Per 10,000 Passages) |
---|---|
Bling Fire | 0.823 |
SpaCy | 8.653 |
NLTK | 17.821 |