More results (unlabelled data)

opensource-spraakherkenning-nl · Jul 8, 2024 · 5db0a9a · 5db0a9a
1 parent f961471
commit 5db0a9a
Show file tree

Hide file tree

Showing 3 changed files with 84 additions and 28 deletions.
diff --git a/NISV/res_labelled.md b/NISV/res_labelled.md
@@ -12,13 +12,18 @@ For more details about the corpus, click [here](https://citeseerx.ist.psu.edu/do
 
 **The subset used in this benchmark is `bn_nl` (Broadcast News programmes in the Netherlands).**
 
+This data does not reflect the type of content on which the ASR will be applied to in terms of length of the audio, but it offers some rough estimates on the WER performance of the model, particularly when it comes to the time alignment of the word-level timestamps with the reference files.
+
 <br>
 
 For each Whisper implementation, 2 variables have been modified:
 - The model version: `large-v2` vs. `large-v3` (to confirm the hypothesis from the UT evaluation)
 - The compute type: `float16` vs. `float32`
+- **For Huggingface and WhisperX:** `batch_size`
+    - `2` for HF
+    - `64` for `WhisperX float16`, `16` for `WhisperX float32`
 
-The last parameter refers to data types used to represent real numbers such as the weights of the Whisper model. In our case, `float16`, also known as **half-precision**, uses 16 bits to store a single floating-point number, whereas `float32`, known as **single-precision**, uses 32 bits to store a single floating-point number. It is known throughout various deep learning applications that `float16` uses less memory and is faster, with the trade-off of loss in accuracy. However, in the case of Whisper, it has been reported that `float16` leads to only a 0.1% increase in WER with the benefit of significantly reducing time and memory required to run the model.
+The compute type refers to data types used to represent real numbers such as the weights of the Whisper model. In our case, `float16`, also known as **half-precision**, uses 16 bits to store a single floating-point number, whereas `float32`, known as **single-precision**, uses 32 bits to store a single floating-point number. It is known throughout various deep learning applications that `float16` uses less memory and is faster, with the trade-off of loss in accuracy. However, in the case of Whisper, it has been reported that `float16` leads to only a 0.1% increase in WER with the benefit of significantly reducing time and memory required to run the model.
 
 <br>
 
@@ -35,17 +40,30 @@ Here is a matrix with **WER** results of the baseline implementation from OpenAI
 
 And a matrix with the **time** spent in total by each implementation **to load and transcribe** the dataset:
 
-|Model\Parameters|large-v2 with `float16`|large-v2 with `float32`|large-v3 with `float16`|large-v3 with `float32`|
+|Load+transcribe|large-v2 with `float16`|large-v2 with `float32`|large-v3 with `float16`|large-v3 with `float32`|
 |---|---|---|---|---|
 |[OpenAI](https://github.com/openai/whisper)|36m:06s|32m:41s|42m:08s|30m:25s|
 |[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)|21m:48s|19m:13s|23m:22s|22m:02s|
-|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|11m:40s|22m:27s|**11m:18s**|21m:56s|
-|[**WhisperX**](https://github.com/m-bain/whisperX/)**\***|**11m:17s**|**15m:54s**|11m:29s|**15m:05s**|
+|**[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)**|**1m:51s**|**2m:08s**|**1m:50s**|**2m:12s**|
+|[WhisperX](https://github.com/m-bain/whisperX/)**\***|11m:17s|15m:54s|11m:29s|15m:05s|
 
 \* For WhisperX, a separate alignment model based on wav2vec 2.0 has been applied in order to obtain word-level timestamps. Therefore, the time measured contains the time to load the model, time to transcribe, and time to align to generate timestamps. Speaker diarization has also been applied for WhisperX, which is measured separately and covered in [this section](./whisperx.md).
 
 <br>
 
+As well as the **time** spent in total by **faster-whisper** and **WhisperX** to **load, transcribe + save the output to files\***:
+
+|Load+transcribe+save output|large-v2 with `float16`|large-v2 with `float32`|large-v3 with `float16`|large-v3 with `float32`|
+|---|---|---|---|---|
+|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|**11m:40s**|22m:27s|**11m:18s**|21m:56s|
+|[WhisperX](https://github.com/m-bain/whisperX/)\**|15m:45s|**20m:26s**|16m:01s|**19m:36s**|
+
+\* It has been noticed after benchmarking that, for these 2 implementations, saving the output takes unusually long.
+
+\** For WhisperX, this includes the entire pipeline (loading -> transcription -> alignment -> speaker diarization -> saving to file).
+
+<br>
+
 Finally, a matrix with the **maximum GPU memory consumption + maximum GPU power usage** of each implementation (**on average**):
 
 |Max. memory / Max. power|large-v2 with `float16`|large-v2 with `float32`|large-v3 with `float16`|large-v3 with `float32`|

diff --git a/NISV/res_unlabelled.md b/NISV/res_unlabelled.md
@@ -2,50 +2,63 @@
 
 <h2>Computational performance on the unlabelled audio of Broadcast News in the Netherlands</h2>
 
+The unlabelled data is considered to be long-form (one audio file that lasts for a longer period) which reflects more closely the type of data found in audiovisual/oral history archives. Thus, even if the WER is not calculated (due to the lack of complete labelling for this subset), the computational performance information will give us a better estimate of each implementation's performance when applied to longer individual audio files
+
 More details about the parameters and the dataset can be found [here](./res_labelled.md).
 
 <br>
 
 For each Whisper implementation, 2 variables have been modified:
 - The model version: `large-v2` vs. `large-v3` (to confirm the hypothesis from the UT evaluation)
 - The compute type: `float16` vs. `float32` (check [here](./res_labelled.md) for more details about this parameter)
+- For Huggingface (HF) and WhisperX: `batch_size`
+    - `4` for `HF float16`, `2` for `HF float32`
+    - `64/48` for `WhisperX float16 labelled/unlabelled`, `16` for `WhisperX float32`
 
 <br>
 
-**TODO: Add results**
-
-<!-- <br>
-
 Here's a matrix with the **time** spent in total by each implementation **to load and transcribe** the data:
 
 |Model\Parameters|large-v2 with `float16`|large-v2 with `float32`|large-v3 with `float16`|large-v3 with `float32`|
 |---|---|---|---|---|
-|[OpenAI](https://github.com/openai/whisper)|36m:06s|32m:41s|42m:08s|30m:25s|
-|[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)|21m:48s|19m:13s|23m:22s|22m:02s|
-|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|11m:40s|22m:27s|11m:18s|21m:56s|
-|[WhisperX](https://github.com/m-bain/whisperX/)*|11m:17s|15m:54s|11m:29s|15m:05s|
+|[OpenAI](https://github.com/openai/whisper)|1h:43m:47s|1h:20m:29s|1h:57m:06s|1h:28m:50s|
+|[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)|43m:05s|1h:05m:17s|41m:39s|1h:01m:45s|
+|**[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)**|**4m:14s**|**4m:26s**|**3m:36s**|**5m:07s**|
+|[WhisperX](https://github.com/m-bain/whisperX/)*|26m:57s|31m:57s|27m:00s|31m:43s|
 
 \* For WhisperX, a separate alignment model based on wav2vec 2.0 has been applied in order to obtain word-level timestamps. Therefore, the time measured contains the time to load the model, time to transcribe, and time to align to generate timestamps. Speaker diarization has also been applied for WhisperX, which is measured separately and covered in a different section.
 
 <br>
 
+As well as the **time** spent in total by **faster-whisper** and **WhisperX** to **load, transcribe + save the output to files\***:
+
+|Load+transcribe+save output|large-v2 with `float16`|large-v2 with `float32`|large-v3 with `float16`|large-v3 with `float32`|
+|---|---|---|---|---|
+|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|39m:59s|1h:18m:34s|40m:07s|1h:23m:07s|
+|**[WhisperX](https://github.com/m-bain/whisperX/)\*\***|**39m:25s**|**44m:01s**|**39m:21s**|**43m:52s**|
+
+\* It has been noticed after benchmarking that, for these 2 implementations, saving the output takes unusually long.
+
+\** For WhisperX, this includes the entire pipeline (loading -> transcription -> alignment -> speaker diarization -> saving to file).
+
+<br>
+
 And also a matrix with the **maximum GPU memory consumption + maximum GPU power usage** of each implementation (**on average**):
 
 |Max. memory / Max. power|large-v2 with `float16`|large-v2 with `float32`|large-v3 with `float16`|large-v3 with `float32`|
 |---|---|---|---|---|
-|[OpenAI](https://github.com/openai/whisper)|10621 MiB / 240 W|10639 MiB / 264 W|10927 MiB / 238 W|10941 MiB / 266 W|
-|[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)*|15073 MiB / 141 W|12981 MiB / 215 W|14566 MiB / 123 W|19385 MiB / 235 W|
-|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|8576 MiB / 188 W|11694 MiB / 235 W|8567 MiB / 195 W|6942 MiB / 237 W|
-|[WhisperX](https://github.com/m-bain/whisperX/)*|9419 MiB / 246 W|13548 MiB / 249 W|9417 MiB / 243 W|13539 MiB / 247 W|
+|[OpenAI](https://github.com/openai/whisper)|10943 MiB / 274 W|10955 MiB / 293 W|11094 MiB / 279 W|11164 MiB / 291 W|
+|[Huggingface (`transformers`)](https://huggingface.co/openai/whisper-large-v2#long-form-transcription)*|16629 MiB / 269 W|18563 MiB / 287 W|12106 MiB / 259 W|15061 MiB / 288 W|
+|[faster-whisper](https://github.com/SYSTRAN/faster-whisper/)|3789 MiB / 156 W|6813 MiB / 191 W|3750 MiB / 160 W|6953 MiB / 200 W|
+|[WhisperX](https://github.com/m-bain/whisperX/)*|22273 MiB / 268 W|21375 MiB / 281 W|22142 MiB / 267 W|21298 MiB / 279 W|
 
 \* For these implementations, batching is supported. Setting a higher `batch_size` will lead to faster inference at the cost of extra memory used.
 
 ## Detailed results per pipeline component for WhisperX
-Go [here]().
+[Click here](./whisperx.md)
 
 ## Hardware setup
 
 A high-performance computing cluster was used. The cluster's hardware consists of 2 x Nvidia Quadro RTX 6000 with 24 GiB VRAM each, using CUDA version 12.4, with an Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz and 256 GB of RAM available.
 
 The OS installed on the cluster is [RHEL 9.3](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html-single/9.3_release_notes/index). -->
-
diff --git a/NISV/whisperx.md b/NISV/whisperx.md
@@ -13,18 +13,19 @@ Due to the wav2vec 2.0 based aligner which doesn't support aligning digits and c
 Two variables have been experimented with:
 - The model version: `large-v2` vs. `large-v3` (to confirm the hypothesis from the UT evaluation)
 - The compute type: `float16` vs. `float32` (check [here](./res_labelled.md) for more details about this parameter)
-- The batch size: `64` for `float16` and `16` for `float32`
 
-<br>
+## Labelled data
+
+The batch size used is `64` for `float16` and `16` for `float32`.
 
-Here's a matrix with the **time** spent by each component of WhisperX, using the various parameter configurations mentioned in the previous page:
+Here's a matrix with the **time** spent by each component of WhisperX, using the various parameter configurations mentioned in the previous page, on the labelled data:
 
-|Configuration\Component|Loading|Transcriber|Aligner|Diarizer|
-|---|---|---|---|---|
-|large-v2 with `float16`|7.73s|4m:17s|6m:53s|3m:58s|
-|large-v2 with `float32`|10.51s|8m:07s|7m:36s|4m:01s|
-|large-v3 with `float16`|3.21s|4m:14s|7m:11s|4m:01s|
-|large-v3 with `float32`|6.12s|8m:00s|6m:59s|4m:00s|
+|Configuration\Component|Loading|Transcriber|Aligner|Diarizer|Total|Total+Saving to JSON|
+|---|---|---|---|---|---|---|
+|large-v2 with `float16`|7.73s|4m:17s|6m:53s|3m:58s|15m:16s|15m:45s|
+|large-v2 with `float32`|10.51s|8m:07s|7m:36s|4m:01s|19m:54s|20m:26s|
+|large-v3 with `float16`|3.21s|4m:14s|7m:11s|4m:01s|15m:29s|16m:01s|
+|large-v3 with `float32`|6.12s|8m:00s|6m:59s|4m:00s|19m:05s|19m:36s|
 
 <br>
 
@@ -35,4 +36,28 @@ And also a matrix with the **maximum GPU memory consumption + maximum GPU power
 |large-v2 with `float16`|9419 MiB / 246 W|11916 MiB / 227 W|13578 MiB / 229 W|
 |large-v2 with `float32`|13548 MiB / 249 W|14749 MiB / 234 W|16480 MiB / 234 W|
 |large-v3 with `float16`|9417 MiB / 243 W|11918 MiB / 235 W|13605 MiB / 231 W|
-|large-v3 with `float32`|13539 MiB / 247 W|14715 MiB / 232 W|16411 MiB / 228 W|
+|large-v3 with `float32`|13539 MiB / 247 W|14715 MiB / 232 W|16411 MiB / 228 W|
+
+## Unlabelled data
+
+The batch size used is `48` for `float16` and `16` for `float32`
+
+Here's a matrix with the **time** spent by each component of WhisperX, using the various parameter configurations mentioned in the previous page, on the unlabelled data:
+
+|Configuration\Component|Loading|Transcriber|Aligner|Diarizer|Total|Total+Saving to JSON|
+|---|---|---|---|---|---|---|
+|large-v2 with `float16`|2.35s|15m:23s|11m:34s|12m:18s|39m:17s|39m:25s|
+|large-v2 with `float32`|6.48s|20m:35s|11m:21s|11m:51s|43m:53s|44m:01s|
+|large-v3 with `float16`|4.79s|15m:23s|11m:37s|12m:10s|39m:15s|39m:21s|
+|large-v3 with `float32`|7.27s|20m:22s|11m:21s|11m:55s|43m:45s|43m:52s|
+
+<br>
+
+And also a matrix with the **maximum GPU memory consumption + maximum GPU power usage** of each configuration (**on average**):
+
+|Max. memory / Max. power|Transcriber|Aligner|Diarizer|
+|---|---|---|---|
+|large-v2 with `float16`|22273 MiB / 268 W|12649 MiB / 298 W|14502 MiB / 277 W|
+|large-v2 with `float32`|21375 MiB / 281 W|15381 MiB / 294 W|17316 MiB / 278 W|
+|large-v3 with `float16`|22142 MiB / 267 W|12977 MiB / 296 W|14502 MiB / 278 W|
+|large-v3 with `float32`|21298 MiB / 279 W|15381 MiB / 297 W|17316 MiB / 276 W|