diff --git a/.gitattributes b/.gitattributes new file mode 100644 index 0000000..dfe0770 --- /dev/null +++ b/.gitattributes @@ -0,0 +1,2 @@ +# Auto detect text files and perform LF normalization +* text=auto diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..d5ac763 --- /dev/null +++ b/.gitignore @@ -0,0 +1,2 @@ +.aider* +.env diff --git a/029_201.wav b/029_201.wav new file mode 100644 index 0000000..ded7082 Binary files /dev/null and b/029_201.wav differ diff --git a/074_222.wav b/074_222.wav new file mode 100644 index 0000000..a794432 Binary files /dev/null and b/074_222.wav differ diff --git a/121_097.wav b/121_097.wav new file mode 100644 index 0000000..5fba628 Binary files /dev/null and b/121_097.wav differ diff --git a/KuMincho-R.otf b/KuMincho-R.otf new file mode 100644 index 0000000..03ef1ca Binary files /dev/null and b/KuMincho-R.otf differ diff --git a/KuMincho-R.woff2 b/KuMincho-R.woff2 new file mode 100644 index 0000000..65ae298 Binary files /dev/null and b/KuMincho-R.woff2 differ diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..20776b8 --- /dev/null +++ b/LICENSE @@ -0,0 +1,21 @@ +MIT License + +Copyright (c) 2024 laubonghaudoi + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. diff --git a/README.md b/README.md new file mode 100644 index 0000000..e5af5a8 --- /dev/null +++ b/README.md @@ -0,0 +1,2 @@ +# zoengjyutgaai + diff --git a/index.html b/index.html new file mode 100644 index 0000000..4356429 --- /dev/null +++ b/index.html @@ -0,0 +1,308 @@ + + + + + + + 張悦楷講古語音數據集 + + + + + +
+ +
+

+ 張悦楷講古語音數據集 + The Zoeng Jyut Gaai Storytelling Voice Dataset +

+

+ 開源粵語語音數據集 Open-sourced Cantonese Voice Dataset +

+
+ + +
+ +
+
+

+ 授權許可
+ License +

+

+ CC0 公共領域
+ Public Domain +

+
+
+

+ 語言
+ Language +

+

+ 粵語
+ Cantonese
+ ISO 639-3: yue +

+
+
+

+ 總時長
+ Total Duration +

+

+ 65 個鐘
+ 65 hours +

+
+
+

+ 總字數
+ Total Characters +

+

123456

+
+
+

+ 發音人
+ Voice Actor +

+

張悦楷

+
+
+ +
+

介紹 Introduction

+

+ 本數據集由廣州最出名嘅話劇演員、説書藝人(講古佬)張悦楷講《三國演義》錄音製成。所有錄音均錄於 + 1980 + 年代。數據集所有文本均由人工轉寫,並根據《三國演義》原文校對嚟確保準確性。 +

+

+ This dataset was made from recordings of Zoeng Jyut Gaai, the most + famous drama actor and storyteller in Canton, storytelling + Romance of the Three Kingdoms. All recordings were recorded + in the 1980s. All texts in the dataset were transcribed manually and + proofread according to the original text of + Romance of the Three Kingdoms to ensure accuracy. +

+

+ 本數據集可用於各種用途,例如語音合成(TTS)、語音識別(ASR)、語言模型(LLM)、語言學分析等等。 + 張悦楷語音合成 就係一個用本數據集訓練出嚟嘅 TTS 系統。 +

+

+ This dataset is multi-purposed. It can be used for Text-To-Speech + (TTS), Automatic Speech Recognition (ASR), Language Modeling, + linguistics analysis, etc. As an example, + + 張悦楷語音合成 + + is a TTS system trained on this dataset. +

+

數據樣例 Data samples

+
+
+ +

+ 當今天下嘅英雄,就係使君你,同我喇。 +

+
+
+ +

+ 唉!既生瑜,何生亮!既生瑜,何生亮!既生瑜,何生亮啊! +

+
+
+ +

+ 王朗講完,孔明喺架車上哈哈大笑佢話:哈哈哈哈哈哈哈哈,我仲以為堂堂漢朝嘅大老元臣,所講嘅道理必定十分高明嘅,點估到竟然如此卑鄙啊! +

+
+
+

下載 Download

+ +

+ 如果你想單純克隆所有 wav 文件,可以用下面嘅命令嚟凈係克隆個 + wav/ 路徑,避免 clone 晒成個 repo: +

+

+ If you want to clone only the wav files without cloning the entire + repo, use the following commands to clone the + wav/ directory only: +

+
git clone --filter=blob:none --sparse https://huggingface.co/datasets/laubonghaudoi/zoengjyutgaai_saamgwokjinji
+
+cd zoengjyutgaai_saamgwokjinji
+
+git sparse-checkout init --cone
+git sparse-checkout set wav
+git checkout
+

數據統計

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ 總時長 Total Duration +
+ 平均音頻時長 Average Clip Duration +
+ 中位音頻時長 Median Clip Duration +
+ 最短音頻時長 Min Clip Duration +
+ 最長音頻時長 Max Clip Duration +
+ 平均每句字數(含標點) Average Characters Per Clip (including + punctuation) +
+ 文本總字數(含標點) Total Characters # (including + punctuation) +
+ 覆蓋漢字數 Unique Chinese Characters # +
+ 採樣率 Sampling Rate + + 44100 Hz +
+ 音頻文件格式 Audio file format + + .wav +
+

+ 所有源字幕 SRT 文件都存放喺 Hugging Face + 倉庫嘅srt/路經下。所有源音頻都以 .webm 格式放喺 + .webm/ 路經下。 +

+

+ All source subtitle SRT files are stored in the + srt/ directory of the Hugging Face repository. All + source audio are stored in .webm format in the + .webm/ directory. +

+ +

引用 Citation

+

+ 本數據集屬公共領域,遵循 + CC0 + 許可聲明。即係話你可以無需授權免費任用本數據集,亦都唔需要註明出處。不過如果你用咗本數據集,我哋都希望你可以引用本頁面,作為對楷叔嘅懷念同致敬: +

+

+ This dataset is in the public domain and follows the + CC0 + license agreement. This means you can use this dataset for free + without attribution. However, if you use this dataset, we hope you + can cite this page as a tribute to Kai Suk: +

+
+@misc{zoengjyutgaai2025,
+    author={Mingfei Lau}
+    title={張悦楷講古語音數據集 The Zoeng Jyut Gaai Storytelling Voice Dataset},
+    affiliation={粵語計算語言學基礎建設組 Cantonese Computational Linguistics Infrastructure Development Workgroup (CanCLID)},
+    howpublished = {\url{https://canclid.github.io/zoengjyutgaai/}},
+    year={2025}
+}
+
+
+ + + +
+ + diff --git a/styles.css b/styles.css new file mode 100644 index 0000000..e15b9de --- /dev/null +++ b/styles.css @@ -0,0 +1,19 @@ +@font-face { + font-family: "KuMincho"; + src: url("KuMincho-R.woff2") format("woff2"), + url("KuMincho-R.woff") format("woff"), + url("KuMincho-R.otf") format("opentype"); + font-weight: normal; + font-style: normal; + font-display: swap; +} + +/* Add any custom styles here */ +pre { + white-space: pre-wrap; + word-wrap: break-word; +} + +body { + font-family: "KuMincho", serif; +} diff --git a/zoengjyutgaai.webp b/zoengjyutgaai.webp new file mode 100644 index 0000000..2c15445 Binary files /dev/null and b/zoengjyutgaai.webp differ