+ 授權許可
+ License
+
+
+ CC0 公共領域
+ Public Domain
+
+ 語言
+ Language
+
+
+ 粵語
+ Cantonese
+ ISO 639-3: yue
+
+ 總時長
+ Total Duration
+
+
+ 65 個鐘
+ 65 hours
+
+ 總字數
+ Total Characters
+
+ 123456
+
+ 發音人
+ Voice Actor
+
+ 張悦楷
+介紹 Introduction
++ 本數據集由廣州最出名嘅話劇演員、説書藝人(講古佬)張悦楷講《三國演義》錄音製成。所有錄音均錄於 + 1980 + 年代。數據集所有文本均由人工轉寫,並根據《三國演義》原文校對嚟確保準確性。 +
++ This dataset was made from recordings of Zoeng Jyut Gaai, the most + famous drama actor and storyteller in Canton, storytelling + Romance of the Three Kingdoms. All recordings were recorded + in the 1980s. All texts in the dataset were transcribed manually and + proofread according to the original text of + Romance of the Three Kingdoms to ensure accuracy. +
++ 本數據集可用於各種用途,例如語音合成(TTS)、語音識別(ASR)、語言模型(LLM)、語言學分析等等。 + 張悦楷語音合成 就係一個用本數據集訓練出嚟嘅 TTS 系統。 +
++ This dataset is multi-purposed. It can be used for Text-To-Speech + (TTS), Automatic Speech Recognition (ASR), Language Modeling, + linguistics analysis, etc. As an example, + + 張悦楷語音合成 + + is a TTS system trained on this dataset. +
+數據樣例 Data samples
++ 當今天下嘅英雄,就係使君你,同我喇。 +
++ 唉!既生瑜,何生亮!既生瑜,何生亮!既生瑜,何生亮啊! +
++ 王朗講完,孔明喺架車上哈哈大笑佢話:哈哈哈哈哈哈哈哈,我仲以為堂堂漢朝嘅大老元臣,所講嘅道理必定十分高明嘅,點估到竟然如此卑鄙啊! +
+下載 Download
+ +
+ 如果你想單純克隆所有 wav 文件,可以用下面嘅命令嚟凈係克隆個
+ wav/
路徑,避免 clone 晒成個 repo:
+
+ If you want to clone only the wav files without cloning the entire
+ repo, use the following commands to clone the
+ wav/
directory only:
+
git clone --filter=blob:none --sparse https://huggingface.co/datasets/laubonghaudoi/zoengjyutgaai_saamgwokjinji
+
+cd zoengjyutgaai_saamgwokjinji
+
+git sparse-checkout init --cone
+git sparse-checkout set wav
+git checkout
+ 數據統計
++ 總時長 Total Duration + | ++ |
+ 平均音頻時長 Average Clip Duration + | ++ |
+ 中位音頻時長 Median Clip Duration + | ++ |
+ 最短音頻時長 Min Clip Duration + | ++ |
+ 最長音頻時長 Max Clip Duration + | ++ |
+ 平均每句字數(含標點) Average Characters Per Clip (including + punctuation) + | ++ |
+ 文本總字數(含標點) Total Characters # (including + punctuation) + | ++ |
+ 覆蓋漢字數 Unique Chinese Characters # + | ++ |
+ 採樣率 Sampling Rate + | ++ 44100 Hz + | +
+ 音頻文件格式 Audio file format + | ++ .wav + | +
+ 所有源字幕 SRT 文件都存放喺 Hugging Face
+ 倉庫嘅srt/
路經下。所有源音頻都以 .webm 格式放喺
+ .webm/
路經下。
+
+ All source subtitle SRT files are stored in the
+ srt/
directory of the Hugging Face repository. All
+ source audio are stored in .webm format in the
+ .webm/
directory.
+
引用 Citation
++ 本數據集屬公共領域,遵循 + CC0 + 許可聲明。即係話你可以無需授權免費任用本數據集,亦都唔需要註明出處。不過如果你用咗本數據集,我哋都希望你可以引用本頁面,作為對楷叔嘅懷念同致敬: +
++ This dataset is in the public domain and follows the + CC0 + license agreement. This means you can use this dataset for free + without attribution. However, if you use this dataset, we hope you + can cite this page as a tribute to Kai Suk: +
++@misc{zoengjyutgaai2025, + author={Mingfei Lau} + title={張悦楷講古語音數據集 The Zoeng Jyut Gaai Storytelling Voice Dataset}, + affiliation={粵語計算語言學基礎建設組 Cantonese Computational Linguistics Infrastructure Development Workgroup (CanCLID)}, + howpublished = {\url{https://canclid.github.io/zoengjyutgaai/}}, + year={2025} +}+