![ParroT](https://private-user-images.githubusercontent.com/31032829/245473896-3a19944b-d42d-45da-919b-320f1410a3a6.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk1NzYzNzMsIm5iZiI6MTczOTU3NjA3MywicGF0aCI6Ii8zMTAzMjgyOS8yNDU0NzM4OTYtM2ExOTk0NGItZDQyZC00NWRhLTkxOWItMzIwZjE0MTBhM2E2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE0VDIzMzQzM1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWZiNDgzODU0MzA0ZTY5ZmIxMTI2N2QwMDViYzk4MDJkNDA2ZDkzODhhZDM4MmE4MDI4MGZmNDBiZWFjMTU1ODQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.t-jk63Pj0mikaxHuSrD_yPp8-kXVMVvi55pLGqPCGLc)
InstructMT Data & Scripts @ParroT
A collection of instruction data and scripts for machine translation.
The resulting files mainly fit the format of ParroT and partially that of Stanford-Alpaca.
Below lists the resources of high-quality translation data for instruction tuning. You can access the data through the links. The previous links are problematic, now they are fixed.
Data | Source | Zh-En | En-Zh | De-En | En-De | Format |
---|---|---|---|---|---|---|
Translation | newstest17-20 | 12.2k | 12.2k | 13.3k | 13.3k | `TXT` |
MQM-Score | newstest20 | 20.0k | n/a | n/a | 14.1k | `JSON` |
MQM-Error | newstest20 | 124.3k | n/a | n/a | 79.0k | `TXT` |
COMET-Score | newstest20 | n/a | 19.8k | 9.4k | n/a | `JSON` |
Translation | wmt20 | 475.0k | 475.0k | n/a | n/a | `TXT`: Filtered from 26M |
parrot
├── alpaca
│ └── convert_alpaca_to_hf.py
├── contrastive-instruction
│ ├── convert_cometscore_to_csi_alpaca.py
│ ├── convert_mqmscore_to_csi_alpaca.py
│ └── instruct_t2t.txt
├── error-guided-instruction
│ ├── convert_cometscore_to_egi_alpaca.py
│ ├── convert_mqmerror_to_egi_alpaca.py
│ └── instruct_e2t.txt
└── translation-instruction
├── convert_pair_to_alpaca.py
└── instruct_follow.txt
1. Translation Instruction
Example usage and output:
cd ./parrot/translation-instruction
# Download the Translation data into the folder
python3 convert_pair_to_alpaca.py \
-s zh -t en \
-if instruct_follow.txt \
-sf newstest17-20.en-zh.zh \
-tf newstest17-20.en-zh.en \
-of data_ti_alp.zh-en.json
[
{
"instruction": "I'd appreciate it if you could present the English translation for these sentences.",
"input": "28岁厨师被发现死于旧金山一家商场",
"output": "28-Year-Old Chef Found Dead at San Francisco Mall"
},
...
]
2. Contrastive Instruction
Example usage and output for MQM Zh-En:
cd ./parrot/contrastive-instruction
# Download the MQM-Score data into the folder
python3 convert_mqmscore_to_csi_alpaca.py \
-s zh -t en \
-if instruct_t2t.txt \
-i sys_rating_mqm.zh-en.json \
-o data_csi_alp.zh-en.json
[
{
"instruction": "Could you supply the English translation for the upcoming sentences?",
"input": "国有企业和优势民营企业走进赣南革命老区。\n\n### Hint: A superior translation would be",
"output": "<p>State-owned enterprises and advantageous private enterprises entered the old revolutionary area of Gannan.</p> rather than <p>State-owned enterprises and dominant private enterprises entered the old revolutionary area of southern Jiangxi.</p>"
},
...
]
Example usage and output for COMET En-Zh:
cd ./parrot/contrastive-instruction
# Download the COMET-Score data into the folder
python3 convert_cometscore_to_csi_alpaca.py \
-s en -t zh \
-if instruct_t2t.txt \
-i sys_rating_comet.en-zh.json \
-o data_csi_alp.en-zh.json
[
{
"instruction": "Could you supply the Chinese translation for the upcoming sentences?",
"input": "Michael Jackson wore tape on his nose to get front pages, former bodyguard claims\n\n### Hint: A superior translation would be",
"output": "<p>前保镖声称迈克尔·杰克逊为登上头条新闻在鼻子上贴上胶带</p> rather than <p>前保镖称迈克尔·杰克逊为上头版在鼻子上贴胶带</p>"
},
...
]
3. Error-Guided Instruction
Example usage and output for MQM En-Zh:
cd ./parrot/error-guided-instruction
# Download the MQM-Error data into the folder
python3 convert_mqmerror_to_egi_alpaca.py \
-s zh -t en \
-if instruct_e2t.txt \
-i mqm_newstest2020_zhen.txt \
-o data_egi_alp.zh-en.json
[
{
"instruction": "Could you supply the English translation for the upcoming sentences?",
"input": "国有企业和优势民营企业走进赣南革命老区。\n\n### Hint: A rendition having minor fluency/grammar errors is possible",
"output": "State-owned enterprises and dominant private enterprises entered the old revolutionary area of southern Jiangxi <v>State-owned enterprises and dominant private enterprises entered the old revolutionary area of southern Jiangxi.</v> "
},
...
]
Example usage and output for COMET En-Zh:
cd ./parrot/error-guided-instruction
# Download the COMET-Score data into the folder
python3 convert_cometscore_to_egi_alpaca.py \
-s en -t zh \
-if instruct_e2t.txt \
-i sys_rating_comet.en-zh.json \
-o data_egi_alp.en-zh.json
[
{
"instruction": "Could you supply the Chinese translation for the upcoming sentences?",
"input": "Michael Jackson wore tape on his nose to get front pages, former bodyguard claims\n\n### Hint: A rendition having no errors is possible",
"output": "前保镖声称迈克尔·杰克逊为登上头条新闻在鼻子上贴上胶带"
},
...
]
* Alpaca Format
The above three instruction types can be used for Stanford-Alpaca directly.
Or you can transform them to fit the format of ParroT as follows:
cd ./parrot/translation-instruction
python3 ../alpaca/convert_alpaca_to_hf.py \
-i data_ti_alp.zh-en.json \
-o data_ti_hf.zh-en.json
# Each dict is saved as one line but we show it in multiple lines for better appearance
{
"text": "28-Year-Old Chef Found Dead at San Francisco Mall</s>",
"prefix": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nI'd appreciate it if you could present the English translation for these sentences.\n\n### Input:\n28岁厨师被发现死于旧金山一家商场\n\n### Response:"
}
Please kindly cite our paper if you find the data resources here helpful:
@inproceedings{jiao2023parrot,
title={ParroT: Translating During Chat Using Large Language Models},
author={Wenxiang Jiao and Jen-tse Huang and Wenxuan Wang and Xing Wang and Shuming Shi and Zhaopeng Tu},
booktitle = {ArXiv},
year = {2023}
}