Skip to content

Latest commit

 

History

History
147 lines (105 loc) · 10.2 KB

README.en.md

File metadata and controls

147 lines (105 loc) · 10.2 KB

MuCGEC: A Multi-Reference Multi-Source Evaluation Dataset for Chinese Grammatical Error Correction & SOTA Models

English | 简体中文

News

  • Our new work SynGEC has been accepted by EMNLP2022. In this paper, we propose a SynGEC model that can utilize GEC-oriented tailored syntax and get SOTA performance (e.g., 45.32/46.51 F0.5 scores on NLPCC-18 and MuCGEC-Test), respectively. Welcome to test it! Link: [Link]

  • We have uploaded the full MuCGEC dataset to this link and opened a leaderboard at Tianchi Platform, welcome to submit your results!

  • We have released a part of MuCGEC dataset at Tianchi Platform, welcome to download and test!

Citation

If you find this work is useful for your research, please cite our paper:

MuCGEC: a Multi-Reference Multi-Source Evaluation Dataset for Chinese Grammatical Error Correction (Accepted by NAACL2022 main conference) [PDF]

@inproceedings{zhang-etal-2022-mucgec,
    title = "{M}u{CGEC}: a Multi-Reference Multi-Source Evaluation Dataset for {C}hinese Grammatical Error Correction",
    author = "Zhang, Yue  and
      Li, Zhenghua  and
      Bao, Zuyi  and
      Li, Jiacheng  and
      Zhang, Bo  and
      Li, Chen  and
      Huang, Fei  and
      Zhang, Min",
    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.naacl-main.227",
    pages = "3118--3130",
    }

Introduction

Chinese Grammatical Error Correction (CGEC) technology aims to automatically correct spelling, grammatical, semantical, and other errors in a Chinese sentence. This technology is very useful in various scenarios.

Current CGEC evaluation datasets have some flaws, such as small amounts of data, single reference, single text source, etc. In order to better evaluate CGEC models, this repository provide a new multi-reference multi-source CGEC evaluation dataset named MuCGEC. Meanwhile, in order to promote the development of CGEC, we also provide the following additional resources:

  • CGEC data annotation guidelines./guidelines

  • CGEC evaluation tools./scorers:

    • ChERRANT:We extend the English GEC evaluation tool ERRANT to accomondate Chinese, and name it ChERRANT(Chinese ERRANT). ChERRANT supports the CGEC model evaluation at both word and character granularity.
  • CGEC benchmark models./models:

    • Seq2Edit model./models/seq2edit-based-CGEC: This kind of models treats GEC as a sequence labeling task and performs error corrections via a sequence of token-level edits, including insertion, deletion, and substitution.
      • With minor modifications to accommodate Chinese, we adopt GECToR, which achieves the SOTA performance on English GEC datasets.
    • Seq2Seq model./models/seq2seq-based-CGEC:This kind of models straightforwardly treats GEC as a monolingual translation task
      • We fine-tune the recently-proposed Seq2Seq pretrained model Chinese-BART and use it in CGEC.
    • Ensemble model./scorers/ChERRANT/emsemble.sh:We adopt a simple edit-wise vote mechanism, which can support the ensemble of heterogeneous models (such as Seq2Seq and Seq2Edit) and lead to significant performance boosts.
  • CGEC tools./tools

    • Tokenization tools
    • Data augmentation tools (Todo)
    • Data cleaning tools (Todo)

MuCGEC Dataset

Our dataset consists of texts written by Chinese learners. We select data from the following three sources: NLPCC18 corpus, CGED corpus, and Chinese Lang8 corpus. Each sentence has been corrected by three annotators, and their corrections are meticulously reviewed by an expert, resulting in 2.3 references per sentence. The detailed statistics are shown below:

Subset #Sents %Errors Chars/sent Edits/sent Refs/sent
MuCGEC-NLPCC18 1996 1904(95.4%) 29.7 2.5 2.5
MuCGEC-CGED 3125 2988(95.6%) 44.8 4.0 2.3
MuCGEC-Lang8 1942 1652(85.1%) 37.5 2.8 2.1
MuCGEC-ALL 7063 6544(92.7%) 38.5 3.2 2.3

Compared with previous CGEC evaluation sets (such as NLPCC18-orig and CGED-orig), MuCGEC has richer references and data sources. In addition, during the annotation procedure, we also found that 74 sentences could not be annotated since their meanings are unclear.

For more details about MuCGEC, please kindly refer to our paper.

Download link

We have released MuCGEC at https://tianchi.aliyun.com/dataset?spm=5176.12281949.J_3941670930.15.493e2448TJV71D.

CGEC benchmark models

Experimental enviroment

We use Python 3.8 to experiment, and the necessary dependencies can be installed through the following code. Considering that there are some conflicts between the environments of Seq2Seq and Seq2Edit models, two environments need to be installed separately:

# Seq2Edit environment
pip install -r requirements_seq2edit.txt

# Seq2Seq environment
pip install -r requirements_seq2seq.txt

Training data

The training data used in our experiment is composed of: 1) Chinese Lang8 corpus; 2)HSK corpus. We upsampling HSK corpus 5 times. We only use the erroneous part of the training data.

Download Link: Google Drive

Usage

We provide pipeline scripts to use our model, including the process of preprocessing->training->inference. Please refer to ./models/seq2edit-based-CGEC/pipeline.sh and ./models/seq2seq-based-CGEC/pipeline.sh

Besides, we also provide converged checkpoints for testing (the following metrics are precision/recall/F0.5):

Model NLPCC18-Official(m2socrer) MuCGEC(ChERRANT)
seq2seq_lang8[Link] 37.78/29.91/35.89 40.44/26.71/36.67
seq2seq_lang8+hsk[Link] 41.50/32.87/39.43 44.02/28.51/39.70
seq2edit_lang8[Link] 37.43/26.29/34.50 38.08/22.90/33.62
seq2edit_lang8+hsk[Link] 43.12/30.18/39.72 44.65/27.32/39.62

The ensemble strategy used in our paper can be found in ./scorers/ChERRANT/ensemble.sh. Please kindly note that above seq2edit checkpoints are based on Chinese-StructBERT-large and seq2seq checkpoints are based on Chinese-BART-Large.

Tips

  • We found that some useful tricks in English are still effective in Chinese, such as the confidence bias trick in GECToR and R2L reranking trick in Seq2Seq models. You can try them yourself.
  • We found that the decomposition of the training procedure (firstly train on HSL+Lang8, then fine-tune on HSK) can lead to further improvement. You can re-train our models following this two-stage training strategy.
  • Our Seq2Seq models based on Chinese-BART can be improved: 1) the original vocabulary of Chinese-BART lacks some common Chinese punctuation / characters; 2)The training and generation speed of transformers library is relatively slow. We recently re-implement our Seq2Seq model based on fairseq and use some additional tricks, which greatly improved its performance (4-5 F0.5) and accelerated its speed. We will also release the improved version in the future.
  • None of the baselines we provided used pseudo data. For data augmentation, please refer to our paper in CTC-2021 CGEC competition [Link].

Evaluation

To get the evaluation results on the Official NLPCC18 dataset, you can make predictions by using our benchmark models, and evaluate the results through M2Scorer. It is important to note that the predicted results must be segmented using the PKUSEG tool.

To get the evaluation results on MuCGEC, you can evaluate the results through our proposed ChERRANT. Please refer to ./scorers/ChERRANT/demo.sh for the use of ChERRANT.

Error types in ChERRANT

  • Operation tier (word/char granularity):

    • M(missing): Missing error, which means tokens need to be inserted;
    • R(redundant): Redundant error, which means tokens need to be deleted;
    • S(substitute): Substitued error , which means tokens need to be replaced;
    • W(word-order): Word-error error, which means tokens need to be re-ordered.
  • Linguistic tier (word granularity):

    • We define 14 main types of linguistic errors (basically based on part-of-speech tags):

      error types

Related works

  • We have used some technologies from this repo in the CTC-2021 evaluation task, and obtained the top-1 score. Please see: CTC-report.
  • Online demonstration platform of the baseline models: GEC demo

Contact

If you have any problems, feel free to contact me at [email protected].