AStitchInLanguageModels: Dataset and Methods for the Exploration of Idiomaticity in Pre-Trained Language Models
This package contains the dataset AStitchInLanguageModels and associated task information.
This dataset and associated tasks were introduced in our (findings of) EMNLP 2021 paper "AStitchInLanguageModels: Dataset and Methods for the Exploration of Idiomaticity in Pre-Trained Language Models". Please cite this paper if you use any of the data or methods from this package.
The paper will be made available by the 9th of September
This is a novel dataset consisting of:
- Naturally occurring sentences (and two surrounding sentences) containing potentially idiomatic MWEs annotated with a fine-grained set of meanings: compositional meaning, idiomatic meaning(s), proper noun and "meta usage". See Tasks (Task 1, Task 2) for details and Raw Data Section for complete data.
- Data in both Portuguese and English
- Paraphrases for each meaning of each MWE; (See Extended Noun Compound Senses Dataset)
In addition, we use this dataset to define two tasks:
- These tasks are aimed at evaluating i) a model’s ability to detect idiomatic use (Task 1), and ii) the effectiveness of sentence embeddings in representing idiomaticity (Task 2).
- These tasks are presented in multilingual, zero-shot, one-shot and few-shot settings.
- We provide strong baselines using state-of-the-art models, including experiments with one-shot and few-shot setups for idiomaticity detection and the use of the idiom principle for detecting and representing MWEs in contextual embeddings. Our results highlight the significant scope for improvement.
- Prerequisites
- Task 1: Idiomaticity Detection
- Task 2: Idiomaticity Representation
- Extended Noun Compound Senses Dataset
- Task Independent Data
- Citation
The scripts in this package have been tested using Python 3.8.6 and PyTorch 1.7.1. The additionally require the following packages. Please note that this will overwrite existing versions of your package. For this reason we suggest you use a virtual environment.
While we use Sentence Transformers to generate sentence embeddings that can be compared using cosine similarity, we make some changes to ensure that it can use custom tokenizers. Please install the location version available at dependencies/sentence-transformers.
cd AStitchInLanguageModels/dependencies/sentence-transformers
pip3 install -e .
Download version 4.7.0 from here.
cd transformers-4.7.0
pip3 install -e .
pip3 install datasets==1.6.1
pip3 install tqdm==4.49.0
pip3 install nltk==3.6.2
And from your Python prompt:
>>> import nltk
>>> nltk.download('punkt')
The first task we propose is designed to evaluate the extent to which models can identify idiomaticity in text and consists of two Subtasks: a coarse-grained classification task (Subtask A) and a fine-grained classification task (Subtask B). The evaluation metric for this task is F1.
The data associated with this Task can be found in this folder. Data is split into zero-shot, one-shot and few-shot data in both Portuguese and English. Please see the paper for a detailed description of the task and methods.
We used 🤗 Transformers (this script, local copy with F1 evaluation available here) for training with the following hyperparameters. Further details are available in the paper.
python run_glue.py \
--model_name_or_path $model \
--do_train \
--do_eval \
--max_seq_length 128 \
--per_device_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 9 \
--evaluation_strategy "epoch" \
--output_dir $output_dir \
--seed $seed \
--train_file $train_file \
--validation_file $dev_file \
--evaluation_strategy "epoch" \
--save_strategy "epoch" \
--load_best_model_at_end \
--metric_for_best_model "f1" \
--save_total_limit 3
Task 2 is the more challenging task of creating sentence embeddings that accurately represent sentences regardless of whether or not they contain idiomatic expressions. This is tested using Semantic Text Similarity (STS) and the metric for this task is the Spearman Rank correlation between models' output STS between sentences containing idiomatic expressions and the same sentences with the idiomatic expressions replaced by non-idiomatic paraphrases (which capture the correct meaning of the MWEs). Please see the paper for more details on the task.
Complete details of this task including the data and models is available in the task folder: AStitchInLanguageModels/Dataset/Task2/. This includes details on the following:
- Adding Idiom Tokens to 🤗 Transformers Models
- Creating Sentence Transformers models
- Creating the Evaluation Data
- Generating Pre-Training Data
- Task 2 Subtask A - Pre-Training for Idiom Representation
- Task 2 Subtask B - Fine-Tuning for Idiom Representation
- Pre-Trained and Fine-Tuned Models for Task 2
We also provide an Extended Noun Compound Senses dataset (ExNC dataset) that is highly granular. This data differs from previous sense datasets in that:
- it provides all possible senses,
- we ensure that meanings provided are as close to the original phrase as possible to ensure that this dataset is an adversarial dataset,
- we highlight purely compositional noun compounds.
Please see the associated data folder for more details.
You can download the Task independent annotated data from this folder. The data format is described in the README available in the same folder.
Where possible, please use the training, development and test splits provided so results can remain comparable.
If you make use of this work, please cite us:
@inproceedings{tayyar-madabushi-etal-2021-astitchinlanguagemodels-dataset,
title = "{AS}titch{I}n{L}anguage{M}odels: Dataset and Methods for the Exploration of Idiomaticity in Pre-Trained Language Models",
author = "Tayyar Madabushi, Harish and
Gow-Smith, Edward and
Scarton, Carolina and
Villavicencio, Aline",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
month = nov,
year = "2021",
address = "Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-emnlp.294",
pages = "3464--3477",
}```