Corrupt an input text to test NLP models' robustness.
For details refer to https://nlp-demo.readthedocs.io
pip install wild-nlp
All together we defined and implemented 11 aspects of text corruption.
-
Articles
Randomly removes or swaps articles into wrong ones.
-
Digits2Words
Converts numbers into words. Handles floating numbers as well.
-
Misspellings
Misspells words appearing in the Wikipedia list of:
- commonly misspelled English words
- homophones
-
Punctuation
Randomly adds or removes specified punctuation marks.
-
QWERTY
Simulates errors made while writing on a QWERTY-type keyboard.
-
RemoveChar
Randomly removes:
- characters from words or
- white spaces from sentences
-
SentimentMasking
Replaces random, single character with for example an asterisk in:
- negative or
- positive words from Opinion Lexicon:
http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
-
Swap
Randomly swaps two characters within a word, excluding punctuations.
-
Change char
Randomly change characters according to chosen dictionary, default is 'ocr' to simulate simple OCR errors.
-
White spaces
Randomly add or remove white spaces (listed as a parameter).
- Sub string
Randomly add a substring to simulate more comples signs.
- All aspects can be chained together with the wildnlp.aspects.utils.compose function.
Aspects can be applied to any text. Below is the list of datasets for which we already implemented processing pipelines.
-
CoNLL
The CoNLL-2003 shared task data for language-independent named entity recognition.
-
IMDB
The IMDB dataset containing movie reviews for a sentiment analysis. The dataset consists of 50 000 reviews of two classes, negative and positive.
-
SNLI
The SNLI dataset supporting the task of natural language inference.
-
SQuAD
The SQuAD dataset for the Machine Comprehension problem.
from wildnlp.aspects.dummy import Reverser, PigLatin
from wildnlp.aspects.utils import compose
from wildnlp.datasets import SampleDataset
# Create a dataset object and load the dataset
dataset = SampleDataset()
dataset.load()
# Crate a composed corruptor function.
# Functions will be applied in the same order they appear.
composed = compose(Reverser(), PigLatin())
# Apply the function to the dataset
modified = dataset.apply(composed)