NLP Data Augmentation
Augmentation Techniques for Textual Data Using NLP Augmentation Libraries
“Augmentation” is the process of enlarging in size or amount. As data is important when it comes to using Neural models, let’s expand our data!
How? let’s check out the data augmentation techniques for textual data.
As mentioned in “A Survey of Data Augmentation Approaches for NLP”[b], some of the Data Augmentation Techniques are:
- Rule-Based: Easy Data Augmentation(EDA)
- Example Interpolation Techniques: MIXUP, SEQ2MIXUP
- Model-Based Techniques: Seq2seq, language model, back translation, fine-tuning GPT-2, paraphrasing.
Under Rule-Based, the basic and most commonly used technique is EDA: Easy data augmentation techniques which include:
1. Synonym Replacement: Randomly choose n words from the sentence that does not stop words. Replace each of these words with one of its synonyms chosen at random.
2. Random Deletion: Randomly remove each word in the sentence with probability p.
3. Random Swap: Randomly choose two words in the sentence and swap their positions. Do this n times.
4. Random Insertion: Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this n times
Various Data Augmentation Task:
- Summarization
- Question Answering
- Sequence Tagging
- Parsing
- Grammatical Error Correction
- Neural Machine Translation
- Data to Text
- Dialogue
Available Libraries:
- TextAugment
- AugLy
- NLPAug
- Parrot paraphrase
- Pegasus paraphrase
Getting Started!
1. TextAugment
TextAugment is a Python 3 library for augmenting text for natural language processing applications. TextAugment stands on the giant shoulders of NLTK, Gensim, and TextBlob and plays nicely with them.
pip install numpy nltk gensim textblob googletrans textaugment
Synonym Replacement
from textaugment import EDAeda_augment = EDA()TEST_TEXT = "The quick brown fox jumps over the lazy dog."
syn_augmented = eda_augment.synonym_replacement(TEST_TEXT)
print("ORIGINAL Text :",TEST_TEXT)
print("AUGMENTED TEXT:", syn_augmented)
Random Deletion
from textaugment import EDA
eda_augment = EDA()
TEST_TEXT = "The quick brown fox jumps over the lazy dog."
random_del_augmented = eda_augment.random_deletion(TEST_TEXT, p=0.2)
print("ORIGINAL Text :",TEST_TEXT)
print("AUGMENTED TEXT:", random_del_augmented)
Random Swap
from textaugment import EDA
eda_augment = EDA()
TEST_TEXT = "The quick brown fox jumps over the lazy dog."
random_swap_augmented = eda_augment.random_swap(TEST_TEXT)
print("ORIGINAL Text :",TEST_TEXT)
print("AUGMENTED TEXT:", random_swap_augmented)
Random Insertion
from textaugment import EDA
eda_augment = EDA()
TEST_TEXT = "The quick brown fox jumps over the lazy dog."
rnd_insert_augmented = eda_augment.random_insertion(TEST_TEXT)
print("ORIGINAL Text :",TEST_TEXT)
print("AUGMENTED TEXT:", rnd_insert_augmented)
2. AugLy
Facebook just recently released the AugLy package to the public domain. AugLy library is divided into four sub-libraries, each for different kinds of data modalities (audio, images, videos, and texts).
pip install -U augly
Replace Similar Characters
import augly.text as textaugs
TEST_TEXT = "The quick brown fox jumps over the lazy dog."
textaugs.replace_similar_chars(TEST_TEXT)OUTPUT: Th3 quick brown fox jumps over the lazy do9.
Insert Punctuations
import augly.text as textaugs
TEST_TEXT = "The quick brown fox jumps over the lazy dog."
textaugs.insert_punctuation_chars(TEST_TEXT)OUTPUT: ['T.h.e. .q.u.i.c.k. .b.r.o.w.n. .f.o.x. .j.u.m.p.s. .o.v.e.r. .t.h.e. .l.a.z.y. .d.o.g..']
3. NLPAug
NLPAug is a library for textual augmentation in machine learning experiments. The goal is to improve deep learning model performance by generating textual data.
pip install nlpaug
Synonym Augmentation
import nlpaug.augmenter.word as naw
text = 'The quick brown fox jumps over the lazy dog .'syn_aug = naw.SynonymAug(aug_src='wordnet',aug_max=2)
syn_aug_text = syn_aug.augment(text,n=4)
print("ORIGINAL TEXT: ", text)
print("AUGMENTED TEXT: ",syn_aug_text)
Back translation involves taking the translated version of a document or file and then having a separate independent translator (who has no knowledge of or contact with the original text) translate it back into the original language.
import nlpaug.augmenter.word as naw
aug = naw.BackTranslationAug()
text = 'The quick brown fox jumps over the lazy dog .'
print("ORIGINAL TEXT: ", text)
print("AUGMENTED TEXT: ",aug.augment(text))
4. Parrot Paraphraser
Parrot is a paraphrase-based utterance augmentation framework purpose-built to accelerate training NLU models. A paraphrase framework is more than just a paraphrasing model.
pip install git+
import torch
from parrot import ParrotPARROT_PRETRAINED_MODEL = "prithivida/parrot_paraphraser_on_T5"
parrot_model = Parrot(model_tag=PARROT_PRETRAINED_MODEL)PHRASE = "The quick brown fox jumps over the lazy dog."
para_phrases = parrot_model.augment(input_phrase=PHRASE, use_gpu=False)
for para_phrase in para_phrases:
5. Pegasus Paraphraser
PEGASUS is a standard Transformer encoder-decoder. PEGASUS uses GSG to pre-train a Transformer encoder-decoder on large corpora of documents.
pip install transformers
import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizerBEAM_NUM = 10
PEGASUS_PRETRAINED_MODEL = 'tuner007/pegasus_paraphrase'pegasus_tokenizer = PegasusTokenizer.from_pretrained(PEGASUS_PRETRAINED_MODEL)
pegasus_model = PegasusForConditionalGeneration.from_pretrained(PEGASUS_PRETRAINED_MODEL).to(torch_device)input_text = "Can you recommed some upscale restaurants in Newyork?"
batch = pegasus_tokenizer([input_text], truncation=True, padding='longest', max_length=60, return_tensors="pt").to(torch_device)
translated = pegasus_model.generate(**batch, max_length=60, num_beams=BEAM_NUM, num_return_sequences=RETURN_SEQ_NUM, temperature=1.5)
tgt_text = pegasus_tokenizer.batch_decode(translated, skip_special_tokens=True)
print("ACTUAL TEXT: ",input_text)
for each_text in tgt_text:
Using all these libraries and the techniques, we can enhance our data and it will be a huge help for generating chatbot data or by paraphrasing the extractive summary to generate an abstractive type of summary. Lemme know if you can come up with anything interesting after trying these out 😃
NOTE: For full code, click here
[a] Surangika Ranathunga, En-Shiun Annie Lee, Marjana Prifti Skenduli, Ravi Shekhar, Mehreen Alam, Rishemjit Kaur. 2021. Neural Machine Translation for Low-Resource Languages: A Survey.
[b] Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, Eduard Hovy. 2021. A Survey of Data Augmentation Approaches for NLP.
[c] EDA from scratch:
[e] Augly
[f] nlpaug
[g] Parrot Paraphraser
[h] Pegasus Paraphraser
[I] Improving short text classification through global augmentation methods.
[j] PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization