NLP Data Augmentation

Augmentation Techniques for Textual Data Using NLP Augmentation Libraries

5 min readFeb 28, 2022

“Augmentation” is the process of enlarging in size or amount. As data is important when it comes to using Neural models, let’s expand our data!
How? let’s check out the data augmentation techniques for textual data.

As mentioned in “A Survey of Data Augmentation Approaches for NLP”[b], some of the Data Augmentation Techniques are:

Rule-Based: Easy Data Augmentation(EDA)
Example Interpolation Techniques: MIXUP, SEQ2MIXUP
Model-Based Techniques: Seq2seq, language model, back translation, fine-tuning GPT-2, paraphrasing.

Under Rule-Based, the basic and most commonly used technique is EDA: Easy data augmentation techniques which include:

1. Synonym Replacement: Randomly choose n words from the sentence that does not stop words. Replace each of these words with one of its synonyms chosen at random.

2. Random Deletion: Randomly remove each word in the sentence with probability p.

3. Random Swap: Randomly choose two words in the sentence and swap their positions. Do this n times.

4. Random Insertion: Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this n times

Various Data Augmentation Task:

Summarization
Question Answering
Sequence Tagging
Parsing
Grammatical Error Correction
Neural Machine Translation
Data to Text
Dialogue

Available Libraries:

TextAugment
AugLy
NLPAug
Parrot paraphrase
Pegasus paraphrase

Getting Started!

1. TextAugment

TextAugment is a Python 3 library for augmenting text for natural language processing applications. TextAugment stands on the giant shoulders of NLTK, Gensim, and TextBlob and plays nicely with them.

pip install numpy nltk gensim textblob googletrans textaugment

Synonym Replacement

from textaugment import EDAeda_augment = EDA()TEST_TEXT = "The quick brown fox jumps over the lazy dog."
syn_augmented = eda_augment.synonym_replacement(TEST_TEXT)
print("ORIGINAL Text :",TEST_TEXT)
print("AUGMENTED TEXT:", syn_augmented)

The output of Synonym Replacement

Random Deletion

from textaugment import EDA

eda_augment = EDA()

TEST_TEXT = "The quick brown fox jumps over the lazy dog."
random_del_augmented = eda_augment.random_deletion(TEST_TEXT, p=0.2)
print("ORIGINAL Text :",TEST_TEXT)
print("AUGMENTED TEXT:", random_del_augmented)

The output of Random Deletion

Random Swap

from textaugment import EDA

eda_augment = EDA()

TEST_TEXT = "The quick brown fox jumps over the lazy dog."
random_swap_augmented = eda_augment.random_swap(TEST_TEXT)
print("ORIGINAL Text :",TEST_TEXT)
print("AUGMENTED TEXT:", random_swap_augmented)

The output of Random Swap

Random Insertion

from textaugment import EDA

eda_augment = EDA()

TEST_TEXT = "The quick brown fox jumps over the lazy dog."
rnd_insert_augmented = eda_augment.random_insertion(TEST_TEXT)
print("ORIGINAL Text :",TEST_TEXT)
print("AUGMENTED TEXT:", rnd_insert_augmented)

The output of Random Insertion

2. AugLy

Facebook just recently released the AugLy package to the public domain. AugLy library is divided into four sub-libraries, each for different kinds of data modalities (audio, images, videos, and texts).

pip install -U augly

Replace Similar Characters

import augly.text as textaugs

TEST_TEXT = "The quick brown fox jumps over the lazy dog."
textaugs.replace_similar_chars(TEST_TEXT)OUTPUT: Th3 quick brown fox jumps over the lazy do9.

Insert Punctuations

import augly.text as textaugs

TEST_TEXT = "The quick brown fox jumps over the lazy dog."
textaugs.insert_punctuation_chars(TEST_TEXT)OUTPUT: ['T.h.e. .q.u.i.c.k. .b.r.o.w.n. .f.o.x. .j.u.m.p.s. .o.v.e.r. .t.h.e. .l.a.z.y. .d.o.g..']

3. NLPAug

NLPAug is a library for textual augmentation in machine learning experiments. The goal is to improve deep learning model performance by generating textual data.

pip install nlpaug

Synonym Augmentation

import nlpaug.augmenter.word as naw
text = 'The quick brown fox jumps over the lazy dog .'syn_aug = naw.SynonymAug(aug_src='wordnet',aug_max=2)
syn_aug_text = syn_aug.augment(text,n=4)
print("ORIGINAL TEXT: ", text)
print("AUGMENTED TEXT: ",syn_aug_text)

The output of Synonym Augmentation

BackTranslation

Back translation involves taking the translated version of a document or file and then having a separate independent translator (who has no knowledge of or contact with the original text) translate it back into the original language.

import nlpaug.augmenter.word as naw
aug = naw.BackTranslationAug()
text = 'The quick brown fox jumps over the lazy dog .'

print("ORIGINAL TEXT: ", text)
print("AUGMENTED TEXT: ",aug.augment(text))

The output of BackTranslation

4. Parrot Paraphraser

Parrot is a paraphrase-based utterance augmentation framework purpose-built to accelerate training NLU models. A paraphrase framework is more than just a paraphrasing model.

pip install git+https://github.com/PrithivirajDamodaran/Parrot_Paraphraser.git

import torch
from parrot import ParrotPARROT_PRETRAINED_MODEL = "prithivida/parrot_paraphraser_on_T5"
parrot_model = Parrot(model_tag=PARROT_PRETRAINED_MODEL)PHRASE = "The quick brown fox jumps over the lazy dog."
para_phrases = parrot_model.augment(input_phrase=PHRASE, use_gpu=False)
for para_phrase in para_phrases:
    print(para_phrase)

5. Pegasus Paraphraser

PEGASUS is a standard Transformer encoder-decoder. PEGASUS uses GSG to pre-train a Transformer encoder-decoder on large corpora of documents.

pip install transformers

import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizerBEAM_NUM = 10
RETURN_SEQ_NUM = 10
PEGASUS_PRETRAINED_MODEL = 'tuner007/pegasus_paraphrase'pegasus_tokenizer = PegasusTokenizer.from_pretrained(PEGASUS_PRETRAINED_MODEL)
pegasus_model = PegasusForConditionalGeneration.from_pretrained(PEGASUS_PRETRAINED_MODEL).to(torch_device)input_text = "Can you recommed some upscale restaurants in Newyork?"

batch = pegasus_tokenizer([input_text], truncation=True, padding='longest', max_length=60, return_tensors="pt").to(torch_device)
translated = pegasus_model.generate(**batch, max_length=60, num_beams=BEAM_NUM, num_return_sequences=RETURN_SEQ_NUM, temperature=1.5)
tgt_text = pegasus_tokenizer.batch_decode(translated, skip_special_tokens=True)

print("ACTUAL TEXT: ",input_text)
for each_text in tgt_text:
    print(each_text)

Conclusion

Using all these libraries and the techniques, we can enhance our data and it will be a huge help for generating chatbot data or by paraphrasing the extractive summary to generate an abstractive type of summary. Lemme know if you can come up with anything interesting after trying these out 😃

NOTE: For full code, click here

REF:

[a] Surangika Ranathunga, En-Shiun Annie Lee, Marjana Prifti Skenduli, Ravi Shekhar, Mehreen Alam, Rishemjit Kaur. 2021. Neural Machine Translation for Low-Resource Languages: A Survey.

[b] Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, Eduard Hovy. 2021. A Survey of Data Augmentation Approaches for NLP.

[c] EDA from scratch: https://jovian.ai/abdulmajee/eda-data-augmentation-techniques-for-text-nlp

[d]TextAugment https://github.com/dsfsi/textaugment

[e] Augly https://analyticsarora.com/how-to-use-augly-on-image-video-audio-and-text/

[f] nlpaug https://github.com/makcedward/nlpaug

[g] Parrot Paraphraser https://github.com/PrithivirajDamodaran/Parrot_Paraphraser

[h] Pegasus Paraphraser https://huggingface.co/tuner007/pegasus_paraphrase

[I] Improving short text classification through global augmentation methods.

[j] PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization https://arxiv.org/abs/1912.08777