NLP using Stanza
Simple Natural Language Processing tutorial using Stanza package in Python
Stanza is an NLP library created by the Stanford NLP Group that contains NLP pipelines. Before we move on to Stanza, you can also look into NLP using NLTK and NLP using Spacy.
Stanza’s neural network NLP pipeline
The raw text of any language mentioned here can be passed into the pipeline where we can tokenize text, lemmatize or get part of speech tags or named entities easily. So let’s get started!
Install:
pip install stanza
Now, let’s import it so that we can use its features.
import stanza
Before we do any task, we have to train the machine the language we will be working on. So let’s download the English language model
import stanza
stanza.download('en')
Load the pipeline
nlp = stanza.Pipeline('en')
WORD TOKENIZE
Tokenize words to get the tokens of the text i.e breaking the sentences into words.
import stanza
nlp = stanza.Pipeline('en')en_test_= "Chris Manning teaches at Stanford University. He lives in the Bay Area."
doc = nlp(en_test_)
word_tokens = [token.text for sent in doc.sentences for token in sent.tokens]print (word_tokens)[OUTPUT]
['Chris', 'Manning', 'teaches', 'at', 'Stanford', 'University', '.', 'He', 'lives', 'in', 'the', 'Bay', 'Area', '.']
SENTENCE TOKENIZE
Tokenize sentences if the there are more than 1 sentence i.e breaking the sentences to list of sentence.
import stanza
nlp = stanza.Pipeline('en')en_test_= "Chris Manning teaches at Stanford University. He lives in the Bay Area."
doc = nlp(en_test_)
sent_list = [sent.text for sent in doc.sentences]print (sent_list)[OUTPUT]
['Chris Manning teaches at Stanford University.', 'He lives in the Bay Area.']
Lemma
lemmatize the text so as to get its root form eg: functions,funtionality as function
import stanza
nlp = stanza.Pipeline('en', processors='tokenize,lemma')en_test_= "Chris Manning teaches at Stanford University. He lives in the Bay Area."doc = nlp(en_test_)print(*[f'{word.text}_{word.lemma}' for sent in doc.sentences for word in sent.words])[OUTPUT]
Chris_Chris Manning_manning teaches_teach at_at Stanford_Stanford University_University ._. He_he lives_life in_in the_the Bay_Bay Area_Area ._.
Here, note that we need to pass the modules we will be using in the pipeline. For Lemma, we are tokenizing the text and fetching the lemma so we need to pass ‘tokenize,lemma’ on the pipeline’s processor argument.
POS tags
POS tag helps us to know the tags of each word like whether a word is a noun, an adjective etc.
import stanza
nlp = stanza.Pipeline('en', processors='tokenize,pos')en_test_= "Chris Manning teaches at Stanford University. He lives in the Bay Area."doc = nlp(en_test_)
sent_list = [sent.text for sent in doc.sentences]print([f'{word.text}_{word.xpos}' for sent in doc.sentences for word in sent.words])[OUTPUT]
['Chris_NNP', 'Manning_NNP', 'teaches_VBZ', 'at_IN', 'Stanford_NNP', 'University_NNP', '._.', 'He_PRP', 'lives_VBZ', 'in_IN', 'the_DT', 'Bay_NNP', 'Area_NNP', '._.']
Note that for POS, we need to tokenize the text and tag it therefore we pass ‘tokenize,pos’ on the pipeline’s processor argument.
NER
NER(Named Entity Recognition) is the process of getting the entity names
import stanza
nlp = stanza.Pipeline('en', processors='tokenize,ner')en_test_= "Chris Manning teaches at Stanford University. He lives in the Bay Area."doc = nlp(en_test_)
sent_list = [sent.text for sent in doc.sentences]print(*[f'{token.text}_{token.ner}' for sent in doc.sentences for token in sent.tokens])[OUTPUT]
Chris_B-PERSON Manning_E-PERSON teaches_O at_O Stanford_B-ORG University_E-ORG ._O He_O lives_O in_O the_B-LOC Bay_I-LOC Area_E-LOC ._O
Note again that for NER, we need to tokenize the text and user ner model to get ner tags so we pass ‘tokenize,ner’ on the pipeline’s processor argument.
WORD FEATURES
One nice thing I found in Stanza library was the “word features” where it gives us whether the word is singular or plural, gender, case, etc. To get the features, pass “mwt” (Multi-Word Token) in the processors.
import stanza
nlp = stanza.Pipeline('en', processors='tokenize,mwt,pos')en_test_= "Chris Manning teaches at Stanford University. He lives in the Bay Area."doc = nlp(en_test_)print(*[f'WORD: {word.text}\tPOS: {word.upos}\tfeats: {word.feats if word.feats else "_"}' for sent in doc.sentences for word in sent.words], sep='\n')[OUTPUT]
WORD: Chris POS: PROPN feats: Number=Sing
WORD: Manning POS: PROPN feats: Number=Sing
WORD: teaches POS: VERB feats: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
WORD: at POS: ADP feats: _
WORD: Stanford POS: PROPN feats: Number=Sing
WORD: University POS: PROPN feats: Number=Sing
WORD: . POS: PUNCT feats: _
WORD: He POS: PRON feats: Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs
WORD: lives POS: VERB feats: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
WORD: in POS: ADP feats: _
WORD: the POS: DET feats: Definite=Def|PronType=Art
WORD: Bay POS: PROPN feats: Number=Sing
WORD: Area POS: PROPN feats: Number=Sing
WORD: . POS: PUNCT feats: _
Overall, I could see that the modules are similar to spacy. So if you are familiar with Spacy, working with Stanza will be super easy!
You can play with the “Stanza” library more by going through their website.