NLP using Stanza

Simple Natural Language Processing tutorial using Stanza package in Python

Pema Gurung
4 min readAug 3, 2021

Stanza is an NLP library created by the Stanford NLP Group that contains NLP pipelines. Before we move on to Stanza, you can also look into NLP using NLTK and NLP using Spacy.

Stanza’s neural network NLP pipeline

source: https://stanfordnlp.github.io/stanza/

The raw text of any language mentioned here can be passed into the pipeline where we can tokenize text, lemmatize or get part of speech tags or named entities easily. So let’s get started!

Install:

pip install stanza

Now, let’s import it so that we can use its features.

import stanza

Before we do any task, we have to train the machine the language we will be working on. So let’s download the English language model

import stanza
stanza.download('en')

Load the pipeline

nlp = stanza.Pipeline('en')

WORD TOKENIZE

Tokenize words to get the tokens of the text i.e breaking the sentences into words.

import stanza
nlp = stanza.Pipeline('en')
en_test_= "Chris Manning teaches at Stanford University. He lives in the Bay Area."
doc = nlp(en_test_)
word_tokens = [token.text for sent in doc.sentences for token in sent.tokens]
print (word_tokens)[OUTPUT]
['Chris', 'Manning', 'teaches', 'at', 'Stanford', 'University', '.', 'He', 'lives', 'in', 'the', 'Bay', 'Area', '.']

SENTENCE TOKENIZE

Tokenize sentences if the there are more than 1 sentence i.e breaking the sentences to list of sentence.

import stanza
nlp = stanza.Pipeline('en')
en_test_= "Chris Manning teaches at Stanford University. He lives in the Bay Area."
doc = nlp(en_test_)
sent_list = [sent.text for sent in doc.sentences]
print (sent_list)[OUTPUT]
['Chris Manning teaches at Stanford University.', 'He lives in the Bay Area.']

Lemma

lemmatize the text so as to get its root form eg: functions,funtionality as function

import stanza
nlp = stanza.Pipeline('en', processors='tokenize,lemma')
en_test_= "Chris Manning teaches at Stanford University. He lives in the Bay Area."doc = nlp(en_test_)print(*[f'{word.text}_{word.lemma}' for sent in doc.sentences for word in sent.words])[OUTPUT]
Chris_Chris Manning_manning teaches_teach at_at Stanford_Stanford University_University ._. He_he lives_life in_in the_the Bay_Bay Area_Area ._.

Here, note that we need to pass the modules we will be using in the pipeline. For Lemma, we are tokenizing the text and fetching the lemma so we need to pass ‘tokenize,lemma’ on the pipeline’s processor argument.

POS tags

POS tag helps us to know the tags of each word like whether a word is a noun, an adjective etc.

import stanza
nlp = stanza.Pipeline('en', processors='tokenize,pos')
en_test_= "Chris Manning teaches at Stanford University. He lives in the Bay Area."doc = nlp(en_test_)
sent_list = [sent.text for sent in doc.sentences]
print([f'{word.text}_{word.xpos}' for sent in doc.sentences for word in sent.words])[OUTPUT]
['Chris_NNP', 'Manning_NNP', 'teaches_VBZ', 'at_IN', 'Stanford_NNP', 'University_NNP', '._.', 'He_PRP', 'lives_VBZ', 'in_IN', 'the_DT', 'Bay_NNP', 'Area_NNP', '._.']

Note that for POS, we need to tokenize the text and tag it therefore we pass ‘tokenize,pos’ on the pipeline’s processor argument.

NER

NER(Named Entity Recognition) is the process of getting the entity names

import stanza
nlp = stanza.Pipeline('en', processors='tokenize,ner')
en_test_= "Chris Manning teaches at Stanford University. He lives in the Bay Area."doc = nlp(en_test_)
sent_list = [sent.text for sent in doc.sentences]
print(*[f'{token.text}_{token.ner}' for sent in doc.sentences for token in sent.tokens])[OUTPUT]
Chris_B-PERSON Manning_E-PERSON teaches_O at_O Stanford_B-ORG University_E-ORG ._O He_O lives_O in_O the_B-LOC Bay_I-LOC Area_E-LOC ._O

Note again that for NER, we need to tokenize the text and user ner model to get ner tags so we pass ‘tokenize,ner’ on the pipeline’s processor argument.

WORD FEATURES

One nice thing I found in Stanza library was the “word features” where it gives us whether the word is singular or plural, gender, case, etc. To get the features, pass “mwt” (Multi-Word Token) in the processors.

import stanza
nlp = stanza.Pipeline('en', processors='tokenize,mwt,pos')
en_test_= "Chris Manning teaches at Stanford University. He lives in the Bay Area."doc = nlp(en_test_)print(*[f'WORD: {word.text}\tPOS: {word.upos}\tfeats: {word.feats if word.feats else "_"}' for sent in doc.sentences for word in sent.words], sep='\n')[OUTPUT]
WORD: Chris POS: PROPN feats: Number=Sing
WORD: Manning POS: PROPN feats: Number=Sing
WORD: teaches POS: VERB feats: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
WORD: at POS: ADP feats: _
WORD: Stanford POS: PROPN feats: Number=Sing
WORD: University POS: PROPN feats: Number=Sing
WORD: . POS: PUNCT feats: _
WORD: He POS: PRON feats: Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs
WORD: lives POS: VERB feats: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
WORD: in POS: ADP feats: _
WORD: the POS: DET feats: Definite=Def|PronType=Art
WORD: Bay POS: PROPN feats: Number=Sing
WORD: Area POS: PROPN feats: Number=Sing
WORD: . POS: PUNCT feats: _

Overall, I could see that the modules are similar to spacy. So if you are familiar with Spacy, working with Stanza will be super easy!
You can play with the “Stanza” library more by going through their website.

--

--

Pema Gurung
Pema Gurung

Written by Pema Gurung

MSc CL at University of Stuttgart, Germany. Linkedin: https://www.linkedin.com/in/pemagrg/

No responses yet