Natural Language Toolkit (NTLK)'s Essentials

Dec 9, 2022 data coding machine learning python

Introduction

This is a set of examples which shows some of the most basic text parsing tasks that can be accomplished using the NLTK library in Python. This builds on my previous blog post, which shows the intersection between traditional machine learning and natural language processing (NLP).

Base Imports and Boilerplate

Given that some imports result in extra memory being allocated, some of them are performed when required only.

1import nltk
2import requests
3import pandas as pd
4from bs4 import BeautifulSoup
5from IPython.display import Markdown

Obtaining Text

Natural language processing is essentially about parsing text, so before we can explore the NTLK’s key text parsing capabilities, we need to obtain some text.

Built-In Text Corpora

NLTK allows downloading various corpora (plural for corpus). A corpus is body of text. Using ntlk.download('book') we get the assortment of texts referred in the O’Reilly’s book. This is a useful selection given that it represents different writing styles: fiction, religion, news reporting, etc.

1nltk.download('book',quiet=True)

True

The bodies of text found in book can be accessed using the nltk.book package. We can bring the corpora variables text1, text2, etc,. into scope as follows:

1from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

We can list the texts programmatically as follows:

1nltk.book.texts()

text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

1text1

<Text: Moby Dick by Herman Melville 1851>

Local files

In this case we just use Python’s regular file functions.

1text_dream = ""
2with open('dream.txt') as f:
3    text_dream = f.read()
4text_dream[0:52] # Print first 52 characters

'And so even though we face the difficulties of today'

The Web

Here we need to use an external library, such as requests, to fetch the Web content:

1html_agile = requests.get('https://agilemanifesto.org/').text
2 # Display first 56 characters
3display(Markdown("`%s`" % html_agile[0:56]))

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN ">

Once we obtain a HTML document, we can use BeautifulSoup to get rid of the tags and obtain the text:

1text_agile = BeautifulSoup(html_agile, 'html.parser').get_text()
2text_agile[0:45] # First 45 characters

'\n\n\nManifesto for Agile Software Development\n\n'

Basic Parsing

Sentences

The nltk.sent_tokenize() function may be used to obtain the sentences contained in a text as follows.

1sentences = nltk.sent_tokenize(text_dream)
2s = ""
3s += "Sentences: %d\n" % len(sentences)
4s += "First 5 sentences:\n"
5for i in range(0,5):
6    s += "    %d - %s\n" % (i,sentences[i])
7display(Markdown("```\n%s```\n" % s))
8

1Sentences: 29
2First 5 sentences:
3    0 - And so even though we face the difficulties of today and tomorrow, I still have a dream.
4    1 - It is a dream deeply rooted in the American dream.
5    2 - I have a dream that one day this nation will rise up and live out the true meaning of its creed:
6 
7We hold these truths to be self-evident, that all men are created equal.
8    3 - I have a dream that one day on the red hills of Georgia, the sons of former slaves and the sons of former slave owners will be able to sit down together at the table of brotherhood.
9    4 - I have a dream that one day even the state of Mississippi, a state sweltering with the heat of injustice, sweltering with the heat of oppression, will be transformed into an oasis of freedom and justice.

Gotcha: The sent_tokenize() function is often smart enough to recognise periods used in acronyms but not always. See the example below:

1nltk.sent_tokenize("The U.S. is in the American continent. The R.O.I. is in Europe.")

['The U.S. is in the American continent.', 'The R.O.I.', 'is in Europe.']

Words

The nltk.word_tokenize() function may be used to obtain the sentences contained in a text as follows.

1words = nltk.word_tokenize(text_dream)
2print("Word: %d" % len(words))
3print("First 5 words:")
4for i in range(0,5):
5    print("    %d - %s" % (i,words[i]))
6

Word: 688
First 5 words:
    0 - And
    1 - so
    2 - even
    3 - though
    4 - we

Gotcha: Contractions such as shouldn’t are considered separate words, as they represent two words (should and not). The same applies to numbers.

1nltk.word_tokenize("You shouldn't ignore that it is £1.50 off. It's important.")

['You',
 'should',
 "n't",
 'ignore',
 'that',
 'it',
 'is',
 '£1.50',
 'off',
 '.',
 'It',
 "'s",
 'important',
 '.']

Word Frequency

Word frequency is about finding out how often a given word is used in a given text. For example, the word ‘dream’ appears 11 times:

1freqDist = FreqDist(nltk.word_tokenize(text_dream))
2freqDist['dream']

We may display the frequency for all words as follows. There are also specific methods such as keys() and values() in case we don’t want to iterate through each dictionary tuple.

1# Show only 10 items (no particular order)
2counter = 0
3for k,v in freqDist.items():
4    counter += 1
5    if counter > 9:
6        break
7    print("%s (%d)" % (k,v))

And (5)
so (2)
even (2)
though (1)
we (9)
face (1)
the (31)
difficulties (1)
of (34)

It is often more helpful to obtain the most popular words using most_common(), so that we don’t need to do the sorting ourselves.

1# 10 most common words
2for k,v in freqDist.most_common(10):
3    print("%s (%d)" % (k,v))

, (39)
of (34)
the (31)
. (23)
and (21)
be (20)
will (17)
to (16)
a (15)
freedom (13)

More complex queries may be expressed by combining other properties in addition to word frequency. For example:

1# Words consisting of at least 4 letters,
2# who appear at least 5 times, 
3# and that start with letters a,b,c,d, or e.
4[ w for w in freqDist.keys() if len(w) >= 4 and freqDist[w] >= 5 and w[0].lower() in "abcde"]

['dream', 'able', 'every']

Stemming

A stem is the form of a word before any inflectional affixes are appended. In English, some stems are also proper words. Stems are obtained using rules which ‘chop off’ known inflectional affixes.

1porter = nltk.PorterStemmer() # Porter is a famous stemming algorithm. There's also LancasterStemmer()
2[porter.stem(t) for t in ["magically","information","painful","administration","universities"]]

['magic', 'inform', 'pain', 'administr', 'univers']

Lemmatisation

This process is based on looking up the words on an actual dictionary, as opposed to chopping off affixes. The above NLTK’s implementation uses the WordNet lexical database. It returns the input word unchanged if it cannot be found in WordNet.

1nltk.download('omw-1.4',quiet=True) # Words are looked up on an actual dictionary

True

1lemmatizer = nltk.WordNetLemmatizer()
2[lemmatizer.lemmatize(t) for t in ["magically","information","painful","administration","universities"]]

['magically', 'information', 'painful', 'administration', 'university']

Advanced Parsing

Part-of-Speech (POS) Tagging

POS Tagging is the process of classifying words into their parts of speech and labelling—tagging—them depending on their function: conjunction, adverb, verb, etc.

1nltk.pos_tag(nltk.word_tokenize(text_dream)[0:19])

[('And', 'CC'),
 ('so', 'RB'),
 ('even', 'RB'),
 ('though', 'IN'),
 ('we', 'PRP'),
 ('face', 'VBP'),
 ('the', 'DT'),
 ('difficulties', 'NNS'),
 ('of', 'IN'),
 ('today', 'NN'),
 ('and', 'CC'),
 ('tomorrow', 'NN'),
 (',', ','),
 ('I', 'PRP'),
 ('still', 'RB'),
 ('have', 'VBP'),
 ('a', 'DT'),
 ('dream', 'NN'),
 ('.', '.')]

We can look up the description of a specific tag using nltk.help.upenn_tagset(TAG), or nltk.help.upenn_tagset() (no arguments) if we want to display the meaning of all valid tags.

1## From the first tuple: ('And', 'CC')
2nltk.help.upenn_tagset('CC')

CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet

Gotcha: NLTK’s default tagger cannot produce multiple interpretations of the same word whenever there are ambiguities. For example, the sentence “we saw her duck” can be interpreted either as “we saw the duck (animal) that belongs to her” or “we saw her quickly lowering her head”. NLTK assumes the first interpretation given that it tags “duck” as a noun.

1print(nltk.pos_tag(nltk.word_tokenize("we saw her duck")))
2nltk.help.upenn_tagset('NN')

[('we', 'PRP'), ('saw', 'VBD'), ('her', 'PRP'), ('duck', 'NN')]
NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...

Sentence Structure

A sentence structure describes the relationship between one or more words in a sentence. For example:

 1               Mary ate pizza with her hands (Sentence)
 2               ---- ------------------------
 3                 |             |
 4           Noun Phrase         |
 5                           Verb Phrase
 6                       ------------------------
 7                       ate pizza with her hands
 8                       --- ----- --------------
 9                      Verb Noun  Prepositional Phrase
10                                 --------------------
11                                    with her hands
12                                    ---- --- -----
13                             Preposition Det Noun

Symbols

Symbol	Meaning	Example
S	Sentence	Mary ate pizza with her hands
NP	Noun Phrase	Mary
VP	Verb Phrase	ate pizza with her hands
PP	Prepositional Phrase	with her hands
Det	Determiner	her
N	Noun	Hands
V	Verb	ate
P	Preposition	with

Using the above symbols, we can encode the grammar and allow for variations such as “Bob ingested broccoli with his cutlery”.

 1# Context Free Grammar
 2grammar = nltk.CFG.fromstring("""
 3S -> NP VP
 4NP -> 'Mary' | 'Bob'
 5VP -> V N PP
 6V -> 'ate' | 'ingested'
 7N -> 'pizza' | 'broccoli'
 8PP -> P Det N
 9P -> 'with' | 'using'
10Det -> 'her' | 'his'
11N -> 'hands' | 'cutlery'
12""")
13
14# Parsing
15parser = nltk.ChartParser(grammar)
16for sentence in ["Mary ate pizza with her hands",
17                 "Mary ate broccoli with her cutlery",
18                 "Bob ate broccoli with his cutlery"]:
19    for tree in parser.parse_all(nltk.word_tokenize(sentence)):
20        print(tree)

(S
  (NP Mary)
  (VP (V ate) (N pizza) (PP (P with) (Det her) (N hands))))
(S
  (NP Mary)
  (VP (V ate) (N broccoli) (PP (P with) (Det her) (N cutlery))))
(S
  (NP Bob)
  (VP (V ate) (N broccoli) (PP (P with) (Det his) (N cutlery))))

Using WorldNet

WordNet is a lexical database of English which groups nouns, verbs, adjectives into sets of cognitive synonyms called synsets, for which NLTK provides an interface.

Each sysnet expresses a distinct concept; they are interlinked using conceptual-semantic and lexical relations.

1from nltk.corpus import wordnet as wn

Synonyms and Word Definitions

By default, wn.synsets() returns all the synonyms for a given word, including all possible variations in terms of part of speech (noun, verb, etc.):

1wn.synsets('alternate')

[Synset('surrogate.n.01'),
 Synset('alternate.v.01'),
 Synset('alternate.v.02'),
 Synset('understudy.v.01'),
 Synset('interchange.v.04'),
 Synset('alternate.v.05'),
 Synset('alternate.s.01'),
 Synset('alternate.s.02'),
 Synset('alternate.s.03'),
 Synset('alternate.a.04')]

We can specify a specific part of speech. For example, below we obtain the synonyms for alternate as a verb:

1wn.synsets("alternate",pos=wn.VERB)

[Synset('alternate.v.01'),
 Synset('alternate.v.02'),
 Synset('understudy.v.01'),
 Synset('interchange.v.04'),
 Synset('alternate.v.05')]

We can then either select one of the Synset instances or instantiate a single one directly by name.

1if wn.synsets("alternate",pos=wn.VERB)[0] == wn.synset('alternate.v.01'):
2    print("equal objects")

equal objects

A word may have multiple definitions:

1print("1. %s" % wn.synset('alternate.v.01').definition())
2print("2. %s" % wn.synset('alternate.v.02').definition())

1. go back and forth; swing back and forth between two states or conditions
2. exchange people temporarily to fulfill certain jobs and functions

Path Similarity and Lin Similarity

Path similarity is a measure of how related one word is to another in terms of how many ‘hops’ a word is away from each other. The metric is calculated using 1 / (1 + distance).

1car = wn.synset("car.n.01")     # As in motor vehicle
2bike = wn.synset("bike.n.01")   # As in motor vehicle
3plane = wn.synset("plane.n.01") # As in aircraft
4
5print(car.path_similarity(bike))
6print(car.path_similarity(plane))

0.3333333333333333
0.1111111111111111

The path similarity for car and bike is 0.3333333333333333 because we need to go up to car’s hypernym which is motor_vehicle and then come down to bike which is one of its hyponyms, resulting in two hops:

1         motor_vehicle
2           | 1     | 2 
3          car     bike

Plugging 2 into the path similarity formula, we obtain:

1 / (1 + 2) = 1 / 3 = 0.3333333333333333

1# Lowest Common Hypernym (Common ancestor)
2print(car.lowest_common_hypernyms(bike))
3
4# 1 hop to reach common ancestor
5print(car.hypernyms())  
6
7# 1 hop to reach common ancestor
8print(bike.hypernyms())

[Synset('motor_vehicle.n.01')]
[Synset('motor_vehicle.n.01')]
[Synset('motor_vehicle.n.01')]

The path similarity between car and plane is lower since they are more distant in the hierarchy. It takes 8 hops to reach car from plane and vice versa.

1 / (1 + 8) = 1 / 9 = 0.1111111111111111

 1# Lowest Common Hypernym (Common ancestor)
 2print(car.lowest_common_hypernyms(plane))
 3
 4print("car's hypernyms")
 5# 4 hops to to reach common ancestor
 6print(car.hypernyms())
 7print(car.hypernyms()[0].hypernyms())
 8print(car.hypernyms()[0].hypernyms()[0].hypernyms())
 9print(car.hypernyms()[0].hypernyms()[0].hypernyms()[0].hypernyms())
10
11# 4 hops to reach common ancestor
12print("plane's hypernyms")
13print(plane.hypernyms())
14print(plane.hypernyms()[0].hypernyms())
15print(plane.hypernyms()[0].hypernyms()[0].hypernyms())
16print(plane.hypernyms()[0].hypernyms()[0].hypernyms()[0].hypernyms())

[Synset('vehicle.n.01')]
car's hypernyms
[Synset('motor_vehicle.n.01')]
[Synset('self-propelled_vehicle.n.01')]
[Synset('wheeled_vehicle.n.01')]
[Synset('container.n.01'), Synset('vehicle.n.01')]
plane's hypernyms
[Synset('heavier-than-air_craft.n.01')]
[Synset('aircraft.n.01')]
[Synset('craft.n.02')]
[Synset('vehicle.n.01')]

Lin Similarity

The method lin_similarity()—according to NLTK’s documentation—returns a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).

1from nltk.corpus import wordnet_ic
2brown_ic = wordnet_ic.ic('ic-brown.dat')
3
4print(car.lin_similarity(bike, brown_ic))
5print(car.lin_similarity(plane, brown_ic))

1.4546835708698286e-299
0.7194339072268571

Collocations and Distributional Similarity

If two words appear frequently in similar contexts, they are more likely to be semantically related.

We can find related—i.e., close–words as follows:

 1from nltk.collocations import *
 2
 3bigram_measures = nltk.collocations.BigramAssocMeasures()
 4
 5# Please note that stop words are unfiltered
 6finder = BigramCollocationFinder.from_words(nltk.word_tokenize(text_dream))
 7
 8finder.apply_freq_filter(3) # Minimum frequency
 9for (p1,p2) in finder.nbest(bigram_measures.pmi,10):
10    print(p1,p2)

at last
last !
With this
one day
this faith
I have
Let freedom
ring from
freedom ring
have a

Case Study: Discovering News Topics

Topics are groups of related words which appear to belong to the same topic. Topic discovery is not about inferring topic names (e.g., sports, world affairs, etc.) per se, but the keywords that are more representative of a given—numbered—topic over another. Here we will use the news headlines used in my previous blog post and create a list of a list of tokens for each of them.

 1import re
 2import feedparser
 3from nltk.corpus import stopwords
 4
 5stop_words = set(stopwords.words('english'))
 6
 7def tokens(sentence):
 8  
 9    return [ w for w in nltk.word_tokenize(sentence) 
10             if w.lower() not in stop_words and 
11             re.match(r"[a-zA-Z]",w)]
12
13def get_news(files):
14
15  results = []
16
17  for file in files:
18    feed_uk    = feedparser.parse("../news_analysis/%s_uk.rss" % file)
19    feed_world = feedparser.parse("../news_analysis/%s_world.rss" % file)
20    results +=  ([tokens(e.title) for e in feed_uk.entries ] +
21                [tokens(e.title) for e in feed_world.entries ])
22  return results
23
24doc_set = get_news(["guardian","bbc","daily","sky"])
25doc_set[0:2]

[['Dover',
  'firebomb',
  'attack',
  'motivated',
  'terrorist',
  'ideology',
  'police',
  'say'],
 ['Sunak', 'vows', 'protect', 'mortgage', 'holders', 'says', 'everything']]

The second step is the creation of a dictionary (i.e., a map of integers to words) of unique tokens. Here below, we create a dictionary for our set of documents, named doc_set and print the first five tokens:

1import gensim
2from gensim import corpora, models
3
4dictionary = corpora.Dictionary(doc_set)
5print("Tokens in dictionary: %d" % len(dictionary))
6print("First five tokens:")
7for (i,v) in list(dictionary.items())[0:5]:
8    print(i,v)

Tokens in dictionary: 1095
First five tokens:
0 Dover
1 attack
2 firebomb
3 ideology
4 motivated

The third step is the creation of a corpus in which tokens are represented by their dictionary indices rather than by strings, which is accomplished using the doc2bow() function:

1corpus = [dictionary.doc2bow(doc) for doc in doc_set]
2print("Documents in corpus: %d" % len(corpus))

Documents in corpus: 203

Let’s compare the first document in doc_set, and its shape after being converted to a bag of words:

1print(doc_set[0])
2print(corpus[0])

['Dover', 'firebomb', 'attack', 'motivated', 'terrorist', 'ideology', 'police', 'say']
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]

Let’s now look at a document in the middle of the set, say, document #100

1print(doc_set[100])
2print(corpus[100])

['Microsoft', 'co-founder', 'collection', 'poised', 'raise', 'largest', 'art', 'auction', 'history']
[(590, 1), (629, 1), (630, 1), (631, 1), (632, 1), (633, 1), (634, 1), (635, 1), (636, 1)]

Note that the token indices are no longer necessarily sequential. This is because there are repeated words ‘seen’ before. Take token 590:

1print(dictionary[590])

history

Token 590 has appeared in a previously seen document, document #87 to be concrete:

1
2for (i,doc) in enumerate(corpus):
3    for (token_id,_) in doc:
4        if token_id == 590:
5            print(i, doc_set[i])

87 ['Imran', 'Khan', 'shooting', 'another', 'violent', 'moment', 'Pakistan', 'political', 'history']
100 ['Microsoft', 'co-founder', 'collection', 'poised', 'raise', 'largest', 'art', 'auction', 'history']

Number of Topics. A Chicken and Egg Problem.

We now have everything we need to create a topic model. There are a number of arguments which are described here, but we will use defaults except for corpus (which is essentially the model’s training set), id2word which allows mapping the tokens identifiers back to their original words when we print results, and num_topics.

But what about num_topics? Who says that our corpus consists of exactly 10 topics?

1ldamodel = gensim.models.ldamodel.LdaModel(corpus, id2word = dictionary, num_topics=10, passes=50)

The truth is that we don’t know in advance. The argument num_topics is in reality a hyper-parameter (among others); something that we need to tune and experiment with to find out the optimal value. In the below example, we try out topics 1..10 and see which value results in the greater coherence score.

1from gensim.models import CoherenceModel
2
3print("Topics Coherence")
4for num_topics in range(1,11):
5    ldamodel = gensim.models.ldamodel.LdaModel(corpus, id2word = dictionary, num_topics=num_topics, passes=50, random_state=0)
6    top_topics = ldamodel.top_topics(corpus)
7    coherencemodel = CoherenceModel(model=ldamodel, texts=doc_set, dictionary=dictionary, coherence='c_v')
8    topic_coherence = coherencemodel.get_coherence()
9    print("%d      %.4f" % (num_topics,topic_coherence))

Topics Coherence
1      0.6721
2      0.5846
3      0.6002
4      0.6056
5      0.5939
6      0.5630
7      0.5052
8      0.5801
9      0.5263
10      0.5485

Greater coherence means that the weighting of each of the keywords identified for each topic are more likely to appear in the related topic. Our results suggest that our corpus gravitates toward one topic (which is true in the sense that they are news published in the UK), but the second best is four topics.

1num_topics = 4
2
3ldamodel = gensim.models.ldamodel.LdaModel(corpus, id2word = dictionary, num_topics=num_topics, passes=50, random_state=0)

Now we display the top five tokens (keywords) associated with each of the discovered topics together with their weights. Their weights indicate the how influential each keyword is in determining each topic.

1top_topics = ldamodel.top_topics(corpus)
2
3for i,v in enumerate(top_topics):
4    print("Topic %d:" % i)
5    for (weight,keyword) in v[0][0:5]:
6        print("%.9f %s" % (weight,keyword))

Topic 0:
0.008946272 accused
0.007531412 death
0.006100893 Sunak
0.006083373 Kyrgios
0.006083373 Nick
Topic 1:
0.008040487 war
0.006470365 crush
0.004992961 Kherson
0.004979674 Putin
0.004970739 Korea
Topic 2:
0.006141314 attack
0.006121390 UK
0.006106316 Russian
0.006099785 strikes
0.004675262 fire
Topic 3:
0.007932898 immigration
0.007920699 centre
0.006424925 US
0.006423592 UK
0.006422467 PM

As we can see, our corpus is perhaps too small to derive well bounded topics. For example, topics 1 and 2 seem to relate to the same topic (The War in Ukraine).

Before You Leave

🤘 Subscribe to my 100% spam-free newsletter!