Using NLP to Categorise News Headlines

Table of Contents
Introduction
Sentiment analysis (e.g., whether people’s feelings are positive or negative toward a given product, policy, etc.) is a popular natural language processing (NLP) use case which is why I decided to take a different approach and write a NLP tutorial based on news headlines categorisation. The motivation is that the use case is as simple as sentiment analysis, but ‘real world’ data is much easier to get since most news sites offer easy-to-parse RSS feeds.
In a nutshell, this tutorial presents a classification problem whereby there is a single feature consisting of a news headline, and a label which is True if the headline concerns local, UK news, or false, if it concerns world news.
Our data set consists of just 200 instances, but it serves to demonstrate how easy the whole process is. This goal is not perfect scores, but gaining intuition about the key concepts and tools that help address a use case like this.
Imports and Boilerplate
External resources are always referred using the respective module prefix so we can import everything we need first. Just skip ahead.
1import feedparser
2import numpy as np
3import pandas as pd
4from sklearn.model_selection import train_test_split
5from sklearn.feature_extraction.text import CountVectorizer
6from sklearn.linear_model import LogisticRegression
7from sklearn.feature_extraction.text import TfidfVectorizer
8from sklearn.metrics import roc_auc_score
9from IPython.display import Markdown
10from sklearn.preprocessing import MinMaxScaler
News Sources
While there are multiple news outlets in the UK, I’ve selected a few that offer segregated UK and World news RSS feeds (among other categories), so that I didn’t have to do the labelling manually.
Although the RSS urls may be fetched live, in real time, I saved them to a local file in order to create a reproducible lab. For reference, these are the news outlets and respective RSS feeds used on 2022-11-04:
I saved each result using wget URL -O outlet_name_uk.rss
and wget URL -O outlet_name_world.rss
respectively. After that, I created a script to consolidate the results from all outlets into a single, uniform, labelled data set:
1def get_news(files):
2
3 results = []
4
5 for file in files:
6 feed_uk = feedparser.parse("%s_uk.rss" % file)
7 feed_world = feedparser.parse("%s_world.rss" % file)
8 results += ([{ "title" : e.title,
9 "UK" : True
10} for e in feed_uk.entries ] +
11 [{ "title" : e.title,
12 "UK" : False
13} for e in feed_world.entries ])
14 return results
15
16df = pd.DataFrame(
17 data=get_news(["guardian","bbc","daily","sky"]))
Given that all of the outlets show pretty much the same number of ‘most recent’ stories for each category, we have the advantage of a balanced data set, where one label doesn’t crowd out the others.
1df["UK"].value_counts()
False 103
True 100
Name: UK, dtype: int64
Now, let’s look at some examples of UK stories:
1df[df['UK']].head()
title | UK | |
---|---|---|
0 | Dover firebomb attack motivated by terrorist i... | True |
1 | Sunak vows to protect mortgage holders but say... | True |
2 | Detainees protest during power cut at Harmonds... | True |
3 | Rail passengers in Britain face disruption des... | True |
4 | Reputation of UK politicians is at ‘low point’... | True |
And, some example of world (non-UK) stories:
1df[~df['UK']].head()
title | UK | |
---|---|---|
49 | Brazil, Indonesia and DRC in talks to form ‘Op... | False |
50 | Who’s who at Cop27: the leaders who hold the w... | False |
51 | Race against time for sick patients after Ethi... | False |
52 | Cop27 host accuses countries of making empty p... | False |
53 | Stop Eritrea’s ‘war-funding diaspora tax’, say... | False |
Training and Test Sets
Using the default split, we will train our model on 152
instances, and test it on 51
.
1X_train, X_test, y_train, y_test = train_test_split(df['title'], df['UK'],random_state=2)
2
3print(len(X_train),len(X_test))
152 51
Feature Engineering Approaches
In a text classification problem like this, the key problem is the conversion of text into effective features. The choice of a specific model is a second order concern. For this tutorial, we are simply using Logistic Regression, but Support Vector Machines (SVM) or kNN are often more effective in front of larger data sets. As said before, we are not looking for perfect scores, but trying to understand how the process works.
Count Vectorizer
This feature engineering approach consists on extracting unique words from a corpus, creating a column for each of them, and plugging a zero or a one on each column depending on whether the word is present or not. Let us break up this process into bite-sized steps.
First, we feed our raw features (X_train
), to the vectorizer instance. As a result of invoking fit_transform()
we obtain a new, transformed feature set, which we save in the X_train_vectorized
variable.
1vectorizer = CountVectorizer(ngram_range=(1,1)) # default but explicit for clarity
2X_train_vectorized = vectorizer.fit_transform(X_train)
CountVectorized
applies a number of rules when it comes to keyword extraction. Consider the keywords extracted from the following two sentences:
1CountVectorizer().fit(["Dover firebomb attack motivated by terrorist ideology, police say",
2 "Sunak vows to protect mortgage holders but says he can’t ‘do everything’"]
3 ).get_feature_names_out()
array(['attack', 'but', 'by', 'can', 'do', 'dover', 'everything',
'firebomb', 'he', 'holders', 'ideology', 'mortgage', 'motivated',
'police', 'protect', 'say', 'says', 'sunak', 'terrorist', 'to',
'vows'], dtype=object)
Notice:
- mortgage holders is not a single word. This can be altered using the
ngram_range
parameter (more on this later) - can’t nor any other non-alphabetical punctuation is included
- sunak and dover are in lower case
Let’s now look at the effective number of keywords extracted from X_train
:
1len(vectorizer.get_feature_names_out())
930
There are 930
keywords in total. Let’s look at the first and last 15 ones:
1print(vectorizer.get_feature_names_out()[0:15]) # first 15
2print(vectorizer.get_feature_names_out()[-15:]) # last 15
['000' '10' '11' '127' '13' '15' '2022' '2024' '25' '30' '30bn' '55' '70'
'800' 'aboriginal']
['wonderwall' 'work' 'workers' 'world' 'worst' 'would' 'wounded' 'wrong'
'ww2' 'yair' 'year' 'young' 'youngsters' 'your' 'youth']
But, what is inside X_train_vectorized
exactly? It is a so-called sparse matrix. A sparse matrix contains mainly zero values, and can be encoded in a more efficient fashion than, say, regular Data Frames. But don’t despair, let’s look at a concrete example.
Take the first item in X_train
:
1print(X_train.array[0])
The four-year-old boy with incredible maths skills
The corresponding ‘vectorized’ version of this first row consists of 930 columns, each representing a keyword, in which the value is 1
if the keyword is used in the sentence. Below, I print the total number of columns, and the values of each column in the range 900…950 (just because there are two ones in there!)
1print(len(X_train_vectorized.toarray()[0]))
2print(X_train_vectorized.toarray()[0][900:950])
930
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
Now, how do those zero and ones relate to actual keywords? When we call vectorizer.get_feature_names_out()
, we get an array with 930 keywords. The presence of a 1
in a ‘column’ means that the keyword in that column’s index is active:
1# Original text
2print(X_train.array[0])
3# Vectorized encoded words
4for i,word_on in enumerate(X_train_vectorized.toarray()[0]):
5 if word_on == 1:
6 print(i,vectorizer.get_feature_names_out()[i])
The four-year-old boy with incredible maths skills
126 boy
337 four
411 incredible
509 maths
576 old
754 skills
817 the
912 with
925 year
I said that the whole text analysis affair is around feature engineering. Now that we have ‘proper’ numerical features, we can easily apply a model. We should not forget that we always need to transform the ‘raw’ features using vectorizer.transform()
before passing them down to our model of choice—LogisticRegression in this case:
1model = LogisticRegression()
2model.fit(X_train_vectorized, y_train)
3
4y_train_pred = model.predict(vectorizer.transform(X_train))
5y_test_pred = model.predict(vectorizer.transform(X_test))
6
7print('AUC Train score: %2.8f' % roc_auc_score(y_train, y_train_pred))
8print('AUC Test score: %2.8f' % roc_auc_score(y_test, y_test_pred))
AUC Train score: 0.96098491
AUC Test score: 0.81018519
A test score of 80% is not impressive, but we also need to understand that ‘UK’ versus ‘World’ gives room to plenty of ambiguity. Let us look at some of the false negatives and false positives:
1test_df = pd.DataFrame(data = {"title" : X_test,
2 "UK (predicted)" : y_test_pred,
3 "UK (actual)" : y_test})
4test_df[test_df["UK (predicted)"] != test_df["UK (actual)"]].sort_values(["UK (predicted)","UK (actual)"])
title | UK (predicted) | UK (actual) | |
---|---|---|---|
130 | Skomer: Manx shearwaters feed chicks plastic a... | False | True |
6 | Labour plans expansion of state nursery sector... | False | True |
127 | Driver and cyclist near misses caught on camera | False | True |
2 | Detainees protest during power cut at Harmonds... | False | True |
3 | Rail passengers in Britain face disruption des... | False | True |
35 | ‘It’s not just childcare’: focus on early year... | False | True |
45 | Rugby fans and a Churchill cake: Friday’s best... | False | True |
172 | Riot police race to immigration removal centre... | False | True |
64 | Malaysia’s 97-year-old former PM Mahathir Moha... | True | False |
54 | Tory-linked lobbying firm agreed to help swing... | True | False |
Note the last story, which was labelled as World (non-UK) but was predicted as UK:
1test_df["title"].loc[54]
'Tory-linked lobbying firm agreed to help swing DRC election, leak suggests'
The Guardian had originally included this story in its World feed, but the model had learned that ‘Tory’ is word more strongly associated with UK news. Is the story a world one because it concerns the Democratic Republic of Congo (DRC), or is it a UK story because the conservatives are involved?. Good question.
It follows that it is useful to understand the degree to which each keyword influences the predictions. Here below, we list the keywords with the smallest and largest weight:
1def print_coefficients(is_tfidf,model):
2
3 coefs = model.coef_[0]
4 if is_tfidf:
5 coefs = X_train_vectorized.max(0).toarray()[0]
6 sorted_index = coefs.argsort()
7
8 r = [ (coefs[sorted_index[i]],
9 vectorizer.get_feature_names_out()[sorted_index[i]])
10 for i in list(range(0,11))+list(range(len(coefs)-11,len(coefs)))
11 ]
12 s = "Weight | Weight (Normalised) | Word\n"
13 s += "-------|---------------------|----- \n"
14 scaler = MinMaxScaler().fit([[coefs.max()],[coefs.min()]])
15 for (weight,word) in r:
16 normalised = scaler.transform([[weight]])[0][0]
17 s += " %s | %s | `%s`\n" % (weight,normalised,word)
18
19 display(Markdown(s))
20
21print_coefficients(False,model)
Weight | Weight (Normalised) | Word |
---|---|---|
-0.783382295591252 | 0.0 | ukraine |
-0.6218376107055257 | 0.10124040508945176 | russian |
-0.5753895381130733 | 0.1303495131248572 | cop27 |
-0.5185752286201282 | 0.16595516435387803 | could |
-0.5145059715943034 | 0.1685053765753387 | midterms |
-0.47514424583918957 | 0.19317345564761862 | deal |
-0.46031815676277643 | 0.20246499789210742 | netanyahu |
-0.45987590077722 | 0.20274216067937362 | ukrainian |
-0.45669830875852796 | 0.20473356450308317 | crush |
-0.44776993824795497 | 0.21032899370534097 | war |
-0.43308719369945214 | 0.2195307016226697 | no |
0.4074628232588574 | 0.7463052239475684 | living |
0.4204622442647869 | 0.75445198934254 | cost |
0.4228764224786277 | 0.7559649600756859 | peru |
0.42959285906651146 | 0.7601741654815831 | pictures |
0.456594321698718 | 0.7770960407932097 | tory |
0.4804256649756676 | 0.7920311954155415 | back |
0.4839314989481192 | 0.7942283091850129 | 10 |
0.4864743789572403 | 0.7958219376129578 | uk |
0.5895532939833902 | 0.8604217176726399 | british |
0.6267672635665634 | 0.8837437932844255 | will |
0.8122719764653672 | 1.0 | london |
It does make perfect sense that a story involving ukraine
is most likely a world story, whereas one involving london
is a UK one. We also see tory
in the top 10. But what about pictures
, or cost
?
TF-IDF Vectorizer (Term Frequency–Inverse Document Frequency)
TFIDF is a feature engineering approach for text analysis that, unlike Count Vectorizer, goes beyond merely counting the absence or presence of keywords. TFIDF removes and/or penalises both infrequent and overly frequent words. Let’s swap CountVectorizer()
with TfidfVectorizer()
and see the results:
1vectorizer = TfidfVectorizer().fit(X_train)
2X_train_vectorized = vectorizer.transform(X_train)
3
4model = LogisticRegression()
5model.fit(X_train_vectorized, y_train)
6
7y_train_pred = model.predict(vectorizer.transform(X_train))
8y_test_pred = model.predict(vectorizer.transform(X_test))
9
10print('AUC Train score: %2.8f' % roc_auc_score(y_train, y_train_pred))
11print('AUC Test score: %2.8f' % roc_auc_score(y_test, y_test_pred))
12print_coefficients(True,model)
AUC Train score: 0.96098491
AUC Test score: 0.81481481
Weight | Weight (Normalised) | Word |
---|---|---|
0.23789805765235938 | 0.0 | and |
0.245360893393626 | 0.018537038936173866 | boulton |
0.245360893393626 | 0.018537038936173866 | rays |
0.245360893393626 | 0.018537038936173866 | chance |
0.245360893393626 | 0.018537038936173866 | adam |
0.245360893393626 | 0.018537038936173866 | king |
0.245360893393626 | 0.018537038936173866 | hope |
0.245360893393626 | 0.018537038936173866 | greta |
0.24581980838395912 | 0.01967694410615628 | playing |
0.24581980838395912 | 0.01967694410615628 | re |
0.24581980838395912 | 0.01967694410615628 | if |
0.4330861148403412 | 0.4848302631079585 | disturbance |
0.4330861148403412 | 0.4848302631079585 | cause |
0.4354998371085913 | 0.49082574059384887 | hsbc |
0.4354998371085913 | 0.49082574059384887 | banking |
0.45730116079328686 | 0.5449783442440456 | riverboat |
0.5531552306331836 | 0.7830715686450066 | are |
0.5550873888010767 | 0.7878708825292756 | the |
0.5648881962241075 | 0.8122152410419593 | miss |
0.5685765002509814 | 0.8213766694486638 | air |
0.6384204638911735 | 0.9948630382153999 | who |
0.6404885558954915 | 0.9999999999999999 | no |
1dtype = [('name', 'S10'), ('height', float), ('age', int)]
2values = [('Arthur', 1.8, 41), ('Lancelot', 1.9, 38),
3 ('Galahad', 1.7, 38)]
4np.array(values, dtype=dtype)
array([(b'Arthur', 1.8, 41), (b'Lancelot', 1.9, 38),
(b'Galahad', 1.7, 38)],
dtype=[('name', 'S10'), ('height', '<f8'), ('age', '<i8')])
We get a marginal AUC score improvement, but what is most noticeable is the absence of popular keywords such as ukraine
and london
at the bottom, and the top—worked out by the Count Vectorizer—seen a few moments ago.
The least popular and most popular words make less sense now: and
predicts world news, and no
predicts UK-news. In spite of this, the AUC score is slightly higher (0.81481481 for TD-IDF versus 0.81018519 for the Count Vectorizer).
N-Grams
N-Grams is not distinct feature engineering approach per se, but a parameter applicable to both the Count and TF-IDF vectorisers. Extracting single keywords like we have done so far is effective for some ‘better than chance’ scores on small data sets but it may be problematic for larger corpora.
For example, ‘Imran Khan’ is the former president of Pakistan, whereby ‘Sadiq Khan’ is the mayor of London. The former corresponds with World stories, whereas the latter corresponds with UK ones. Using ‘Khan’ as a single feature may make the model biased either in the UK or non-UK direction depending on the frequency of stories in either category found in the training set.
In the example below, we set ngram_range=(1,2)
so that we can contemplate combinations of two words. This raises the number of features to 2336.
1vectorizer = TfidfVectorizer(ngram_range=(1,2)).fit(X_train)
2print("Features: %s" % len(vectorizer.get_feature_names_out()))
3X_train_vectorized = vectorizer.transform(X_train)
4
5model = LogisticRegression()
6model.fit(X_train_vectorized, y_train)
7
8y_train_pred = model.predict(vectorizer.transform(X_train))
9y_test_pred = model.predict(vectorizer.transform(X_test))
10
11print('AUC Train score: %2.8f' % roc_auc_score(y_train, y_train_pred))
12print('AUC Test score: %2.8f' % roc_auc_score(y_test, y_test_pred))
13print_coefficients(True,model)
Features: 2336
AUC Train score: 0.96046471
AUC Test score: 0.70370370
Weight | Weight (Normalised) | Word |
---|---|---|
0.16185835719176975 | 0.0 | and |
0.1680669310270251 | 0.019957857513223987 | their anger |
0.1680669310270251 | 0.019957857513223987 | at grassroots |
0.1680669310270251 | 0.019957857513223987 | we re |
0.1680669310270251 | 0.019957857513223987 | out racism |
0.1680669310270251 | 0.019957857513223987 | still ruining |
0.1680669310270251 | 0.019957857513223987 | playing |
0.1680669310270251 | 0.019957857513223987 | winning |
0.1680669310270251 | 0.019957857513223987 | winning good |
0.1680669310270251 | 0.019957857513223987 | grassroots level |
0.1680669310270251 | 0.019957857513223987 | grassroots |
0.3197346004992556 | 0.5075032772843385 | riverboat |
0.3197346004992556 | 0.5075032772843385 | tourists released |
0.3197346004992556 | 0.5075032772843385 | released from |
0.3197346004992556 | 0.5075032772843385 | peru tourists |
0.3197346004992556 | 0.5075032772843385 | detained riverboat |
0.3915741799375625 | 0.7384361981587014 | are |
0.39493326102252757 | 0.7492341790554434 | the |
0.41225891631573935 | 0.8049286056404393 | miss |
0.4178117011463031 | 0.8227783874728921 | air |
0.46164664360078156 | 0.9636886124007192 | who |
0.4729425420117875 | 0.9999999999999999 | no |
Note that for our particular data set, the score does not improve when setting ngram_range
to 2.
To gain some more intuition into what n-grams does, see in the example below how terrorist ideology
is extracted as a keyword, in addition to terrorist
, and ideology
.
1TfidfVectorizer(
2 ngram_range=(1,2)
3 ).fit(["Dover firebomb attack motivated by terrorist ideology, police say",
4 "Sunak vows to protect mortgage holders but says he can’t ‘do everything’"]
5 ).get_feature_names_out()
array(['attack', 'attack motivated', 'but', 'but says', 'by',
'by terrorist', 'can', 'can do', 'do', 'do everything', 'dover',
'dover firebomb', 'everything', 'firebomb', 'firebomb attack',
'he', 'he can', 'holders', 'holders but', 'ideology',
'ideology police', 'mortgage', 'mortgage holders', 'motivated',
'motivated by', 'police', 'police say', 'protect',
'protect mortgage', 'say', 'says', 'says he', 'sunak',
'sunak vows', 'terrorist', 'terrorist ideology', 'to',
'to protect', 'vows', 'vows to'], dtype=object)
Conclusion
In this tutorial we saw how to analyse text-based data to create predictions using two simple Sckit-Learn’s feature engineering tools: CountVectorizer and TfidfVectorizer.
TfidfVectorizer, together with the use of n-grams higher than one are known to provide better results, but in our selected example, based on 200 news stories, CounterVectorizer with ngrams set to 1 (default) performed better.
Last, it is worth mentioning that, when considering a large corpora of news headlines, further feature extraction (in addition to words) may help improve AUC scores. These are some hypotheses worth proving:
- Length: Do world news tend to be longer, perhaps because more context needs to be provided for the headline to make sense?
- Numbers: Is the use of numbers
(0,1,2,..)
positively correlated with the likelihood of headline being either UK or World-related because such numbers tend to refer to dates or events that are more prevalent in either type of headline? - Non-English Symbols: Does the use of non-English characters
(ñ,ü,...)
tilt the prediction toward world news?