Using NLP to Categorise News Headlines

Nov 13, 2022 data coding machine learning python

Introduction

Sentiment analysis (e.g., whether people’s feelings are positive or negative toward a given product, policy, etc.) is a popular natural language processing (NLP) use case which is why I decided to take a different approach and write a NLP tutorial based on news headlines categorisation. The motivation is that the use case is as simple as sentiment analysis, but ‘real world’ data is much easier to get since most news sites offer easy-to-parse RSS feeds.

In a nutshell, this tutorial presents a classification problem whereby there is a single feature consisting of a news headline, and a label which is True if the headline concerns local, UK news, or false, if it concerns world news.

Our data set consists of just 200 instances, but it serves to demonstrate how easy the whole process is. This goal is not perfect scores, but gaining intuition about the key concepts and tools that help address a use case like this.

Imports and Boilerplate

External resources are always referred using the respective module prefix so we can import everything we need first. Just skip ahead.

 1import feedparser
 2import numpy as np
 3import pandas as pd
 4from sklearn.model_selection import train_test_split
 5from sklearn.feature_extraction.text import CountVectorizer
 6from sklearn.linear_model import LogisticRegression
 7from sklearn.feature_extraction.text import TfidfVectorizer
 8from sklearn.metrics import roc_auc_score
 9from IPython.display import Markdown
10from sklearn.preprocessing import MinMaxScaler

News Sources

While there are multiple news outlets in the UK, I’ve selected a few that offer segregated UK and World news RSS feeds (among other categories), so that I didn’t have to do the labelling manually.

Although the RSS urls may be fetched live, in real time, I saved them to a local file in order to create a reproducible lab. For reference, these are the news outlets and respective RSS feeds used on 2022-11-04:

BBC: UK | World
The Guardian: UK | World
Sky: UK | World
Daily Express: UK | World

I saved each result using wget URL -O outlet_name_uk.rss and wget URL -O outlet_name_world.rss respectively. After that, I created a script to consolidate the results from all outlets into a single, uniform, labelled data set:

 1def get_news(files):
 2
 3  results = []
 4
 5  for file in files:
 6    feed_uk    = feedparser.parse("%s_uk.rss" % file)
 7    feed_world = feedparser.parse("%s_world.rss" % file)
 8    results +=      ([{ "title" : e.title, 
 9                       "UK"   : True 
10} for e in feed_uk.entries ] +
11                    [{ "title" : e.title, 
12                       "UK"   : False 
13} for e in feed_world.entries ])
14  return results
15
16df = pd.DataFrame(
17    data=get_news(["guardian","bbc","daily","sky"]))

Given that all of the outlets show pretty much the same number of ‘most recent’ stories for each category, we have the advantage of a balanced data set, where one label doesn’t crowd out the others.

1df["UK"].value_counts()

False    103
True     100
Name: UK, dtype: int64

Now, let’s look at some examples of UK stories:

1df[df['UK']].head()

	title	UK
0	Dover firebomb attack motivated by terrorist i...	True
1	Sunak vows to protect mortgage holders but say...	True
2	Detainees protest during power cut at Harmonds...	True
3	Rail passengers in Britain face disruption des...	True
4	Reputation of UK politicians is at ‘low point’...	True

And, some example of world (non-UK) stories:

1df[~df['UK']].head()

	title	UK
49	Brazil, Indonesia and DRC in talks to form ‘Op...	False
50	Who’s who at Cop27: the leaders who hold the w...	False
51	Race against time for sick patients after Ethi...	False
52	Cop27 host accuses countries of making empty p...	False
53	Stop Eritrea’s ‘war-funding diaspora tax’, say...	False

Training and Test Sets

Using the default split, we will train our model on 152 instances, and test it on 51.

1X_train, X_test, y_train, y_test = train_test_split(df['title'], df['UK'],random_state=2)
2
3print(len(X_train),len(X_test))

152 51

Feature Engineering Approaches

In a text classification problem like this, the key problem is the conversion of text into effective features. The choice of a specific model is a second order concern. For this tutorial, we are simply using Logistic Regression, but Support Vector Machines (SVM) or kNN are often more effective in front of larger data sets. As said before, we are not looking for perfect scores, but trying to understand how the process works.

Count Vectorizer

This feature engineering approach consists on extracting unique words from a corpus, creating a column for each of them, and plugging a zero or a one on each column depending on whether the word is present or not. Let us break up this process into bite-sized steps.

First, we feed our raw features (X_train), to the vectorizer instance. As a result of invoking fit_transform() we obtain a new, transformed feature set, which we save in the X_train_vectorized variable.

1vectorizer = CountVectorizer(ngram_range=(1,1)) # default but explicit for clarity
2X_train_vectorized = vectorizer.fit_transform(X_train)

CountVectorized applies a number of rules when it comes to keyword extraction. Consider the keywords extracted from the following two sentences:

1CountVectorizer().fit(["Dover firebomb attack motivated by terrorist ideology, police say",
2                       "Sunak vows to protect mortgage holders but says he can’t ‘do everything’"]
3                     ).get_feature_names_out()

array(['attack', 'but', 'by', 'can', 'do', 'dover', 'everything',
       'firebomb', 'he', 'holders', 'ideology', 'mortgage', 'motivated',
       'police', 'protect', 'say', 'says', 'sunak', 'terrorist', 'to',
       'vows'], dtype=object)

Notice:

mortgage holders is not a single word. This can be altered using the ngram_range parameter (more on this later)
can’t nor any other non-alphabetical punctuation is included
sunak and dover are in lower case

Let’s now look at the effective number of keywords extracted from X_train:

1len(vectorizer.get_feature_names_out())

There are 930 keywords in total. Let’s look at the first and last 15 ones:

1print(vectorizer.get_feature_names_out()[0:15]) # first 15
2print(vectorizer.get_feature_names_out()[-15:]) # last 15

['000' '10' '11' '127' '13' '15' '2022' '2024' '25' '30' '30bn' '55' '70'
 '800' 'aboriginal']
['wonderwall' 'work' 'workers' 'world' 'worst' 'would' 'wounded' 'wrong'
 'ww2' 'yair' 'year' 'young' 'youngsters' 'your' 'youth']

But, what is inside X_train_vectorized exactly? It is a so-called sparse matrix. A sparse matrix contains mainly zero values, and can be encoded in a more efficient fashion than, say, regular Data Frames. But don’t despair, let’s look at a concrete example.

Take the first item in X_train:

1print(X_train.array[0])

The four-year-old boy with incredible maths skills

The corresponding ‘vectorized’ version of this first row consists of 930 columns, each representing a keyword, in which the value is 1 if the keyword is used in the sentence. Below, I print the total number of columns, and the values of each column in the range 900…950 (just because there are two ones in there!)

1print(len(X_train_vectorized.toarray()[0]))
2print(X_train_vectorized.toarray()[0][900:950])

930
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]

Now, how do those zero and ones relate to actual keywords? When we call vectorizer.get_feature_names_out(), we get an array with 930 keywords. The presence of a 1 in a ‘column’ means that the keyword in that column’s index is active:

1# Original text
2print(X_train.array[0])
3# Vectorized encoded words
4for i,word_on in enumerate(X_train_vectorized.toarray()[0]):
5    if word_on == 1:
6        print(i,vectorizer.get_feature_names_out()[i])

The four-year-old boy with incredible maths skills
126 boy
337 four
411 incredible
509 maths
576 old
754 skills
817 the
912 with
925 year

I said that the whole text analysis affair is around feature engineering. Now that we have ‘proper’ numerical features, we can easily apply a model. We should not forget that we always need to transform the ‘raw’ features using vectorizer.transform() before passing them down to our model of choice—LogisticRegression in this case:

1model = LogisticRegression()
2model.fit(X_train_vectorized, y_train)
3
4y_train_pred = model.predict(vectorizer.transform(X_train))
5y_test_pred = model.predict(vectorizer.transform(X_test))
6
7print('AUC Train score: %2.8f' % roc_auc_score(y_train, y_train_pred))
8print('AUC Test  score: %2.8f' % roc_auc_score(y_test, y_test_pred))

AUC Train score: 0.96098491
AUC Test  score: 0.81018519

A test score of 80% is not impressive, but we also need to understand that ‘UK’ versus ‘World’ gives room to plenty of ambiguity. Let us look at some of the false negatives and false positives:

1test_df = pd.DataFrame(data = {"title" : X_test,
2                                "UK (predicted)" : y_test_pred,
3                                "UK (actual)" : y_test})
4test_df[test_df["UK (predicted)"] != test_df["UK (actual)"]].sort_values(["UK (predicted)","UK (actual)"])

	title	UK (predicted)	UK (actual)
130	Skomer: Manx shearwaters feed chicks plastic a...	False	True
6	Labour plans expansion of state nursery sector...	False	True
127	Driver and cyclist near misses caught on camera	False	True
2	Detainees protest during power cut at Harmonds...	False	True
3	Rail passengers in Britain face disruption des...	False	True
35	‘It’s not just childcare’: focus on early year...	False	True
45	Rugby fans and a Churchill cake: Friday’s best...	False	True
172	Riot police race to immigration removal centre...	False	True
64	Malaysia’s 97-year-old former PM Mahathir Moha...	True	False
54	Tory-linked lobbying firm agreed to help swing...	True	False

Note the last story, which was labelled as World (non-UK) but was predicted as UK:

1test_df["title"].loc[54]

'Tory-linked lobbying firm agreed to help swing DRC election, leak suggests'

The Guardian had originally included this story in its World feed, but the model had learned that ‘Tory’ is word more strongly associated with UK news. Is the story a world one because it concerns the Democratic Republic of Congo (DRC), or is it a UK story because the conservatives are involved?. Good question.

It follows that it is useful to understand the degree to which each keyword influences the predictions. Here below, we list the keywords with the smallest and largest weight:

 1def print_coefficients(is_tfidf,model):
 2
 3    coefs = model.coef_[0]
 4    if is_tfidf:
 5        coefs = X_train_vectorized.max(0).toarray()[0]
 6    sorted_index = coefs.argsort()
 7
 8    r = [ (coefs[sorted_index[i]],
 9           vectorizer.get_feature_names_out()[sorted_index[i]]) 
10           for i in list(range(0,11))+list(range(len(coefs)-11,len(coefs)))
11        ]
12    s =  "Weight | Weight (Normalised) | Word\n"
13    s += "-------|---------------------|----- \n"
14    scaler = MinMaxScaler().fit([[coefs.max()],[coefs.min()]])
15    for (weight,word) in r:
16        normalised = scaler.transform([[weight]])[0][0]
17        s += " %s | %s | `%s`\n" % (weight,normalised,word)
18    
19    display(Markdown(s))
20
21print_coefficients(False,model)

Weight	Weight (Normalised)	Word
-0.783382295591252	0.0	`ukraine`
-0.6218376107055257	0.10124040508945176	`russian`
-0.5753895381130733	0.1303495131248572	`cop27`
-0.5185752286201282	0.16595516435387803	`could`
-0.5145059715943034	0.1685053765753387	`midterms`
-0.47514424583918957	0.19317345564761862	`deal`
-0.46031815676277643	0.20246499789210742	`netanyahu`
-0.45987590077722	0.20274216067937362	`ukrainian`
-0.45669830875852796	0.20473356450308317	`crush`
-0.44776993824795497	0.21032899370534097	`war`
-0.43308719369945214	0.2195307016226697	`no`
0.4074628232588574	0.7463052239475684	`living`
0.4204622442647869	0.75445198934254	`cost`
0.4228764224786277	0.7559649600756859	`peru`
0.42959285906651146	0.7601741654815831	`pictures`
0.456594321698718	0.7770960407932097	`tory`
0.4804256649756676	0.7920311954155415	`back`
0.4839314989481192	0.7942283091850129	`10`
0.4864743789572403	0.7958219376129578	`uk`
0.5895532939833902	0.8604217176726399	`british`
0.6267672635665634	0.8837437932844255	`will`
0.8122719764653672	1.0	`london`

It does make perfect sense that a story involving ukraine is most likely a world story, whereas one involving london is a UK one. We also see tory in the top 10. But what about pictures, or cost?

TF-IDF Vectorizer (Term Frequency–Inverse Document Frequency)

TFIDF is a feature engineering approach for text analysis that, unlike Count Vectorizer, goes beyond merely counting the absence or presence of keywords. TFIDF removes and/or penalises both infrequent and overly frequent words. Let’s swap CountVectorizer() with TfidfVectorizer() and see the results:

 1vectorizer = TfidfVectorizer().fit(X_train)
 2X_train_vectorized = vectorizer.transform(X_train)
 3
 4model = LogisticRegression()
 5model.fit(X_train_vectorized, y_train)
 6
 7y_train_pred = model.predict(vectorizer.transform(X_train))
 8y_test_pred = model.predict(vectorizer.transform(X_test))
 9
10print('AUC Train score: %2.8f' % roc_auc_score(y_train, y_train_pred))
11print('AUC Test  score: %2.8f' % roc_auc_score(y_test, y_test_pred))
12print_coefficients(True,model)

AUC Train score: 0.96098491
AUC Test  score: 0.81481481

Weight	Weight (Normalised)	Word
0.23789805765235938	0.0	`and`
0.245360893393626	0.018537038936173866	`boulton`
0.245360893393626	0.018537038936173866	`rays`
0.245360893393626	0.018537038936173866	`chance`
0.245360893393626	0.018537038936173866	`adam`
0.245360893393626	0.018537038936173866	`king`
0.245360893393626	0.018537038936173866	`hope`
0.245360893393626	0.018537038936173866	`greta`
0.24581980838395912	0.01967694410615628	`playing`
0.24581980838395912	0.01967694410615628	`re`
0.24581980838395912	0.01967694410615628	`if`
0.4330861148403412	0.4848302631079585	`disturbance`
0.4330861148403412	0.4848302631079585	`cause`
0.4354998371085913	0.49082574059384887	`hsbc`
0.4354998371085913	0.49082574059384887	`banking`
0.45730116079328686	0.5449783442440456	`riverboat`
0.5531552306331836	0.7830715686450066	`are`
0.5550873888010767	0.7878708825292756	`the`
0.5648881962241075	0.8122152410419593	`miss`
0.5685765002509814	0.8213766694486638	`air`
0.6384204638911735	0.9948630382153999	`who`
0.6404885558954915	0.9999999999999999	`no`

1dtype = [('name', 'S10'), ('height', float), ('age', int)]
2values = [('Arthur', 1.8, 41), ('Lancelot', 1.9, 38),
3          ('Galahad', 1.7, 38)]
4np.array(values, dtype=dtype)

array([(b'Arthur', 1.8, 41), (b'Lancelot', 1.9, 38),
       (b'Galahad', 1.7, 38)],
      dtype=[('name', 'S10'), ('height', '<f8'), ('age', '<i8')])

We get a marginal AUC score improvement, but what is most noticeable is the absence of popular keywords such as ukraine and london at the bottom, and the top—worked out by the Count Vectorizer—seen a few moments ago.

The least popular and most popular words make less sense now: and predicts world news, and no predicts UK-news. In spite of this, the AUC score is slightly higher (0.81481481 for TD-IDF versus 0.81018519 for the Count Vectorizer).

N-Grams

N-Grams is not distinct feature engineering approach per se, but a parameter applicable to both the Count and TF-IDF vectorisers. Extracting single keywords like we have done so far is effective for some ‘better than chance’ scores on small data sets but it may be problematic for larger corpora.

For example, ‘Imran Khan’ is the former president of Pakistan, whereby ‘Sadiq Khan’ is the mayor of London. The former corresponds with World stories, whereas the latter corresponds with UK ones. Using ‘Khan’ as a single feature may make the model biased either in the UK or non-UK direction depending on the frequency of stories in either category found in the training set.

In the example below, we set ngram_range=(1,2) so that we can contemplate combinations of two words. This raises the number of features to 2336.

 1vectorizer = TfidfVectorizer(ngram_range=(1,2)).fit(X_train)
 2print("Features: %s" % len(vectorizer.get_feature_names_out()))
 3X_train_vectorized = vectorizer.transform(X_train)
 4
 5model = LogisticRegression()
 6model.fit(X_train_vectorized, y_train)
 7
 8y_train_pred = model.predict(vectorizer.transform(X_train))
 9y_test_pred = model.predict(vectorizer.transform(X_test))
10
11print('AUC Train score: %2.8f' % roc_auc_score(y_train, y_train_pred))
12print('AUC Test  score: %2.8f' % roc_auc_score(y_test, y_test_pred))
13print_coefficients(True,model)

Features: 2336
AUC Train score: 0.96046471
AUC Test  score: 0.70370370

Weight	Weight (Normalised)	Word
0.16185835719176975	0.0	`and`
0.1680669310270251	0.019957857513223987	`their anger`
0.1680669310270251	0.019957857513223987	`at grassroots`
0.1680669310270251	0.019957857513223987	`we re`
0.1680669310270251	0.019957857513223987	`out racism`
0.1680669310270251	0.019957857513223987	`still ruining`
0.1680669310270251	0.019957857513223987	`playing`
0.1680669310270251	0.019957857513223987	`winning`
0.1680669310270251	0.019957857513223987	`winning good`
0.1680669310270251	0.019957857513223987	`grassroots level`
0.1680669310270251	0.019957857513223987	`grassroots`
0.3197346004992556	0.5075032772843385	`riverboat`
0.3197346004992556	0.5075032772843385	`tourists released`
0.3197346004992556	0.5075032772843385	`released from`
0.3197346004992556	0.5075032772843385	`peru tourists`
0.3197346004992556	0.5075032772843385	`detained riverboat`
0.3915741799375625	0.7384361981587014	`are`
0.39493326102252757	0.7492341790554434	`the`
0.41225891631573935	0.8049286056404393	`miss`
0.4178117011463031	0.8227783874728921	`air`
0.46164664360078156	0.9636886124007192	`who`
0.4729425420117875	0.9999999999999999	`no`

Note that for our particular data set, the score does not improve when setting ngram_range to 2.

To gain some more intuition into what n-grams does, see in the example below how terrorist ideology is extracted as a keyword, in addition to terrorist, and ideology.

1TfidfVectorizer( 
2  ngram_range=(1,2)
3  ).fit(["Dover firebomb attack motivated by terrorist ideology, police say",
4         "Sunak vows to protect mortgage holders but says he can’t ‘do everything’"]
5       ).get_feature_names_out()

array(['attack', 'attack motivated', 'but', 'but says', 'by',
       'by terrorist', 'can', 'can do', 'do', 'do everything', 'dover',
       'dover firebomb', 'everything', 'firebomb', 'firebomb attack',
       'he', 'he can', 'holders', 'holders but', 'ideology',
       'ideology police', 'mortgage', 'mortgage holders', 'motivated',
       'motivated by', 'police', 'police say', 'protect',
       'protect mortgage', 'say', 'says', 'says he', 'sunak',
       'sunak vows', 'terrorist', 'terrorist ideology', 'to',
       'to protect', 'vows', 'vows to'], dtype=object)

Conclusion

In this tutorial we saw how to analyse text-based data to create predictions using two simple Sckit-Learn’s feature engineering tools: CountVectorizer and TfidfVectorizer.

TfidfVectorizer, together with the use of n-grams higher than one are known to provide better results, but in our selected example, based on 200 news stories, CounterVectorizer with ngrams set to 1 (default) performed better.

Last, it is worth mentioning that, when considering a large corpora of news headlines, further feature extraction (in addition to words) may help improve AUC scores. These are some hypotheses worth proving:

Length: Do world news tend to be longer, perhaps because more context needs to be provided for the headline to make sense?
Numbers: Is the use of numbers (0,1,2,..) positively correlated with the likelihood of headline being either UK or World-related because such numbers tend to refer to dates or events that are more prevalent in either type of headline?
Non-English Symbols: Does the use of non-English characters (ñ,ü,...) tilt the prediction toward world news?

Before You Leave

🤘 Subscribe to my 100% spam-free newsletter!