Using NLP to Categorise News Headlines

Share on:

Table of Contents

Introduction

Sentiment analysis (e.g., whether people’s feelings are positive or negative toward a given product, policy, etc.) is a popular natural language processing (NLP) use case which is why I decided to take a different approach and write a NLP tutorial based on news headlines categorisation. The motivation is that the use case is as simple as sentiment analysis, but ‘real world’ data is much easier to get since most news sites offer easy-to-parse RSS feeds.

In a nutshell, this tutorial presents a classification problem whereby there is a single feature consisting of a news headline, and a label which is True if the headline concerns local, UK news, or false, if it concerns world news.

Our data set consists of just 200 instances, but it serves to demonstrate how easy the whole process is. This goal is not perfect scores, but gaining intuition about the key concepts and tools that help address a use case like this.

Imports and Boilerplate

External resources are always referred using the respective module prefix so we can import everything we need first. Just skip ahead.

 1import feedparser
 2import numpy as np
 3import pandas as pd
 4from sklearn.model_selection import train_test_split
 5from sklearn.feature_extraction.text import CountVectorizer
 6from sklearn.linear_model import LogisticRegression
 7from sklearn.feature_extraction.text import TfidfVectorizer
 8from sklearn.metrics import roc_auc_score
 9from IPython.display import Markdown
10from sklearn.preprocessing import MinMaxScaler

News Sources

While there are multiple news outlets in the UK, I’ve selected a few that offer segregated UK and World news RSS feeds (among other categories), so that I didn’t have to do the labelling manually.

Although the RSS urls may be fetched live, in real time, I saved them to a local file in order to create a reproducible lab. For reference, these are the news outlets and respective RSS feeds used on 2022-11-04:

I saved each result using wget URL -O outlet_name_uk.rss and wget URL -O outlet_name_world.rss respectively. After that, I created a script to consolidate the results from all outlets into a single, uniform, labelled data set:

 1def get_news(files):
 2
 3  results = []
 4
 5  for file in files:
 6    feed_uk    = feedparser.parse("%s_uk.rss" % file)
 7    feed_world = feedparser.parse("%s_world.rss" % file)
 8    results +=      ([{ "title" : e.title, 
 9                       "UK"   : True 
10} for e in feed_uk.entries ] +
11                    [{ "title" : e.title, 
12                       "UK"   : False 
13} for e in feed_world.entries ])
14  return results
15
16df = pd.DataFrame(
17    data=get_news(["guardian","bbc","daily","sky"]))

Given that all of the outlets show pretty much the same number of ‘most recent’ stories for each category, we have the advantage of a balanced data set, where one label doesn’t crowd out the others.

1df["UK"].value_counts()
False    103
True     100
Name: UK, dtype: int64

Now, let’s look at some examples of UK stories:

1df[df['UK']].head()

title UK
0 Dover firebomb attack motivated by terrorist i... True
1 Sunak vows to protect mortgage holders but say... True
2 Detainees protest during power cut at Harmonds... True
3 Rail passengers in Britain face disruption des... True
4 Reputation of UK politicians is at ‘low point’... True

And, some example of world (non-UK) stories:

1df[~df['UK']].head()

title UK
49 Brazil, Indonesia and DRC in talks to form ‘Op... False
50 Who’s who at Cop27: the leaders who hold the w... False
51 Race against time for sick patients after Ethi... False
52 Cop27 host accuses countries of making empty p... False
53 Stop Eritrea’s ‘war-funding diaspora tax’, say... False

Training and Test Sets

Using the default split, we will train our model on 152 instances, and test it on 51.

1X_train, X_test, y_train, y_test = train_test_split(df['title'], df['UK'],random_state=2)
2
3print(len(X_train),len(X_test))
152 51

Feature Engineering Approaches

In a text classification problem like this, the key problem is the conversion of text into effective features. The choice of a specific model is a second order concern. For this tutorial, we are simply using Logistic Regression, but Support Vector Machines (SVM) or kNN are often more effective in front of larger data sets. As said before, we are not looking for perfect scores, but trying to understand how the process works.

Count Vectorizer

This feature engineering approach consists on extracting unique words from a corpus, creating a column for each of them, and plugging a zero or a one on each column depending on whether the word is present or not. Let us break up this process into bite-sized steps.

First, we feed our raw features (X_train), to the vectorizer instance. As a result of invoking fit_transform() we obtain a new, transformed feature set, which we save in the X_train_vectorized variable.

1vectorizer = CountVectorizer(ngram_range=(1,1)) # default but explicit for clarity
2X_train_vectorized = vectorizer.fit_transform(X_train)

CountVectorized applies a number of rules when it comes to keyword extraction. Consider the keywords extracted from the following two sentences:

1CountVectorizer().fit(["Dover firebomb attack motivated by terrorist ideology, police say",
2                       "Sunak vows to protect mortgage holders but says he can’t ‘do everything’"]
3                     ).get_feature_names_out()
array(['attack', 'but', 'by', 'can', 'do', 'dover', 'everything',
       'firebomb', 'he', 'holders', 'ideology', 'mortgage', 'motivated',
       'police', 'protect', 'say', 'says', 'sunak', 'terrorist', 'to',
       'vows'], dtype=object)

Notice:

  • mortgage holders is not a single word. This can be altered using the ngram_range parameter (more on this later)
  • can’t nor any other non-alphabetical punctuation is included
  • sunak and dover are in lower case

Let’s now look at the effective number of keywords extracted from X_train:

1len(vectorizer.get_feature_names_out())
930

There are 930 keywords in total. Let’s look at the first and last 15 ones:

1print(vectorizer.get_feature_names_out()[0:15]) # first 15
2print(vectorizer.get_feature_names_out()[-15:]) # last 15
['000' '10' '11' '127' '13' '15' '2022' '2024' '25' '30' '30bn' '55' '70'
 '800' 'aboriginal']
['wonderwall' 'work' 'workers' 'world' 'worst' 'would' 'wounded' 'wrong'
 'ww2' 'yair' 'year' 'young' 'youngsters' 'your' 'youth']

But, what is inside X_train_vectorized exactly? It is a so-called sparse matrix. A sparse matrix contains mainly zero values, and can be encoded in a more efficient fashion than, say, regular Data Frames. But don’t despair, let’s look at a concrete example.

Take the first item in X_train:

1print(X_train.array[0])
The four-year-old boy with incredible maths skills

The corresponding ‘vectorized’ version of this first row consists of 930 columns, each representing a keyword, in which the value is 1 if the keyword is used in the sentence. Below, I print the total number of columns, and the values of each column in the range 900…950 (just because there are two ones in there!)

1print(len(X_train_vectorized.toarray()[0]))
2print(X_train_vectorized.toarray()[0][900:950])
930
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]

Now, how do those zero and ones relate to actual keywords? When we call vectorizer.get_feature_names_out(), we get an array with 930 keywords. The presence of a 1 in a ‘column’ means that the keyword in that column’s index is active:

1# Original text
2print(X_train.array[0])
3# Vectorized encoded words
4for i,word_on in enumerate(X_train_vectorized.toarray()[0]):
5    if word_on == 1:
6        print(i,vectorizer.get_feature_names_out()[i])
The four-year-old boy with incredible maths skills
126 boy
337 four
411 incredible
509 maths
576 old
754 skills
817 the
912 with
925 year

I said that the whole text analysis affair is around feature engineering. Now that we have ‘proper’ numerical features, we can easily apply a model. We should not forget that we always need to transform the ‘raw’ features using vectorizer.transform() before passing them down to our model of choice—LogisticRegression in this case:

1model = LogisticRegression()
2model.fit(X_train_vectorized, y_train)
3
4y_train_pred = model.predict(vectorizer.transform(X_train))
5y_test_pred = model.predict(vectorizer.transform(X_test))
6
7print('AUC Train score: %2.8f' % roc_auc_score(y_train, y_train_pred))
8print('AUC Test  score: %2.8f' % roc_auc_score(y_test, y_test_pred))
AUC Train score: 0.96098491
AUC Test  score: 0.81018519

A test score of 80% is not impressive, but we also need to understand that ‘UK’ versus ‘World’ gives room to plenty of ambiguity. Let us look at some of the false negatives and false positives:

1test_df = pd.DataFrame(data = {"title" : X_test,
2                                "UK (predicted)" : y_test_pred,
3                                "UK (actual)" : y_test})
4test_df[test_df["UK (predicted)"] != test_df["UK (actual)"]].sort_values(["UK (predicted)","UK (actual)"])

title UK (predicted) UK (actual)
130 Skomer: Manx shearwaters feed chicks plastic a... False True
6 Labour plans expansion of state nursery sector... False True
127 Driver and cyclist near misses caught on camera False True
2 Detainees protest during power cut at Harmonds... False True
3 Rail passengers in Britain face disruption des... False True
35 ‘It’s not just childcare’: focus on early year... False True
45 Rugby fans and a Churchill cake: Friday’s best... False True
172 Riot police race to immigration removal centre... False True
64 Malaysia’s 97-year-old former PM Mahathir Moha... True False
54 Tory-linked lobbying firm agreed to help swing... True False

Note the last story, which was labelled as World (non-UK) but was predicted as UK:

1test_df["title"].loc[54]
'Tory-linked lobbying firm agreed to help swing DRC election, leak suggests'

The Guardian had originally included this story in its World feed, but the model had learned that ‘Tory’ is word more strongly associated with UK news. Is the story a world one because it concerns the Democratic Republic of Congo (DRC), or is it a UK story because the conservatives are involved?. Good question.

It follows that it is useful to understand the degree to which each keyword influences the predictions. Here below, we list the keywords with the smallest and largest weight:

 1def print_coefficients(is_tfidf,model):
 2
 3    coefs = model.coef_[0]
 4    if is_tfidf:
 5        coefs = X_train_vectorized.max(0).toarray()[0]
 6    sorted_index = coefs.argsort()
 7
 8    r = [ (coefs[sorted_index[i]],
 9           vectorizer.get_feature_names_out()[sorted_index[i]]) 
10           for i in list(range(0,11))+list(range(len(coefs)-11,len(coefs)))
11        ]
12    s =  "Weight | Weight (Normalised) | Word\n"
13    s += "-------|---------------------|----- \n"
14    scaler = MinMaxScaler().fit([[coefs.max()],[coefs.min()]])
15    for (weight,word) in r:
16        normalised = scaler.transform([[weight]])[0][0]
17        s += " %s | %s | `%s`\n" % (weight,normalised,word)
18    
19    display(Markdown(s))
20
21print_coefficients(False,model)
Weight Weight (Normalised) Word
-0.783382295591252 0.0 ukraine
-0.6218376107055257 0.10124040508945176 russian
-0.5753895381130733 0.1303495131248572 cop27
-0.5185752286201282 0.16595516435387803 could
-0.5145059715943034 0.1685053765753387 midterms
-0.47514424583918957 0.19317345564761862 deal
-0.46031815676277643 0.20246499789210742 netanyahu
-0.45987590077722 0.20274216067937362 ukrainian
-0.45669830875852796 0.20473356450308317 crush
-0.44776993824795497 0.21032899370534097 war
-0.43308719369945214 0.2195307016226697 no
0.4074628232588574 0.7463052239475684 living
0.4204622442647869 0.75445198934254 cost
0.4228764224786277 0.7559649600756859 peru
0.42959285906651146 0.7601741654815831 pictures
0.456594321698718 0.7770960407932097 tory
0.4804256649756676 0.7920311954155415 back
0.4839314989481192 0.7942283091850129 10
0.4864743789572403 0.7958219376129578 uk
0.5895532939833902 0.8604217176726399 british
0.6267672635665634 0.8837437932844255 will
0.8122719764653672 1.0 london

It does make perfect sense that a story involving ukraine is most likely a world story, whereas one involving london is a UK one. We also see tory in the top 10. But what about pictures, or cost?

TF-IDF Vectorizer (Term Frequency–Inverse Document Frequency)

TFIDF is a feature engineering approach for text analysis that, unlike Count Vectorizer, goes beyond merely counting the absence or presence of keywords. TFIDF removes and/or penalises both infrequent and overly frequent words. Let’s swap CountVectorizer() with TfidfVectorizer() and see the results:

 1vectorizer = TfidfVectorizer().fit(X_train)
 2X_train_vectorized = vectorizer.transform(X_train)
 3
 4model = LogisticRegression()
 5model.fit(X_train_vectorized, y_train)
 6
 7y_train_pred = model.predict(vectorizer.transform(X_train))
 8y_test_pred = model.predict(vectorizer.transform(X_test))
 9
10print('AUC Train score: %2.8f' % roc_auc_score(y_train, y_train_pred))
11print('AUC Test  score: %2.8f' % roc_auc_score(y_test, y_test_pred))
12print_coefficients(True,model)
AUC Train score: 0.96098491
AUC Test  score: 0.81481481
Weight Weight (Normalised) Word
0.23789805765235938 0.0 and
0.245360893393626 0.018537038936173866 boulton
0.245360893393626 0.018537038936173866 rays
0.245360893393626 0.018537038936173866 chance
0.245360893393626 0.018537038936173866 adam
0.245360893393626 0.018537038936173866 king
0.245360893393626 0.018537038936173866 hope
0.245360893393626 0.018537038936173866 greta
0.24581980838395912 0.01967694410615628 playing
0.24581980838395912 0.01967694410615628 re
0.24581980838395912 0.01967694410615628 if
0.4330861148403412 0.4848302631079585 disturbance
0.4330861148403412 0.4848302631079585 cause
0.4354998371085913 0.49082574059384887 hsbc
0.4354998371085913 0.49082574059384887 banking
0.45730116079328686 0.5449783442440456 riverboat
0.5531552306331836 0.7830715686450066 are
0.5550873888010767 0.7878708825292756 the
0.5648881962241075 0.8122152410419593 miss
0.5685765002509814 0.8213766694486638 air
0.6384204638911735 0.9948630382153999 who
0.6404885558954915 0.9999999999999999 no
1dtype = [('name', 'S10'), ('height', float), ('age', int)]
2values = [('Arthur', 1.8, 41), ('Lancelot', 1.9, 38),
3          ('Galahad', 1.7, 38)]
4np.array(values, dtype=dtype)
array([(b'Arthur', 1.8, 41), (b'Lancelot', 1.9, 38),
       (b'Galahad', 1.7, 38)],
      dtype=[('name', 'S10'), ('height', '<f8'), ('age', '<i8')])

We get a marginal AUC score improvement, but what is most noticeable is the absence of popular keywords such as ukraine and london at the bottom, and the top—worked out by the Count Vectorizer—seen a few moments ago.

The least popular and most popular words make less sense now: and predicts world news, and no predicts UK-news. In spite of this, the AUC score is slightly higher (0.81481481 for TD-IDF versus 0.81018519 for the Count Vectorizer).

N-Grams

N-Grams is not distinct feature engineering approach per se, but a parameter applicable to both the Count and TF-IDF vectorisers. Extracting single keywords like we have done so far is effective for some ‘better than chance’ scores on small data sets but it may be problematic for larger corpora.

For example, ‘Imran Khan’ is the former president of Pakistan, whereby ‘Sadiq Khan’ is the mayor of London. The former corresponds with World stories, whereas the latter corresponds with UK ones. Using ‘Khan’ as a single feature may make the model biased either in the UK or non-UK direction depending on the frequency of stories in either category found in the training set.

In the example below, we set ngram_range=(1,2) so that we can contemplate combinations of two words. This raises the number of features to 2336.

 1vectorizer = TfidfVectorizer(ngram_range=(1,2)).fit(X_train)
 2print("Features: %s" % len(vectorizer.get_feature_names_out()))
 3X_train_vectorized = vectorizer.transform(X_train)
 4
 5model = LogisticRegression()
 6model.fit(X_train_vectorized, y_train)
 7
 8y_train_pred = model.predict(vectorizer.transform(X_train))
 9y_test_pred = model.predict(vectorizer.transform(X_test))
10
11print('AUC Train score: %2.8f' % roc_auc_score(y_train, y_train_pred))
12print('AUC Test  score: %2.8f' % roc_auc_score(y_test, y_test_pred))
13print_coefficients(True,model)
Features: 2336
AUC Train score: 0.96046471
AUC Test  score: 0.70370370
Weight Weight (Normalised) Word
0.16185835719176975 0.0 and
0.1680669310270251 0.019957857513223987 their anger
0.1680669310270251 0.019957857513223987 at grassroots
0.1680669310270251 0.019957857513223987 we re
0.1680669310270251 0.019957857513223987 out racism
0.1680669310270251 0.019957857513223987 still ruining
0.1680669310270251 0.019957857513223987 playing
0.1680669310270251 0.019957857513223987 winning
0.1680669310270251 0.019957857513223987 winning good
0.1680669310270251 0.019957857513223987 grassroots level
0.1680669310270251 0.019957857513223987 grassroots
0.3197346004992556 0.5075032772843385 riverboat
0.3197346004992556 0.5075032772843385 tourists released
0.3197346004992556 0.5075032772843385 released from
0.3197346004992556 0.5075032772843385 peru tourists
0.3197346004992556 0.5075032772843385 detained riverboat
0.3915741799375625 0.7384361981587014 are
0.39493326102252757 0.7492341790554434 the
0.41225891631573935 0.8049286056404393 miss
0.4178117011463031 0.8227783874728921 air
0.46164664360078156 0.9636886124007192 who
0.4729425420117875 0.9999999999999999 no

Note that for our particular data set, the score does not improve when setting ngram_range to 2.

To gain some more intuition into what n-grams does, see in the example below how terrorist ideology is extracted as a keyword, in addition to terrorist, and ideology.

1TfidfVectorizer( 
2  ngram_range=(1,2)
3  ).fit(["Dover firebomb attack motivated by terrorist ideology, police say",
4         "Sunak vows to protect mortgage holders but says he can’t ‘do everything’"]
5       ).get_feature_names_out()
array(['attack', 'attack motivated', 'but', 'but says', 'by',
       'by terrorist', 'can', 'can do', 'do', 'do everything', 'dover',
       'dover firebomb', 'everything', 'firebomb', 'firebomb attack',
       'he', 'he can', 'holders', 'holders but', 'ideology',
       'ideology police', 'mortgage', 'mortgage holders', 'motivated',
       'motivated by', 'police', 'police say', 'protect',
       'protect mortgage', 'say', 'says', 'says he', 'sunak',
       'sunak vows', 'terrorist', 'terrorist ideology', 'to',
       'to protect', 'vows', 'vows to'], dtype=object)

Conclusion

In this tutorial we saw how to analyse text-based data to create predictions using two simple Sckit-Learn’s feature engineering tools: CountVectorizer and TfidfVectorizer.

TfidfVectorizer, together with the use of n-grams higher than one are known to provide better results, but in our selected example, based on 200 news stories, CounterVectorizer with ngrams set to 1 (default) performed better.

Last, it is worth mentioning that, when considering a large corpora of news headlines, further feature extraction (in addition to words) may help improve AUC scores. These are some hypotheses worth proving:

  • Length: Do world news tend to be longer, perhaps because more context needs to be provided for the headline to make sense?
  • Numbers: Is the use of numbers (0,1,2,..) positively correlated with the likelihood of headline being either UK or World-related because such numbers tend to refer to dates or events that are more prevalent in either type of headline?
  • Non-English Symbols: Does the use of non-English characters (ñ,ü,...) tilt the prediction toward world news?

Before You Leave

🤘 Subscribe to my 100% spam-free newsletter!

website counters