Practical Graph Theory using Networkx

Share on:

Table of Contents

Introduction

Graphs are everywhere. Do you use Sky Scanner, Trainline, Facebook, or Linked-In? These are all basically graph browser apps. Graphs are incredibly simple data structures; all you need is a bunch of connections—called edges in graph theory. For example, Bond Street connects to Oxford Circus, and Oxford Circus connects to Piccadilly Circus, therefore I can go from Bond Street to Piccadilly Circus.

But how do you work out that it is possible to reach Piccadilly Circus from Bond Street? Enter Networkx. Networkx is Python’s flagship graph manipulation library. It implements dozens of algorithms, from Dijkstra’s shortest path—this one would answer the aforementioned question—to Adamic Adar’s link prediction (i.e., who should Facebook recommend you as a friend).

In this by example tutorial, I introduce the various types of graph structures, and the most popular algorithms to reason about graphs and solve practical problems with them.

I have also created a custom visualisation function based on Graphviz so that most examples actually display the graph under scrutiny. I believe that my approach results in a much better pedagogic experience than just describing graphs as mathematical objects. Furthermore, many tutorials rely on Matplotlib for visualisation, rather than Graphviz, which produces less than stellar results.

Last, I based quite a few sample data structures on the lectures by Daniel Romero who, while abstract, help demonstrate the nuanced differences in the usefulness of certain algorithms and measurement approaches.

Base Imports and Boilerplate

These are all the imports used throughout the tutorial. It is best to introduce them first and then forget about them.

 1import networkx as nx
 2import pygraphviz as pgv
 3import matplotlib.pyplot as plt
 4from networkx.algorithms import bipartite
 5import pandas as pd
 6import warnings
 7from IPython.display import Image, Markdown, display
 8import warnings
 9import sys
10import os
11
12if not sys.warnoptions:
13    warnings.simplefilter("ignore")
14    os.environ["PYTHONWARNINGS"] = "ignore" 

This is my custom graph visualisation function based on Graphviz. I will be calling this function in nearly all examples. Feel free to include it in your own projects.

 1def visualise(graph,hide=False,clusters=None,circular=False):
 2    layout = "circo" if circular else "dot"
 3    oneblock = True if circular else False
 4    G = pgv.AGraph(strict=False, directed=graph.is_directed(),
 5                    rankdir="LR",newrank="True",layout=layout,oneblock=oneblock)
 6
 7    targetGraph = {}
 8
 9    if clusters is not None:
10        for (label,cluster) in clusters:
11            name = "cluster_%s" % label.lower().replace(" ","_")
12            subgraph = G.add_subgraph(name=name,label=label,labelloc="b",rank="same")
13            for node in cluster:
14                targetGraph[node] = subgraph
15    else:
16        for n in graph.nodes():
17            targetGraph[n] = G
18
19    for (n,data) in graph.nodes(data=True):
20        label = "%s " % n
21        for attribute,value in data.items():
22            if not hide:
23                label +=  "\n%s = " % attribute
24            label += "\n%s," % value
25        if len(label) > 1:
26            label = label[0:-1]
27        targetGraph[n].add_node(n,label=label)
28    for (u,v,data) in graph.edges(data=True):
29
30        dir = 'forward' if graph.is_directed() else 'none'
31        label = ""
32        for attribute,value in data.items():
33            if not hide:
34                label +=  " %s = " % attribute
35            label += "%s," % value
36        if len(label) > 1:
37            label = label[0:-1]
38        G.add_edge(u,v,dir=dir,label=label)
39        G.layout(prog="dot")
40    display(Image(G.draw(format='png')))
41    return

Fundamental Graph Structures

In this section we’ll cover the main types of graph data structures. Before we begin, we just need to master two words:

  • Edges: This is what a connection is called in graph theory. An edge is the same as an link, or a relationship. You may also call them arrows but such a notion is rather specific to directed graphs because arrows “point” somewhere.
  • Nodes: Nodes are simply the subjects in a connection. A node is the same as an object. If you have an edge, you have at least a node (if pointing to itself), or two nodes, if you are connecting two different subjects.

That’s all the terminology we need. Let’s begin.

Undirected Graph

In an undirected graph, the nodes connect the nodes equally without distinction in terms of from/to. There is no sense of direction. For example, when you say “Joe is married to Mary” it would be a contradiction to say “but Mary is not married to Joe”. In an undirected graph, all relationships are mutual or two-way.

1G = nx.Graph()
2G.add_edge('A','B')
3G.add_edge('A','C')
4visualise(G)

Directed Graph

When people think about graphs, they are likely to think about a directed one. In a directed graph, the edges connect the nodes in a explicit direction such as from node A to node B, but not from node B to node A—unless this connection is added as a new edge. In the visualisation, we typically use an arrow to denote the direction of each edge.

1G = nx.DiGraph()
2G.add_edge('A','B')
3G.add_edge('A','C')
4visualise(G)

Multigraph

A multigraph graph is a type of graph in which a pair of nodes can have multiple relationships simultaneously.

1# There is also nx.MultiDiGraph() for directed graphs
2G = nx.MultiGraph()
3G.add_edge('Sun','Earth',type='orbiting_body')
4G.add_edge('Sun','Earth',distance=1.0)
5visualise(G)

Multigraphs introduce the problem of rendering ambiguous the identity of a given edge. Networkx implicitly assigns a key number to each new edge:

1G = nx.MultiGraph()
2G.add_edge('Sun','Earth',type='orbiting_body') # Key 0
3G.add_edge('Sun','Earth',distance=1.0)         # Key 1
4print(G['Sun']['Earth'][0])                    # Get edge with Key 0
5print(G['Sun']['Earth'][1])                    # Get edge with Key 1
{'type': 'orbiting_body'}
{'distance': 1.0}

We may, however, specify that key number explicitly, so that we can consistently discriminate different sets of edge types. Here below, for instance, we specify that the edge for type will have key=1, and the edge for distance will have key=0—effectively reversing the order seen in the last example.

1G = nx.MultiGraph()
2G.add_edge('Sun','Earth',type='orbiting_body', key=1) 
3G.add_edge('Sun','Earth',distance=1.0        , key=0)
4print(G['Sun']['Earth'][0])                    
5print(G['Sun']['Earth'][1])                    
{'distance': 1.0}
{'type': 'orbiting_body'}

Specific attributes may be queried using the dictionary syntax or the underlying Networkx’s method:

1print(G['Sun']['Earth'][1]['type'])
2print(G.get_edge_data('Sun', 'Earth',key=1)['type'])
orbiting_body
orbiting_body

Disconnected Graphs

If the nodes contained in a graph cannot be reached by navigating the defined edges, then the graph is said to be disconnected.

1G = nx.Graph()
2G.add_edges_from([(1,2),
3                  (1,3),
4                  (4,5),
5                  (5,6)
6                ])
7visualise(G)
8print("Connected: %s" % nx.is_connected(G)) 

Connected: False

We can also obtain the so-called connected components which are the subsets of the input graph that are connected graphs:

1print("Connected components #%d " % nx.number_connected_components(G))
2print("Components: %s" % list(nx.connected_components(G)))
3print("Component in which node 2 is found: %s" % list(nx.node_connected_component(G,2)))
Connected components #2 
Components: [{1, 2, 3}, {4, 5, 6}]
Component in which node 2 is found: [1, 2, 3]

Strongly Connected Directed Graphs

A directed graph is strongly connected if and only if every node in the graph is reachable from every other node. This is not true for the example below. This graph is weakly connected, though, given that if the the graph were to be an undirected one, it would be fully connected.

Note that the nodes {3, 4, 6} are strongly connected. We can obtain the set of strongly connected nodes using the strongly_connected_components() method.

 1G = nx.DiGraph()
 2G.add_edges_from([(1,2),
 3                  (1,3),
 4                  (3,4),
 5                  (4,5),
 6                  (4,6),
 7                  (6,3)
 8                ])
 9visualise(G)
10print("Strongly connected: %s" % nx.is_strongly_connected(G))
11print("Weakly connected: %s" % nx.is_weakly_connected(G))
12print("Connected components: %s" % list(nx.strongly_connected_components(G))) 

Strongly connected: False
Weakly connected: True
Connected components: [{2}, {5}, {3, 4, 6}, {1}]

Let’s now fix it by ensuring that every node can be reached from any other node:

 1G = nx.DiGraph()
 2G.add_edges_from([(1,2),
 3                  (2,4),
 4                  (1,3),
 5                  (3,4),
 6                  (4,5),
 7                  (5,6),
 8                  (4,6),
 9                  (6,1)
10                ])
11visualise(G)
12print("Strongly connected: %s" % nx.is_strongly_connected(G))

Strongly connected: True

Use of Edge and Node Attributes

Weighted Graph

A weighted graph is not a fundamentally different graph type except for the fact that edges are associated with a weight attribute which may be used to treat edges differently—i.e., higher or lower—depending on their weight.

1G = nx.Graph()
2G.add_edge('A','B',weight=6)
3G.add_edge('A','C',weight=12)
4visualise(G,hide=True)

Signed Graph

A signed graph is one in which the edges signal either a positive or negative relationship.

1G = nx.Graph()
2G.add_edge('A','B', sign='+')
3G.add_edge('A','C', sign='-')
4visualise(G,hide=True)

Multiple Edge Attributes

Edges may have one or more attributes associated with them.

1# Distance is in astronomical units (AU)
2G = nx.DiGraph()
3G.add_edge('Sun','Earth', relation='planet',distance=1.0)
4G.add_edge('Sun','Mercury', relation='planet',distance=0.387)
5G.add_edge('Earth','Moon', relation='satellite',distance=0.002569)
6visualise(G)

Node Attributes

Nodes, like edges, may also have attributes. In the example below, we add a mass attribute to each node.

 1G = nx.Graph()
 2G.add_edge('Sun','Mercury', distance=0.387)
 3G.add_edge('Sun','Venus'  , distance=0.722)
 4G.add_edge('Sun','Earth'  , distance=1.0)
 5# Node-wise attributes (in addition to edge-wise ones)
 6G.add_node('Sun',     mass=333000)
 7G.add_node('Mercury', mass=0.0553)
 8G.add_node('Venus',   mass=0.815)
 9G.add_node('Earth',   mass=1.0)
10visualise(G)

Querying nodes’ attributes involves obtaining the nodes dictionary via nodes():

1print(G.nodes())              # Show all nodes
2print(G.nodes(data=True))     # Show all nodes with their attributes
3print(G.nodes['Sun'])         # Show node 
4print(G.nodes['Sun']['mass']) # Show node's attribute 
['Sun', 'Mercury', 'Venus', 'Earth']
[('Sun', {'mass': 333000}), ('Mercury', {'mass': 0.0553}), ('Venus', {'mass': 0.815}), ('Earth', {'mass': 1.0})]
{'mass': 333000}
333000

Bipartite Graphs

Bipartite graphs—also called bigraphs—are graphs whose nodes can be unambiguously divided into two disjoint sets U and V—also called bottom and top—. Bipartite graphs are characterised by lacking odd cycles. Trees are an example of bipartite graphs.

Inductive Examples

1 Edge, 1 Node (No)

We require at least two distinct nodes to create two disjoint sets.

1G = nx.Graph()
2G.add_edges_from([(1,1)])
3visualise(G)
4display(Markdown("Bipartite: %s" % bipartite.is_bipartite(G)))

Bipartite: False

1 Edge, 2 Nodes (Yes)

This is the smallest possible bipartite graph

1G = nx.Graph()
2G.add_edges_from([(1,2)])
3U,V = bipartite.sets(G)
4visualise(G,clusters=[("U",U),("V",V)])
5display(Markdown("Bipartite: %s" % bipartite.is_bipartite(G)))
6display(Markdown("U = %s, V = %s" % (U, V)))

Bipartite: True

U = {1}, V = {2}

2 Edges, 3 Nodes as a Sequence (Yes)

We have not created an odd cycle.

1G = nx.Graph()
2G.add_edges_from([(1,2)])
3G.add_edges_from([(2,3)])
4U,V = bipartite.sets(G)
5visualise(G,clusters=[("U",U),("V",V)])
6display(Markdown("Bipartite: %s" % bipartite.is_bipartite(G)))
7display(Markdown("U = %s, V = %s" % (U, V)))

Bipartite: True

U = {1, 3}, V = {2}

2 Edges, 3 Nodes as a Tree (Yes)

We have not created an odd cycle.

1G = nx.Graph()
2G.add_edges_from([  (1,2),
3                    (1,3)
4                ])
5U,V = bipartite.sets(G)
6visualise(G,clusters=[("U",U),("V",V)])
7display(Markdown("Bipartite: %s" % bipartite.is_bipartite(G)))
8display(Markdown("U = %s, V = %s" % (U, V)))

Bipartite: True

U = {1}, V = {2, 3}

3 Edges, 3 Nodes as a Triad (No)

We have formed an odd cycle.

1G = nx.Graph()
2G.add_edges_from([  (1,2),
3                    (1,3),
4                    (2,3)
5                ])
6visualise(G)
7display(Markdown("Bipartite: %s" % bipartite.is_bipartite(G)))

Bipartite: False

4 Edges, 4 Nodes as a Square (Yes)

Here, the cycle between 1, 2, 3, and 4 forming a square may seem strange—as being a valid bipartite graph—but we can see that the odd and even numbers never connect directly, forming two separate disjoint sets.

 1G = nx.Graph()
 2G.add_edges_from([  (1,2),
 3                    (2,3),
 4                    (3,4),
 5                    (4,1)
 6                ])
 7
 8visualise(G)                
 9U,V = bipartite.sets(G)
10visualise(G,clusters=[("U",U),("V",V)])
11display(Markdown("Bipartite: %s" % bipartite.is_bipartite(G)))
12display(Markdown("U = %s, V = %s" % (U, V)))

Bipartite: True

U = {1, 3}, V = {2, 4}

Bipartite Node Set Membership

Given a set of nodes, we can check whether those nodes belong to one of the two disjoint sets found in a bipartite graph:

 1G = nx.Graph()
 2G.add_edges_from([  (1,2),
 3                    (2,3),
 4                    (3,4),
 5                    (4,1)
 6                ])
 7U,V = bipartite.sets(G)
 8visualise(G,clusters=[("U",U),("V",V)])
 9
10print(bipartite.is_bipartite_node_set(G,[1,3])) # True, nodes in set U
11print(bipartite.is_bipartite_node_set(G,[2,4])) # True, nodes in set V
12print(bipartite.is_bipartite_node_set(G,[2,1])) # False
13print(bipartite.is_bipartite_node_set(G,[2,3])) # False

True
True
False
False

Projected Graph

A regular projected graph consists of a graph which connects the inputs nodes with those nodes in the same set which are connected through a common node. In the below example, we try to match people who like the same food

 1G = nx.Graph()
 2G.add_edges_from([  ('Mary','Pizza'),
 3                    ('Mary','Pasta'),
 4                    ('Mary','Chips'),
 5                    ('Pizza','Mike'),
 6                    ('Pasta','Joe'),
 7                    ('Joe','Chips'),
 8                ])
 9U,V = bipartite.sets(G)
10visualise(G,clusters=[("U",U),("V",V)])
11
12display(Markdown("Those who like to eat what Joe likes"))
13visualise(bipartite.projected_graph(G,['Joe']))
14
15display(Markdown("Those who like to eat what Mary likes"))
16visualise(bipartite.projected_graph(G,['Mary']))
17
18display(Markdown("Those who like to eat what Mike and Mary like"))
19visualise(bipartite.projected_graph(G,['Mike','Mary']))

Those who like to eat what Joe likes

Those who like to eat what Mary likes

Those who like to eat what Mike and Mary like

Weighted Projection

The weighted projection, relative to the regular one, adds an additional attribute weight to every edge which reflects how many nodes in common each neighbour has.

In the below example, the edge ('Mary','Mike') has weight = 1 because they only have Pizza in common as a favourite food, whereas the edge ('Mary','Joe') has a weight = 2 because they have both Pasta and Chips in common as favourite foods.

 1G = nx.Graph()
 2G.add_edges_from([  ('Mary','Pizza'),
 3                    ('Mary','Pasta'),
 4                    ('Mary','Chips'),
 5                    ('Pizza','Mike'),
 6                    ('Pasta','Joe'),
 7                    ('Joe','Chips'),
 8                ])
 9U,V = bipartite.sets(G)
10visualise(G,clusters=[("U",U),("V",V)])
11display(Markdown("Those who like to eat what Mary likes"))
12visualise(bipartite.weighted_projected_graph(G,['Mary']))

Those who like to eat what Mary likes

Graph-Level Function

Here we will look at some general facts that can be obtained about a graph.

General Level

Queries that can be obtained upon the graph, without specifying any specific edges or nodes.

 1G = nx.Graph()
 2G.add_edges_from([  (1,2),
 3                    (1,3)
 4                ])
 5visualise(G)
 6
 7display(Markdown("**Adjacency** - All recursive connected nodes: `adjacency = %s`" %
 8                    list(G.adjacency())
 9                ))
10display(Markdown("**Degree** - Number of edges adjacent to each node: `degree = %s`" % 
11                    G.degree()
12                ))
13display(Markdown("**Order** - Number of nodes in the graph: `order = %s`" % 
14                    G.order()
15                ))
16display(Markdown("**Size** - Number of edges in the graph: `size = %s`" % 
17                    G.size()
18                ))

Adjacency - All recursive connected nodes: adjacency = [(1, {2: {}, 3: {}}), (2, {1: {}}), (3, {1: {}})]

Degree - Number of edges adjacent to each node: degree = [(1, 2), (2, 1), (3, 1)]

Order - Number of nodes in the graph: order = 3

Size - Number of edges in the graph: size = 2

Node Level

Queries upon the graph relative to input node(s).

 1G = nx.Graph()
 2G.add_edges_from([  (1,2),
 3                    (1,3)
 4                ])
 5visualise(G)
 6
 7display(Markdown("**Neighbours** - Nodes connected to a given node. `neighbours(1) = %s`" 
 8                    % list(G.neighbors(1))
 9                ))
10display(Markdown("**Nbunch** - Nodes that match those in the input set `nbunch_iter([1,2,5]) = %s`" 
11                    % list(G.nbunch_iter([1,2,5]))
12                ))

Neighbours - Nodes connected to a given node. neighbours(1) = [2, 3]

Nbunch - Nodes that match those in the input set nbunch_iter([1,2,5]) = [1, 2]

Clustering Coefficient

Local Clustering Coefficient

The local clustering coefficient of a node in a graph quantifies how close its neighbours are to being a complete graph.

A more intuitive way of seeing it is that a clustering coefficient for you determines the degree to which your friends are friends among themselves. This is calculated by dividing the number of friendships among your friends, by the potential total of friendships there could be if they were all friends among themselves:

# Current friendships among your friends / # Total potential friendships

Let’s look at a concrete example.

Next we have a graph in which your direct friends are: Joe, Mary, Susan, and Daisy. In turn, Joe is friends with Mary, and Susan is friends with Daisy.

1G = nx.Graph()
2G.add_edges_from([('You','Joe'),
3                  ('You','Mary'),
4                  ('You','Susan'),
5                  ('You','Daisy'),
6                  ('Joe', 'Mary'),
7                  ('Susan','Daisy'),
8                ])
9visualise(G)

Therefore, there are 2 friendships among your friends, but potentially, there could be a total of 6 friendships:

  • Joe and Mary (already friends)
  • Joe and Susan
  • Joe and Daisy
  • Susan and Mary
  • Susan and Daisy (already friends)
  • Mary and Daisy

This results in a coefficient of 2/6 = 1/3 = 0.333333…

1nx.clustering(G,'You')
0.3333333333333333

Global Clustering Coefficient

A global clustering coefficient measures the tendency for all the nodes in a graph to cluster together. This is typically measured by calculating either their average clustering, or their transitivity:

  • Average clustering: The local sum of all local clustering results divided by the number of nodes
  • Transitivity: The number of triangles multiplied by three, divided by the number of total triplets.

It is easier to see these approaches in action.

Inductive Examples

2 Nodes, 1 Edge (Not Meaningful)

There are zero values in denominators for transitivity.

1G = nx.Graph()
2G.add_edges_from([(1,2),
3                ])
4visualise(G)
5display(Markdown("Average Clustering: %f" % nx.average_clustering(G)))
6display(Markdown("Transitivity: %f" % nx.transitivity(G)))

Average Clustering: 0.000000

Transitivity: 0.000000

3 Nodes, 2 Edges (Complete Graph-1)

This results in zero values for both approaches.

1G = nx.Graph()
2G.add_edges_from([(1,2),
3                  (1,3),
4                ])
5visualise(G)
6display(Markdown("Average Clustering: %f" % nx.average_clustering(G)))
7display(Markdown("Transitivity: %f" % nx.transitivity(G)))

Average Clustering: 0.000000

Transitivity: 0.000000

The calculations work as follows.

For average clustering:

clustering(G,1) + clustering(G,2) + clustering (G,3) / nodes(G)

= 0 + 0 + 0 / 3

= 0 / 3

=0

For transitivity:

3 x triangles / triplets (triplets are 1-2, 2-3, and 3-3)

= 3 x 0 / 3

= 0 / 3

= 0

3 Nodes, 3 Edges (Complete Graph)

This is a perfect triangle, and the smallest possible example of 100% clustering using both average clustering and transitivity.

1G = nx.Graph()
2G.add_edges_from([(1,2),
3                  (1,3),
4                  (2,3),
5                ])
6visualise(G)
7display(Markdown("Average Clustering: %f" % nx.average_clustering(G)))
8display(Markdown("Transitivity: %f" % nx.transitivity(G)))

Average Clustering: 1.000000

Transitivity: 1.000000

The calculations work as follows.

For average clustering:

clustering(G,1) + clustering(G,2) + clustering (G,3) / nodes(G)

= 1 + 1 + 1 / 3

= 3 / 3

=1

For transitivity:

3 x triangles / triplets (triplets are 1-2, 2-3, and 3-3)

= 3 x 1 / 3

= 3 / 3

= 1

4 Nodes, 4 Edges (Complete Graph-2)

Here we are missing two edges to complete the graph and we have one triangle.

1G = nx.Graph()
2G.add_edges_from([(1,2),
3                  (1,3),
4                  (1,4),
5                  (2,3)
6                ])
7visualise(G)
8display(Markdown("Average Clustering: %f" % nx.average_clustering(G)))
9display(Markdown("Transitivity: %f" % nx.transitivity(G)))

Average Clustering: 0.583333

Transitivity: 0.600000

The calculations work as follows.

For average clustering:

clustering(G,1) + clustering(G,2) + clustering (G,3) + clustering(G,4) / nodes(G)

= 0.333333 + 1 + 1 + 0 / 4

= 2.333333 / 4

= 0.583333

For transitivity:

3 x triangles / triplets (triplets are 1-2, 2-3, 3-3, 1-3, 1-4)

= 3 x 1 / 5

= 3 / 5

= 0.6

4 Nodes, 5 Edges (Complete Graph-1)

Here we have all nodes connected except for (2,4). This configuration forms two triangles.

 1G = nx.Graph()
 2G.add_edges_from([(1,2),
 3                  (1,3),
 4                  (1,4),
 5                  (2,3),
 6                  (3,4)
 7                ])
 8visualise(G)
 9display(Markdown("Average Clustering: %f" % nx.average_clustering(G)))
10display(Markdown("Transitivity: %f" % nx.transitivity(G)))

Average Clustering: 0.833333

Transitivity: 0.750000

The calculations work as follows.

For average clustering:

clustering(G,1) + clustering(G,2) + clustering (G,3) + clustering(G,4) / nodes(G)

= 0.666666 + 1 + 0.66666 + 1 / 4

= 3.333333 / 4

= 0.833333

For transitivity:

3 x triangles / triplets (triplets are 1-2, 2-3, 3-3, 1-3, 1-4, 3-4, 2-3, and 3-4)

= 3 x 2 / 8

= 6 / 8

= 0.75

4 Nodes, 6 Edges (Complete Graph)

Three triangles emerge when connecting all the nodes in the graph.

 1G = nx.Graph()
 2G.add_edges_from([(1,2),
 3                  (1,3),
 4                  (1,4),
 5                  (2,3),
 6                  (3,4),
 7                  (2,4)
 8                ])
 9visualise(G)
10display(Markdown("Average Clustering: %f" % nx.average_clustering(G)))
11display(Markdown("Transitivity: %f" % nx.transitivity(G)))

Average Clustering: 1.000000

Transitivity: 1.000000

The calculations work as follows.

For average clustering:

clustering(G,1) + clustering(G,2) + clustering (G,3) + clustering(G,4) / nodes(G)

= 1 + 1 + 1 + 1 / 4

= 4 / 4

= 1.0

For transitivity:

3 x triangles / triplets (triplets are 1-2, 2-3, 3-3, 1-3, 1-4, 3-4, 2-3, 3-4, and 2-4)

= 3 x 3 / 9

= 9 / 9

= 1.0

Note: There are more triplet combinations than total edge combinations because edges may be counted multiple times in the context of triplets. For example, edge (2,3) is both a triplet in the triangle (1,2,3) as well as a triplet in the triangle (2,3,4).

Distance

Here we will look at various operations concerning the absolute, relative, and average distance of node(s) in a graph.

We will use a subsection of the London Underground’s network to illustrate distance-wise operations.

 1G = nx.Graph()
 2G.add_edges_from([  ('Marble Arch','Bond Street',{"line" : "Central"}),
 3                    ('Bond Street','Oxford Circus',{"line" : "Central"}),
 4                    ('Bond Street','Green Park',{"line" : "Jubilee"}),
 5                    ('Oxford Circus','Green Park',{"line" : "Victoria"}),
 6                    ('Oxford Circus','Regent\'s Park',{"line" : "Bakerloo"}),
 7                    ('Oxford Circus','Piccadilly Circus',{"line" : "Bakerloo"}),
 8                    ('Oxford Circus','Tottenham Court Road',{"line" : "Central"}),
 9                    ('Green Park','Hyde Park Corner',{"line" : "Piccadilly"}),
10                    ('Green Park','Piccadilly Circus',{"line" : "Piccadilly"}),
11                    ('Green Park','Victoria',{"line" : "Victoria"}),
12                    ('Piccadilly Circus','Charing Cross',{"line" : "Bakerloo"})
13            ])
14visualise(G,hide=True)

Shortest Path

This is the most intuitive graph operation. It provides the number of edges that sit between two nodes.

In the example below we pose the following questions:

  • Stations in the shortest path: What are the minimum number of tube stations that I need to go through to get from Marble Arch to Charing Cross?
  • Number of stations: What is the total number of stations that I need to go through?
  • Lines changes: How many underground lines changes are required?
 1visualise(G,hide=True)
 2
 3# Stations in the shortest path
 4stations = nx.shortest_path(G,'Marble Arch','Charing Cross')
 5print("Stations: %s" % stations)
 6
 7# Number of stations
 8distance = nx.shortest_path_length(G,'Marble Arch','Charing Cross') + 1
 9print("Total: %d" % distance)
10
11# Line changes
12changes = set([ G[stations[i]][stations[i+1]]['line'] for i in range(0,len(stations)-1) ])
13print("Line changes: %d %s" % (len(changes), changes))

Stations: ['Marble Arch', 'Bond Street', 'Oxford Circus', 'Piccadilly Circus', 'Charing Cross']
Total: 5
Line changes: 2 {'Bakerloo', 'Central'}

Bread-first Search (BFS) vs Depth-first Search (DFS)

The BFS algorithm systematically finds all the surrounding nodes relative the initial node, before moving into the next node, whereas the DFS algorithm follows each node the furthest it can. Here below we see the differences in the algorithms’ search strategy, starting from Oxford Circus.

1visualise(G,hide=True)
2station = "Oxford Circus"
3bfsEdges = list(nx.bfs_tree(G,station).edges())
4dfsEdges = list(nx.dfs_tree(G,station).edges())
5s  = "BFS | DFS\n"
6s += "----|----\n"
7for i in range(0,len(bfsEdges)):
8    s += "%s | %s \n" % (bfsEdges[i],dfsEdges[i])
9display(Markdown(s))

BFS DFS
(‘Oxford Circus’, ‘Bond Street’) (‘Oxford Circus’, ‘Bond Street’)
(‘Oxford Circus’, ‘Green Park’) (‘Oxford Circus’, “Regent’s Park”)
(‘Oxford Circus’, “Regent’s Park”) (‘Oxford Circus’, ‘Tottenham Court Road’)
(‘Oxford Circus’, ‘Piccadilly Circus’) (‘Bond Street’, ‘Marble Arch’)
(‘Oxford Circus’, ‘Tottenham Court Road’) (‘Bond Street’, ‘Green Park’)
(‘Bond Street’, ‘Marble Arch’) (‘Green Park’, ‘Hyde Park Corner’)
(‘Green Park’, ‘Hyde Park Corner’) (‘Green Park’, ‘Piccadilly Circus’)
(‘Green Park’, ‘Victoria’) (‘Green Park’, ‘Victoria’)
(‘Piccadilly Circus’, ‘Charing Cross’) (‘Piccadilly Circus’, ‘Charing Cross’)

Please note that due to Networkx’s DFS implementation, and the shape of our data set, the DFS search does not quite match the intuition whereby we would expect ('Oxford Circus', 'Bond Street') to be followed by ('Bond Street', 'Marble Arch').

Please also note that weighted models—ours is unweighted—may benefit from a more advanced algorithm like Dijkstra’s.

Graph-level Distance Stats and Queries

 1visualise(G,hide=True)
 2s = "Distance stats: \n\n"
 3s += ("* **Average distance** between nodes: %2.2f\n" % 
 4        nx.average_shortest_path_length(G)
 5    )
 6s += ("* **Eccentricity** is the largest distance between node n and all other nodes: %s\n" % 
 7        nx.eccentricity(G)
 8    )    
 9s += ("* **Diameter** is the maximum distance between any pair of nodes: %2.2f\n" % 
10        nx.diameter(G)
11    )
12s += ("* **Radius** is the minimum eccentricity: %2.2f\n" % 
13        nx.radius(G)
14    )
15s += ("* **Periphery** is the set of nodes that have eccentricity equal to the diameter: %s\n" % 
16        nx.periphery(G)
17    )
18s += ("* **Center** is the set of nodes that have eccentricity equal to the radius: %s\n" % 
19        nx.center(G)
20    )
21nx.center(G)    
22display(Markdown(s))

Distance stats:

  • Average distance between nodes: 2.11
  • Eccentricity is the largest distance between node n and all other nodes: {‘Marble Arch’: 4, ‘Bond Street’: 3, ‘Oxford Circus’: 2, ‘Green Park’: 2, “Regent’s Park”: 3, ‘Piccadilly Circus’: 3, ‘Tottenham Court Road’: 3, ‘Hyde Park Corner’: 3, ‘Victoria’: 3, ‘Charing Cross’: 4}
  • Diameter is the maximum distance between any pair of nodes: 4.00
  • Radius is the minimum eccentricity: 2.00
  • Periphery is the set of nodes that have eccentricity equal to the diameter: [‘Marble Arch’, ‘Charing Cross’]
  • Center is the set of nodes that have eccentricity equal to the radius: [‘Oxford Circus’, ‘Green Park’]

Robustness

In the previous section we looked at the distance property, both relative to nodes, and at the graph level. Graph robustness concerns the implications of network ‘failure’—in other words, the disconnection, or removal of nodes.

We will use the same graph as before, which is a subsection of the London underground network.

 1G = nx.Graph()
 2G.add_edges_from([  ('Marble Arch','Bond Street',{"line" : "Central"}),
 3                    ('Bond Street','Oxford Circus',{"line" : "Central"}),
 4                    ('Bond Street','Green Park',{"line" : "Jubilee"}),
 5                    ('Oxford Circus','Green Park',{"line" : "Victoria"}),
 6                    ('Oxford Circus','Regent\'s Park',{"line" : "Bakerloo"}),
 7                    ('Oxford Circus','Piccadilly Circus',{"line" : "Bakerloo"}),
 8                    ('Oxford Circus','Tottenham Court Road',{"line" : "Central"}),
 9                    ('Green Park','Hyde Park Corner',{"line" : "Piccadilly"}),
10                    ('Green Park','Piccadilly Circus',{"line" : "Piccadilly"}),
11                    ('Green Park','Victoria',{"line" : "Victoria"}),
12                    ('Piccadilly Circus','Charing Cross',{"line" : "Bakerloo"})
13            ])
14visualise(G,hide=True)

Simple Paths

Let’s first understand how commuters use the network. Say that a commuter wants to go from Marble Arch to Charing Cross, what options do they have? We can use all_simple_paths() to answer this question.

Here below, we also sort the list by the number of stations that are required to achieve the destination.

1sorted([ (len(path),path) for path in nx.all_simple_paths(G,'Marble Arch','Charing Cross') ])
[(5,
  ['Marble Arch',
   'Bond Street',
   'Green Park',
   'Piccadilly Circus',
   'Charing Cross']),
 (5,
  ['Marble Arch',
   'Bond Street',
   'Oxford Circus',
   'Piccadilly Circus',
   'Charing Cross']),
 (6,
  ['Marble Arch',
   'Bond Street',
   'Green Park',
   'Oxford Circus',
   'Piccadilly Circus',
   'Charing Cross']),
 (6,
  ['Marble Arch',
   'Bond Street',
   'Oxford Circus',
   'Green Park',
   'Piccadilly Circus',
   'Charing Cross'])]

Graph Disconnection

Let’s stay that your are TFL (Transport for London) administrator, and striking personnel are planning to leave some stations unattended—i.e., they must be closed during the strike.

Which station(s) would cause the most disruption, if closed?

For our particular subsection of the London Underground network, closing one station only is all that is required to prevent commuters from reaching their destinations:

1nx.node_connectivity(G)
1

We can find out a single station that can disconnect the network—this is not the only one.

1
2nx.minimum_node_cut(G)
{'Green Park'}

And what stations would be disconnected as a result?

1G_strike = G.copy()
2G_strike.remove_nodes_from(nx.minimum_node_cut(G))
3visualise(G_strike)

Now, what about if we wanted to reason about the disruption caused for a commuter who travels from Marble Arch, to Charing Cross. To get the answer, we just need to provide the nodes that represent these two stations to minimum_node_cut():

1nx.minimum_node_cut(G,'Marble Arch','Charing Cross')
{'Piccadilly Circus'}

Let’s now consider a scenario in which engineering works are about to take place in one of the network’s tunnels (represented by edges).

What is the minimum number of segments required to disconnect the network?

1nx.edge_connectivity(G)
1

And which are the segment(s)?

1nx.minimum_edge_cut(G)
{('Green Park', 'Hyde Park Corner')}

What about the edges that would disrupt a commuter who needs to go from Marble Arch to Charing Cross?

1nx.minimum_edge_cut(G,'Marble Arch','Charing Cross')
{('Piccadilly Circus', 'Charing Cross')}

As we can see, the above functions return the first node and edges, respectively, that are minimally required to disrupt the network—i.e, disconnect the graph, rather than those that would provoke the most disruption overall.

Centrality

Centrality concerns the ranking of nodes in a graph depending on their position. We will use the following graph to discuss various centrality metrics.

1G = nx.Graph()
2G.add_edges_from([('A','D'),
3                  ('B','D'),
4                  ('C','D'),
5                  ('D','E'),
6                  ('E','F'),
7                  ('E','G')])
8visualise(G)

Degree Centrality (Undirected Graphs)

Degree centrality scores each node relative to their degree—i.e., their number of connections—over the number of nodes, subtracting the node in hand from the count.

The formula, applied to E, works as follows:

Degree Centrality(G,N) = Degree(N) / #(G)-1

= 3 / 7-1

= 3 / 6

= 0.5

1nx.degree_centrality(G)
{'A': 0.16666666666666666,
 'D': 0.6666666666666666,
 'B': 0.16666666666666666,
 'C': 0.16666666666666666,
 'E': 0.5,
 'F': 0.16666666666666666,
 'G': 0.16666666666666666}

Degree Centrality (Directed Graphs)

In the case of directed graph, the direction of the edges—let’s call them arrows in this section—is meaningful so we need to decide, for each node, whether to count either the inward arrows (those that point to the node in hand), or the outward arrows (those that originate from the node in hand).

The formula is like before where we count the degree divided by the number of nodes, subtracting the node in hand from the count. What changes is that the degree is calculated either for the inward or the outward arrows. Let’s see examples for both cases.

1G = nx.DiGraph()
2G.add_edges_from([('A','D'),
3                  ('B','D'),
4                  ('C','D'),
5                  ('D','E'),
6                  ('E','F'),
7                  ('E','G')])
8visualise(G)

In-Degree

If we are counting the inward arrows, then we are calculating the In-degree. In the case of D we have 3 inward arrows, and the total number of nodes is 5. So, its score is: 3 / (7-1) = 3 / 6 = 0.5.

1nx.in_degree_centrality(G)
{'A': 0.0,
 'D': 0.5,
 'B': 0.0,
 'C': 0.0,
 'E': 0.16666666666666666,
 'F': 0.16666666666666666,
 'G': 0.16666666666666666}

Out-Degree

If we are counting the outward arrows, then we are calculating the Out-degree. In the case of D we have 1 inward arrow, and the total number of nodes is 7. So, its score is: 1 / (7-1) = 1 / 6 = 0.166666….

1nx.out_degree_centrality(G)
{'A': 0.16666666666666666,
 'D': 0.16666666666666666,
 'B': 0.16666666666666666,
 'C': 0.16666666666666666,
 'E': 0.3333333333333333,
 'F': 0.0,
 'G': 0.0}

Closeness Centrality (Undirected Graphs)

Closeness centrality—normalised unless otherwise stated–of a node is the average length of the shortest path between the node and all other nodes in the graph. Whereas degree only considers the direct edges connected to the node in hand, closeness considers the distance to all the other nodes in the graph. Let’s first come back to our undirected graph:

1G = nx.Graph()
2G.add_edges_from([('A','D'),
3                  ('B','D'),
4                  ('C','D'),
5                  ('D','E'),
6                  ('E','F'),
7                  ('E','G')])
8visualise(G)

Let’s take the example of node D. Here we see the distance to each other node in the graph:

1set(nx.shortest_path_length(G,'D').items())
{('A', 1), ('B', 1), ('C', 1), ('D', 0), ('E', 1), ('F', 2), ('G', 2)}

We now sum up the distances and use it as the denominator—the nominator is the number of nodes minus one.

1total_distance = sum(nx.shortest_path_length(G,'D').values())
2number_of_nodes = G.number_of_nodes()
3
4print(" total distance: %d" % total_distance)
5print("number of nodes: %d" % number_of_nodes)
6print("      closeness: %2.2f" % ((number_of_nodes-1) / total_distance))
 total distance: 8
number of nodes: 7
      closeness: 0.75

These are the results for all the nodes:

1nx.closeness_centrality(G)
{'A': 0.46153846153846156,
 'D': 0.75,
 'B': 0.46153846153846156,
 'C': 0.46153846153846156,
 'E': 0.6666666666666666,
 'F': 0.42857142857142855,
 'G': 0.42857142857142855}

Closeness Centrality (Directed Graphs)

Closeness centrality in the case of directed graph is calculated by default using the inward distance to each node. For the outward instance, the graph should be reversed using G.reserve().

 1G = nx.DiGraph()
 2G.add_edges_from([('A','D'),
 3                  ('B','D'),
 4                  ('C','D'),
 5                  ('D','E'),
 6                  ('E','F'),
 7                  ('E','G')])
 8visualise(G)
 9display(nx.closeness_centrality(G))
10display(nx.closeness_centrality(G.reverse()))

{'A': 0.0,
 'D': 0.5,
 'B': 0.0,
 'C': 0.0,
 'E': 0.38095238095238093,
 'F': 0.34722222222222227,
 'G': 0.34722222222222227}



{'A': 0.2962962962962963,
 'D': 0.3,
 'B': 0.2962962962962963,
 'C': 0.2962962962962963,
 'E': 0.3333333333333333,
 'F': 0.0,
 'G': 0.0}

Betweenness

Betweenness measures the degree to which each node is in the path of other nodes in a graph. Let’s start with our sample undirected graph:

1G = nx.Graph()
2G.add_edges_from([('A','D'),
3                  ('B','D'),
4                  ('C','D'),
5                  ('D','E'),
6                  ('E','F'),
7                  ('E','G')])
8visualise(G)
9nx.betweenness_centrality(G,normalized=False)

{'A': 0.0, 'D': 12.0, 'B': 0.0, 'C': 0.0, 'E': 9.0, 'F': 0.0, 'G': 0.0}

Let’s consider betweenness for node D. The denormalized score is 12, which is calculated by summing up, for each combination of nodes, excluding D, the difference between all the possible shortest paths, and those paths that go through D:

From To Paths Through D Total Paths Ratio
A B 1 1 1
A C 1 1 1
A E 1 1 1
A F 1 1 1
A G 1 1 1
B C 1 1 1
B E 1 1 1
B F 1 1 1
B G 1 1 1
C E 1 1 1
C F 1 1 1
C G 1 1 1
E F 0 1 0
E G 0 1 0
F G 0 1 0
- - - - 12 (Sum)

Scores are presented in a normalised fashion, by default, which consists on dividing the score by the number of combinations of nodes, barring the node in hand itself:

For example, for D, this results in:

Betweenness(D) /(1/2 * (N-2) * (N-1))

=12 / (0.5 * 5 * 6)

=12 / 15

=0.8

We can extract the normalised scores just by removing the normalised argument—this is the default.

1nx.betweenness_centrality(G)
{'A': 0.0, 'D': 0.8, 'B': 0.0, 'C': 0.0, 'E': 0.6, 'F': 0.0, 'G': 0.0}

Edge Betweenness

Betweenness centrality can also be calculated relative to edges, rather than nodes.

1nx.edge_betweenness_centrality(G)
{('A', 'D'): 0.2857142857142857,
 ('D', 'B'): 0.2857142857142857,
 ('D', 'C'): 0.2857142857142857,
 ('D', 'E'): 0.5714285714285714,
 ('E', 'F'): 0.2857142857142857,
 ('E', 'G'): 0.2857142857142857}

Performance Notes

  • Calculating betweenness has a performance of $O(n^3)$: it gets exponentially slower with larger number of nodes.
  • One approach is to use sampling by passing the argument k to betweenness_centrality() so that the paths are calculated only for k nodes rather than all of them.
  • Another approach is to specify a subset of nodes for the calculation, in which case an alternative function called betweenness_centrality_subset() is provided.

PageRank

PageRank (PR) is Larry Page’s original patented algorithm, used by Google Search to rank web pages. It works by counting the number and quality of links to a page, to determine how important the website is. It is based on the assumption that more important websites are likely to receive more links from other websites.

PageRank uses a directed graph as a data structure whereby the nodes are the websites, and the edges are the links.

Basic Algorithm

The basic algorithm gives each node an initial importance equal to 1 / # Nodes. Then, with each iteration of the algorithm, the nodes with inward arrows are given the importance of the node from which each inward arrow originates, divided by the number of outward arrows that node has. The idea is that the importance conferred by a node is diluted by the number of outward arrows it has.

Let us see the Page Rank algorithm in action.

Let us get started with a sample graph in which the initial importance is 0.2 which is 1 / 5, given that we have 5 nodes in total:

 1G = nx.DiGraph()
 2# Sample graph by Daniel Romero
 3G.add_edges_from([('A','B'),
 4                  ('B','C'),
 5                  ('B','D'),
 6                  ('C','B'),
 7                  ('D','C'),
 8                  ('D','A'),
 9                  ('D','E'),
10                  ('E','A'),
11                ])
12def initial_nodes():
13    importance = 1 / G.number_of_nodes()
14    return { n : { "importance" : importance } for n in G.nodes() }
15nx.set_node_attributes(G,values=initial_nodes())
16visualise(G,hide=True)

Now, let us do one first iteration of the algorithm, and see what new values we get for each node. Note in the table below, how the value of each node is calculated by summing up the importance contributed by each inward node, which in turn is diluted by the out degree (the number of outgoing arrows) from that node.

 1def simplePageRank(max_iter=1,show=False):
 2    nodes = initial_nodes()
 3    s = "\n"
 4    s +=  "i | Node |  Inward Node  | Importance | Out Degree| Influence\n"
 5    s +=  "--|------|---------------|------------|-----------|----------\n"
 6    for i in range(1,max_iter+1):
 7        new_nodes = {}
 8        for n in nodes.keys():
 9            sum = 0.0
10            for in_node,_ in G.in_edges(nbunch=n):
11                s += "%d " % i # Iteration
12                s += "| %s %2.2f (Old) " % (n,nodes[n]['importance'])
13                s += "| %s " % in_node
14                in_node_importance = nodes[in_node]['importance']
15                in_node_outward_degree = G.out_degree(nbunch=in_node)
16                in_node_influence = in_node_importance / in_node_outward_degree
17                sum = sum + in_node_influence
18                s += "| %2.2f " % in_node_importance
19                s += "| %2.2f " % in_node_outward_degree
20                s += "| %2.2f/%2.2f = %2.2f " % (in_node_importance,in_node_outward_degree,in_node_influence)
21                s += "\n"
22           
23            new_nodes[n] = { 'importance' : sum if sum > 0.0 else nodes[n]['importance']}
24 
25            in_nodes = "+".join([ n[0] for n in G.in_edges(nbunch=n) ]) if sum > 0.0 else "No Inward Nodes"
26            s += "%i | %s %2.2f (%s) |-|- |- |-\n" % (i,n,new_nodes[n]['importance'],in_nodes)
27        nodes = new_nodes
28    if show == True:
29        display(Markdown(s))
30    return {  k : { "importance" : "%2.2f" % v['importance']} for k,v in nodes.items() }
31
32nx.set_node_attributes(G,values=simplePageRank(max_iter=1,show=True))
33visualise(G,hide=True)
i Node Inward Node Importance Out Degree Influence
1 A 0.20 (Old) D 0.20 3.00 0.20/3.00 = 0.07
1 A 0.20 (Old) E 0.20 1.00 0.20/1.00 = 0.20
1 A 0.27 (D+E) - - - -
1 B 0.20 (Old) A 0.20 1.00 0.20/1.00 = 0.20
1 B 0.20 (Old) C 0.20 1.00 0.20/1.00 = 0.20
1 B 0.40 (A+C) - - - -
1 C 0.20 (Old) B 0.20 2.00 0.20/2.00 = 0.10
1 C 0.20 (Old) D 0.20 3.00 0.20/3.00 = 0.07
1 C 0.17 (B+D) - - - -
1 D 0.20 (Old) B 0.20 2.00 0.20/2.00 = 0.10
1 D 0.10 (B) - - - -
1 E 0.20 (Old) D 0.20 3.00 0.20/3.00 = 0.07
1 E 0.07 (D) - - - -

Let us now run same algorithm two times. What is interesting is that in the second run, the importance contributed by each inward node is no longer the initial value of 0.2, but the values calculated in the first iteration.

1nx.set_node_attributes(G,values=simplePageRank(max_iter=2,show=True))
2visualise(G,hide=True)
i Node Inward Node Importance Out Degree Influence
1 A 0.20 (Old) D 0.20 3.00 0.20/3.00 = 0.07
1 A 0.20 (Old) E 0.20 1.00 0.20/1.00 = 0.20
1 A 0.27 (D+E) - - - -
1 B 0.20 (Old) A 0.20 1.00 0.20/1.00 = 0.20
1 B 0.20 (Old) C 0.20 1.00 0.20/1.00 = 0.20
1 B 0.40 (A+C) - - - -
1 C 0.20 (Old) B 0.20 2.00 0.20/2.00 = 0.10
1 C 0.20 (Old) D 0.20 3.00 0.20/3.00 = 0.07
1 C 0.17 (B+D) - - - -
1 D 0.20 (Old) B 0.20 2.00 0.20/2.00 = 0.10
1 D 0.10 (B) - - - -
1 E 0.20 (Old) D 0.20 3.00 0.20/3.00 = 0.07
1 E 0.07 (D) - - - -
2 A 0.27 (Old) D 0.10 3.00 0.10/3.00 = 0.03
2 A 0.27 (Old) E 0.07 1.00 0.07/1.00 = 0.07
2 A 0.10 (D+E) - - - -
2 B 0.40 (Old) A 0.27 1.00 0.27/1.00 = 0.27
2 B 0.40 (Old) C 0.17 1.00 0.17/1.00 = 0.17
2 B 0.43 (A+C) - - - -
2 C 0.17 (Old) B 0.40 2.00 0.40/2.00 = 0.20
2 C 0.17 (Old) D 0.10 3.00 0.10/3.00 = 0.03
2 C 0.23 (B+D) - - - -
2 D 0.10 (Old) B 0.40 2.00 0.40/2.00 = 0.20
2 D 0.20 (B) - - - -
2 E 0.07 (Old) D 0.10 3.00 0.10/3.00 = 0.03
2 E 0.03 (D) - - - -

If we run infinite iterations, the importance of each node will eventually converge. We don’t need to run that many iterations. Let us do just 100:

1nx.set_node_attributes(G,values=simplePageRank(max_iter=100,show=False))
2visualise(G,hide=True)

We can see that Neworkx’s own pagerank() function provides similar results:

1
2nx.pagerank(G,alpha=1.0)
{'A': 0.12499985859863202,
 'B': 0.37500069212248505,
 'C': 0.2499997246394413,
 'D': 0.18749944927888273,
 'E': 0.06250027536055858}

Scaling (Alpha Parameter)

In the last section we saw that we passed the argument alpha=1.0 to pagerank(), why? It turns out that the basic algorithm has a problem. Consider this graph:

 1G = nx.DiGraph()
 2G.add_edges_from([('A','C'),
 3                  ('B','C'),
 4                  ('C','D'),
 5                  ('C','E'),
 6                  ('D','E'),
 7                  ('E','D'),
 8                ])
 9nx.set_node_attributes(G,values=initial_nodes())
10visualise(G,hide=True)

If we run the basic page rank algorithm many times over—100000 times in the example below—we see that D and E have a reinforcing effect whereby their importance ends up reducing, comparatively, that of {A,B,C}:

1nx.set_node_attributes(G,values=simplePageRank(max_iter=100000,show=False))
2visualise(G,hide=True)

The results are equivalent to Network’x pagerank() when using alpha=1.0:

1nx.set_node_attributes(G,
2    values={ k : { "importance" : "%2.2f" % v} 
3                for k,v in nx.pagerank(G,alpha=1.0).items()}
4    )
5visualise(G)

The parameter alpha works as a damping parameter, in the sense that it adds some randomness to the otherwise, static relationship between nodes. It therefore prevents cycles like {D,E} from reducing—disproportionally—the importance of nodes that are not involved in cyclical relationships. The closer alpha is to zero, the more randomness is added. By default, pagerank() assumes alpha=0.85, so a 15% dampening effect is added. Let us see the difference:

1nx.set_node_attributes(G,
2    values={ k : { "importance" : "%2.2f" % v} 
3                for k,v in nx.pagerank(G).items()} # alpha=0.85 by default
4    )
5visualise(G)

Here we see that the importance of {A,B,C} hasn’t been watered down to zero, when alpha was 1.0. But, what happens if we set alpha to 0.0? In this case, the edge/arrows become pretty much meaningless—all nodes have the same importance:

1nx.set_node_attributes(G,
2    values={ k : { "importance" : "%2.2f" % v} 
3                for k,v in nx.pagerank(G,alpha=0.0).items()} 
4    )
5visualise(G)

HITS

Hyperlink-Induced Topic Search (HITS), also informally known as Hubs and Authorities, is an algorithm developed by Jon Kleinberg that serves a similar purpose as PageRank.

HITS is a dual rank system whereby each node has both a hub, and a rank score

  • Hub score: The degree to which the node serves as a directory
    • Characterised by larger number of outgoing edges—out degree
    • Positively influenced by outgoing edges that have a high authority score
  • Authority score: The degree to which the node is referenced by multiple sources
    • Characterised by larger number of incoming edges—in degree.
    • Positively influenced by incoming edges that have a high hub score

The algorithm works similarly to PageRank in the sense that each node has an initial score (1) and that each iteration of the algorithm results in the calculation of new values from old ones. The difference is that two scores (hub and authority) are calculated rather than one, and the calculations take the “opposite” score rather than their own:

1HUB_SCORE_A  = SUM(OUTGOING_B_AUTH_SCORE, OUTGOING_C_AUTH_SCORE, ...)
2AUTH_SCORE_A = SUM(INCOMING_D_HUB_SCORE, INCOMING_E_HUB_SCORE, ...)

Results are also normalised (in each iteration) by dividing them by the sum of all values:

1HUB_SCORE_A_NORMALISED = HUB_SCORE_A / SUM(HUB_SCORE_A, HUB_SCORE_B, ...)

Given the similarities in mechanics to PageRank, we’ll proceed to showing Networkx’s hits() in action:

 1G = nx.DiGraph()
 2G.add_edges_from([('A','C'),
 3                  ('B','C'),
 4                  ('C','D'),
 5                  ('C','E'),
 6                  ('D','E'),
 7                  ('E','D'),
 8                ])
 9values = {}
10for k in nx.hits(G)[0].keys():
11  values[k] = { "hub_score"  : ("%2.2f" % nx.hits(G)[0][k]),
12                "auth_score" : ("%2.2f" % nx.hits(G)[1][k])}
13nx.set_node_attributes(G, values=values)
14visualise(G)
15nx.hits(G)

({'A': 1.8702628814109454e-16,
  'C': 0.49999999999999994,
  'B': 1.8702628814109454e-16,
  'D': 0.24999999999999997,
  'E': 0.24999999999999992},
 {'A': -0.0,
  'C': 3.7405257628218904e-16,
  'B': -0.0,
  'D': 0.4999999999999998,
  'E': 0.49999999999999983})

Preferential Attachment

This is a process that is also informally known as “the rich get richer”, in which those who have more units of something (say, wealth, friends, etc.) are likely to have even more.

In a social network graph, those who have a lot of friends are likely to keep adding even more friends. In a graph, the probability of a new node (say a potential new friends), being attached to an existing node, is calculated by its degree divided by the combined degree of all nodes in the network. Let us look at three inductive examples.

Inductive Examples

2 Nodes, 1 Edge

Here there are two nodes, so the probability of node 3 attaching itself to either node 1 or 2 is 50/50.

1G = nx.Graph()
2G.add_edges_from([ (1,2)
3
4                ])
5d = dict(nx.degree(G))
6t = sum(d.values())
7nx.set_node_attributes(G,values={ k : {"proba:" : v/t} for k,v in d.items()})
8visualise(G,hide=True)

3 Nodes, 2 Edges

Now that 3 has joined 2, the probabilities change.

The calculation works out as follows:

Node Degree Probability
1 1 1/T = 1/4 = 0.25
2 2 2/T = 2/4 = 0.50
3 1 1/T = 1/4 = 0.25
- 4 (T) -
1G = nx.Graph()
2G.add_edges_from([  (1,2),
3                    (3,2)
4                ])
5d = dict(nx.degree(G))
6t = sum(d.values())
7nx.set_node_attributes(G,values={ k : {"proba:" : v/t} for k,v in d.items()})
8visualise(G,hide=True)

5 Nodes, 4 Edges

We now attach nodes 4 and 5 to node 2. See how this further increases the probability of any new node joining 2, relative to the other nodes.

 1G = nx.Graph()
 2G.add_edges_from([  (1,2),
 3                    (3,2),
 4                    (4,2),
 5                    (5,2),
 6                    (4,5)
 7                ])
 8d = dict(nx.degree(G))
 9t = sum(d.values())
10nx.set_node_attributes(G,values={ k : {"proba:" : v/t} for k,v in d.items()})
11visualise(G,hide=True)

Barabasi Albert Graph

Based on the notion of preferential attachment, we can mechanise the adding of new nodes. In Networkx, barabasi_albert_graph(m,n,initial_graph=G) will randomly—but based on existing and future probabilities—attach m nodes using n edges.

In the example below, we add 50 extra nodes to our graph.

1G2 = nx.barabasi_albert_graph(50,1,initial_graph=G,seed=3)
2d = dict(nx.degree(G2))
3t = sum(d.values())
4nx.set_node_attributes(G2,
5    values={ k : {"proba:" : "%2.2f" % (v/t)} for k,v in d.items()}
6    )
7visualise(G2,hide=True)

We don’t necessarily need a base graph to use Network’s barabasi_albert_graph(). We can generate a synthetic graph, from the ground up, based on the attachment model.

In the example below, only 1 edge is added per node, to produce a gentler visualisation.

1visualise(nx.barabasi_albert_graph(25,1,seed=4))

Please note that it is not guaranteed that the first node 0 will be the one with the highest degree. While in the first iteration, there is a 100% chance that node 1 will attach to node 0, this chance then becomes 50% for nodes 0 and 1.

There is, naturally, a higher chance that the first nodes will end up having the highest degree, because they have the highest probability of attachment when the first nodes are added. Let us just change the random seed, to show instances in which a different node, in this case 4, results as the one with the highest degree:

1visualise(nx.barabasi_albert_graph(25,1,seed=5))

For very large graphs, we see that the nodes’ degree show a power law pattern. If we render a simple plot, we see the concentration of values on the bottom left corner.

1G = nx.barabasi_albert_graph(100000,1,seed=1)
2d = dict(G.degree())
3ns = nx.number_of_nodes(G)
4degree = sorted(list(d.values()), reverse=False)
5proportion = [ degree.count(v)/ns for v in degree]
6plt.scatter(degree,proportion,s=0.5)
7plt.xlabel('Degree')
8plt.ylabel("Proportion of all Nodes")
9plt.show()

Instead, if we use a logarithmic scale, we see a straight line, which is the telltale sign of a power law at play:

1plt.scatter(degree,proportion,s=0.5)
2plt.xlabel('Degree')
3plt.ylabel("Proportion of all Nodes")
4plt.xscale('log')
5plt.yscale('log')
6plt.show()

Watts–Strogatz Model

The Watts–Strogatz model is a random graph generation model that is capable of producing graphs with small-world properties, including short average path lengths and high clustering, when provided with the right input parameters:

  • n: the number of nodes in the model
  • k: the number of neighbours to which each node n will have an edge to
  • p: the probability of rewiring an edge (u,v) to a new node w, so that it becomes (u,w)

For example, here we produce a graph consisting of 50 nodes, each attached to two neighbours, with 0 probability of rewiring:

1visualise(nx.watts_strogatz_graph(50,2,0),circular=True)

Here below we keep the same number of nodes, but change the probability of rewiring to 0.5:

1visualise(nx.watts_strogatz_graph(50,2,0.5,seed=4),circular=True)

Please note that seed=4 guarantees the formation of a connected graph—it is a cheat! If we wanted to produce a random connected graph without fixing the seed, we can use the connected_watts_strogatz_graph() function instead. Alternatively, newman_watts_strogatz_graph() can be used, which adds new edges, rather than rewiring existing ones:

1visualise(nx.newman_watts_strogatz_graph(50,2,0.5,seed=4),circular=True)

Link Prediction

Link—or edge—prediction is the ability to predict the likelihood of any existing pair of nodes of forming a new connection. It is used, for instance, in Facebook to suggest new friends.

Here bellow we will look at different link prediction approaches. First, let us use a common graph for all of them:

 1G = nx.Graph()
 2G.add_edges_from([
 3    ('A','B'),
 4    ('A','D'),
 5    ('A','E'),
 6    ('B','C'),
 7    ('B','D'),
 8    ('C','D'),
 9    ('C','F'),
10    ('E','F'),
11    ('E','G'),
12    ('F','G'),
13    ('G','H'),
14    ('G','I')
15])
16visualise(G)

Common Neighbour

This approach consists on measuring the number of common neighbours between two nodes. For example, nodes A and C have two common neighbours: B and D:

1list(nx.common_neighbors(G,'A','C'))
['B', 'D']

With this notion in mind, we can construct a probability table as follows:

 1neighbours = [ (u,v,len(list(nx.common_neighbors(G,u,v)))) 
 2                            for u,v in nx.non_edges(G) ]
 3                            
 4def show_probas(edges):
 5    s  = ""
 6    s += "Probability | Edges      \n"
 7    s += "------------|------------\n"
 8    for p in sorted(set([ p for _,_,p in edges ]),reverse=True):
 9        es = ", ".join([ "(%s,%s)" % (u,v) for u,v,p2 in edges if p == p2])
10        s += "%f | %s \n" % (p,es)
11    display(Markdown(s))
12
13show_probas(neighbours)
Probability Edges
2.000000 (A,C)
1.000000 (I,E), (I,F), (I,H), (G,A), (G,C), (E,D), (E,H), (E,B), (E,C), (A,F), (B,F), (F,H), (F,D)
0.000000 (I,A), (I,D), (I,B), (I,C), (G,D), (G,B), (A,H), (B,H), (H,D), (H,C)

Jaccard Coefficient

This is a normalised version of the simple “count the neighbours” approach, in which the number of common neighbours is divided by the union of neighbours for each of the nodes in the edge. Let us take A and C again and do the math:

 1visualise(G)
 2common = set(nx.common_neighbors(G,'A','C'))
 3a = set(nx.neighbors(G,'A'))
 4c = set(nx.neighbors(G,'C'))
 5union_a_c = a.union(c)
 6print("(A,C) common neighbours: %s (%d)" % (common, len(common)))
 7print("A neighbours: %s" % a)
 8print("C neighbours: %s" % c)
 9print("Union(A,C) = %s (%d)" % (union_a_c, len(union_a_c)))
10print("Jaccard coefficient = %d/%d = %2.2f" % (len(common),len(union_a_c),(len(common)/len(union_a_c))))

(A,C) common neighbours: {'D', 'B'} (2)
A neighbours: {'E', 'D', 'B'}
C neighbours: {'F', 'D', 'B'}
Union(A,C) = {'B', 'F', 'E', 'D'} (4)
Jaccard coefficient = 2/4 = 0.50

We can calculate the Jaccard coefficient for all edges as follows:

1jaccard = list(nx.jaccard_coefficient(G))
2show_probas(jaccard)
Probability Edges
1.000000 (I,H)
0.500000 (A,C)
0.333333 (I,E), (I,F), (E,H), (F,H)
0.200000 (E,D), (E,B), (E,C), (A,F), (B,F), (F,D)
0.166667 (G,A), (G,C)
0.000000 (I,A), (I,D), (I,B), (I,C), (G,D), (G,B), (A,H), (B,H), (H,D), (H,C)

Resource Allocation

This measurement penalises common nodes that have lots of neighbours. Therefore, the number of common neighbours is divided by their total degree. For example, for ‘A’, and ‘C’, the calculation works as follows:

 1visualise(G)
 2common   = set(nx.common_neighbors(G,'A','C'))
 3common_l = len(common) 
 4b = nx.degree(G,'B')
 5d = nx.degree(G,'D')
 6print("(A,C) common neighbours: %s (%d)" % (common, len(common)))
 7print("B degree: %s" % b)
 8print("D degree: %s" % d)
 9print("Resource Allocation Index = 1/%d + 1/%d = %2.2f" 
10        % (b,d,(1/b+1/d)))

(A,C) common neighbours: {'D', 'B'} (2)
B degree: 3
D degree: 3
Resource Allocation Index = 1/3 + 1/3 = 0.67

We can calculate the resource allocation index for the entire graph as follows:

1resource = list(nx.resource_allocation_index(G))
2show_probas(resource)
Probability Edges
0.666667 (A,C)
0.333333 (G,A), (G,C), (E,D), (E,B), (E,C), (A,F), (B,F), (F,D)
0.250000 (I,E), (I,F), (I,H), (E,H), (F,H)
0.000000 (I,A), (I,D), (I,B), (I,C), (G,D), (G,B), (A,H), (B,H), (H,D), (H,C)

Adamic-Adar Index

This is similar to the resource allocation index, but the degree is passed to the log function. So, for nodes A, and C, the calculation is 1/log(degree(B)) + 1/log(degree(D)), given that B and D are their common neighbours.

The complete results are as follows:

1visualise(G)
2adamic_adar = list(nx.adamic_adar_index(G))
3show_probas(adamic_adar)

Probability Edges
1.820478 (A,C)
0.910239 (G,A), (G,C), (E,D), (E,B), (E,C), (A,F), (B,F), (F,D)
0.721348 (I,E), (I,F), (I,H), (E,H), (F,H)
0.000000 (I,A), (I,D), (I,B), (I,C), (G,D), (G,B), (A,H), (B,H), (H,D), (H,C)

Preferential Attachment

We’ve seen this measurement already in a previous section. Preferential attachment gives a higher score to nodes that have a high degree. For an edge (u,v), this is calculated as the product of their degrees: degree(u) * degree(v).

In the results below, G has a degree of 4, which is the highest degree in the graph, so the highest ranked edges are those that connect with nodes with the second highest degree, which is 3.

1visualise(G)
2pref = list(nx.preferential_attachment(G))
3show_probas(pref)

Probability Edges
12.000000 (G,A), (G,D), (G,B), (G,C)
9.000000 (E,D), (E,B), (E,C), (A,F), (A,C), (B,F), (F,D)
3.000000 (I,E), (I,A), (I,F), (I,D), (I,B), (I,C), (E,H), (A,H), (B,H), (F,H), (H,D), (H,C)
1.000000 (I,H)

Community Structure

This is an extension of common neighbours whereby those nodes that are in the same community are given an extra ‘bonus’ point when scored.

The “community” is not inferred from the graph’s structure; it is an additional property. Let’s say that we want to classify some nodes into red and blue communities:

1red  = ['A','B','C','D']
2blue = ['E','F','G','H','I']
3visualise(G,clusters=[("red",red),("blue",blue)])

We then need to tag the relevant nodes accordingly. For example:

1for n in G.nodes():
2    G.nodes[n]['community'] = 0 if n in red else 1

We can now calculate the connection probability for every edge combination as follows:

1visualise(G)
2community = list(nx.cn_soundarajan_hopcroft(G))
3show_probas(community)

Probability Edges
4.000000 (A,C)
2.000000 (I,E), (I,F), (I,H), (E,H), (F,H)
1.000000 (G,A), (G,C), (E,D), (E,B), (E,C), (A,F), (B,F), (F,D)
0.000000 (I,A), (I,D), (I,B), (I,C), (G,D), (G,B), (A,H), (B,H), (H,D), (H,C)

Community Resource Allocation

This is a combination of the resource allocation measure plus a “bonus” for those nodes that are in the same community.

1visualise(G)
2community = list(nx.ra_index_soundarajan_hopcroft(G))
3show_probas(community)

Probability Edges
0.666667 (A,C)
0.250000 (I,E), (I,F), (I,H), (E,H), (F,H)
0.000000 (I,A), (I,D), (I,B), (I,C), (G,A), (G,D), (G,B), (G,C), (E,D), (E,B), (E,C), (A,H), (A,F), (B,H), (B,F), (F,D), (H,D), (H,C)

Serialising Graphs (Reading and Writing)

Adjacent List

In an adjacent list, the first column represents all the nodes found in the graph, the second, third, and n columns represent the nodes to which the node in the first column is connected—i.e., forming an edge.

1# Serialising a Graph using the adjacent list format
2G = nx.Graph()
3G.add_edges_from([  (1,2),
4                    (1,3),
5                    (2,3)
6                ])
7visualise(G)
8nx.write_adjlist(G, "test.adjlist", delimiter="|")
9!cat test.adjlist

#/home/codespace/.local/lib/python3.10/site-packages/ipykernel_launcher.py -f /tmp/tmpzw03bg2a.json --HistoryManager.hist_file=:memory:
# GMT Thu Dec 29 15:03:24 2022
# 
1|2|3
2|3
3
1# Deserialising a file using the adjacent list format
2G2 = nx.read_adjlist("test.adjlist",delimiter="|",nodetype=int)
3print(G2.edges(data=True))
4print(G.edges() == G2.edges())
[(1, 2, {}), (1, 3, {}), (2, 3, {})]
True

Adjacency Matrix

An adjacency matrix is one in which the the X and Y axes (or columns and rows) represent the nodes, and the intersections are denoted by 1 (or logically True), when the coordinates denote an edge. This is more often used as an efficient in-memory format rather than as a file-based format.

 1# Serialising a graph using the adjacency matrix format
 2G = nx.Graph()
 3G.add_edges_from([  (1,2),
 4                    (1,3),
 5                    (2,3)
 6                ])
 7visualise(G)
 8adj_matrix = None
 9with warnings.catch_warnings():
10    warnings.simplefilter("ignore")
11    adj_matrix = nx.adjacency_matrix(G)
12
13display(Markdown("In raw dense/non-sparse format:"))
14print(adj_matrix.todense())
15
16display(Markdown("As a DataFrame, to show the X and Y axes"))
17df = pd.DataFrame(adj_matrix.todense(),columns=[1,2,3],index=[1,2,3])
18df

In raw dense/non-sparse format:

[[0 1 1]
 [1 0 1]
 [1 1 0]]

As a DataFrame, to show the X and Y axes

1 2 3
1 0 1 1
2 1 0 1
3 1 1 0

Edge List

The edge list is the most common format, easiest to manipulate in Excel, and to export from other tools. In this format, the first two values represent the edge whereas the third and n values represent the edge’s attributes. When serialising from Python directly, the third value is encoded as a python dictionary but files generated by other tools (to be then deserialised by Networkx) are usually encoded according to the former description.

1# Serialising a graph using the edge list format
2G = nx.Graph()
3G.add_edges_from([  (1,2,{"name" : "edge1"}),
4                    (1,3,{"name" : "edge2"}),
5                    (2,3,{"name" : "edge3"})
6                ])
7visualise(G)
8nx.write_edgelist(G,'test.edgelist')
9!cat test.edgelist

1 2 {'name': 'edge1'}
1 3 {'name': 'edge2'}
2 3 {'name': 'edge3'}
1# Deserialising external file using the edge list format
2!cat solar.tsb
3solar = nx.read_edgelist('solar.tsb', data=[('distance', float)], 
4                         create_using=nx.MultiDiGraph())
5visualise(solar)
Sun Mercury 0.387
Sun Venus   0.722
Sun Earth   1
Sun Mars    1.52
Sun Jupiter 5.20
Sun Saturn  9.58
Sun Uranus  19.2
Sun Neptune 30.1
Earth   Moon    0.002569

Translating a Graph to a Pandas DataFrame

Here we just need to be sure to extract each attribute from data individually, when constructing the DataFrame.

1import pandas as pd
2
3solar = nx.read_edgelist('solar.tsb', data=[('distance', float)], 
4                         create_using=nx.MultiDiGraph())
5df = pd.DataFrame([ (u,v,data['distance']) for u,v,data in solar.edges(data=True) ], 
6                    columns=['from', 'to', 'distance'])
7df

from to distance
0 Sun Mercury 0.387000
1 Sun Venus 0.722000
2 Sun Earth 1.000000
3 Sun Mars 1.520000
4 Sun Jupiter 5.200000
5 Sun Saturn 9.580000
6 Sun Uranus 19.200000
7 Sun Neptune 30.100000
8 Earth Moon 0.002569

Conclusion

In this long by example tutorial, we saw that Networkx abstracts away from the complexity around the most useful graph algorithms available today, while preserving their power.

Bearing in mind that Networkx runs on memory (it is not a “at rest” graph database like Amazon Neptune), it provides all the logical functionality required to build an application like Sky Scanner or Facebook.

If you find any errors, please do get in touch with me. I’ll give you credit for helping me identify bugs and issues in general.

Before You Leave

🤘 Subscribe to my 100% spam-free newsletter!

website counters