I started on Coursera's Social Network Analysis course and was looking around for some network data to start analyzing. I've seen a talk by Matt Biddulph at a Big Data London meetup (blog post) on analyzing Wikipedia data and wondered if something similar could be easily done with news data.
It was fairly easy to grab some newspaper articles using the Guardian Open Platform. I then used the python-based Natural Lanuage Toolkit to extract named entities (in particular the names of people) from the articles. A network could then be constructed using names as the nodes, and connecting nodes with a link if at least two articles included both names.
The resulting network could then be loaded into Gephi, an excellent tool for visualizing and anayzing networks.
In this analysis, I downloaded the full texts of articles in the period October 1st 2012 - October 13th 2012. Performing named entity extraction using NLTK and python was simple using the code snippet in Tim McNamara's blog post.
The named entities were filtered to those NLTK considers as PERSONs and containing two words (first and last name) and used to construct the graph (edges between nodes added when two names are linked by at least two separate articles).
The graph was written by a python program to a graph description format (GDF) file to be imported into Gephi. Gephi was used to detect and color different communities within the graph and layout the nodes. Different communities populated by groups of people (politicians, sports people, etc). Gephi was used to output the network in SVG format for visualization.
The network shows some interesting features. Communities of US and UK politicians are prominent and are quite interlinked. Several interesting questions can be asked:
If you are interested in looking further at the data you can take a look at:
|Open network in new window (SVG file)|
|download the .gephi file for the network|
|download the .gdf file for the network|