` Niall McCarroll's Professional Home Page

Exploring UK Gender Pay Gap Data with Seaborn

python

6th April 2018

Quick post using Python3 and the Seaborn statisitcal visualization package to start trying to understand the UK gender pay gap data released this week. All UK companies with more than 250 employees are required to provide data on how their female and male employees are paid differently. I decided to drill down to look at how, according to the data self-reported by companies, pay varies by gender in the electricity sector.

I've provided my workings in a jupyter notebook. If you want to run the examples and don't have Jupyter and Seaborn installed I'd recommend installing these quickly and easily via Anaconda.

More...

UK general election 2017 provisional results

javascript, svg

8th July 2017

So things didn't quite turn out quite as anyone expected in the snap UK general election...

hemicycle diagram thumbnail I wanted to create a visualisation of the results which contrast the seats won with the % of the popular vote, and came up with this infographic. The nice thing about the two semi-circular charts I generated is that they can be nested within each other.

More...

Mo Farah's Olympic (Rio 2016) 5000m final victory, in tweets

javascript, raphaeljs

21st August 2016

Four years on from the London Olympics he's only gone and done it again - the double double 5000m/1000m.

Mo Farah's "mobot" victory gesture. Once again, I tracked the tweets using the twitter streaming API (search terms #gomo,#motime,@mo_farah,#mofarah) before, during and after the race.

The interesting things is, well, the distribution of tweets over time is pretty similar to last time. Even the absolute rates in tweets per second are similar, despite the fact the race started at 01.37am British Summer Time. You can compare them youselves by looking at my original post from 2012.

More...

Getting started with PySpark - Part 2

pyspark, python, data science

5th May 2014

In Part 1 we looked at installing the data processing engine Apache Spark and started to explore some features of its Python API, PySpark. In this article, we look in more detail at using PySpark.

More...

Getting started with PySpark - Part 1

pyspark, python, data science

2nd March 2014

Apache Spark is a relatively new data processing engine implemented in Scala and Java that can run on a cluster to process and analyze large amounts of data. Spark performance is particularly good if the cluster has sufficient main memory to hold the data being analyzed. Several sub-projects run on top of Spark and provide graph analysis (GraphX), Hive-based SQL engine (Shark), machine learning algorithms (MLlib) and realtime streaming (Spark streaming). Spark has also recently been promoted from incubator status to a new top-level project.

In this series of blog posts, we'll look at installing spark on a cluster and explore using its Python API bindings PySpark for a number of practical data science tasks. This first post focuses on installation and getting started.

More...

twitstreamer

python

1st December 2012

This snippet, twitstreamer, is a simple command line tool, written in python3, for retrieving tweets via the twitter streaming API, v1.1. The tweets are written to standard output as CSV or JSON formatted lines.

The tool will read from either of two twitter streaming APIs.

More...

twitfetcher

python

21st November 2012

This snippet, twitfetcher, is a simple command line tool, written in python3, for retrieving tweets via the twitter search API, v1.1. The tweets can be stored into CSV or JSON formatted files.

Twitter only makes a sample of those tweets sent over the previous week searchable, but it is still a very useful free source of data for data science experiments.

More...

Analyzing co-occurence networks with Gephi

python, sna, gephi, nltk

12th October 2012

I started on Coursera's Social Network Analysis course and was looking around for some network data to start analyzing. I've seen a talk by Matt Biddulph at a Big Data London meetup (blog post) on analyzing Wikipedia data and wondered if something similar could be easily done with news data.

It was fairly easy to grab some newspaper articles using the Guardian Open Platform. I then used the python-based Natural Lanuage Toolkit to extract named entities (in particular the names of people) from the articles. A network could then be constructed using names as the nodes, and connecting nodes with a link if at least two articles included both names.

The resulting network could then be loaded into Gephi, an excellent tool for visualizing and anayzing networks.

More...

Mo Farah's Olympic 5000m final victory, in tweets

javascript, raphaeljs

12th August 2012

Another sports related post, this time inspired by Mo Farah's amazing double gold medals (in the 5000m and 10000m) over the last couple of weeks at the London Olympics.

Mo Farah's "mobot" victory gesture I used the gRaphael Charting Library and the Twitter search API to show how the rate at which tweets containing the hashtag #gomo varied before during and just after the 5000m London Olympics final. Hover over the chart to display the text for selected tweets.

The main features of the chart are a small peak just before the race starts followed by the huge peak after Mo wins. And I thought it was a long way to jog to the bus stop when running late in the morning!

More...

Visualizing Reading FC's Winning 2011/2012 Season

javascript, d3

8th August 2012

I used the D3 javascript visualization library and modified one of the code examples from Mike Dewar's excellent book Getting started with D3 to build a modest visualization which tells the story of Reading football club's championship winning season 2011-2012.

The visualization plots matches played (x-axis) against points accumulated (y-axis). Click on "Add club" button to compare the progress against that of the other clubs playing in the England and Wales FA Championship.

More...