Uncategorized

Dexy for Data Scientists

February 3, 2011   ·   By   ·   Comments Off   ·   Posted in Uncategorized

Dexy is a tool for writing about code and data, so it’s an ideal tool for writing about data science which involves lots of both. I’m very excited to be at the Strata conference meeting lots of fellow R and Python users working in statistics and data analysis, and hopefully for some of them Dexy will be a new productivity and communication tool to add to their open source portfolios.

I want to show an example using some of the code presented at the conference, in particular some of the scripts from the data bootcamp tutorial. The original source code and slides are at Drew Conway’s github, and hopefully there will be videos available.

In order to run these examples, you need to have access to the utility functions for accessing gmail via IMAP. They either need to be installed as a python package or the gmail directory needs to be in the same location as the script you wish to run. Here’s a setup.py file you can use to create a package:

#!/usr/bin/env python

from setuptools import setup, find_packages

setup(name='gmail',
description='gmail utils',
packages=find_packages()
)

The examples involve harvesting and analyzing information from a fake email account. The first thing done is a simple listing of which email folders are present and how many messages each of them contain. Here is the script:

"""
email_stats.py

Created by Hilary Mason on 2011-01-31.
Copyright © 2011 Hilary Mason. All rights reserved.
"""

# Modified by Ana Nelson

import sys, os
from gmail import Gmail

g = Gmail("ann9enigma@gmail.com", "stratar0x")

folder_stats = {}
folder_stats['inbox'] = len(g.get_message_ids())

for folder_name in g.list_folders():
folder_stats[folder_name] = len(g.get_message_ids(folder_name))

for folder_name, count in folder_stats.items():
print "Folder %s, # messages: %s" % (folder_name, count)

And this generates output like this:


Folder [Gmail]/Starred, # messages: 4
Folder INBOX, # messages: 16
Folder [Gmail]/All Mail, # messages: 37
Folder bacn, # messages: 4
Folder Personal, # messages: 0
Folder Travel, # messages: 0
Folder Work, # messages: 0
Folder commercial, # messages: 10
Folder [Gmail]/Drafts, # messages: 0
Folder waiting, # messages: 1
Folder [Gmail]/Spam, # messages: 51
Folder inbox, # messages: 16
Folder [Gmail]/Trash, # messages: 4
Folder [Gmail]/Sent Mail, # messages: 6
Folder [Gmail], # messages: 0
Folder friends, # messages: 13
Folder Receipts, # messages: 0

Next we look at the graph structure of the network represented by the emails. Each email creates an “edge” from the sender to the recipient. So the first task is too harvest these edges.

We create a simple class whose only method is an init method which does all the work and saves the edges to a data file:

class email_edges(object):
def init(self, username, password):

First we create an instance of the Gmail class for interacting with gmail:

        g = Gmail(username, password)

Next we create a file object and a CSV writer:

        csv_file = open('dexy—email-graph.csv', 'wb')
graph_out = csv.writer(csv_file)

Then we start iterating over the messages in all the folders:

        viewed_messages = []
# iterate through all folders in the account
for folder in g.list_folders():
# iterate through message IDs
for message_id in g.get_message_ids(folder):
# …but don't repeat messages
if message_id not in viewed_messages:
msg = g.get_message(message_id)

and parse each message to extract the from and to information:

                    # grab the from and to lines
for line in msg.split('\n'):
line = line.strip()
if line[0:5] == "From:":
msg_from = line[5:].strip()
elif line[0:3] == "To:":
msg_to = line[3:].strip()

and we write this to the CSV file we created earlier:

                    try:
# output the from and to
graph_out.writerow([msg_from, msg_to])
# ignore if we can't read the headers
except UnboundLocalError:
pass

The script is run by:

e = email_edges(username, password)

The data collected looks like:


Gmail Team <mail-noreply@google.com>,Ann Enigma <ann9enigma@gmail.com>
Gmail Team <mail-noreply@google.com>,Ann Enigma <ann9enigma@gmail.com>
Gmail Team <mail-noreply@google.com>,Ann Enigma <ann9enigma@gmail.com>
Atul Sharma <asharma@sw-at.com>,ann9enigma@gmail.com
Vidas Vitkauskas <invite+kjdm-5jm5-jd@facebookmail.com>,ann9enigma@gmail.com
Jake Hofman <jhofman@gmail.com>,Ann Enigma <ann9enigma@gmail.com>
Jake Hofman <jake@jakehofman.com>,Ann Enigma <ann9enigma@gmail.com>
"""BananaRepublic.com"" <custserv@bananarepublic.delivery.net>",ann9enigma@gmail.com

Now we will do a visualization based on this edge data, using the NetworkX package. Here is the code which draws the full network graph from the CSV file generated above:

gmail_graph=graphFromCSV("dexy—email-graph.csv")

full_graph=plt.figure(figsize=(6,6))
nx.draw_spring(gmail_graph, arrows=False, node_size=50, with_labels=False)
plt.savefig("dexy—full-graph.png")

Here is what the full network graph looks like.

In a recent blog posts on the dataists, Mike Dewar said:

“The basic data science pipeline is on its way to becoming an open one. From Open Data, through an open source analysis, and ending up in results released as part of the Creative Commons, every step of data science can be performed openly. … The central part of this pipeline – Open Analysis – has a basic problem: what’s the use of sharing analysis nobody can read or understand? It’s great that people put their analysis online for the world to see, but what’s the point if that analysis is written in dense code that can barely be read by its author?”

By making it easy to write documentation of analysis scripts, Dexy can help with this sharing of code. But perhaps more importantly, Dexy can help with the sharing of data science. Not only writing documentation to explain analysis code, but writing other kinds of documents which use code and data in a dynamic and automated way, so that results are reverse-engineerable. This means you can easily verify someone else’s work, learn from it, and fork not just their code but an entire workflow and its explanation in words.