I'm fascinated by networks and data visualization. I've always wanted to try my hand at making some of the inspiring images I see on blogs like flowing data. This network diagram is my first amateur attempt.
I started by writing a rather simple web crawler in Python. The logic for the bot was:
1. Open a page
2. Create a list of all the links on that page (capture the total number of links)
3. For each link, create a new bot to follow the link and start the whole process again.
This was a great chance to use the Threading module in Python. I am not an expert in threading or multiprocessing. However, threading allowed me to create a new bot for each link I wanted to follow.
Here is the code for my spider class:
Created on Jun 13, 2012
@author: Alex Baker
from threading import Thread
def scan(self,url,mem, f):
# Get the url
usock = urllib2.urlopen(url)
# Your current URL is now your "old" url and
# all the new ones come from the page
old_url = url
# Read the data to a variable
data = usock.read()
# Create a Beautiful Soup object to parse the contents
soup = BeautifulSoup.BeautifulSoup(data)
# Get the title
title = soup.title.string
# Get the total number of links
count = len(soup.findAll('a'))
# For each link, create a new bot and follow it.
for link in soup.findAll('a'):
# Cleaning up the url
url = link.get('href').strip()
if url[:1] in ['#', '/','','?','j']:
# Also, avoid following the same link
elif url == old_url:
# Get the domain - not interested in other links
url_domain = url.split('/')
# Build a domain link for our bot to follow
url = "http://%s/" % (url_domain)
# Make sure that you have not gone to this domain already
if self.check_mem(url, mem)==0:
# Create your string to write to file
text = "%s,%s,%s\n" % (old_url, url, count)
# Write to your file object
# Add the domain to the "memory" to avoid it going forward
# Spawn a new bot to follow the link
spawn = spider1()
# Set it loose!
Thread(target=spawn.scan, args=(url, mem, f)).start()
except Exception, errtxt:
# For Threading errors print the error.
# For any other type of error, give the url.
print 'error with url %s' % (url)
# Just keep going - avoids allowing the thread to end in error.
def check_mem(self, url,mem):
# Quick function to check in the "member" if the domain has already been visited.
As you can see, the code is simplistic - it only considers the domain/sub-domain rather than each individual link. Also, because it checks to make sure that no domain is used twice
To run the class, I used something like this:
mem = 
f = open('output.txt', 'w')
url = 'http://justanasterisk.com'# write the url here
s = spider1()
s.scan(url, mem, f)
Once started, it doesn't stop - so kill it after a while (or build that in). Running this on my MacBook, I recorded 27,000 links in about 10 minutes.
The number of data points is small in comparison to some of the sets I've explored using BigQuery or Amazon SimpleDB. However, I wanted to make a visualization and I realized that the number of pixels would really define how many data point were useful. I figured that 10 minutes would give me the structure that I wanted. I used my blog justanasterisk.com as the starting point. I won't attach the data (you can create that yourself) but suffice to say that each line was:
source, destination, # of links on source page
Here is where I was out of my element. I browsed a few different tools and the best (read: easiest) solution for my needs was Cytoscape. It is simple to use and has several presets included to make you feel like you've done some serious analysis. For the image above, I used one of the built in layouts (modified slightly) and a custom visual style.
I won't underwhelm you with further details, but shoot me an email if you want more. I'll probably add a few more images to this post when I get them rendered.