Tag Archives: advanced

Another in my series of Python scripting blog posts. This time I’m sharing a script that can rip through RSS feeds and devour their content and stuff it into a database in a way that scales up to 1000s of feeds. To accomplish this the script is multi-threaded.

The big problem with scaling up a web script like this is that there is a huge amount of latency when requesting something over the internet. Due to the bandwidth as well as remote processing time it can take as long as a couple of seconds to get anything back. Requesting one feed after the other in series will waste a lot of time, and that makes this type of script a prime candidate for some threading.

I borrowed parts of this script from this post: Threaded data collection with Python, including examples

What could you do with all this content? Just off the top of my head I can think of many interesting things to do:

  • Create histograms of the publish times of posts to find out the most/least popular days and times are for publishing
  • Plot trends of certain words or phrases over time
  • create your own aggregation website
  • get the trending topics by doing counting the occurrence of words by day
  • Try writing some natural language processing algorithms

This script is coded at 20 threads, but that really needs to be fine tuned for the best performance. Depending on your bandwidth and the sites you want to grab you may want to tweak the THREAD_LIMIT value.

import sqlite3
import threading
import time
import Queue
from time import strftime
 
import feedparser     # available at http://feedparser.org
 
 
THREAD_LIMIT = 20
jobs = Queue.Queue(0)
rss_to_process = Queue.Queue(THREAD_LIMIT)
 
DATABASE = "rss.sqlite"
 
conn = sqlite3.connect(DATABASE)
conn.row_factory = sqlite3.Row
c = conn.cursor()
 
#insert initial values into feed database
c.execute('CREATE TABLE IF NOT EXISTS RSSFeeds (id INTEGER PRIMARY KEY AUTOINCREMENT, url VARCHAR(1000));')
c.execute('CREATE TABLE IF NOT EXISTS RSSEntries (entry_id INTEGER PRIMARY KEY AUTOINCREMENT, id, url, title, content, date);')
c.execute("INSERT INTO RSSFeeds(url) VALUES('http://www.halotis.com/feed/');")
 
feeds = c.execute('SELECT id, url FROM RSSFeeds').fetchall()
 
def store_feed_items(id, items):
    """ Takes a feed_id and a list of items and stored them in the DB """
    for entry in items:
        c.execute('SELECT entry_id from RSSEntries WHERE url=?', (entry.link,))
        if len(c.fetchall()) == 0:
            c.execute('INSERT INTO RSSEntries (id, url, title, content, date) VALUES (?,?,?,?,?)', (id, entry.link, entry.title, entry.summary, strftime("%Y-%m-%d %H:%M:%S",entry.updated_parsed)))
 
def thread():
    while True:
        try:
            id, feed_url = jobs.get(False) # False = Don't wait
        except Queue.Empty:
            return
 
        entries = feedparser.parse(feed_url).entries
        rss_to_process.put((id, entries), True) # This will block if full
 
for info in feeds: # Queue them up
    jobs.put([info['id'], info['url']])
 
for n in xrange(THREAD_LIMIT):
    t = threading.Thread(target=thread)
    t.start()
 
while threading.activeCount() > 1 or not rss_to_process.empty():
    # That condition means we want to do this loop if there are threads
    # running OR there's stuff to process
    try:
        id, entries = rss_to_process.get(False, 1) # Wait for up to a second
    except Queue.Empty:
        continue
 
    store_feed_items(id, entries)
 
conn.commit()

Bit.ly offers a very simple API for creating short URLs. The service can also provide you with some basic click statistics. Unfortunately there are a few missing pieces to the API. To get around that you’ll have to keep a list of bit.ly links you want to track. Depending on your situation you may need to keep some of the information updated regularly and stored locally to do a deeper analysis of your links.

There are a couple of advanced tricks you can use to get more out of your tracking.

  1. Add GET arguments to the end of the URL to split test
  2. – If you want to track clicks from different sources that land at the same page you need use different links. The easiest way to create two links to the same page is to append a GET argument. So if you wanted to promote my site http://halotis.com and wanted to compare Twitter to AdWords then you could create bit.ly links to http://halotis.com?from=twitter and http://halotis.com?from=adwords. You can add more information with more arguments such as http://halotis.com/?from=adwords&adgroup=group1. If you control the landing page, then you will see those arguments in Google Analytics and will have even more information about who clicked your links.

  3. Look at stats for any bit.ly link including referring sites, real-time click time-lines, and locations by adding a + to the end of it: http://bit.ly/10HYCo+
  4. Find out which other bit.ly users have shortened a link using the API – google.com bitly info
  5. Use the javascript library to grab stats and embed them into a webpage — see code below

Get click count stats inserted with this Javascript example code. Just update the login & ApiKey and put this in the head section of your webpage:

<script type="text/javascript" charset="utf-8" src="http://bit.ly/javascript-api.js?version=latest&login=YOURBITLYLOGIN&apiKey=YOURAPIKEYGOESHERE"></script>
<script type="text/javascript" charset="utf-8">
	BitlyCB.myStatsCallback = function(data) {
		var results = data.results;
 
		var links = document.getElementsByTagName('a');
		for (var i=0; i < links.length; i++) {
			var a = links[i];
			if (a.href && a.href.match(/^http\:\/\/bit\.ly/)) {
				var hash = BitlyClient.extractBitlyHash(a.href);
				if (results.hash == hash || results.userHash == hash) {
					if (results.userClicks) {
						var uc = results.userClicks + " clicks on this bit.ly URL. ";
					} else {
						var uc = "";
					}
 
					if (results.clicks) {
						var c = results.clicks;
					} else {
						var c = "0";
					}
					c += " clicks on all shortened URLS for this source. ";
 
					var sp = BitlyClient.createElement('span', {'text': " [ " + uc + c + " ] "});
					a.parentNode.insertBefore(sp, a.nextSibling);
				}
			}
 
		};
 
	}
 
	// wait until page is loaded to call API
	BitlyClient.addPageLoadEvent(function(){
		var links = document.getElementsByTagName('a');
		var fetched = {};
		var hashes = [];
		for (var i=0; i < links.length; i++) {
			var a = links[i];
			if (a.href && a.href.match(/^http\:\/\/bit\.ly/)) {
				if (!fetched[a.href]) {
					BitlyClient.stats(BitlyClient.extractBitlyHash(a.href), 'BitlyCB.myStatsCallback');
					fetched[a.href] = true;
				}
			}
		};
 
	});
	</script>

If you want to have a small command line script that can fetch this data from bit.ly and print it then check out this Python script that uses the bitly library which makes it very easy:

import bitly       #http://code.google.com/p/python-bitly/
BITLY_LOGIN = "YOUR_BITLY_LOGIN"
BITLY_API_KEY = "YOUR_BITLY_API_KEY"
 
short_url='http://bit.ly/31IqMl'
 
b = bitly.Api(login=BITLY_LOGIN,apikey=BITLY_API_KEY)
stats = b.stats(short_url)
print "%s - User clicks %s, total clicks: %s" % (short_url, stats.user_clicks, stats.total_clicks)