Scrape Technorati Search Results in Python

Today’s script will perform a search on Technorati and then scrape out the search results. It is useful because Technorati is up to date about things that are happening in the blogosphere. And that gives you a way to tune into everything going on there.

The scope of the blogosphere matched with Technoratis ability sort the results by the most recent is what makes this very powerful. This script will help you find up to the moment content which you can then data-mine for whatever purposes you want.

Possible uses:

  • Create a tag cloud of what is happening today within your niche
  • aggregate the content into your own site
  • post it to Twitter
  • convert the search results into an RSS feed

And here’s the Python code:

import urllib2
from BeautifulSoup import BeautifulSoup
def get_technorati_results(query, page_limit=10):
    page = 1
    links = []
    while page < page_limit :
        url='' + '+'.join(query.split()) + '?language=n&page=' + str(page)
        req = urllib2.Request(url)
        HTML = urllib2.urlopen(req).read()
        soup = BeautifulSoup(HTML)
        next = soup.find('li', attrs={'class':'next'}).find('a')
        #links is a list of (url, summary, title) tuples
        links +=   [(link.find('blockquote')['cite'], ''.join(link.find('blockquote').findAll(text=True)), ''.join(link.find('h3').findAll(text=True))) for link in soup.find('div', id='results').findAll('li', attrs={'class':'hentry'})]
        if next :
            page = page+1
        else :
    return links
if __name__=='__main__':
    links = get_technorati_results('halotis marketing')
    print links