Tag Archives: technorati

Today’s script will perform a search on Technorati and then scrape out the search results. It is useful because Technorati is up to date about things that are happening in the blogosphere. And that gives you a way to tune into everything going on there.

The scope of the blogosphere matched with Technoratis ability sort the results by the most recent is what makes this very powerful. This script will help you find up to the moment content which you can then data-mine for whatever purposes you want.

Possible uses:

  • Create a tag cloud of what is happening today within your niche
  • aggregate the content into your own site
  • post it to Twitter
  • convert the search results into an RSS feed

And here’s the Python code:

import urllib2
 
from BeautifulSoup import BeautifulSoup
 
def get_technorati_results(query, page_limit=10):
 
    page = 1
    links = []
 
    while page < page_limit :
        url='http://technorati.com/search/' + '+'.join(query.split()) + '?language=n&page=' + str(page)
        req = urllib2.Request(url)
        HTML = urllib2.urlopen(req).read()
        soup = BeautifulSoup(HTML)
 
        next = soup.find('li', attrs={'class':'next'}).find('a')
 
        #links is a list of (url, summary, title) tuples
        links +=   [(link.find('blockquote')['cite'], ''.join(link.find('blockquote').findAll(text=True)), ''.join(link.find('h3').findAll(text=True))) for link in soup.find('div', id='results').findAll('li', attrs={'class':'hentry'})]
 
        if next :
            page = page+1
        else :
            break
 
    return links
 
if __name__=='__main__':
    links = get_technorati_results('halotis marketing')
    print links