Today’s script will perform a search on Technorati and then scrape out the search results. It is useful because Technorati is up to date about things that are happening in the blogosphere. And that gives you a way to tune into everything going on there.
The scope of the blogosphere matched with Technoratis ability sort the results by the most recent is what makes this very powerful. This script will help you find up to the moment content which you can then data-mine for whatever purposes you want.
Possible uses:
- Create a tag cloud of what is happening today within your niche
- aggregate the content into your own site
- post it to Twitter
- convert the search results into an RSS feed
And here’s the Python code:
import urllib2 from BeautifulSoup import BeautifulSoup def get_technorati_results(query, page_limit=10): page = 1 links = [] while page < page_limit : url='http://technorati.com/search/' + '+'.join(query.split()) + '?language=n&page=' + str(page) req = urllib2.Request(url) HTML = urllib2.urlopen(req).read() soup = BeautifulSoup(HTML) next = soup.find('li', attrs={'class':'next'}).find('a') #links is a list of (url, summary, title) tuples links += [(link.find('blockquote')['cite'], ''.join(link.find('blockquote').findAll(text=True)), ''.join(link.find('h3').findAll(text=True))) for link in soup.find('div', id='results').findAll('li', attrs={'class':'hentry'})] if next : page = page+1 else : break return links if __name__=='__main__': links = get_technorati_results('halotis marketing') print links |