Tag Archives: yahoo

Ok, even though Yahoo search is on the way out and will be replace by the search engine behind Bing. That transition won’t happen until sometime in 2010. Until then Yahoo still has 20% of the search engine market share and it’s important to consider it as an important source of traffic for your websites.

This script is similar to the Google and Bing SERP scrapers that I posted earlier on this site but Yahoo’s pages were slightly more complicated to parse. This was because they use a re-direct service in their URLs which required some regular expression matching.

I will be putting all these little components together into a larger program later.

Example Usage:

$ python yahooScrape.py
http://www.halotis.com/
http://www.halotis.com/2007/08/27/automation-is-key-automate-the-web/
http://twitter.com/halotis
http://www.scribd.com/halotis
http://www.topless-sandal.com/product_info.php/products_id/743?tsSid=71491a7bb080238335f7224573598606
http://feeds.feedburner.com/HalotisBlog
http://www.planet-tonga.com/sports/haloti_ngata.shtml
http://blog.oregonlive.com/ducks/2007/08/kellens_getting_it_done.html
http://friendfeed.com/mfwarren
http://friendfeed.com/mfwarren?start=30

Here’s the Script:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# http://halotis.com/
 
import urllib,urllib2
import re
 
from BeautifulSoup import BeautifulSoup
 
def yahoo_grab(query):
 
    address = "http://search.yahoo.com/search?p=%s" % (urllib.quote_plus(query))
    request = urllib2.Request(address, None, {'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'} )
    urlfile = urllib2.urlopen(request)
    page = urlfile.read(200000)
    urlfile.close()
 
    soup = BeautifulSoup(page)
    url_pattern = re.compile('/\*\*(.*)')
    links =   [urllib.unquote_plus(url_pattern.findall(x.find('a')['href'])[0]) for x in soup.find('div', id='web').findAll('h3')]
 
    return links
 
if __name__=='__main__':
    # Example: Search written to file
    links = yahoo_grab('halotis')
    print '\n'.join(links)