Tag Archives: search

Digg is by far the most popular social news site on the internet. With it’s simple “thumbs up” system the users of the site promote the most interesting and high quality stores and the best of those make it to the front page. What you end up with is a filtered view of the most interesting stuff.

It’s a great site and one that I visit every day.

I wanted to write a script that makes use of the search feature on Digg so that I could scrape out and re-purpose the best stuff to use elsewhere. The first step in writing that larger (top secret) program was to start with a scraper for Digg search.

The short python script I came up with will return the search results from Digg in a standard python data structure so it’s simple to use. It parses out the title, destination, comment count, digg link, digg count, and summary for the top 100 search results.

You can perform advanced searches on digg by using a number of different flags:

  • +b Add to see buried stories
  • +p Add to see only promoted stories
  • +np Add to see only unpromoted stories
  • +u Add to see only upcoming stories
  • Put terms in “quotes” for an exact search
  • -d Remove the domain from the search
  • Add -term to exclude a term from your query (e.g. apple -iphone)
  • Begin your query with site: to only display stories from that URL.

This script also allows the search results to be sorted:

from DiggSearch import digg_search
digg_search('twitter', sort='newest')  #sort by newest first
digg_search('twitter', sort='digg')  # sort by number of diggs
digg_search('twitter -d')  # sort by best match

Here’s the Python code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# http://halotis.com/
 
import urllib,urllib2
import re
 
from BeautifulSoup import BeautifulSoup
 
USER_AGENT = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'
 
def remove_extra_spaces(data):
    p = re.compile(r'\s+')
    return p.sub(' ', data)
 
def digg_search(query, sort=None, pages=10):
    """Returns a list of the information I need from a digg query
    sort can be one of [None, 'digg', 'newest']
    """
 
    digg_results = []
    for page in range (1,pages):
 
        #create the URL
        address = "http://digg.com/search?s=%s" % (urllib.quote_plus(query))
        if sort:
            address = address + '&sort=' + sort
        if page > 1:
            address = address + '&page=' + str(page)
 
        #GET the page
        request = urllib2.Request(address, None, {'User-Agent':USER_AGENT} )
        urlfile = urllib2.urlopen(request)
        page = urlfile.read(200000)
        urlfile.close()
 
        #scrape it
        soup = BeautifulSoup(page)
        links = soup.findAll('h3', id=re.compile("title\d"))
        comments = soup.findAll('a', attrs={'class':'tool comments'})
        diggs = soup.findAll('strong', id=re.compile("diggs-strong-\d"))
        body = soup.findAll('a', attrs={'class':'body'})
        for i in range(0,len(links)):
            item = {'title':remove_extra_spaces(' '.join(links[i].findAll(text=True))).strip(), 
                    'destination':links[i].find('a')['href'],
                    'comment_count':int(comments[i].string.split()[0]),
                    'digg_link':comments[i]['href'],
                    'digg_count':diggs[i].string,
                    'summary':body[i].find(text=True)
                    }
            digg_results.append(item)
 
        #last page early exit
        if len(links) < 10:
            break
 
    return digg_results
 
if __name__=='__main__':
    #for testing
    results = digg_search('twitter -d', 'digg', 2)
    for r in results:
        print r

You can grab the source code from the bitbucket repository.