Tag Archives: scrape

Digg is by far the most popular social news site on the internet. With it’s simple “thumbs up” system the users of the site promote the most interesting and high quality stores and the best of those make it to the front page. What you end up with is a filtered view of the most interesting stuff.

It’s a great site and one that I visit every day.

I wanted to write a script that makes use of the search feature on Digg so that I could scrape out and re-purpose the best stuff to use elsewhere. The first step in writing that larger (top secret) program was to start with a scraper for Digg search.

The short python script I came up with will return the search results from Digg in a standard python data structure so it’s simple to use. It parses out the title, destination, comment count, digg link, digg count, and summary for the top 100 search results.

You can perform advanced searches on digg by using a number of different flags:

  • +b Add to see buried stories
  • +p Add to see only promoted stories
  • +np Add to see only unpromoted stories
  • +u Add to see only upcoming stories
  • Put terms in “quotes” for an exact search
  • -d Remove the domain from the search
  • Add -term to exclude a term from your query (e.g. apple -iphone)
  • Begin your query with site: to only display stories from that URL.

This script also allows the search results to be sorted:

from DiggSearch import digg_search
digg_search('twitter', sort='newest')  #sort by newest first
digg_search('twitter', sort='digg')  # sort by number of diggs
digg_search('twitter -d')  # sort by best match

Here’s the Python code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# http://halotis.com/
 
import urllib,urllib2
import re
 
from BeautifulSoup import BeautifulSoup
 
USER_AGENT = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'
 
def remove_extra_spaces(data):
    p = re.compile(r'\s+')
    return p.sub(' ', data)
 
def digg_search(query, sort=None, pages=10):
    """Returns a list of the information I need from a digg query
    sort can be one of [None, 'digg', 'newest']
    """
 
    digg_results = []
    for page in range (1,pages):
 
        #create the URL
        address = "http://digg.com/search?s=%s" % (urllib.quote_plus(query))
        if sort:
            address = address + '&sort=' + sort
        if page > 1:
            address = address + '&page=' + str(page)
 
        #GET the page
        request = urllib2.Request(address, None, {'User-Agent':USER_AGENT} )
        urlfile = urllib2.urlopen(request)
        page = urlfile.read(200000)
        urlfile.close()
 
        #scrape it
        soup = BeautifulSoup(page)
        links = soup.findAll('h3', id=re.compile("title\d"))
        comments = soup.findAll('a', attrs={'class':'tool comments'})
        diggs = soup.findAll('strong', id=re.compile("diggs-strong-\d"))
        body = soup.findAll('a', attrs={'class':'body'})
        for i in range(0,len(links)):
            item = {'title':remove_extra_spaces(' '.join(links[i].findAll(text=True))).strip(), 
                    'destination':links[i].find('a')['href'],
                    'comment_count':int(comments[i].string.split()[0]),
                    'digg_link':comments[i]['href'],
                    'digg_count':diggs[i].string,
                    'summary':body[i].find(text=True)
                    }
            digg_results.append(item)
 
        #last page early exit
        if len(links) < 10:
            break
 
    return digg_results
 
if __name__=='__main__':
    #for testing
    results = digg_search('twitter -d', 'digg', 2)
    for r in results:
        print r

You can grab the source code from the bitbucket repository.

Have you ever wanted to track and assess your SEO efforts by seeing how they change your position in Google’s organic SERP? With this script you can now track and chart your position for any number of search queries and find the position of the site/page you are trying to rank.

This will allow you to visually identify any target keyword phrases that are doing well, and which ones may need some more SEO work.

This python script has a number of different components.

  • SEOCheckConfig.py script is used to add new target search queries to the database.
  • SEOCheck.py searches Google and saves the best position (in the top 100 results)
  • SEOCheckCharting.py graph all the results

The charts produced look like this:

seocheck

The main part of the script is SEOCheck.py. This script should be scheduled to run regularly (I have mine running 3 times per day on my webfaction hosting account).

For a small SEO consultancy business this type of application generates the feedback and reports that you should be using to communicate with your clients. It identifies where the efforts should go and how successful you have been.

To use this set of script you first will need to edit and run the SEOCheckConfig.py file. Add your own queries and domains that you’d like to check to the SETTINGS variable then run the script to load those into the database.

Then schedule SEOCheck.py to run periodically. On Windows you can do that using Scheduled Tasks:
Scheduled Task Dialog

On either Mac OSX or Linux you can use crontab to schedule it.

To generate the Chart simply run the SEOCheckCharting.py script. It will plot all the results on one graph.

You can find and download all the source code for this in the HalOtis-Collection on bitbucket. It requires BeautifulSoup, matplotlib, and sqlalchemy libraries to be installed.

This is a simple Twitter Python script that checks your friends time-line and prints out any links that have been posted. In addition it visits each of the URLs and finds the actual title of the destination page and prints that along side. This simple script demonstrates an easy way to gather some of the hottest trends on the internet the moment they happen.

If you set up a Twitter account within a niche and find a few of the players in that niche to follow then you can simply find any links posted, check them to see if they are on topic (using some keyword/heuristics) and then either notify yourself of the interesting content, or automatically scrape it for use on one of your related websites. That gives you perhaps the most up to date content possible before it hits Google Trends. It also gives you a chance to promote it before the social news sites find it (or be the first to submit it to them).

With a bit more work you could parse out some of the meta tag keywords/description, crawl the website, or find and cut out the content from the page. If it’s a blog you could post a comment.

Example Usage:

$ python TwitterLinks.py
http://bit.ly/s8rQX - Twitter Status - Tweets from users you follow may be missing from your timeline
http://bit.ly/26hiT - Why Link Exchanges Are a Terrible, No-Good Idea - Food Blog Alliance
http://FrankAndTrey.com - Frank and Trey
http://bit.ly/yPRHp - Gallery: Cute animals in the news this week
...

And here’s the python code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# http://halotis.com/
 
try:
   import json
except:
   import simplejson as json # http://undefined.org/python/#simplejson
import twitter     #http://code.google.com/p/python-twitter/
 
from urllib2 import urlopen
import re
 
SETTINGS = {'user':'twitter user name', 'password':'you password here'}
 
def listFriendsURLs(user, password):
    re_pattern='.*?((?:http|https)(?::\\/{2}[\\w]+)(?:[\\/|\\.]?)(?:[^\\s"]*))'	# HTTP URL
    rg = re.compile(re_pattern,re.IGNORECASE|re.DOTALL)
 
    api = twitter.Api(user, password)
    timeline = api.GetFriendsTimeline(user)
 
    for status in timeline:
        m = rg.search(status.text)
        if m:
            httpurl=m.group(1)
            title = getTitle(httpurl)
            print httpurl, '-', title
 
def getTitle(url):
    req = urlopen(url)
    html = req.read()
 
    re_pattern='<title>(.*?)</title>'
    rg = re.compile(re_pattern,re.IGNORECASE|re.DOTALL)
 
    m = rg.search(html)
    if m:
        title = m.group(1)
        return title.strip()
    return None
 
if __name__ == '__main__':
    listFriendsURLs(SETTINGS['user'], SETTINGS['password'])

Flickr has an amazing library of images and a stellar API for accessing and easily downloading them. I wanted to make use of their API to start downloading a collection of images to use on a future website project and so I started looking for a nice Python script to help me do it.

I found the awesome flickrpy python API wrapper that makes all the hard work very easy to use and wrote a short script that will search for images with a given tag(s) and download them to the current directory.

Expanding on this you could easily use PIL to modify the images and re-purpose them for use on another website such as a wordpress photoblog.

To use the script you’ll have to download flickrpy, and get a Flickr API key.

Here’s the Python script that will download 20 images from Flickr:

#!/usr/bin/env python
"""Usage: python flickrDownload.py TAGS
TAGS is a space delimited list of tags
 
Created by Matt Warren on 2009-09-08.
Copyright (c) 2009 HalOtis.com. All rights reserved.
"""
import sys
import shutil
import urllib
 
import flickr
 
NUMBER_OF_IMAGES = 20
 
#this is slow
def get_urls_for_tags(tags, number):
    photos = flickr.photos_search(tags=tags, tag_mode='all', per_page=number)
    urls = []
    for photo in photos:
        try:
            urls.append(photo.getURL(size='Large', urlType='source'))
        except:
            continue
    return urls
 
def download_images(urls):
    for url in urls:
        file, mime = urllib.urlretrieve(url)
        name = url.split('/')[-1]
        print name
        shutil.copy(file, './'+name)
 
def main(*argv):
    args = argv[1:]
    if len(args) == 0:
        print "You must specify at least one tag"
        return 1
 
    tags = [item for item in args]
 
    urls = get_urls_for_tags(tags, NUMBER_OF_IMAGES)
    download_images(urls)
 
if __name__ == '__main__':
    sys.exit(main(*sys.argv))