Tag Archives: script

Digg is by far the most popular social news site on the internet. With it’s simple “thumbs up” system the users of the site promote the most interesting and high quality stores and the best of those make it to the front page. What you end up with is a filtered view of the most interesting stuff.

It’s a great site and one that I visit every day.

I wanted to write a script that makes use of the search feature on Digg so that I could scrape out and re-purpose the best stuff to use elsewhere. The first step in writing that larger (top secret) program was to start with a scraper for Digg search.

The short python script I came up with will return the search results from Digg in a standard python data structure so it’s simple to use. It parses out the title, destination, comment count, digg link, digg count, and summary for the top 100 search results.

You can perform advanced searches on digg by using a number of different flags:

  • +b Add to see buried stories
  • +p Add to see only promoted stories
  • +np Add to see only unpromoted stories
  • +u Add to see only upcoming stories
  • Put terms in “quotes” for an exact search
  • -d Remove the domain from the search
  • Add -term to exclude a term from your query (e.g. apple -iphone)
  • Begin your query with site: to only display stories from that URL.

This script also allows the search results to be sorted:

from DiggSearch import digg_search
digg_search('twitter', sort='newest')  #sort by newest first
digg_search('twitter', sort='digg')  # sort by number of diggs
digg_search('twitter -d')  # sort by best match

Here’s the Python code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# http://halotis.com/
import urllib,urllib2
import re
from BeautifulSoup import BeautifulSoup
USER_AGENT = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'
def remove_extra_spaces(data):
    p = re.compile(r'\s+')
    return p.sub(' ', data)
def digg_search(query, sort=None, pages=10):
    """Returns a list of the information I need from a digg query
    sort can be one of [None, 'digg', 'newest']
    digg_results = []
    for page in range (1,pages):
        #create the URL
        address = "http://digg.com/search?s=%s" % (urllib.quote_plus(query))
        if sort:
            address = address + '&sort=' + sort
        if page > 1:
            address = address + '&page=' + str(page)
        #GET the page
        request = urllib2.Request(address, None, {'User-Agent':USER_AGENT} )
        urlfile = urllib2.urlopen(request)
        page = urlfile.read(200000)
        #scrape it
        soup = BeautifulSoup(page)
        links = soup.findAll('h3', id=re.compile("title\d"))
        comments = soup.findAll('a', attrs={'class':'tool comments'})
        diggs = soup.findAll('strong', id=re.compile("diggs-strong-\d"))
        body = soup.findAll('a', attrs={'class':'body'})
        for i in range(0,len(links)):
            item = {'title':remove_extra_spaces(' '.join(links[i].findAll(text=True))).strip(), 
        #last page early exit
        if len(links) < 10:
    return digg_results
if __name__=='__main__':
    #for testing
    results = digg_search('twitter -d', 'digg', 2)
    for r in results:
        print r

You can grab the source code from the bitbucket repository.

translate_logoOk, so this isn’t my script but it’s a much nicer version of the one I wrote that scrapes the actual Google translate website to do the same thing. I’d like to thank Ashish Yadav for writing and sharing this.

Translating text is an easy way to create variations of content that is recognized as unique by the search engines. As part of a bigger SEO strategy this can make a big impact on your traffic. Or it could be used to provide an automated way to translate your website to another language.

# -*- coding: utf-8 -*-
import re
import sys
import urllib
import simplejson
baseUrl = "http://ajax.googleapis.com/ajax/services/language/translate"
def getSplits(text,splitLength=4500):
    Translate Api has a limit on length of text(4500 characters) that can be translated at once, 
    return (text[index:index+splitLength] for index in xrange(0,len(text),splitLength))
def translate(text,src='', to='en'):
    A Python Wrapper for Google AJAX Language API:
    * Uses Google Language Detection, in cases source language is not provided with the source text
    * Splits up text if it's longer then 4500 characters, as a limit put up by the API
    params = ({'langpair': '%s|%s' % (src, to),
             'v': '1.0'
    for text in getSplits(text):
            params['q'] = text
            resp = simplejson.load(urllib.urlopen('%s' % (baseUrl), data = urllib.urlencode(params)))
                    retText += resp['responseData']['translatedText']
    return retText
def test():
    msg = "      Write something You want to be translated to English,\n"\
        "      Enter ctrl+c to exit"
    print msg
    while True:
        text = raw_input('#>  ')
        retText = translate(text)
        print retText
if __name__=='__main__':
    except KeyboardInterrupt:
        print "\n"

Flickr has an amazing library of images and a stellar API for accessing and easily downloading them. I wanted to make use of their API to start downloading a collection of images to use on a future website project and so I started looking for a nice Python script to help me do it.

I found the awesome flickrpy python API wrapper that makes all the hard work very easy to use and wrote a short script that will search for images with a given tag(s) and download them to the current directory.

Expanding on this you could easily use PIL to modify the images and re-purpose them for use on another website such as a wordpress photoblog.

To use the script you’ll have to download flickrpy, and get a Flickr API key.

Here’s the Python script that will download 20 images from Flickr:

#!/usr/bin/env python
"""Usage: python flickrDownload.py TAGS
TAGS is a space delimited list of tags
Created by Matt Warren on 2009-09-08.
Copyright (c) 2009 HalOtis.com. All rights reserved.
import sys
import shutil
import urllib
import flickr
#this is slow
def get_urls_for_tags(tags, number):
    photos = flickr.photos_search(tags=tags, tag_mode='all', per_page=number)
    urls = []
    for photo in photos:
            urls.append(photo.getURL(size='Large', urlType='source'))
    return urls
def download_images(urls):
    for url in urls:
        file, mime = urllib.urlretrieve(url)
        name = url.split('/')[-1]
        print name
        shutil.copy(file, './'+name)
def main(*argv):
    args = argv[1:]
    if len(args) == 0:
        print "You must specify at least one tag"
        return 1
    tags = [item for item in args]
    urls = get_urls_for_tags(tags, NUMBER_OF_IMAGES)
if __name__ == '__main__':

I schedule all my scripts to run on a WebFaction server. That gives me excellent up time, and the ability to install my own programs and python libraries. So far they have proved to be a kick-ass web hosting service.

I needed a way to push code updates to my scripts up to the server and so I quickly put together a simple script that uses FTP to transfer the files I need uploaded.

It’s not very sophisticated and there are probably better ways to deploy code such as using mercurial, or rsync to push out updates without stepping on remote code changes. But the FTP approach will work just fine.

This script uses a hard-coded list of files that I want to always push to the server.

here it is:

#!/usr/bin/env python
import ftplib
HOST = 'ftp.server.com'
REMOTE_DIR = './bin/'
files = ('file1.py', 'file2.py')
def connect(site, dir, user=(), verbose=True):
    if verbose:
        print 'Connecting', site
    remote = ftplib.FTP(site)   
    if verbose:
        print 'Changing directory', dir
    return remote
def putfile(remote, file, verbose=True):
    if verbose: 
        print 'Uploading', file
    local = open(file, 'rb')    
    remote.storbinary('STOR ' + file, local, 1024)
    if verbose: 
        print 'Upload done.'
def disconnect(remote):
if __name__ == '__main__':
    remote = connect(HOST, REMOTE_DIR, info)
    for file in files:
        putfile(remote, file)

neverblueNeverblue is a CPA network that I have found to be one of the better ones out there. If you are not familiar with the CPA side of internet marketing it’s where you get paid for each person you refer that performs a certain action (CPA = Cost Per Action) The action could be anything from providing a Zip Code, or email address, to purchasing a sample. The marketer who promotes the offer can get quite a good payout – anything from $0.05 to $50+.

Marketers find offers to promote using services like that provided by neverblue. And neverblue acts as the middle man by finding and paying the marketers, and finding businesses with offers for them to promote.

Neverblue is unique in that they program their own platform and have developed some nice APIs and interfaces for getting your performance and tracking statistics programatically. I promote a bunch of their offers and and make a decent amount of money through them so I thought I should write a script that can download my statistics and keep it stored somewhere mesh it with my PPC data to calculate return on investment numbers per keyword.

Getting data from Neverblue is a 3 step process:

  1. Request a report to be generated
  2. Wait for that report to finish
  3. Request the results of the report

This is a bit more complex than most of the processes that download information, but it is a pretty flexible way to request bigger datasets without timing out on the HTTP request.

So here’s a short Python script I wrote based on Neverblue’s sample PHP script. I just prints out the payout information for yesterday.

Example Usage:

$ python NeverblueCheck.py
2009-08-20 $40.00

Here’s the Python code that gets the statistics from neverblue:

#!/usr/bin/env python
# encoding: utf-8
Created by Matt Warren on 2009-08-12.
Copyright (c) 2009 HalOtis.com. All rights reserved.
import urllib2
import time
import csv
import os
from urllib import urlencode
    from xml.etree import ElementTree
except ImportError:
    from elementtree import ElementTree
username='Your Neverblue login (email)'
url = 'https://secure.neverblue.com/service/aff/v1/rest/'
schedule_url = url + 'reportSchedule/'
status_url   = url + 'reportStatus/'
download_url = url + 'reportDownloadUrl/'
REALM = 'secure.neverblue.com'
def install_opener():
    # create a password manager
    password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
    # Add the username and password.
    password_mgr.add_password(REALM, url, username, password)
    handler = urllib2.HTTPBasicAuthHandler(password_mgr)
    # create "opener" (OpenerDirector instance)
    opener = urllib2.build_opener(handler)
    # Install the opener.
    # Now all calls to urllib2.urlopen use our opener.
def request_report():
    params={'type':'date', 'relativeDate':'yesterday', 'campaign':0}
    req = urllib2.Request(schedule_url + '?' + urlencode(params))
    handle = urllib2.urlopen(req)
    xml = handle.read()
    tree = ElementTree.fromstring(xml)
    # parse the reportJob code from the XML
    reportJob = tree.find('reportJob').text
    return reportJob
def check_status(reportJob):
    params = {'reportJob' = reportJob}
    for i in range(0, SERVER_RETRIES):
        req = urllib2.Request(status_url + '?' + urlencode(params))
        handle = urllib2.urlopen(req)
        xml = handle.read()
        tree = ElementTree.fromstring(xml)
        reportStatus = tree.find('reportStatus').text
        if reportStatus == 'completed':
    return reportStatus
def get_results(reportJob):
    params = {'reportJob':reportJob, 'format':'csv'}
    req = urllib2.Request(download_url + '?' + urlencode(params))
    handle = urllib2.urlopen(req)
    xml = handle.read()
    tree = ElementTree.fromstring(xml)
    downloadURL = tree.find('downloadUrl').text
    report = urllib2.urlopen(downloadURL).read()
    reader = csv.DictReader( report.split( '\n' ) )
    for row in reader:
        print row['Date'], row['Payout']
if __name__=='__main__':
    reportJob = request_report()
    reportStatus = check_status(reportJob)
    if reportStatus == 'completed':

If you’re interested in trying to make money with CPA offers I highly recommend using Neverblue to find some really profitable offers and probably the most advanced platform for doing international offers out there right now.

Product Advertising APIAmazon has a very comprehensive associate program that allows you to promote just about anything imaginable for any niche and earn commission for anything you refer. The size of the catalog is what makes Amazon such a great program. People make some good money promoting Amazon products.

There is a great Python library out there for accessing the other Amazon web services such as S3, and EC2 called boto. However it doesn’t support the Product Advertising API.

With the Product Advertising API you have access to everything that you can read on the Amazon site about each product. This includes the product description, images, editor reviews, customer reviews and ratings. This is a lot of great information that you could easily find a good use for with your websites.

So how do you get at this information from within a Python program? Well the complicated part is dealing with the authentication that Amazon has put in place. To make that a bit easier I used the connection component from boto.

Here’s a demonstration snippet of code that will print out the top 10 best selling books on Amazon right now.

Example Usage:

$ python AmazonSample.py
Glenn Becks Common Sense: The Case Against an Out-of-Control Government, Inspired by Thomas Paine by Glenn Beck
Culture of Corruption: Obama and His Team of Tax Cheats, Crooks, and Cronies by Michelle Malkin
The Angel Experiment (Maximum Ride, Book 1) by James Patterson
The Time Travelers Wife by Audrey Niffenegger
The Help by Kathryn Stockett
South of Broad by Pat Conroy
Paranoia by Joseph Finder
The Girl Who Played with Fire by Stieg Larsson
The Shack [With Headphones] (Playaway Adult Nonfiction) by William P. Young
The Girl with the Dragon Tattoo by Stieg Larsson

To use this code you’ll need an Amazon associate account and fill out the keys and tag needed for authentication.

Product Advertising API Python code:

#!/usr/bin/env python
# encoding: utf-8
Created by Matt Warren on 2009-08-17.
Copyright (c) 2009 HalOtis.com. All rights reserved.
import urllib
    from xml.etree import ET
except ImportError:
    from elementtree import ET
from boto.connection import AWSQueryConnection
def amazon_top_for_category(browseNodeId):
    aws_conn = AWSQueryConnection(
        aws_secret_access_key=AWS_SECRET_ACCESS_KEY, is_secure=False,
    aws_conn.SignatureVersion = '2'
    params = dict(
        Timestamp=time.strftime("%Y-%m-%dT%H:%M:%S", time.gmtime()))
    verb = 'GET'
    path = '/onca/xml'
    qs, signature = aws_conn.get_signature(params, verb, path)
    qs = path + '?' + qs + '&Signature=' + urllib.quote(signature)
    response = aws_conn._mexe(verb, qs, None, headers={})
    tree = ET.fromstring(response.read())
    NS = tree.tag.split('}')[0][1:]
    for item in tree.find('{%s}Items'%NS).findall('{%s}Item'%NS):
        title = item.find('{%s}ItemAttributes'%NS).find('{%s}Title'%NS).text
        author = item.find('{%s}ItemAttributes'%NS).find('{%s}Author'%NS).text
        print title, 'by', author
if __name__ == '__main__':
    amazon_top_for_category(1000) #Amazon category number for US Books

linkbuildinglu7Inlinks (aka backlinks) are an important aspect of your SEO strategy. They are the ways that people will find your website and they are an indicator to search engines that your website is important and should rank well. So it is important to keep an eye on this statistic for your website. There is a saying: “you can’t manage what you can’t measure” which applies. If you want your website to rank well you need to manage your inlinks and so you need to measure them.

This script requires a Yahoo! AppID because it uses the REST API for Yahoo! Site Explorer rather than any scraping of pages which you can get by going to the Yahoo! Developer Network.

The script simply returns the total number of results but you could easily extend this to print out all your inlinks. I will be using this to track my inlink count over time by running it every day and storing the result in a database.

Example Usage:

$ python yahooInlinks.py http://www.halotis.com
checking http://www.halotis.com

Here’s the Python Code:

#!/usr/bin/env python 
# -*- coding: utf-8 -*-
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# http://halotis.com/
import urllib2, sys, urllib
   import json
   import simplejson as json # http://undefined.org/python/#simplejson
def yahoo_inlinks_count(query):
    if not query.startswith('http://') raise Exception('site must start with "http://"')
    request = 'http://search.yahooapis.com/SiteExplorerService/V1/inlinkData?appid=' + YAHOO_APP_ID + '&query=' + urllib.quote_plus(query) + '&output=json&results=0'
    	results = json.load(urllib2.urlopen(request))
    	raise Exception("Web services request failed")
    return results['ResultSet']['totalResultsAvailable']
if __name__=='__main__':
    print 'checking', sys.argv[1]
    print yahoo_inlinks_count(sys.argv[1])

Ok, even though Yahoo search is on the way out and will be replace by the search engine behind Bing. That transition won’t happen until sometime in 2010. Until then Yahoo still has 20% of the search engine market share and it’s important to consider it as an important source of traffic for your websites.

This script is similar to the Google and Bing SERP scrapers that I posted earlier on this site but Yahoo’s pages were slightly more complicated to parse. This was because they use a re-direct service in their URLs which required some regular expression matching.

I will be putting all these little components together into a larger program later.

Example Usage:

$ python yahooScrape.py

Here’s the Script:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# http://halotis.com/
import urllib,urllib2
import re
from BeautifulSoup import BeautifulSoup
def yahoo_grab(query):
    address = "http://search.yahoo.com/search?p=%s" % (urllib.quote_plus(query))
    request = urllib2.Request(address, None, {'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'} )
    urlfile = urllib2.urlopen(request)
    page = urlfile.read(200000)
    soup = BeautifulSoup(page)
    url_pattern = re.compile('/\*\*(.*)')
    links =   [urllib.unquote_plus(url_pattern.findall(x.find('a')['href'])[0]) for x in soup.find('div', id='web').findAll('h3')]
    return links
if __name__=='__main__':
    # Example: Search written to file
    links = yahoo_grab('halotis')
    print '\n'.join(links)

1_google_logoHere’s a short script that will scrape the first 100 listings in the Google Organic results.

You might want to use this to find the position of your sites and track their position for certain target keyword phrases over time. That could be a very good way to determine, for example, if your SEO efforts are working. Or you could use the list of URLs as a starting point for some other web crawling activity

As the script is written it will just dump the list of URLs to a txt file.

It uses the BeautifulSoup library to help with parsing the HTML page.

Example Usage:

$ python GoogleScrape.py
$ cat links.txt

Here’s the script:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# http://halotis.com/
import urllib,urllib2
from BeautifulSoup import BeautifulSoup
def google_grab(query):
    address = "http://www.google.com/search?q=%s&num=100&hl=en&start=0" % (urllib.quote_plus(query))
    request = urllib2.Request(address, None, {'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'} )
    urlfile = urllib2.urlopen(request)
    page = urlfile.read(200000)
    soup = BeautifulSoup(page)
    links =   [x['href'] for x in soup.findAll('a', attrs={'class':'l'})]
    return links
if __name__=='__main__':
    # Example: Search written to file
    links = google_grab('halotis')

This isn’t my script but I thought it would appeal to the reader of this blog.  It’s a script that  will lookup the Google Page Rank for any website and uses the same interface as the Google Toolbar to do it. I’d like to thank Fred Cirera for writing it and you can checkout his blog about this script here.

I’m not exactly sure what I would use this for but it might have applications for anyone who wants to do some really advanced SEO work and find a real way to accomplish Page Rank sculpting. Perhaps finding the best websites to put links on.

The reason it is such an involved bit of math is that it need to compute a checksum in order to work. It should be pretty reliable since it doesn’t involve and scraping.

Example usage:

$ python pagerank.py http://www.google.com/
PageRank: 10	URL: http://www.google.com/
$ python pagerank.py http://www.mozilla.org/
PageRank: 9	URL: http://www.mozilla.org/
$ python pagerank.py http://halotis.com
PageRange: 3   URL: http://www.halotis.com/

And the script:

#!/usr/bin/env python
#  Script for getting Google Page Rank of page
#  Google Toolbar 3.0.x/4.0.x Pagerank Checksum Algorithm
#  original from http://pagerank.gamesaga.net/
#  this version was adapted from http://www.djangosnippets.org/snippets/221/
#  by Corey Goldberg - 2010
#  Licensed under the MIT license: http://www.opensource.org/licenses/mit-license.php
import urllib
def get_pagerank(url):
    hsh = check_hash(hash_url(url))
    gurl = 'http://www.google.com/search?client=navclient-auto&features=Rank:&q=info:%s&ch=%s' % (urllib.quote(url), hsh)
        f = urllib.urlopen(gurl)
        rank = f.read().strip()[9:]
    except Exception:
        rank = 'N/A'
    if rank == '':
        rank = '0'
    return rank
def  int_str(string, integer, factor):
    for i in range(len(string)) :
        integer *= factor
        integer &= 0xFFFFFFFF
        integer += ord(string[i])
    return integer
def hash_url(string):
    c1 = int_str(string, 0x1505, 0x21)
    c2 = int_str(string, 0, 0x1003F)
    c1 >>= 2
    c1 = ((c1 >> 4) & 0x3FFFFC0) | (c1 & 0x3F)
    c1 = ((c1 >> 4) & 0x3FFC00) | (c1 & 0x3FF)
    c1 = ((c1 >> 4) & 0x3C000) | (c1 & 0x3FFF)
    t1 = (c1 & 0x3C0) < < 4
    t1 |= c1 & 0x3C
    t1 = (t1 << 2) | (c2 & 0xF0F)
    t2 = (c1 & 0xFFFFC000) << 4
    t2 |= c1 & 0x3C00
    t2 = (t2 << 0xA) | (c2 & 0xF0F0000)
    return (t1 | t2)
def check_hash(hash_int):
    hash_str = '%u' % (hash_int)
    flag = 0
    check_byte = 0
    i = len(hash_str) - 1
    while i >= 0:
        byte = int(hash_str[i])
        if 1 == (flag % 2):
            byte *= 2;
            byte = byte / 10 + byte % 10
        check_byte += byte
        flag += 1
        i -= 1
    check_byte %= 10
    if 0 != check_byte:
        check_byte = 10 - check_byte
        if 1 == flag % 2:
            if 1 == check_byte % 2:
                check_byte += 9
            check_byte >>= 1
    return '7' + str(check_byte) + hash_str
if __name__ == '__main__':
    if len(sys.argv) != 2:
        url = 'http://www.google.com/'
        url = sys.argv[1]
    print get_pagerank(url)