Category Archives: Python

There are many Twitter API libraries available for Python. I wanted to find out which one was the best and what the strengths and weaknesses of each are. However there are too many out there to find and review all of them. Instead here’s a bunch of the most popular Python Twitter API wrappers with a small review of each one, and some sample code showing off the syntax.

Python Twitter

This is the library that I personally use in most of my Twitter scripts. It’s very simple to use, but is not up to date with the latest developments at Twitter. It was written by DeWitt Clinton and available on Google Code. If you just want the basic API functionality this does a pretty decent job.

import twitter
api = twitter.Api('username', 'password')
statuses = api.GetPublicTimeline()
print [s.user.name for s in statuses]
users = api.GetFriends()
print [u.name for u in users]
statuses = api.GetUserTimeline(user)
print [s.text for s in statuses]
api.PostUpdate(username, password, 'I love python-twitter!')

twyt

Twyt is a pretty comprehensive library that seems to be pretty solid and well organized. In some cases there is added complexity to parse the return json objects from the Python Twitter API. It is written and maintained by Andrew Price.

from twyt import twitter, data
t = twitter.Twitter()
t.set_auth("username", "password")
print t.status_friends_timeline()
print t.user_friends()
return_val = t.status_update("Testing 123")
s = data.Status()
s.load_json(return_val)
print s
t.status_destroy(s.id)

Twython

An up to date, Python wrapper for the Twitter API. Supports Twitter’s main API, Twitter’s search API, and (soon) using OAuth with Twitter/Streaming API. It is based on the Python Twitter library and is actively maintained by Ryan Mcgrath.

import twython
twitter = twython.setup(authtype="Basic", username="example", password="example")
twitter.updateStatus("See how easy this was?")
friends_timeline = twitter.getFriendsTimeline(count="150", page="3")
print [tweet["text"] for tweet in friends_timeline]

Tweepy

Tweepy is a pretty compelling Python Twitter API library. It’s up to date with the latest features of Twitter and actively being developed by Joshua Roesslein. It features OAuth support, Python 3 support, streaming API support and it’s own cache system. Retweet streaming was recently added. If you want to use the most up to date features of the Twitter API on Python, or use Python 3, then you should definitely check out Tweepy.

import tweepy
api = tweepy.API.new('basic', 'username', 'password')
public_timeline = api.public_timeline()
print [tweet.text for tweet in public_timeline]
friends_timeline = api.friends_timeline()
print [tweet.text for tweet in friends_timeline]
u = tweepy.api.get_user('username')
friends = u.friends()
print [tweet.screen_name for f in friends]
api.update_status('tweeting with tweepy')

It’s pretty clear from the syntax samples that there’s not much difference between any of these Python Twitter API libraries when just getting the basics done. The difference only start to show up when you get into the latest features. OAuth, Streaming, and the retweet functions really differentiate these libraries. I hope this overview helps you find and choose the library that’s right for your project.

This is a simple Twitter Python script that checks your friends time-line and prints out any links that have been posted. In addition it visits each of the URLs and finds the actual title of the destination page and prints that along side. This simple script demonstrates an easy way to gather some of the hottest trends on the internet the moment they happen.

If you set up a Twitter account within a niche and find a few of the players in that niche to follow then you can simply find any links posted, check them to see if they are on topic (using some keyword/heuristics) and then either notify yourself of the interesting content, or automatically scrape it for use on one of your related websites. That gives you perhaps the most up to date content possible before it hits Google Trends. It also gives you a chance to promote it before the social news sites find it (or be the first to submit it to them).

With a bit more work you could parse out some of the meta tag keywords/description, crawl the website, or find and cut out the content from the page. If it’s a blog you could post a comment.

Example Usage:

$ python TwitterLinks.py
http://bit.ly/s8rQX - Twitter Status - Tweets from users you follow may be missing from your timeline
http://bit.ly/26hiT - Why Link Exchanges Are a Terrible, No-Good Idea - Food Blog Alliance
http://FrankAndTrey.com - Frank and Trey
http://bit.ly/yPRHp - Gallery: Cute animals in the news this week
...

And here’s the python code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# http://halotis.com/
 
try:
   import json
except:
   import simplejson as json # http://undefined.org/python/#simplejson
import twitter     #http://code.google.com/p/python-twitter/
 
from urllib2 import urlopen
import re
 
SETTINGS = {'user':'twitter user name', 'password':'you password here'}
 
def listFriendsURLs(user, password):
    re_pattern='.*?((?:http|https)(?::\\/{2}[\\w]+)(?:[\\/|\\.]?)(?:[^\\s"]*))'	# HTTP URL
    rg = re.compile(re_pattern,re.IGNORECASE|re.DOTALL)
 
    api = twitter.Api(user, password)
    timeline = api.GetFriendsTimeline(user)
 
    for status in timeline:
        m = rg.search(status.text)
        if m:
            httpurl=m.group(1)
            title = getTitle(httpurl)
            print httpurl, '-', title
 
def getTitle(url):
    req = urlopen(url)
    html = req.read()
 
    re_pattern='<title>(.*?)</title>'
    rg = re.compile(re_pattern,re.IGNORECASE|re.DOTALL)
 
    m = rg.search(html)
    if m:
        title = m.group(1)
        return title.strip()
    return None
 
if __name__ == '__main__':
    listFriendsURLs(SETTINGS['user'], SETTINGS['password'])

spider_webHere’s a simple web crawling script that will go from one url and find all the pages it links to up to a pre-defined depth. Web crawling is of course the lowest level tool used by Google to create its multi-billion dollar business. You may not be able to compete with Google’s search technology but being able to crawl your own sites, or that of your competitors can be very valuable.

You could for instance routinely check your websites to make sure that it is live and all the links are working. it could notify you of any 404 errors. By adding in a page rank check you could identify better linking strategies to boost your page rank scores. And you could identify possible leaks – paths a user could take that takes them away from where you want them to go.

Here’s the script:

# -*- coding: utf-8 -*-
from HTMLParser import HTMLParser
from urllib2 import urlopen
 
class Spider(HTMLParser):
    def __init__(self, starting_url, depth, max_span):
        HTMLParser.__init__(self)
        self.url = starting_url
        self.db = {self.url: 1}
        self.node = [self.url]
 
        self.depth = depth # recursion depth max
        self.max_span = max_span # max links obtained per url
        self.links_found = 0
 
    def handle_starttag(self, tag, attrs):
        if self.links_found < self.max_span and tag == 'a' and attrs:
            link = attrs[0][1]
            if link[:4] != "http":
                link = '/'.join(self.url.split('/')[:3])+('/'+link).replace('//','/')
 
            if link not in self.db:
                print "new link ---> %s" % link
                self.links_found += 1
                self.node.append(link)
            self.db[link] = (self.db.get(link) or 0) + 1
 
    def crawl(self):
        for depth in xrange(self.depth):
            print "*"*70+("\nScanning depth %d web\n" % (depth+1))+"*"*70
            context_node = self.node[:]
            self.node = []
            for self.url in context_node:
                self.links_found = 0
                try:
                    req = urlopen(self.url)
                    res = req.read()
                    self.feed(res)
                except:
                    self.reset()
        print "*"*40 + "\nRESULTS\n" + "*"*40
        zorted = [(v,k) for (k,v) in self.db.items()]
        zorted.sort(reverse = True)
        return zorted
 
if __name__ == "__main__":
    spidey = Spider(starting_url = 'http://www.7cerebros.com.ar', depth = 5, max_span = 10)
    result = spidey.crawl()
    for (n,link) in result:
        print "%s was found %d time%s." %(link,n, "s" if n is not 1 else "")

translate_logoOk, so this isn’t my script but it’s a much nicer version of the one I wrote that scrapes the actual Google translate website to do the same thing. I’d like to thank Ashish Yadav for writing and sharing this.

Translating text is an easy way to create variations of content that is recognized as unique by the search engines. As part of a bigger SEO strategy this can make a big impact on your traffic. Or it could be used to provide an automated way to translate your website to another language.

# -*- coding: utf-8 -*-
 
import re
import sys
import urllib
import simplejson
 
baseUrl = "http://ajax.googleapis.com/ajax/services/language/translate"
 
def getSplits(text,splitLength=4500):
    '''
    Translate Api has a limit on length of text(4500 characters) that can be translated at once, 
    '''
    return (text[index:index+splitLength] for index in xrange(0,len(text),splitLength))
 
 
def translate(text,src='', to='en'):
    '''
    A Python Wrapper for Google AJAX Language API:
    * Uses Google Language Detection, in cases source language is not provided with the source text
    * Splits up text if it's longer then 4500 characters, as a limit put up by the API
    '''
 
    params = ({'langpair': '%s|%s' % (src, to),
             'v': '1.0'
             })
    retText=''
    for text in getSplits(text):
            params['q'] = text
            resp = simplejson.load(urllib.urlopen('%s' % (baseUrl), data = urllib.urlencode(params)))
            try:
                    retText += resp['responseData']['translatedText']
            except:
                    raise
    return retText
 
 
def test():
    msg = "      Write something You want to be translated to English,\n"\
        "      Enter ctrl+c to exit"
    print msg
    while True:
        text = raw_input('#>  ')
        retText = translate(text)
        print retText
 
 
if __name__=='__main__':
    try:
        test()
    except KeyboardInterrupt:
        print "\n"
        sys.exit(0)

I’m working on a project that required a way to iteratively go through Amazon BrowseNodes. To do that I wanted to do a breadth first search through the tree and came up with a rather nice way to do that in Python.

There are a few resources that can be useful for finding browsenodes. The BrowseNodes.com website provides a crude way to browse through browsenodes. But it doesn’t offer the kind of control you might need in your own application.

Here’s a Python script that will print out browseNodes breadth first starting from the root node for books.

#!/usr/bin/env python
"""
Created by Matt Warren on 2009-09-08.
Copyright (c) 2009 HalOtis.com. All rights reserved.
"""
import time
import urllib
 
try:
    from xml.etree import ElementTree
except ImportError:
    from elementtree import ElementTree
 
from boto.connection import AWSQueryConnection
 
AWS_ACCESS_KEY_ID = 'YOUR ID'
AWS_ASSOCIATE_TAG = 'YOUR TAG'
AWS_SECRET_ACCESS_KEY = 'YOUR KEY'
 
BROWSENODES = {}
 
def bfs(root,children=iter):
    queue = [root, ]
    visited = list(set([]))
 
    while len(queue) > 0:
        node = queue.pop(0)
        visited.append(node)
        yield node
 
        for child in children(node):
            if not child in visited:
                queue.append(child)
    return
 
 
def amazon_browsenodelookup_children(nodeId, searchIndex='Books'):
    aws_conn = AWSQueryConnection(
        aws_access_key_id=AWS_ACCESS_KEY_ID,
        aws_secret_access_key=AWS_SECRET_ACCESS_KEY, is_secure=False,
        host='ecs.amazonaws.com')
    aws_conn.SignatureVersion = '2'
    params = dict(
        Service='AWSECommerceService',
        Version='2009-07-01',
        SignatureVersion=aws_conn.SignatureVersion,
        AWSAccessKeyId=AWS_ACCESS_KEY_ID,
        AssociateTag=AWS_ASSOCIATE_TAG,
        Operation='BrowseNodeLookup',
        SearchIndex=searchIndex,
        BrowseNodeId=nodeId,
        Timestamp=time.strftime("%Y-%m-%dT%H:%M:%S", time.gmtime()))
    verb = 'GET'
    path = '/onca/xml'
    qs, signature = aws_conn.get_signature(params, verb, path)
    qs = path + '?' + qs + '&Signature=' + urllib.quote(signature)
    response = aws_conn._mexe(verb, qs, None, headers={})
    content = response.read()
    tree = ElementTree.fromstring(content)
    NS = tree.tag.split('}')[0][1:]
 
    children = []
    try:
        for node in tree.find('{%s}BrowseNodes'%NS).find('{%s}BrowseNode'%NS).find('{%s}Children'%NS).findall('{%s}BrowseNode'%NS):
            name = node.find('{%s}Name'%NS).text
            id = node.find('{%s}BrowseNodeId'%NS).text
            children.append( id )
            BROWSENODES[id] = name
    except:
        return []
    return children
 
 
if __name__ == '__main__':
    BROWSENODES['1000'] = 'Books'
    count = 0
    LIMIT = 25
    for node in bfs('1000', amazon_browsenodelookup_children):
        count = count + 1
        if count > LIMIT:
            break
        print BROWSENODES[node], '-', node

Flickr has an amazing library of images and a stellar API for accessing and easily downloading them. I wanted to make use of their API to start downloading a collection of images to use on a future website project and so I started looking for a nice Python script to help me do it.

I found the awesome flickrpy python API wrapper that makes all the hard work very easy to use and wrote a short script that will search for images with a given tag(s) and download them to the current directory.

Expanding on this you could easily use PIL to modify the images and re-purpose them for use on another website such as a wordpress photoblog.

To use the script you’ll have to download flickrpy, and get a Flickr API key.

Here’s the Python script that will download 20 images from Flickr:

#!/usr/bin/env python
"""Usage: python flickrDownload.py TAGS
TAGS is a space delimited list of tags
 
Created by Matt Warren on 2009-09-08.
Copyright (c) 2009 HalOtis.com. All rights reserved.
"""
import sys
import shutil
import urllib
 
import flickr
 
NUMBER_OF_IMAGES = 20
 
#this is slow
def get_urls_for_tags(tags, number):
    photos = flickr.photos_search(tags=tags, tag_mode='all', per_page=number)
    urls = []
    for photo in photos:
        try:
            urls.append(photo.getURL(size='Large', urlType='source'))
        except:
            continue
    return urls
 
def download_images(urls):
    for url in urls:
        file, mime = urllib.urlretrieve(url)
        name = url.split('/')[-1]
        print name
        shutil.copy(file, './'+name)
 
def main(*argv):
    args = argv[1:]
    if len(args) == 0:
        print "You must specify at least one tag"
        return 1
 
    tags = [item for item in args]
 
    urls = get_urls_for_tags(tags, NUMBER_OF_IMAGES)
    download_images(urls)
 
if __name__ == '__main__':
    sys.exit(main(*sys.argv))

I schedule all my scripts to run on a WebFaction server. That gives me excellent up time, and the ability to install my own programs and python libraries. So far they have proved to be a kick-ass web hosting service.

I needed a way to push code updates to my scripts up to the server and so I quickly put together a simple script that uses FTP to transfer the files I need uploaded.

It’s not very sophisticated and there are probably better ways to deploy code such as using mercurial, or rsync to push out updates without stepping on remote code changes. But the FTP approach will work just fine.

This script uses a hard-coded list of files that I want to always push to the server.

here it is:

#!/usr/bin/env python
import ftplib
 
USERNAME='username'
PASSWORD='password'
HOST = 'ftp.server.com'
REMOTE_DIR = './bin/'
info= (USERNAME, PASSWORD)
 
files = ('file1.py', 'file2.py')
 
def connect(site, dir, user=(), verbose=True):
    if verbose:
        print 'Connecting', site
    remote = ftplib.FTP(site)   
    remote.login(*user)
    if verbose:
        print 'Changing directory', dir
    remote.cwd(dir)
    return remote
 
def putfile(remote, file, verbose=True):
    if verbose: 
        print 'Uploading', file
    local = open(file, 'rb')    
    remote.storbinary('STOR ' + file, local, 1024)
    local.close()
    if verbose: 
        print 'Upload done.'
 
def disconnect(remote):
    remote.quit()
 
if __name__ == '__main__':
    remote = connect(HOST, REMOTE_DIR, info)
    for file in files:
        putfile(remote, file)
    disconnect(remote)

neverblueNeverblue is a CPA network that I have found to be one of the better ones out there. If you are not familiar with the CPA side of internet marketing it’s where you get paid for each person you refer that performs a certain action (CPA = Cost Per Action) The action could be anything from providing a Zip Code, or email address, to purchasing a sample. The marketer who promotes the offer can get quite a good payout – anything from $0.05 to $50+.

Marketers find offers to promote using services like that provided by neverblue. And neverblue acts as the middle man by finding and paying the marketers, and finding businesses with offers for them to promote.

Neverblue is unique in that they program their own platform and have developed some nice APIs and interfaces for getting your performance and tracking statistics programatically. I promote a bunch of their offers and and make a decent amount of money through them so I thought I should write a script that can download my statistics and keep it stored somewhere mesh it with my PPC data to calculate return on investment numbers per keyword.

Getting data from Neverblue is a 3 step process:

  1. Request a report to be generated
  2. Wait for that report to finish
  3. Request the results of the report

This is a bit more complex than most of the processes that download information, but it is a pretty flexible way to request bigger datasets without timing out on the HTTP request.

So here’s a short Python script I wrote based on Neverblue’s sample PHP script. I just prints out the payout information for yesterday.

Example Usage:

$ python NeverblueCheck.py
2009-08-20 $40.00

Here’s the Python code that gets the statistics from neverblue:

#!/usr/bin/env python
# encoding: utf-8
"""
NeverblueCheck.py
 
Created by Matt Warren on 2009-08-12.
Copyright (c) 2009 HalOtis.com. All rights reserved.
"""
 
import urllib2
import time
import csv
import os
from urllib import urlencode
try:
    from xml.etree import ElementTree
except ImportError:
    from elementtree import ElementTree
 
username='Your Neverblue login (email)'
password='password'
 
url = 'https://secure.neverblue.com/service/aff/v1/rest/'
schedule_url = url + 'reportSchedule/'
status_url   = url + 'reportStatus/'
download_url = url + 'reportDownloadUrl/'
REALM = 'secure.neverblue.com'
 
SERVER_RETRIES = 100
SERVER_DELAY = 2
 
def install_opener():
    # create a password manager
    password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
 
    # Add the username and password.
    password_mgr.add_password(REALM, url, username, password)
 
    handler = urllib2.HTTPBasicAuthHandler(password_mgr)
 
    # create "opener" (OpenerDirector instance)
    opener = urllib2.build_opener(handler)
 
    # Install the opener.
    # Now all calls to urllib2.urlopen use our opener.
    urllib2.install_opener(opener)
 
def request_report():
    params={'type':'date', 'relativeDate':'yesterday', 'campaign':0}
    req = urllib2.Request(schedule_url + '?' + urlencode(params))
 
    handle = urllib2.urlopen(req)
    xml = handle.read()
    tree = ElementTree.fromstring(xml)
 
    # parse the reportJob code from the XML
    reportJob = tree.find('reportJob').text
    return reportJob
 
def check_status(reportJob):
    params = {'reportJob' = reportJob}
 
    for i in range(0, SERVER_RETRIES):
        req = urllib2.Request(status_url + '?' + urlencode(params))
        handle = urllib2.urlopen(req)
        xml = handle.read()
        tree = ElementTree.fromstring(xml)
        reportStatus = tree.find('reportStatus').text
        if reportStatus == 'completed':
            break
        time.sleep(SERVER_DELAY)
    return reportStatus
 
def get_results(reportJob):
    params = {'reportJob':reportJob, 'format':'csv'}
    req = urllib2.Request(download_url + '?' + urlencode(params))
    handle = urllib2.urlopen(req)
    xml = handle.read()
    tree = ElementTree.fromstring(xml)
    downloadURL = tree.find('downloadUrl').text
    report = urllib2.urlopen(downloadURL).read()
    reader = csv.DictReader( report.split( '\n' ) )
    for row in reader:
        print row['Date'], row['Payout']
 
if __name__=='__main__':
    install_opener()
    reportJob = request_report()
    reportStatus = check_status(reportJob)
    if reportStatus == 'completed':
        get_results(reportJob)

If you’re interested in trying to make money with CPA offers I highly recommend using Neverblue to find some really profitable offers and probably the most advanced platform for doing international offers out there right now.

I found this really neat bit of .bat file magic that will let you save your python script code in a .bat file and run it in windows just like any other script. The nice thing about this is that you don’t have to create a separate “launch.bat” file with one “start python script.py” line in it.

This makes running python scripts in Windows more like it is on a Linux/Mac where you can easily add a #!/usr/bin/env python line to the script and run it directly.

Here’s the bit of tricky batch file magic that does it:

@setlocal enabledelayedexpansion && python -x "%~f0" %* & exit /b !ERRORLEVEL!
#start python code here
print "hello world"

The way it works is that the first line of the file does two different things.

  1. starts python interpreter passing the name of the file in, and the -x option will tell it to skip the first line (containing .bat file code)
  2. When python finishes the script exits.

This nifty trick makes it much nicer for writing admin scripts with python on Windows.

Update: fixed to properly pass command line arguments (%* argument passes through the command line arguments for the bat file to python)

Product Advertising APIAmazon has a very comprehensive associate program that allows you to promote just about anything imaginable for any niche and earn commission for anything you refer. The size of the catalog is what makes Amazon such a great program. People make some good money promoting Amazon products.

There is a great Python library out there for accessing the other Amazon web services such as S3, and EC2 called boto. However it doesn’t support the Product Advertising API.

With the Product Advertising API you have access to everything that you can read on the Amazon site about each product. This includes the product description, images, editor reviews, customer reviews and ratings. This is a lot of great information that you could easily find a good use for with your websites.

So how do you get at this information from within a Python program? Well the complicated part is dealing with the authentication that Amazon has put in place. To make that a bit easier I used the connection component from boto.

Here’s a demonstration snippet of code that will print out the top 10 best selling books on Amazon right now.

Example Usage:

$ python AmazonSample.py
Glenn Becks Common Sense: The Case Against an Out-of-Control Government, Inspired by Thomas Paine by Glenn Beck
Culture of Corruption: Obama and His Team of Tax Cheats, Crooks, and Cronies by Michelle Malkin
The Angel Experiment (Maximum Ride, Book 1) by James Patterson
The Time Travelers Wife by Audrey Niffenegger
The Help by Kathryn Stockett
South of Broad by Pat Conroy
Paranoia by Joseph Finder
The Girl Who Played with Fire by Stieg Larsson
The Shack [With Headphones] (Playaway Adult Nonfiction) by William P. Young
The Girl with the Dragon Tattoo by Stieg Larsson

To use this code you’ll need an Amazon associate account and fill out the keys and tag needed for authentication.

Product Advertising API Python code:

#!/usr/bin/env python
# encoding: utf-8
"""
AmazonExample.py
 
Created by Matt Warren on 2009-08-17.
Copyright (c) 2009 HalOtis.com. All rights reserved.
"""
 
import urllib
try:
    from xml.etree import ET
except ImportError:
    from elementtree import ET
 
from boto.connection import AWSQueryConnection
 
AWS_ACCESS_KEY_ID = 'YOUR ACCESS KEY'
AWS_ASSOCIATE_TAG = 'YOUR TAG'
AWS_SECRET_ACCESS_KEY = 'YOUR SECRET KEY'
 
def amazon_top_for_category(browseNodeId):
    aws_conn = AWSQueryConnection(
        aws_access_key_id=AWS_ACCESS_KEY_ID,
        aws_secret_access_key=AWS_SECRET_ACCESS_KEY, is_secure=False,
        host='ecs.amazonaws.com')
    aws_conn.SignatureVersion = '2'
    params = dict(
        Service='AWSECommerceService',
        Version='2009-07-01',
        SignatureVersion=aws_conn.SignatureVersion,
        AWSAccessKeyId=AWS_ACCESS_KEY_ID,
        AssociateTag=AWS_ASSOCIATE_TAG,
        Operation='ItemSearch',
        BrowseNode=browseNodeId,
        SearchIndex='Books',
        ResponseGroup='ItemAttributes,EditorialReview',
        Order='salesrank',
        Timestamp=time.strftime("%Y-%m-%dT%H:%M:%S", time.gmtime()))
    verb = 'GET'
    path = '/onca/xml'
    qs, signature = aws_conn.get_signature(params, verb, path)
    qs = path + '?' + qs + '&Signature=' + urllib.quote(signature)
    response = aws_conn._mexe(verb, qs, None, headers={})
    tree = ET.fromstring(response.read())
 
    NS = tree.tag.split('}')[0][1:]
 
    for item in tree.find('{%s}Items'%NS).findall('{%s}Item'%NS):
        title = item.find('{%s}ItemAttributes'%NS).find('{%s}Title'%NS).text
        author = item.find('{%s}ItemAttributes'%NS).find('{%s}Author'%NS).text
        print title, 'by', author
 
if __name__ == '__main__':
    amazon_top_for_category(1000) #Amazon category number for US Books