Tag Archives: programming

There are many Twitter API libraries available for Python. I wanted to find out which one was the best and what the strengths and weaknesses of each are. However there are too many out there to find and review all of them. Instead here’s a bunch of the most popular Python Twitter API wrappers with a small review of each one, and some sample code showing off the syntax.

Python Twitter

This is the library that I personally use in most of my Twitter scripts. It’s very simple to use, but is not up to date with the latest developments at Twitter. It was written by DeWitt Clinton and available on Google Code. If you just want the basic API functionality this does a pretty decent job.

import twitter
api = twitter.Api('username', 'password')
statuses = api.GetPublicTimeline()
print [s.user.name for s in statuses]
users = api.GetFriends()
print [u.name for u in users]
statuses = api.GetUserTimeline(user)
print [s.text for s in statuses]
api.PostUpdate(username, password, 'I love python-twitter!')

twyt

Twyt is a pretty comprehensive library that seems to be pretty solid and well organized. In some cases there is added complexity to parse the return json objects from the Python Twitter API. It is written and maintained by Andrew Price.

from twyt import twitter, data
t = twitter.Twitter()
t.set_auth("username", "password")
print t.status_friends_timeline()
print t.user_friends()
return_val = t.status_update("Testing 123")
s = data.Status()
s.load_json(return_val)
print s
t.status_destroy(s.id)

Twython

An up to date, Python wrapper for the Twitter API. Supports Twitter’s main API, Twitter’s search API, and (soon) using OAuth with Twitter/Streaming API. It is based on the Python Twitter library and is actively maintained by Ryan Mcgrath.

import twython
twitter = twython.setup(authtype="Basic", username="example", password="example")
twitter.updateStatus("See how easy this was?")
friends_timeline = twitter.getFriendsTimeline(count="150", page="3")
print [tweet["text"] for tweet in friends_timeline]

Tweepy

Tweepy is a pretty compelling Python Twitter API library. It’s up to date with the latest features of Twitter and actively being developed by Joshua Roesslein. It features OAuth support, Python 3 support, streaming API support and it’s own cache system. Retweet streaming was recently added. If you want to use the most up to date features of the Twitter API on Python, or use Python 3, then you should definitely check out Tweepy.

import tweepy
api = tweepy.API.new('basic', 'username', 'password')
public_timeline = api.public_timeline()
print [tweet.text for tweet in public_timeline]
friends_timeline = api.friends_timeline()
print [tweet.text for tweet in friends_timeline]
u = tweepy.api.get_user('username')
friends = u.friends()
print [tweet.screen_name for f in friends]
api.update_status('tweeting with tweepy')

It’s pretty clear from the syntax samples that there’s not much difference between any of these Python Twitter API libraries when just getting the basics done. The difference only start to show up when you get into the latest features. OAuth, Streaming, and the retweet functions really differentiate these libraries. I hope this overview helps you find and choose the library that’s right for your project.

spider_webHere’s a simple web crawling script that will go from one url and find all the pages it links to up to a pre-defined depth. Web crawling is of course the lowest level tool used by Google to create its multi-billion dollar business. You may not be able to compete with Google’s search technology but being able to crawl your own sites, or that of your competitors can be very valuable.

You could for instance routinely check your websites to make sure that it is live and all the links are working. it could notify you of any 404 errors. By adding in a page rank check you could identify better linking strategies to boost your page rank scores. And you could identify possible leaks – paths a user could take that takes them away from where you want them to go.

Here’s the script:

# -*- coding: utf-8 -*-
from HTMLParser import HTMLParser
from urllib2 import urlopen
 
class Spider(HTMLParser):
    def __init__(self, starting_url, depth, max_span):
        HTMLParser.__init__(self)
        self.url = starting_url
        self.db = {self.url: 1}
        self.node = [self.url]
 
        self.depth = depth # recursion depth max
        self.max_span = max_span # max links obtained per url
        self.links_found = 0
 
    def handle_starttag(self, tag, attrs):
        if self.links_found < self.max_span and tag == 'a' and attrs:
            link = attrs[0][1]
            if link[:4] != "http":
                link = '/'.join(self.url.split('/')[:3])+('/'+link).replace('//','/')
 
            if link not in self.db:
                print "new link ---> %s" % link
                self.links_found += 1
                self.node.append(link)
            self.db[link] = (self.db.get(link) or 0) + 1
 
    def crawl(self):
        for depth in xrange(self.depth):
            print "*"*70+("\nScanning depth %d web\n" % (depth+1))+"*"*70
            context_node = self.node[:]
            self.node = []
            for self.url in context_node:
                self.links_found = 0
                try:
                    req = urlopen(self.url)
                    res = req.read()
                    self.feed(res)
                except:
                    self.reset()
        print "*"*40 + "\nRESULTS\n" + "*"*40
        zorted = [(v,k) for (k,v) in self.db.items()]
        zorted.sort(reverse = True)
        return zorted
 
if __name__ == "__main__":
    spidey = Spider(starting_url = 'http://www.7cerebros.com.ar', depth = 5, max_span = 10)
    result = spidey.crawl()
    for (n,link) in result:
        print "%s was found %d time%s." %(link,n, "s" if n is not 1 else "")

translate_logoOk, so this isn’t my script but it’s a much nicer version of the one I wrote that scrapes the actual Google translate website to do the same thing. I’d like to thank Ashish Yadav for writing and sharing this.

Translating text is an easy way to create variations of content that is recognized as unique by the search engines. As part of a bigger SEO strategy this can make a big impact on your traffic. Or it could be used to provide an automated way to translate your website to another language.

# -*- coding: utf-8 -*-
 
import re
import sys
import urllib
import simplejson
 
baseUrl = "http://ajax.googleapis.com/ajax/services/language/translate"
 
def getSplits(text,splitLength=4500):
    '''
    Translate Api has a limit on length of text(4500 characters) that can be translated at once, 
    '''
    return (text[index:index+splitLength] for index in xrange(0,len(text),splitLength))
 
 
def translate(text,src='', to='en'):
    '''
    A Python Wrapper for Google AJAX Language API:
    * Uses Google Language Detection, in cases source language is not provided with the source text
    * Splits up text if it's longer then 4500 characters, as a limit put up by the API
    '''
 
    params = ({'langpair': '%s|%s' % (src, to),
             'v': '1.0'
             })
    retText=''
    for text in getSplits(text):
            params['q'] = text
            resp = simplejson.load(urllib.urlopen('%s' % (baseUrl), data = urllib.urlencode(params)))
            try:
                    retText += resp['responseData']['translatedText']
            except:
                    raise
    return retText
 
 
def test():
    msg = "      Write something You want to be translated to English,\n"\
        "      Enter ctrl+c to exit"
    print msg
    while True:
        text = raw_input('#>  ')
        retText = translate(text)
        print retText
 
 
if __name__=='__main__':
    try:
        test()
    except KeyboardInterrupt:
        print "\n"
        sys.exit(0)

I schedule all my scripts to run on a WebFaction server. That gives me excellent up time, and the ability to install my own programs and python libraries. So far they have proved to be a kick-ass web hosting service.

I needed a way to push code updates to my scripts up to the server and so I quickly put together a simple script that uses FTP to transfer the files I need uploaded.

It’s not very sophisticated and there are probably better ways to deploy code such as using mercurial, or rsync to push out updates without stepping on remote code changes. But the FTP approach will work just fine.

This script uses a hard-coded list of files that I want to always push to the server.

here it is:

#!/usr/bin/env python
import ftplib
 
USERNAME='username'
PASSWORD='password'
HOST = 'ftp.server.com'
REMOTE_DIR = './bin/'
info= (USERNAME, PASSWORD)
 
files = ('file1.py', 'file2.py')
 
def connect(site, dir, user=(), verbose=True):
    if verbose:
        print 'Connecting', site
    remote = ftplib.FTP(site)   
    remote.login(*user)
    if verbose:
        print 'Changing directory', dir
    remote.cwd(dir)
    return remote
 
def putfile(remote, file, verbose=True):
    if verbose: 
        print 'Uploading', file
    local = open(file, 'rb')    
    remote.storbinary('STOR ' + file, local, 1024)
    local.close()
    if verbose: 
        print 'Upload done.'
 
def disconnect(remote):
    remote.quit()
 
if __name__ == '__main__':
    remote = connect(HOST, REMOTE_DIR, info)
    for file in files:
        putfile(remote, file)
    disconnect(remote)

Product Advertising APIThe Amazon Product Advertising API allows you to to do a lot of stuff. Using their REST interface you can browse the entire catalogue of Amazon products, get the prices pictures, descriptions and reviews for just about everything. It’s also possible to use Amazon’s shopping cart. Using this API it is possible to create you’re own e-commerse website with none of your own inventory, and without having to deal with credit card payments or the hassles that most e-commerse stores deal with.

It would be interesting to come up with you’re own value-add to the Amazon experience and offer that through your own website. Just off the top of my head that might be something like

  • a Digg like site for products where people thumb up things like like
  • A more stream lined webpage for looking at reviews
  • A niche website that finds and sells just a small subset of the products
  • A College specific site with links to the books used for each course

There are a lot of possibilities but to get started you need to understand how Amazon organizes and presents this information.

Once you sign up as an Amazon Associate and get the keys you need to access the service you can build the request URL that will return some XML with the information you asked for or any error messages.

Each REST request will require an Operation argument. Each operation has a number of required and optional arguments. For example the “Operation=ItemSearch” will return the items that match the search criteria. The type of information you get back is determined by the ResponseGroup required argument. You can get back just the images for a product by requesting ResponseGroup=Image, or get the editorial review by requesting ResponseGroup=EditorialReview. There are some product specific response groups such as Tracks that are valid for music albums.

When trying to dig through the available products you need to use Amazons own hierarcy of product categories to find what you want. These are called BrowseNodes. A BrowseNode is identified with a positive integer and given a name. They are created and deleted from the system regularly so you shouldn’t save them in your programs. BrowseNodes are hierarchical so the root node “Books” has many child nodes such as “Mystery” and “Non-Fiction” and those in turn have more children. As you can imagine a product may belong to multiple BrowseNodes and a BrowseNode can have multiple parent nodes.

The SearchIndex is the high level category for a product. This is a fixed list that includes: Books, DVD, Apparel, Music, and Software. By separating a search into a SearchIndex it allows Amazon to better optimize how they query their large catalog of available products. And it will return more relevant results for a search query.

So a REST request to the the Product Advertising API might look something like:

http://ecs.amazonaws.com/onca/xml?
Service=AWSECommerceService&
Operation=ItemSearch&
AWSAccessKeyId=[Access Key ID]&
AssociateTag=[ID]&
SearchIndex=Apparel&
Keywords=Shirt

This will return an XML document that contains a list of Shirts that are available as sell as information about each shirt. Because the SeachIndex is Apparel it will not return any books about Shirts.

You’ll need to reference the API documentation if you want to write anything using this. Be prepared.. the documentation is 600 pages long. It’s available here.

I hope this post gives you an idea about where to start with using Amazons Product Advertising API. I have written a few python scripts now that use it and will be cleaning them up this week to post here so stay tuned for that.

I found this really neat bit of .bat file magic that will let you save your python script code in a .bat file and run it in windows just like any other script. The nice thing about this is that you don’t have to create a separate “launch.bat” file with one “start python script.py” line in it.

This makes running python scripts in Windows more like it is on a Linux/Mac where you can easily add a #!/usr/bin/env python line to the script and run it directly.

Here’s the bit of tricky batch file magic that does it:

@setlocal enabledelayedexpansion && python -x "%~f0" %* & exit /b !ERRORLEVEL!
#start python code here
print "hello world"

The way it works is that the first line of the file does two different things.

  1. starts python interpreter passing the name of the file in, and the -x option will tell it to skip the first line (containing .bat file code)
  2. When python finishes the script exits.

This nifty trick makes it much nicer for writing admin scripts with python on Windows.

Update: fixed to properly pass command line arguments (%* argument passes through the command line arguments for the bat file to python)

Product Advertising APIAmazon has a very comprehensive associate program that allows you to promote just about anything imaginable for any niche and earn commission for anything you refer. The size of the catalog is what makes Amazon such a great program. People make some good money promoting Amazon products.

There is a great Python library out there for accessing the other Amazon web services such as S3, and EC2 called boto. However it doesn’t support the Product Advertising API.

With the Product Advertising API you have access to everything that you can read on the Amazon site about each product. This includes the product description, images, editor reviews, customer reviews and ratings. This is a lot of great information that you could easily find a good use for with your websites.

So how do you get at this information from within a Python program? Well the complicated part is dealing with the authentication that Amazon has put in place. To make that a bit easier I used the connection component from boto.

Here’s a demonstration snippet of code that will print out the top 10 best selling books on Amazon right now.

Example Usage:

$ python AmazonSample.py
Glenn Becks Common Sense: The Case Against an Out-of-Control Government, Inspired by Thomas Paine by Glenn Beck
Culture of Corruption: Obama and His Team of Tax Cheats, Crooks, and Cronies by Michelle Malkin
The Angel Experiment (Maximum Ride, Book 1) by James Patterson
The Time Travelers Wife by Audrey Niffenegger
The Help by Kathryn Stockett
South of Broad by Pat Conroy
Paranoia by Joseph Finder
The Girl Who Played with Fire by Stieg Larsson
The Shack [With Headphones] (Playaway Adult Nonfiction) by William P. Young
The Girl with the Dragon Tattoo by Stieg Larsson

To use this code you’ll need an Amazon associate account and fill out the keys and tag needed for authentication.

Product Advertising API Python code:

#!/usr/bin/env python
# encoding: utf-8
"""
AmazonExample.py
 
Created by Matt Warren on 2009-08-17.
Copyright (c) 2009 HalOtis.com. All rights reserved.
"""
 
import urllib
try:
    from xml.etree import ET
except ImportError:
    from elementtree import ET
 
from boto.connection import AWSQueryConnection
 
AWS_ACCESS_KEY_ID = 'YOUR ACCESS KEY'
AWS_ASSOCIATE_TAG = 'YOUR TAG'
AWS_SECRET_ACCESS_KEY = 'YOUR SECRET KEY'
 
def amazon_top_for_category(browseNodeId):
    aws_conn = AWSQueryConnection(
        aws_access_key_id=AWS_ACCESS_KEY_ID,
        aws_secret_access_key=AWS_SECRET_ACCESS_KEY, is_secure=False,
        host='ecs.amazonaws.com')
    aws_conn.SignatureVersion = '2'
    params = dict(
        Service='AWSECommerceService',
        Version='2009-07-01',
        SignatureVersion=aws_conn.SignatureVersion,
        AWSAccessKeyId=AWS_ACCESS_KEY_ID,
        AssociateTag=AWS_ASSOCIATE_TAG,
        Operation='ItemSearch',
        BrowseNode=browseNodeId,
        SearchIndex='Books',
        ResponseGroup='ItemAttributes,EditorialReview',
        Order='salesrank',
        Timestamp=time.strftime("%Y-%m-%dT%H:%M:%S", time.gmtime()))
    verb = 'GET'
    path = '/onca/xml'
    qs, signature = aws_conn.get_signature(params, verb, path)
    qs = path + '?' + qs + '&Signature=' + urllib.quote(signature)
    response = aws_conn._mexe(verb, qs, None, headers={})
    tree = ET.fromstring(response.read())
 
    NS = tree.tag.split('}')[0][1:]
 
    for item in tree.find('{%s}Items'%NS).findall('{%s}Item'%NS):
        title = item.find('{%s}ItemAttributes'%NS).find('{%s}Title'%NS).text
        author = item.find('{%s}ItemAttributes'%NS).find('{%s}Author'%NS).text
        print title, 'by', author
 
if __name__ == '__main__':
    amazon_top_for_category(1000) #Amazon category number for US Books

Ok, even though Yahoo search is on the way out and will be replace by the search engine behind Bing. That transition won’t happen until sometime in 2010. Until then Yahoo still has 20% of the search engine market share and it’s important to consider it as an important source of traffic for your websites.

This script is similar to the Google and Bing SERP scrapers that I posted earlier on this site but Yahoo’s pages were slightly more complicated to parse. This was because they use a re-direct service in their URLs which required some regular expression matching.

I will be putting all these little components together into a larger program later.

Example Usage:

$ python yahooScrape.py
http://www.halotis.com/
http://www.halotis.com/2007/08/27/automation-is-key-automate-the-web/
http://twitter.com/halotis
http://www.scribd.com/halotis
http://www.topless-sandal.com/product_info.php/products_id/743?tsSid=71491a7bb080238335f7224573598606
http://feeds.feedburner.com/HalotisBlog
http://www.planet-tonga.com/sports/haloti_ngata.shtml
http://blog.oregonlive.com/ducks/2007/08/kellens_getting_it_done.html
http://friendfeed.com/mfwarren
http://friendfeed.com/mfwarren?start=30

Here’s the Script:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# http://halotis.com/
 
import urllib,urllib2
import re
 
from BeautifulSoup import BeautifulSoup
 
def yahoo_grab(query):
 
    address = "http://search.yahoo.com/search?p=%s" % (urllib.quote_plus(query))
    request = urllib2.Request(address, None, {'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'} )
    urlfile = urllib2.urlopen(request)
    page = urlfile.read(200000)
    urlfile.close()
 
    soup = BeautifulSoup(page)
    url_pattern = re.compile('/\*\*(.*)')
    links =   [urllib.unquote_plus(url_pattern.findall(x.find('a')['href'])[0]) for x in soup.find('div', id='web').findAll('h3')]
 
    return links
 
if __name__=='__main__':
    # Example: Search written to file
    links = yahoo_grab('halotis')
    print '\n'.join(links)

bingLogo_5F00_lgBased on my last post for scraping the Google SERP I decided to make the small change to scrape the organic search results from Bing.

I wasn’t able to find a way to display 100 results per page in the Bing results so this script will only return the top 10. However it could be enhanced to loop through the pages of results but I have left that out of this code.

Example Usage:

$ python BingScrape.py
http://twitter.com/halotis
http://www.halotis.com/
http://www.halotis.com/progress/
http://doi.acm.org/10.1145/367072.367328
http://runtoloseweight.com/privacy.php
http://twitter.com/halotis/statuses/2391293559
http://friendfeed.com/mfwarren
http://www.date-conference.com/archive/conference/proceedings/PAPERS/2001/DATE01/PDFFILES/07a_2.pdf
http://twitterrespond.com/
http://heatherbreen.com

Here’s the Python Code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# http://halotis.com/
 
import urllib,urllib2
 
from BeautifulSoup import BeautifulSoup
 
def bing_grab(query):
 
    address = "http://www.bing.com/search?q=%s" % (urllib.quote_plus(query))
    request = urllib2.Request(address, None, {'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'} )
    urlfile = urllib2.urlopen(request)
    page = urlfile.read(200000)
    urlfile.close()
 
    soup = BeautifulSoup(page)
    links =   [x.find('a')['href'] for x in soup.find('div', id='results').findAll('h3')]
 
    return links
 
if __name__=='__main__':
    # Example: Search written to file
    links = bing_grab('halotis')
    print '\n'.join(links)

1_google_logoHere’s a short script that will scrape the first 100 listings in the Google Organic results.

You might want to use this to find the position of your sites and track their position for certain target keyword phrases over time. That could be a very good way to determine, for example, if your SEO efforts are working. Or you could use the list of URLs as a starting point for some other web crawling activity

As the script is written it will just dump the list of URLs to a txt file.

It uses the BeautifulSoup library to help with parsing the HTML page.

Example Usage:

$ python GoogleScrape.py
$ cat links.txt
http://www.halotis.com/
http://www.halotis.com/2009/07/01/rss-twitter-bot-in-python/
http://www.blogcatalog.com/blogs/halotis.html
http://www.blogcatalog.com/topic/sqlite/
http://ieeexplore.ieee.org/iel5/10358/32956/01543043.pdf?arnumber=1543043
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1543043
http://doi.ieeecomputersociety.org/10.1109/DATE.2001.915065
http://rapidlibrary.com/index.php?q=hal+otis
http://www.tagza.com/Software/Video_tutorial_-_URL_re-directing_software-___HalOtis/
http://portal.acm.org/citation.cfm?id=367328
http://ag.arizona.edu/herbarium/db/get_taxon.php?id=20605&show_desc=1
http://www.plantsystematics.org/taxpage/0/genus/Halotis.html
http://www.mattwarren.name/
http://www.mattwarren.name/2009/07/31/net-worth-update-3-5/
http://newweightlossdiet.com/privacy.php
http://www.ingentaconnect.com/content/nisc/sajms/1988/00000006/00000001/art00002?crawler=true
http://www.ingentaconnect.com/content/nisc/sajms/2000/00000022/00000001/art00013?crawler=true
http://www.springerlink.com/index/etm69yghjva13xlh.pdf
http://www.springerlink.com/index/b7fytc095bc57x59.pdf
......
$

Here’s the script:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# http://halotis.com/
 
import urllib,urllib2
 
from BeautifulSoup import BeautifulSoup
 
def google_grab(query):
 
    address = "http://www.google.com/search?q=%s&num=100&hl=en&start=0" % (urllib.quote_plus(query))
    request = urllib2.Request(address, None, {'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'} )
    urlfile = urllib2.urlopen(request)
    page = urlfile.read(200000)
    urlfile.close()
 
    soup = BeautifulSoup(page)
    links =   [x['href'] for x in soup.findAll('a', attrs={'class':'l'})]
 
    return links
 
if __name__=='__main__':
    # Example: Search written to file
    links = google_grab('halotis')
    open("links.txt","w+b").write("\n".join(links))