Category Archives: Internet

This is a simple Twitter Python script that checks your friends time-line and prints out any links that have been posted. In addition it visits each of the URLs and finds the actual title of the destination page and prints that along side. This simple script demonstrates an easy way to gather some of the hottest trends on the internet the moment they happen.

If you set up a Twitter account within a niche and find a few of the players in that niche to follow then you can simply find any links posted, check them to see if they are on topic (using some keyword/heuristics) and then either notify yourself of the interesting content, or automatically scrape it for use on one of your related websites. That gives you perhaps the most up to date content possible before it hits Google Trends. It also gives you a chance to promote it before the social news sites find it (or be the first to submit it to them).

With a bit more work you could parse out some of the meta tag keywords/description, crawl the website, or find and cut out the content from the page. If it’s a blog you could post a comment.

Example Usage:

$ python - Twitter Status - Tweets from users you follow may be missing from your timeline - Why Link Exchanges Are a Terrible, No-Good Idea - Food Blog Alliance - Frank and Trey - Gallery: Cute animals in the news this week

And here’s the python code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# (C) 2009 HalOtis Marketing
# written by Matt Warren
   import json
   import simplejson as json #
import twitter     #
from urllib2 import urlopen
import re
SETTINGS = {'user':'twitter user name', 'password':'you password here'}
def listFriendsURLs(user, password):
    re_pattern='.*?((?:http|https)(?::\\/{2}[\\w]+)(?:[\\/|\\.]?)(?:[^\\s"]*))'	# HTTP URL
    rg = re.compile(re_pattern,re.IGNORECASE|re.DOTALL)
    api = twitter.Api(user, password)
    timeline = api.GetFriendsTimeline(user)
    for status in timeline:
        m =
        if m:
            title = getTitle(httpurl)
            print httpurl, '-', title
def getTitle(url):
    req = urlopen(url)
    html =
    rg = re.compile(re_pattern,re.IGNORECASE|re.DOTALL)
    m =
    if m:
        title =
        return title.strip()
    return None
if __name__ == '__main__':
    listFriendsURLs(SETTINGS['user'], SETTINGS['password'])

spider_webHere’s a simple web crawling script that will go from one url and find all the pages it links to up to a pre-defined depth. Web crawling is of course the lowest level tool used by Google to create its multi-billion dollar business. You may not be able to compete with Google’s search technology but being able to crawl your own sites, or that of your competitors can be very valuable.

You could for instance routinely check your websites to make sure that it is live and all the links are working. it could notify you of any 404 errors. By adding in a page rank check you could identify better linking strategies to boost your page rank scores. And you could identify possible leaks – paths a user could take that takes them away from where you want them to go.

Here’s the script:

# -*- coding: utf-8 -*-
from HTMLParser import HTMLParser
from urllib2 import urlopen
class Spider(HTMLParser):
    def __init__(self, starting_url, depth, max_span):
        self.url = starting_url
        self.db = {self.url: 1}
        self.node = [self.url]
        self.depth = depth # recursion depth max
        self.max_span = max_span # max links obtained per url
        self.links_found = 0
    def handle_starttag(self, tag, attrs):
        if self.links_found < self.max_span and tag == 'a' and attrs:
            link = attrs[0][1]
            if link[:4] != "http":
                link = '/'.join(self.url.split('/')[:3])+('/'+link).replace('//','/')
            if link not in self.db:
                print "new link ---> %s" % link
                self.links_found += 1
            self.db[link] = (self.db.get(link) or 0) + 1
    def crawl(self):
        for depth in xrange(self.depth):
            print "*"*70+("\nScanning depth %d web\n" % (depth+1))+"*"*70
            context_node = self.node[:]
            self.node = []
            for self.url in context_node:
                self.links_found = 0
                    req = urlopen(self.url)
                    res =
        print "*"*40 + "\nRESULTS\n" + "*"*40
        zorted = [(v,k) for (k,v) in self.db.items()]
        zorted.sort(reverse = True)
        return zorted
if __name__ == "__main__":
    spidey = Spider(starting_url = '', depth = 5, max_span = 10)
    result = spidey.crawl()
    for (n,link) in result:
        print "%s was found %d time%s." %(link,n, "s" if n is not 1 else "")

translate_logoOk, so this isn’t my script but it’s a much nicer version of the one I wrote that scrapes the actual Google translate website to do the same thing. I’d like to thank Ashish Yadav for writing and sharing this.

Translating text is an easy way to create variations of content that is recognized as unique by the search engines. As part of a bigger SEO strategy this can make a big impact on your traffic. Or it could be used to provide an automated way to translate your website to another language.

# -*- coding: utf-8 -*-
import re
import sys
import urllib
import simplejson
baseUrl = ""
def getSplits(text,splitLength=4500):
    Translate Api has a limit on length of text(4500 characters) that can be translated at once, 
    return (text[index:index+splitLength] for index in xrange(0,len(text),splitLength))
def translate(text,src='', to='en'):
    A Python Wrapper for Google AJAX Language API:
    * Uses Google Language Detection, in cases source language is not provided with the source text
    * Splits up text if it's longer then 4500 characters, as a limit put up by the API
    params = ({'langpair': '%s|%s' % (src, to),
             'v': '1.0'
    for text in getSplits(text):
            params['q'] = text
            resp = simplejson.load(urllib.urlopen('%s' % (baseUrl), data = urllib.urlencode(params)))
                    retText += resp['responseData']['translatedText']
    return retText
def test():
    msg = "      Write something You want to be translated to English,\n"\
        "      Enter ctrl+c to exit"
    print msg
    while True:
        text = raw_input('#>  ')
        retText = translate(text)
        print retText
if __name__=='__main__':
    except KeyboardInterrupt:
        print "\n"

Last night I attended an event here in Vancouver sponsored by neverblue ads. It was a chance to meet my affiliate manager in person to get some of his tips for taking things to the next level. I was also able to meet a ton of other affiliates and find out what was working for them and get a real feel for the industry right now.

One interesting thing that I learned that more advanced marketers are doing right now is to use PPC to do the initial testing and track down which keywords and which demographics are really working for a particular offer. Then they take that information and use it to intelligently purchase a media buy which could cost upwards of $10,000. That’s where the big boys are playing and making tens of thousands of dollars every day.

I was surprised to learn from my affiliate manager that of all the people that are actively working as an affiliate for CPA offers about 30% make more than $5000/month. That’s much higher than I expected.

Here’s a tip for you: if you haven’t heard of Signup right now and find out if there are any meetups of people with similar interests in your city. I’ve joined a bunch of groups and met hundreds of local people now that are interested in making money online. Going to events like meetup202 last night with they guys from Neverblue really keeps me motivated and focused on growing my business.

I’m working on a project that required a way to iteratively go through Amazon BrowseNodes. To do that I wanted to do a breadth first search through the tree and came up with a rather nice way to do that in Python.

There are a few resources that can be useful for finding browsenodes. The website provides a crude way to browse through browsenodes. But it doesn’t offer the kind of control you might need in your own application.

Here’s a Python script that will print out browseNodes breadth first starting from the root node for books.

#!/usr/bin/env python
Created by Matt Warren on 2009-09-08.
Copyright (c) 2009 All rights reserved.
import time
import urllib
    from xml.etree import ElementTree
except ImportError:
    from elementtree import ElementTree
from boto.connection import AWSQueryConnection
def bfs(root,children=iter):
    queue = [root, ]
    visited = list(set([]))
    while len(queue) > 0:
        node = queue.pop(0)
        yield node
        for child in children(node):
            if not child in visited:
def amazon_browsenodelookup_children(nodeId, searchIndex='Books'):
    aws_conn = AWSQueryConnection(
        aws_secret_access_key=AWS_SECRET_ACCESS_KEY, is_secure=False,
    aws_conn.SignatureVersion = '2'
    params = dict(
        Timestamp=time.strftime("%Y-%m-%dT%H:%M:%S", time.gmtime()))
    verb = 'GET'
    path = '/onca/xml'
    qs, signature = aws_conn.get_signature(params, verb, path)
    qs = path + '?' + qs + '&Signature=' + urllib.quote(signature)
    response = aws_conn._mexe(verb, qs, None, headers={})
    content =
    tree = ElementTree.fromstring(content)
    NS = tree.tag.split('}')[0][1:]
    children = []
        for node in tree.find('{%s}BrowseNodes'%NS).find('{%s}BrowseNode'%NS).find('{%s}Children'%NS).findall('{%s}BrowseNode'%NS):
            name = node.find('{%s}Name'%NS).text
            id = node.find('{%s}BrowseNodeId'%NS).text
            children.append( id )
            BROWSENODES[id] = name
        return []
    return children
if __name__ == '__main__':
    BROWSENODES['1000'] = 'Books'
    count = 0
    LIMIT = 25
    for node in bfs('1000', amazon_browsenodelookup_children):
        count = count + 1
        if count > LIMIT:
        print BROWSENODES[node], '-', node

Flickr has an amazing library of images and a stellar API for accessing and easily downloading them. I wanted to make use of their API to start downloading a collection of images to use on a future website project and so I started looking for a nice Python script to help me do it.

I found the awesome flickrpy python API wrapper that makes all the hard work very easy to use and wrote a short script that will search for images with a given tag(s) and download them to the current directory.

Expanding on this you could easily use PIL to modify the images and re-purpose them for use on another website such as a wordpress photoblog.

To use the script you’ll have to download flickrpy, and get a Flickr API key.

Here’s the Python script that will download 20 images from Flickr:

#!/usr/bin/env python
"""Usage: python TAGS
TAGS is a space delimited list of tags
Created by Matt Warren on 2009-09-08.
Copyright (c) 2009 All rights reserved.
import sys
import shutil
import urllib
import flickr
#this is slow
def get_urls_for_tags(tags, number):
    photos = flickr.photos_search(tags=tags, tag_mode='all', per_page=number)
    urls = []
    for photo in photos:
            urls.append(photo.getURL(size='Large', urlType='source'))
    return urls
def download_images(urls):
    for url in urls:
        file, mime = urllib.urlretrieve(url)
        name = url.split('/')[-1]
        print name
        shutil.copy(file, './'+name)
def main(*argv):
    args = argv[1:]
    if len(args) == 0:
        print "You must specify at least one tag"
        return 1
    tags = [item for item in args]
    urls = get_urls_for_tags(tags, NUMBER_OF_IMAGES)
if __name__ == '__main__':

neverblue In the last few months I have been making extremely good money from promoting CPA offers from Neverblue and so I thought it would be worthwile writing a quick script to keep me from having to log into the site to check my statistics.

Neverblue has a feature called the Global Postback URL which they will ping every time a conversion is made and provide you with some details about the action. The most straight forward thing that I wanted to do was to convert that ping into an email. This works great for me since I’m only promoting the lower volume high profit items right now so I won’t get blasted with 100s of emails every day (I wish I had that problem).

Basically it’s a very simple PHP script that I put on my website. Here’s the neverblueemailer.php PHP script:

< ?php
$message = "Neverblue CPA Action Notification\n\n";
$message = $message. "affiliate_id:  " . $_GET["af"] . "\n";
$message = $message . "site_id:  " . $_GET["site"] . "\n";
$message = $message . "campaign_id:  " . $_GET["c"] . "\n";
$message = $message . "action_id:  " . $_GET["ac"] . "\n";
$message = $message . "subid:  " . $_GET["s"] . "\n";
$message = $message . "unique_conversion_id:  " . $_GET["u"] . "\n";
echo $message;
$to = '';
$subject = 'Neverblue CPA Action Notification';
$headers = "From:\r\nReply-To:";
$mail_sent = @mail( $to, $subject, $message, $headers );
echo $mail_sent ? "Mail sent" : "Mail failed";

You just need to edit your email information.

With that file uploaded to your website you can then hook it up to neverblue. Simply login to your account and put the URL in the Tools -> Global Postback URL -> Enter Your Non Site Specific Global Postback URL section. Then click Save Changes.


The URL you need is going to be something like this:{affiliate_id}&site={site_id}&c={campaign_id}&ac={action_id}&s={subid}&u={unique_conversion_id}

That’s it. Test it yourself by visiting the URL in your browser. You should immediately get an email in your inbox.

I cannot recommend neverblue enough. They have quickly become my biggest source of online revenue. It’s worth signing up for to give CPA a try.

nichemarketingbullseyeYesterday I found a website that got me thinking.

It’s a niche website with a lot of unique content that is heavily SEO optimized. The business model was very simple – they have 4 CPA lead generation offers on each page and the goal is quite apparent: keep people on the site long enough with the content and drive them to one of the CPA offers. I imagine that the site is very successful at making money.

It’s a model that is simple, and a website that is not technically difficult to create.  However when I started to think about how to duplicate this business model for myself I found it to be a mountainous undertaking.

And so I’m going to ask you: If you needed to get 30,000-50,000 words of high quality unique content that would keep people on your website for 10+minutes how would you get it? The two options I could think of are:

  1. Pay someone to research and write the content for you
  2. Buy a website that already has the unique content on it

Both of those options could cost $$$.  Is there another way to get great content fast without spending so much money?  Leave a comment with your thoughts.

By the way…  the interesting  website I found that started this train of thought is  How much do you think a website like that would be making?  How much would it cost to develop?

I schedule all my scripts to run on a WebFaction server. That gives me excellent up time, and the ability to install my own programs and python libraries. So far they have proved to be a kick-ass web hosting service.

I needed a way to push code updates to my scripts up to the server and so I quickly put together a simple script that uses FTP to transfer the files I need uploaded.

It’s not very sophisticated and there are probably better ways to deploy code such as using mercurial, or rsync to push out updates without stepping on remote code changes. But the FTP approach will work just fine.

This script uses a hard-coded list of files that I want to always push to the server.

here it is:

#!/usr/bin/env python
import ftplib
HOST = ''
REMOTE_DIR = './bin/'
files = ('', '')
def connect(site, dir, user=(), verbose=True):
    if verbose:
        print 'Connecting', site
    remote = ftplib.FTP(site)   
    if verbose:
        print 'Changing directory', dir
    return remote
def putfile(remote, file, verbose=True):
    if verbose: 
        print 'Uploading', file
    local = open(file, 'rb')    
    remote.storbinary('STOR ' + file, local, 1024)
    if verbose: 
        print 'Upload done.'
def disconnect(remote):
if __name__ == '__main__':
    remote = connect(HOST, REMOTE_DIR, info)
    for file in files:
        putfile(remote, file)

Product Advertising APIThe Amazon Product Advertising API allows you to to do a lot of stuff. Using their REST interface you can browse the entire catalogue of Amazon products, get the prices pictures, descriptions and reviews for just about everything. It’s also possible to use Amazon’s shopping cart. Using this API it is possible to create you’re own e-commerse website with none of your own inventory, and without having to deal with credit card payments or the hassles that most e-commerse stores deal with.

It would be interesting to come up with you’re own value-add to the Amazon experience and offer that through your own website. Just off the top of my head that might be something like

  • a Digg like site for products where people thumb up things like like
  • A more stream lined webpage for looking at reviews
  • A niche website that finds and sells just a small subset of the products
  • A College specific site with links to the books used for each course

There are a lot of possibilities but to get started you need to understand how Amazon organizes and presents this information.

Once you sign up as an Amazon Associate and get the keys you need to access the service you can build the request URL that will return some XML with the information you asked for or any error messages.

Each REST request will require an Operation argument. Each operation has a number of required and optional arguments. For example the “Operation=ItemSearch” will return the items that match the search criteria. The type of information you get back is determined by the ResponseGroup required argument. You can get back just the images for a product by requesting ResponseGroup=Image, or get the editorial review by requesting ResponseGroup=EditorialReview. There are some product specific response groups such as Tracks that are valid for music albums.

When trying to dig through the available products you need to use Amazons own hierarcy of product categories to find what you want. These are called BrowseNodes. A BrowseNode is identified with a positive integer and given a name. They are created and deleted from the system regularly so you shouldn’t save them in your programs. BrowseNodes are hierarchical so the root node “Books” has many child nodes such as “Mystery” and “Non-Fiction” and those in turn have more children. As you can imagine a product may belong to multiple BrowseNodes and a BrowseNode can have multiple parent nodes.

The SearchIndex is the high level category for a product. This is a fixed list that includes: Books, DVD, Apparel, Music, and Software. By separating a search into a SearchIndex it allows Amazon to better optimize how they query their large catalog of available products. And it will return more relevant results for a search query.

So a REST request to the the Product Advertising API might look something like:
AWSAccessKeyId=[Access Key ID]&

This will return an XML document that contains a list of Shirts that are available as sell as information about each shirt. Because the SeachIndex is Apparel it will not return any books about Shirts.

You’ll need to reference the API documentation if you want to write anything using this. Be prepared.. the documentation is 600 pages long. It’s available here.

I hope this post gives you an idea about where to start with using Amazons Product Advertising API. I have written a few python scripts now that use it and will be cleaning them up this week to post here so stay tuned for that.