Category Archives: Python

I got an email the other day from Frank Kern who was pimping another make money online product from his cousin Trey. The Number Effect is a DVD containing the results of an experiment where he created an affiliate link to every one of the 12,000 products for sale on ClickBank and sent paid (PPV) traffic to all of those links and found which ones were profitable. He found 54 niches with profitable campaigns out of 12,000.

Trey went on to talk about the software that he had written for this experiment. It apparently took a bit of work to get going from his outsourced programmer.

I thought it would be fun to try and implement the same script myself. It took about 1 hour to program the whole thing.

So if you want to create your own clickbank affiliate link for all of the clickbank products for sale here’s a script that will do it. Keep in mind that I never did any work to make this thing fast. and it takes about 8 hours to scrape all 13,000 products, create the affiliate links, and resolve the urls for where it goes. Sure I could make it faster, but I’m lazy.

Here’s the python script to do it:

#!/usr/bin/env python
# encoding: utf-8
"""
ClickBankMarketScrape.py
 
Created by Matt Warren on 2010-09-07.
Copyright (c) 2010 HalOtis.com. All rights reserved.
 
"""
 
 
 
CLICKBANK_URL = 'http://www.clickbank.com'
MARKETPLACE_URL = CLICKBANK_URL+'/marketplace.htm'
AFF_LINK_FORM = CLICKBANK_URL+'/info/jmap.htm'
 
AFFILIATE = 'mfwarren'
 
import urllib, urllib2
from BeautifulSoup import BeautifulSoup
import re
 
product_links = []
product_codes = []
pages_to_scrape = []
 
def get_category_urls():
	request = urllib2.Request(MARKETPLACE_URL, None)
	urlfile = urllib2.urlopen(request)
	page = urlfile.read()
	urlfile.close()
 
	soup = BeautifulSoup(page)
	parentCatLinks = [x['href'] for x in soup.findAll('a', {'class':'parentCatLink'})]
	return parentCatLinks
 
def get_products():
 
	fout = open('ClickBankLinks.csv', 'w')
 
	while len(pages_to_scrape) > 0:
 
		url = pages_to_scrape.pop()
		request = urllib2.Request(url, None)
		urlfile = urllib2.urlopen(request)
		page = urlfile.read()
		urlfile.close()
 
		soup = BeautifulSoup(page)
 
		results = [x.find('a') for x in soup.findAll('tr', {'class':'result'})]
 
		nextLink = soup.find('a', title='Next page')
		if nextLink:
			page_to_scrape.append(nextLink['href'])
 
		for product in results:
			try:
				product_code = str(product).split('.')[1]
				product_codes.append(product_code)
				m = re.search('^< (.*)>(.*)< ', str(product))
				title = m.group(2)
				my_link = get_hoplink(product_code)
				request = urllib2.Request(my_link)
				urlfile = urllib2.urlopen(request)
				display_url = urlfile.url
				#page = urlfile.read()  #continue here if you want to scrape keywords etc from landing page
 
				print my_link, display_url
				product_links.append({'code':product_code, 'aff_link':my_link, 'dest_url':display_url})
				fout.write(product_code + ', ' + my_link + ', ' + display_url + '\n')
				fout.flush()
			except:
				continue  # handle cases where destination url is offline
 
	fout.close()
 
def get_hoplink(vendor):
	request = urllib2.Request(AFF_LINK_FORM + '?affiliate=' + AFFILIATE + '&promocode=&submit=Create&vendor='+vendor+'&results=', None)
	urlfile = urllib2.urlopen(request)
	page = urlfile.read()
	urlfile.close()
	soup = BeautifulSoup(page)
	link = soup.findAll('input', {'class':'special'})[0]['value']
	return link
 
if __name__=='__main__':
	urls = get_category_ids()
	for url in urls:
		pages_to_scrape.append(CLICKBANK_URL+url)
	get_products()

Django Dash is a 48 hour competition where teams from all over the world build a django based web application. You can see the finished web applications at http://djangodash.com/. All the projects were required to be open sourced and committed to on either github.com or bitbucket.org. The cool thing about that is you can see how all these web applications were build from the ground up and get a feel for how to build really compelling django apps.

The results have now been posted.

I’m sure there’s lots of little tricks for quick development buried in those repos. I know I’ll be digging through them to get some inspiration for future projects.

Just 2 days ago I had an idea for a website. Today I’ve launched the site. I had a lot of fun coding the web application – that is why I love using Django.

I have to say that considering that I’m getting a bit rusty with Django and that I have never used any of the Facebook APIs I am pretty impressed that I was able to get a reasonably decent looking and functional website in just 2 days.

Want to take a look at the web application? http://like.halotis.com is where I’ve hosted it for the time being.

The concept is pretty simple. People can post quotes, phrases, links, or jokes and then like it on Facebook. The integration with Facebook will allow the application to promote itself virally when popular “Likeables” spread through the network.

It was a pretty simple concept with a trivial data model so most of the development time was spent fiddling with the CSS and layouts. But certainly I have plenty of ideas for ways to increase the engagement of the site going forward.

This may be the first of many future django facebook applications from me.

I will admit that I have trouble from time to time remembering to keep my blog active and publish frequently. Compounding the problem is that I have many 10s of blogs out there that have become stale and effectively dead due to lack of attention. To help solve this problem I’ve put together a simple nagging script that I have scheduled which checks a bunch of my sites to see how fresh the content is. Once it breaks a specified threshold for the number of days since the last post I get a nagging email reminding me to write something for the site.

#!/usr/bin/env python 
# -*- coding: utf-8 -*-
# (C) 2010 HalOtis Marketing
# written by Matt Warren
# http://halotis.com/
 
import feedparser  # available at feedparser.org
import gmail
 
import datetime
import re
 
SETTINGS = [{'name':'yourblog.com', 'url':'http://www.example.com/feed/', 'email':'YOUR EMAIL', 'frequency':'3'},
            ]
 
today = datetime.datetime.today()
 
def check_blog(blog):
 
    d = feedparser.parse(blog['url'])
    lastPost = d.entries[0]
 
    pubDate = datetime.datetime(
            lastPost.modified_parsed[0],
            lastPost.modified_parsed[1],
            lastPost.modified_parsed[2],
            lastPost.modified_parsed[3],
            lastPost.modified_parsed[4],
            lastPost.modified_parsed[5])
 
    if today - pubDate >  datetime.timedelta(days=7):
        print "send email - last post " + str((today-pubDate).days) + " days ago."
        gmail.send_email(blog['name'] + ' needs attention! last post ' + str((today-pubDate).days) + " days ago.", 'Please write a post', to_addr=blog['email'])
    else:
        print "good - last post" + str((today-pubDate).days) + " days ago."
 
if __name__ == '__main__':
    for blog in SETTINGS:
        check_blog(blog)

I’ve added this script to the halotis-collection on bitbucket.org if your interested in pulling it from there.

I have been doing a lot of web development work lately. Mostly learning about how different people create their workflows and manage local development, testing, staging, and production deployment of code.

In the past I have used Apache Ant for deploying Java applications. It is a bit cumbersome. Apache Ant uses XML config files which are kind of limiting once you try to do something non-standard and can sometimes require writing special Java code to create new directives. The resulting XML is not always easy to read.

For the last few days I have been using Fabric to write a few simple deploy scripts and I think this is a much nicer way of doing it. You get the full power of Python but a very simple syntax and easy command line usage.

Here’s a very simple deploy script that I am using to deploy some static files to my web server.

from fabric.api import *
 
#Fabric 0.9.0 compatible
# usages: $ fab prod deploy
 
REMOTE_HG_PATH = '/home/halotis/bin/hg'
 
def prod():
    """Set the target to production."""
    env.user = 'USERNAME'
    env.hosts = ['USERNAME.webfactional.com']
    env.remote_app_dir = 'webapps/APPLICATION'
    env.remote_push_dest = 'ssh://USERNAME@USERNAME.webfactional.com/%s' % env.remote_app_dir
    env.tag = 'production'
 
 
def deploy():
    """Deploy the site.
 
    This will tag the repository, and push changes to the remote location.
    """
    require('hosts', provided_by=[prod, ])
    require('remote_app_dir', provided_by=[prod, ])
    require('remote_push_dest', provided_by=[prod, ])
    require('tag', provided_by=[prod, ])
 
    local("hg tag --force %s" % env.tag)
    local("hg push %s --remotecmd %s" % (env.remote_push_dest, REMOTE_HG_PATH))
    run("cd %s; hg update -C %s" % (env.remote_app_dir, env.tag))

For this to work though you need to have some things set up.

  • Need SSH access to the remote server
  • Mercurial (hg) must be installed on the remote server, and development
  • Need to bootstrap the remote repository – FTP the .hg folder to the destination location
  • Install Fabric on local development machine – $ pip install fabric

Find out more about Fabric from the official site.

I have been moving more of my websites hosting over to my Webfaction account. It has been a good experience overall and the service there is much more powerful than what I get with my GoDaddy hosting account.

Yesterday I found in the support forums a nice little script that allows me to use one of my domain names (or subdomain) and re-direct the DNS settings to my local computer. This way I can remotely connect to my home computer with SSH/HTTP/FTP etc. using a url that I will remember.

This is going to be useful as I am writing some Python Django web applications.

Here’s the script:

import urllib2
import xmlrpclib
import os
 
currentip = urllib2.urlopen('http://www.whatismyip.org').read()
 
if not os.path.isfile('lastip'):
    f = open('lastip', 'w')
    f.close()
 
with open('lastip', 'r') as f:
    lastip = f.read()
 
if lastip != currentip:
    server = xmlrpclib.ServerProxy('https://api.webfaction.com/')
    session_id, account = server.login('USERNAME', 'PASSWORD')
    server.delete_dns_override(session_id, 'YOURDOMAIN.com')
    server.create_dns_override(session_id, 'YOURDOMAIN.com', currentip, '', '', '', '')
 
    with open('lastip', 'w') as f:
        f.write(currentip)
 
    print('IP updated to %s' % currentip)
else:
    print('IP not updated')

Once this is run, I can update my router settings to forward the appropriate services to my computer and give the DNS servers enough time to propagate the new entry.

I can now connect to my Home computer using my own domain name. Pretty cool.

I have been hard at work testing out different approaches to Adwords.  One of the keys is that I’m scripting up a lot of the management of campaigns, ad groups, keywords, and ads.  The Adwords API could be used but there’s an expense to using it which would be a significant expense for the size of my campaigns.  So I have been using the Adwords Editor to help manage everything.  What makes it excellent is that the tool has import and export to/from csv files.  This makes it pretty simple to play with the data.

To get a file that this script will work with just go to the File menu in Google Adwords Editor and select “Export to CSV”  You can then select “Export Selected Campaigns”.  it will write out a csv file.

This Python script will read those output csv files into a Python data structure which you can then manipulate and write back out to a file.

With the file modified you can then use the Adwords Editor’s “Import CSV” facility to get your changes back into the Editor and then uploaded to Adwords.

Having this ability to pull this data into Python, modify it, and then get it back into Adwords means that I can do a lot of really neat things:

  • create massive campaigns with a large number of targeted ads
  • Invent bidding strategies that act individually at the keyword level
  • automate some of the management
  • pull in statistics from CPA networks to calculate ROIs
  • convert text ads into image ads

Here’s the script:

#!/usr/bin/env python
# coding=utf-8
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# http://halotis.com/
 
"""
read and write exported campaigns from Adwords Editor
 
"""
 
import codecs
import csv
 
FIELDS = ['Campaign', 'Campaign Daily Budget', 'Languages', 'Geo Targeting', 'Ad Group', 'Max CPC', 'Max Content CPC', 'Placement Max CPC', 'Max CPM', 'Max CPA', 'Keyword', 'Keyword Type', 'First Page CPC', 'Quality Score', 'Headline', 'Description Line 1', 'Description Line 2', 'Display URL', 'Destination URL', 'Campaign Status', 'AdGroup Status', 'Creative Status', 'Keyword Status', 'Suggested Changes', 'Comment', 'Impressions', 'Clicks', 'CTR', 'Avg CPC', 'Avg CPM', 'Cost', 'Avg Position', 'Conversions (1-per-click)', 'Conversion Rate (1-per-click)', 'Cost/Conversion (1-per-click)', 'Conversions (many-per-click)', 'Conversion Rate (many-per-click)', 'Cost/Conversion (many-per-click)']
 
def readAdwordsExport(filename):
 
    campaigns = {}
 
    f = codecs.open(filename, 'r', 'utf-16')
    reader = csv.DictReader(f, delimiter='\t')
 
    for row in reader:
        #remove empty values from dict
        row = dict((i, j) for i, j in row.items() if j!='' and j != None)
        if row.has_key('Campaign Daily Budget'):  # campain level settings
            campaigns[row['Campaign']] = {}
            for k,v in row.items():
                campaigns[row['Campaign']][k] = v
        if row.has_key('Max Content CPC'):  # AdGroup level settings
            if not campaigns[row['Campaign']].has_key('Ad Groups'):
                campaigns[row['Campaign']]['Ad Groups'] = {}
            campaigns[row['Campaign']]['Ad Groups'][row['Ad Group']] = row
        if row.has_key('Keyword'):  # keyword level settings
            if not campaigns[row['Campaign']]['Ad Groups'][row['Ad Group']].has_key('keywords'):
                campaigns[row['Campaign']]['Ad Groups'][row['Ad Group']]['keywords'] = []
            campaigns[row['Campaign']]['Ad Groups'][row['Ad Group']]['keywords'].append(row)
        if row.has_key('Headline'):  # ad level settings
            if not campaigns[row['Campaign']]['Ad Groups'][row['Ad Group']].has_key('ads'):
                campaigns[row['Campaign']]['Ad Groups'][row['Ad Group']]['ads'] = []
            campaigns[row['Campaign']]['Ad Groups'][row['Ad Group']]['ads'].append(row)
    return campaigns
 
def writeAdwordsExport(data, filename):
    f = codecs.open(filename, 'w', 'utf-16')
    writer = csv.DictWriter(f, FIELDS, delimiter='\t')
    writer.writerow(dict(zip(FIELDS, FIELDS)))
    for campaign, d in data.items():
        writer.writerow(dict((i,j) for i, j in d.items() if i != 'Ad Groups'))
        for adgroup, ag in d['Ad Groups'].items():
            writer.writerow(dict((i,j) for i, j in ag.items() if i != 'keywords' and i != 'ads'))
            for keyword in ag['keywords']:
                writer.writerow(keyword)            
            for ad in ag['ads']:
                writer.writerow(ad)
    f.close()
 
if __name__=='__main__':
    data = readAdwordsExport('export.csv')
    print 'Campaigns:'
    print data.keys()
    writeAdwordsExport(data, 'output.csv')

This code is available in my public repository: http://bitbucket.org/halotis/halotis-collection/

This is more of a helpful snippit than a useful program but it can sometimes be useful to have some user agent strings handy for web scraping.

Some websites check the user agent string and will filter the results of a request. It’s a very simple way to prevent automated scraping. But it is very easy to get around. The user agent can also be checked by spam filters to help detect automated posting.

A great resource for finding and understanding what user agent strings mean is UserAgentString.com.

This simple snippit uses a file containing the list of user agent strings that you want to use. It can very simply source that file and return a random one from the list.

Here’s my source file UserAgents.txt:

Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.3) Gecko/20090913 Firefox/3.5.3
Mozilla/5.0 (Windows; U; Windows NT 6.1; en; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3 (.NET CLR 3.5.30729)
Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3 (.NET CLR 3.5.30729)
Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.1) Gecko/20090718 Firefox/3.5.1
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/532.1 (KHTML, like Gecko) Chrome/4.0.219.6 Safari/532.1
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; InfoPath.2)
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 1.1.4322; .NET CLR 3.5.30729; .NET CLR 3.0.30729)
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.2; Win64; x64; Trident/4.0)
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; SV1; .NET CLR 2.0.50727; InfoPath.2)Mozilla/5.0 (Windows; U; MSIE 7.0; Windows NT 6.0; en-US)
Mozilla/4.0 (compatible; MSIE 6.1; Windows XP)

And here is the python code that makes getting a random agent very simple:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# http://halotis.com/
 
import random
 
SOURCE_FILE='UserAgents.txt'
 
def get():
    f = open(SOURCE_FILE)
    agents = f.readlines()
 
    return random.choice(agents).strip()
 
def getAll():
    f = open(SOURCE_FILE)
    agents = f.readlines()
    return [a.strip() for a in agents]
 
if __name__=='__main__':
    agents = getAll()
    for agent in agents:
        print agent

You can grab the source code for this along with my other scripts from the bitbucket repository.

Digg is by far the most popular social news site on the internet. With it’s simple “thumbs up” system the users of the site promote the most interesting and high quality stores and the best of those make it to the front page. What you end up with is a filtered view of the most interesting stuff.

It’s a great site and one that I visit every day.

I wanted to write a script that makes use of the search feature on Digg so that I could scrape out and re-purpose the best stuff to use elsewhere. The first step in writing that larger (top secret) program was to start with a scraper for Digg search.

The short python script I came up with will return the search results from Digg in a standard python data structure so it’s simple to use. It parses out the title, destination, comment count, digg link, digg count, and summary for the top 100 search results.

You can perform advanced searches on digg by using a number of different flags:

  • +b Add to see buried stories
  • +p Add to see only promoted stories
  • +np Add to see only unpromoted stories
  • +u Add to see only upcoming stories
  • Put terms in “quotes” for an exact search
  • -d Remove the domain from the search
  • Add -term to exclude a term from your query (e.g. apple -iphone)
  • Begin your query with site: to only display stories from that URL.

This script also allows the search results to be sorted:

from DiggSearch import digg_search
digg_search('twitter', sort='newest')  #sort by newest first
digg_search('twitter', sort='digg')  # sort by number of diggs
digg_search('twitter -d')  # sort by best match

Here’s the Python code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# http://halotis.com/
 
import urllib,urllib2
import re
 
from BeautifulSoup import BeautifulSoup
 
USER_AGENT = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'
 
def remove_extra_spaces(data):
    p = re.compile(r'\s+')
    return p.sub(' ', data)
 
def digg_search(query, sort=None, pages=10):
    """Returns a list of the information I need from a digg query
    sort can be one of [None, 'digg', 'newest']
    """
 
    digg_results = []
    for page in range (1,pages):
 
        #create the URL
        address = "http://digg.com/search?s=%s" % (urllib.quote_plus(query))
        if sort:
            address = address + '&sort=' + sort
        if page > 1:
            address = address + '&page=' + str(page)
 
        #GET the page
        request = urllib2.Request(address, None, {'User-Agent':USER_AGENT} )
        urlfile = urllib2.urlopen(request)
        page = urlfile.read(200000)
        urlfile.close()
 
        #scrape it
        soup = BeautifulSoup(page)
        links = soup.findAll('h3', id=re.compile("title\d"))
        comments = soup.findAll('a', attrs={'class':'tool comments'})
        diggs = soup.findAll('strong', id=re.compile("diggs-strong-\d"))
        body = soup.findAll('a', attrs={'class':'body'})
        for i in range(0,len(links)):
            item = {'title':remove_extra_spaces(' '.join(links[i].findAll(text=True))).strip(), 
                    'destination':links[i].find('a')['href'],
                    'comment_count':int(comments[i].string.split()[0]),
                    'digg_link':comments[i]['href'],
                    'digg_count':diggs[i].string,
                    'summary':body[i].find(text=True)
                    }
            digg_results.append(item)
 
        #last page early exit
        if len(links) < 10:
            break
 
    return digg_results
 
if __name__=='__main__':
    #for testing
    results = digg_search('twitter -d', 'digg', 2)
    for r in results:
        print r

You can grab the source code from the bitbucket repository.

Have you ever wanted to track and assess your SEO efforts by seeing how they change your position in Google’s organic SERP? With this script you can now track and chart your position for any number of search queries and find the position of the site/page you are trying to rank.

This will allow you to visually identify any target keyword phrases that are doing well, and which ones may need some more SEO work.

This python script has a number of different components.

  • SEOCheckConfig.py script is used to add new target search queries to the database.
  • SEOCheck.py searches Google and saves the best position (in the top 100 results)
  • SEOCheckCharting.py graph all the results

The charts produced look like this:

seocheck

The main part of the script is SEOCheck.py. This script should be scheduled to run regularly (I have mine running 3 times per day on my webfaction hosting account).

For a small SEO consultancy business this type of application generates the feedback and reports that you should be using to communicate with your clients. It identifies where the efforts should go and how successful you have been.

To use this set of script you first will need to edit and run the SEOCheckConfig.py file. Add your own queries and domains that you’d like to check to the SETTINGS variable then run the script to load those into the database.

Then schedule SEOCheck.py to run periodically. On Windows you can do that using Scheduled Tasks:
Scheduled Task Dialog

On either Mac OSX or Linux you can use crontab to schedule it.

To generate the Chart simply run the SEOCheckCharting.py script. It will plot all the results on one graph.

You can find and download all the source code for this in the HalOtis-Collection on bitbucket. It requires BeautifulSoup, matplotlib, and sqlalchemy libraries to be installed.