Category Archives: Internet

Just 2 days ago I had an idea for a website. Today I’ve launched the site. I had a lot of fun coding the web application – that is why I love using Django.

I have to say that considering that I’m getting a bit rusty with Django and that I have never used any of the Facebook APIs I am pretty impressed that I was able to get a reasonably decent looking and functional website in just 2 days.

Want to take a look at the web application? http://like.halotis.com is where I’ve hosted it for the time being.

The concept is pretty simple. People can post quotes, phrases, links, or jokes and then like it on Facebook. The integration with Facebook will allow the application to promote itself virally when popular “Likeables” spread through the network.

It was a pretty simple concept with a trivial data model so most of the development time was spent fiddling with the CSS and layouts. But certainly I have plenty of ideas for ways to increase the engagement of the site going forward.

This may be the first of many future django facebook applications from me.

I have been doing a lot of web development work lately. Mostly learning about how different people create their workflows and manage local development, testing, staging, and production deployment of code.

In the past I have used Apache Ant for deploying Java applications. It is a bit cumbersome. Apache Ant uses XML config files which are kind of limiting once you try to do something non-standard and can sometimes require writing special Java code to create new directives. The resulting XML is not always easy to read.

For the last few days I have been using Fabric to write a few simple deploy scripts and I think this is a much nicer way of doing it. You get the full power of Python but a very simple syntax and easy command line usage.

Here’s a very simple deploy script that I am using to deploy some static files to my web server.

from fabric.api import *
 
#Fabric 0.9.0 compatible
# usages: $ fab prod deploy
 
REMOTE_HG_PATH = '/home/halotis/bin/hg'
 
def prod():
    """Set the target to production."""
    env.user = 'USERNAME'
    env.hosts = ['USERNAME.webfactional.com']
    env.remote_app_dir = 'webapps/APPLICATION'
    env.remote_push_dest = 'ssh://USERNAME@USERNAME.webfactional.com/%s' % env.remote_app_dir
    env.tag = 'production'
 
 
def deploy():
    """Deploy the site.
 
    This will tag the repository, and push changes to the remote location.
    """
    require('hosts', provided_by=[prod, ])
    require('remote_app_dir', provided_by=[prod, ])
    require('remote_push_dest', provided_by=[prod, ])
    require('tag', provided_by=[prod, ])
 
    local("hg tag --force %s" % env.tag)
    local("hg push %s --remotecmd %s" % (env.remote_push_dest, REMOTE_HG_PATH))
    run("cd %s; hg update -C %s" % (env.remote_app_dir, env.tag))

For this to work though you need to have some things set up.

  • Need SSH access to the remote server
  • Mercurial (hg) must be installed on the remote server, and development
  • Need to bootstrap the remote repository – FTP the .hg folder to the destination location
  • Install Fabric on local development machine – $ pip install fabric

Find out more about Fabric from the official site.

I have been moving more of my websites hosting over to my Webfaction account. It has been a good experience overall and the service there is much more powerful than what I get with my GoDaddy hosting account.

Yesterday I found in the support forums a nice little script that allows me to use one of my domain names (or subdomain) and re-direct the DNS settings to my local computer. This way I can remotely connect to my home computer with SSH/HTTP/FTP etc. using a url that I will remember.

This is going to be useful as I am writing some Python Django web applications.

Here’s the script:

import urllib2
import xmlrpclib
import os
 
currentip = urllib2.urlopen('http://www.whatismyip.org').read()
 
if not os.path.isfile('lastip'):
    f = open('lastip', 'w')
    f.close()
 
with open('lastip', 'r') as f:
    lastip = f.read()
 
if lastip != currentip:
    server = xmlrpclib.ServerProxy('https://api.webfaction.com/')
    session_id, account = server.login('USERNAME', 'PASSWORD')
    server.delete_dns_override(session_id, 'YOURDOMAIN.com')
    server.create_dns_override(session_id, 'YOURDOMAIN.com', currentip, '', '', '', '')
 
    with open('lastip', 'w') as f:
        f.write(currentip)
 
    print('IP updated to %s' % currentip)
else:
    print('IP not updated')

Once this is run, I can update my router settings to forward the appropriate services to my computer and give the DNS servers enough time to propagate the new entry.

I can now connect to my Home computer using my own domain name. Pretty cool.

I have been hard at work testing out different approaches to Adwords.  One of the keys is that I’m scripting up a lot of the management of campaigns, ad groups, keywords, and ads.  The Adwords API could be used but there’s an expense to using it which would be a significant expense for the size of my campaigns.  So I have been using the Adwords Editor to help manage everything.  What makes it excellent is that the tool has import and export to/from csv files.  This makes it pretty simple to play with the data.

To get a file that this script will work with just go to the File menu in Google Adwords Editor and select “Export to CSV”  You can then select “Export Selected Campaigns”.  it will write out a csv file.

This Python script will read those output csv files into a Python data structure which you can then manipulate and write back out to a file.

With the file modified you can then use the Adwords Editor’s “Import CSV” facility to get your changes back into the Editor and then uploaded to Adwords.

Having this ability to pull this data into Python, modify it, and then get it back into Adwords means that I can do a lot of really neat things:

  • create massive campaigns with a large number of targeted ads
  • Invent bidding strategies that act individually at the keyword level
  • automate some of the management
  • pull in statistics from CPA networks to calculate ROIs
  • convert text ads into image ads

Here’s the script:

#!/usr/bin/env python
# coding=utf-8
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# http://halotis.com/
 
"""
read and write exported campaigns from Adwords Editor
 
"""
 
import codecs
import csv
 
FIELDS = ['Campaign', 'Campaign Daily Budget', 'Languages', 'Geo Targeting', 'Ad Group', 'Max CPC', 'Max Content CPC', 'Placement Max CPC', 'Max CPM', 'Max CPA', 'Keyword', 'Keyword Type', 'First Page CPC', 'Quality Score', 'Headline', 'Description Line 1', 'Description Line 2', 'Display URL', 'Destination URL', 'Campaign Status', 'AdGroup Status', 'Creative Status', 'Keyword Status', 'Suggested Changes', 'Comment', 'Impressions', 'Clicks', 'CTR', 'Avg CPC', 'Avg CPM', 'Cost', 'Avg Position', 'Conversions (1-per-click)', 'Conversion Rate (1-per-click)', 'Cost/Conversion (1-per-click)', 'Conversions (many-per-click)', 'Conversion Rate (many-per-click)', 'Cost/Conversion (many-per-click)']
 
def readAdwordsExport(filename):
 
    campaigns = {}
 
    f = codecs.open(filename, 'r', 'utf-16')
    reader = csv.DictReader(f, delimiter='\t')
 
    for row in reader:
        #remove empty values from dict
        row = dict((i, j) for i, j in row.items() if j!='' and j != None)
        if row.has_key('Campaign Daily Budget'):  # campain level settings
            campaigns[row['Campaign']] = {}
            for k,v in row.items():
                campaigns[row['Campaign']][k] = v
        if row.has_key('Max Content CPC'):  # AdGroup level settings
            if not campaigns[row['Campaign']].has_key('Ad Groups'):
                campaigns[row['Campaign']]['Ad Groups'] = {}
            campaigns[row['Campaign']]['Ad Groups'][row['Ad Group']] = row
        if row.has_key('Keyword'):  # keyword level settings
            if not campaigns[row['Campaign']]['Ad Groups'][row['Ad Group']].has_key('keywords'):
                campaigns[row['Campaign']]['Ad Groups'][row['Ad Group']]['keywords'] = []
            campaigns[row['Campaign']]['Ad Groups'][row['Ad Group']]['keywords'].append(row)
        if row.has_key('Headline'):  # ad level settings
            if not campaigns[row['Campaign']]['Ad Groups'][row['Ad Group']].has_key('ads'):
                campaigns[row['Campaign']]['Ad Groups'][row['Ad Group']]['ads'] = []
            campaigns[row['Campaign']]['Ad Groups'][row['Ad Group']]['ads'].append(row)
    return campaigns
 
def writeAdwordsExport(data, filename):
    f = codecs.open(filename, 'w', 'utf-16')
    writer = csv.DictWriter(f, FIELDS, delimiter='\t')
    writer.writerow(dict(zip(FIELDS, FIELDS)))
    for campaign, d in data.items():
        writer.writerow(dict((i,j) for i, j in d.items() if i != 'Ad Groups'))
        for adgroup, ag in d['Ad Groups'].items():
            writer.writerow(dict((i,j) for i, j in ag.items() if i != 'keywords' and i != 'ads'))
            for keyword in ag['keywords']:
                writer.writerow(keyword)            
            for ad in ag['ads']:
                writer.writerow(ad)
    f.close()
 
if __name__=='__main__':
    data = readAdwordsExport('export.csv')
    print 'Campaigns:'
    print data.keys()
    writeAdwordsExport(data, 'output.csv')

This code is available in my public repository: http://bitbucket.org/halotis/halotis-collection/

Google Adwords LogoLast week I started testing some new concepts on Adwords. A week has passed and I wanted to recap what has happened and some things that I have noticed and learned so far.

First off, the strategy that I’m testing is to use the content network exclusively. As a result some of the standard SEM practices don’t really apply. Click through rates are dramatically lower than on search and it takes some time to get used to 0.1% CTRs. It takes a lot of impressions to get traffic at those levels.

Luckily the inventory of ad space with people using Adsense is monstrous and as a result there is plenty opportunities for placements. So for my limited week of testing I have had about 150,000 impressions on my ads resulting in 80 clicks.

The other thing to note is that there is comparatively nobody running ads on the content network. So the competition is almost non-existent. That makes the price per click very low. The total ad spend for the first week of testing was less than $10.

I have run into a number of problems in my testing that I never expected.

  • It’s not possible to use the Adwords API to build flash ads with the Display Ad Builder :(
  • There seems to be a bug with the Adwords Editor when trying to upload a lot of image ads.
  • It takes a long time for image ads to be approved and start running (none of my image ads have been approved yet)
  • Paying to use the Adwords API works out to be very expensive for the scale I want to use it at.
  • optimizing the price is time consuming since it can take days to see enough results.

With all those problems I’m still optimistic that I can find a way to scale things up more radically.  So far in the past week I have written a number of scripts that have helped me build out the campaigns, ad groups and ads.  It has gotten to the point where I can now upload over 1000 text ads to new campaigns, ad groups and keywords in one evening.

Since so far the testing has been inconclusive I’m going to hold off sharing the scripts I have for this strategy.  If it works out you can count on me recording some videos of the technique and the scripts to help.

Google Adwords LogoLast night I had one of those sleepless nights. I’m sure you have had one of these before – After hearing a great idea your mind starts spinning with the possibilities and there’s no way you’ll be able to sleep. I got excited last night about a new approach to Google Adwords that has just might have a lot of potential.

Google Adwords has never really proven to be a profitable way to drive traffic for me (though Microsoft Adcenter has). However several times a year for the past 4 years I have heard a little tip or seen someone use it successfully and have become intrigued enough to dive back in and test the waters again. Each time my testing has been plagued with failure and I have lost thousands of dollars trying to find a system that works.

Yesterday I got a tip, something new that I haven’t yet tried but that sounded promising. And so over the next few weeks I’m going to be running some tests. The problem with the approach I’m testing is that it requires creating a MASSIVE number of keyword targeted ads – a total of over 100,000 ads per niche.

It took me 2.5 hours last night to manually create 400 of the 100,000 ads I need (for the one niche I’m going to test first). There’s no feasible way to create all those ads manually and I’m not interested in spending yet more money on ebooks or software that claims to make money or magically do the work for me. So I am going to program some scripts myself to test the techniques. If it works or doesn’t work I will let you know, and share the code right here on this blog.

The testing started last night. Check back next week for the preliminary results (and maybe a hint about what I’m doing).

This is more of a helpful snippit than a useful program but it can sometimes be useful to have some user agent strings handy for web scraping.

Some websites check the user agent string and will filter the results of a request. It’s a very simple way to prevent automated scraping. But it is very easy to get around. The user agent can also be checked by spam filters to help detect automated posting.

A great resource for finding and understanding what user agent strings mean is UserAgentString.com.

This simple snippit uses a file containing the list of user agent strings that you want to use. It can very simply source that file and return a random one from the list.

Here’s my source file UserAgents.txt:

Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.3) Gecko/20090913 Firefox/3.5.3
Mozilla/5.0 (Windows; U; Windows NT 6.1; en; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3 (.NET CLR 3.5.30729)
Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3 (.NET CLR 3.5.30729)
Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.1) Gecko/20090718 Firefox/3.5.1
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/532.1 (KHTML, like Gecko) Chrome/4.0.219.6 Safari/532.1
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; InfoPath.2)
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 1.1.4322; .NET CLR 3.5.30729; .NET CLR 3.0.30729)
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.2; Win64; x64; Trident/4.0)
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; SV1; .NET CLR 2.0.50727; InfoPath.2)Mozilla/5.0 (Windows; U; MSIE 7.0; Windows NT 6.0; en-US)
Mozilla/4.0 (compatible; MSIE 6.1; Windows XP)

And here is the python code that makes getting a random agent very simple:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# http://halotis.com/
 
import random
 
SOURCE_FILE='UserAgents.txt'
 
def get():
    f = open(SOURCE_FILE)
    agents = f.readlines()
 
    return random.choice(agents).strip()
 
def getAll():
    f = open(SOURCE_FILE)
    agents = f.readlines()
    return [a.strip() for a in agents]
 
if __name__=='__main__':
    agents = getAll()
    for agent in agents:
        print agent

You can grab the source code for this along with my other scripts from the bitbucket repository.

Digg is by far the most popular social news site on the internet. With it’s simple “thumbs up” system the users of the site promote the most interesting and high quality stores and the best of those make it to the front page. What you end up with is a filtered view of the most interesting stuff.

It’s a great site and one that I visit every day.

I wanted to write a script that makes use of the search feature on Digg so that I could scrape out and re-purpose the best stuff to use elsewhere. The first step in writing that larger (top secret) program was to start with a scraper for Digg search.

The short python script I came up with will return the search results from Digg in a standard python data structure so it’s simple to use. It parses out the title, destination, comment count, digg link, digg count, and summary for the top 100 search results.

You can perform advanced searches on digg by using a number of different flags:

  • +b Add to see buried stories
  • +p Add to see only promoted stories
  • +np Add to see only unpromoted stories
  • +u Add to see only upcoming stories
  • Put terms in “quotes” for an exact search
  • -d Remove the domain from the search
  • Add -term to exclude a term from your query (e.g. apple -iphone)
  • Begin your query with site: to only display stories from that URL.

This script also allows the search results to be sorted:

from DiggSearch import digg_search
digg_search('twitter', sort='newest')  #sort by newest first
digg_search('twitter', sort='digg')  # sort by number of diggs
digg_search('twitter -d')  # sort by best match

Here’s the Python code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# http://halotis.com/
 
import urllib,urllib2
import re
 
from BeautifulSoup import BeautifulSoup
 
USER_AGENT = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'
 
def remove_extra_spaces(data):
    p = re.compile(r'\s+')
    return p.sub(' ', data)
 
def digg_search(query, sort=None, pages=10):
    """Returns a list of the information I need from a digg query
    sort can be one of [None, 'digg', 'newest']
    """
 
    digg_results = []
    for page in range (1,pages):
 
        #create the URL
        address = "http://digg.com/search?s=%s" % (urllib.quote_plus(query))
        if sort:
            address = address + '&sort=' + sort
        if page > 1:
            address = address + '&page=' + str(page)
 
        #GET the page
        request = urllib2.Request(address, None, {'User-Agent':USER_AGENT} )
        urlfile = urllib2.urlopen(request)
        page = urlfile.read(200000)
        urlfile.close()
 
        #scrape it
        soup = BeautifulSoup(page)
        links = soup.findAll('h3', id=re.compile("title\d"))
        comments = soup.findAll('a', attrs={'class':'tool comments'})
        diggs = soup.findAll('strong', id=re.compile("diggs-strong-\d"))
        body = soup.findAll('a', attrs={'class':'body'})
        for i in range(0,len(links)):
            item = {'title':remove_extra_spaces(' '.join(links[i].findAll(text=True))).strip(), 
                    'destination':links[i].find('a')['href'],
                    'comment_count':int(comments[i].string.split()[0]),
                    'digg_link':comments[i]['href'],
                    'digg_count':diggs[i].string,
                    'summary':body[i].find(text=True)
                    }
            digg_results.append(item)
 
        #last page early exit
        if len(links) < 10:
            break
 
    return digg_results
 
if __name__=='__main__':
    #for testing
    results = digg_search('twitter -d', 'digg', 2)
    for r in results:
        print r

You can grab the source code from the bitbucket repository.

Have you ever wanted to track and assess your SEO efforts by seeing how they change your position in Google’s organic SERP? With this script you can now track and chart your position for any number of search queries and find the position of the site/page you are trying to rank.

This will allow you to visually identify any target keyword phrases that are doing well, and which ones may need some more SEO work.

This python script has a number of different components.

  • SEOCheckConfig.py script is used to add new target search queries to the database.
  • SEOCheck.py searches Google and saves the best position (in the top 100 results)
  • SEOCheckCharting.py graph all the results

The charts produced look like this:

seocheck

The main part of the script is SEOCheck.py. This script should be scheduled to run regularly (I have mine running 3 times per day on my webfaction hosting account).

For a small SEO consultancy business this type of application generates the feedback and reports that you should be using to communicate with your clients. It identifies where the efforts should go and how successful you have been.

To use this set of script you first will need to edit and run the SEOCheckConfig.py file. Add your own queries and domains that you’d like to check to the SETTINGS variable then run the script to load those into the database.

Then schedule SEOCheck.py to run periodically. On Windows you can do that using Scheduled Tasks:
Scheduled Task Dialog

On either Mac OSX or Linux you can use crontab to schedule it.

To generate the Chart simply run the SEOCheckCharting.py script. It will plot all the results on one graph.

You can find and download all the source code for this in the HalOtis-Collection on bitbucket. It requires BeautifulSoup, matplotlib, and sqlalchemy libraries to be installed.

There are many Twitter API libraries available for Python. I wanted to find out which one was the best and what the strengths and weaknesses of each are. However there are too many out there to find and review all of them. Instead here’s a bunch of the most popular Python Twitter API wrappers with a small review of each one, and some sample code showing off the syntax.

Python Twitter

This is the library that I personally use in most of my Twitter scripts. It’s very simple to use, but is not up to date with the latest developments at Twitter. It was written by DeWitt Clinton and available on Google Code. If you just want the basic API functionality this does a pretty decent job.

import twitter
api = twitter.Api('username', 'password')
statuses = api.GetPublicTimeline()
print [s.user.name for s in statuses]
users = api.GetFriends()
print [u.name for u in users]
statuses = api.GetUserTimeline(user)
print [s.text for s in statuses]
api.PostUpdate(username, password, 'I love python-twitter!')

twyt

Twyt is a pretty comprehensive library that seems to be pretty solid and well organized. In some cases there is added complexity to parse the return json objects from the Python Twitter API. It is written and maintained by Andrew Price.

from twyt import twitter, data
t = twitter.Twitter()
t.set_auth("username", "password")
print t.status_friends_timeline()
print t.user_friends()
return_val = t.status_update("Testing 123")
s = data.Status()
s.load_json(return_val)
print s
t.status_destroy(s.id)

Twython

An up to date, Python wrapper for the Twitter API. Supports Twitter’s main API, Twitter’s search API, and (soon) using OAuth with Twitter/Streaming API. It is based on the Python Twitter library and is actively maintained by Ryan Mcgrath.

import twython
twitter = twython.setup(authtype="Basic", username="example", password="example")
twitter.updateStatus("See how easy this was?")
friends_timeline = twitter.getFriendsTimeline(count="150", page="3")
print [tweet["text"] for tweet in friends_timeline]

Tweepy

Tweepy is a pretty compelling Python Twitter API library. It’s up to date with the latest features of Twitter and actively being developed by Joshua Roesslein. It features OAuth support, Python 3 support, streaming API support and it’s own cache system. Retweet streaming was recently added. If you want to use the most up to date features of the Twitter API on Python, or use Python 3, then you should definitely check out Tweepy.

import tweepy
api = tweepy.API.new('basic', 'username', 'password')
public_timeline = api.public_timeline()
print [tweet.text for tweet in public_timeline]
friends_timeline = api.friends_timeline()
print [tweet.text for tweet in friends_timeline]
u = tweepy.api.get_user('username')
friends = u.friends()
print [tweet.screen_name for f in friends]
api.update_status('tweeting with tweepy')

It’s pretty clear from the syntax samples that there’s not much difference between any of these Python Twitter API libraries when just getting the basics done. The difference only start to show up when you get into the latest features. OAuth, Streaming, and the retweet functions really differentiate these libraries. I hope this overview helps you find and choose the library that’s right for your project.