Category Archives: Software

3038922333_79273fbb30_oThere are a number of services out there such as Google Cash Detective that will go run some searches on Google and then save the advertisements so you can track who is advertising for what keywords over time. It’s actually a very accurate technique for finding out what ads are profitable.

After tracking a keyword for several weeks it’s possible to see what ads have been running consistently over time. The nature of Pay Per Click is that only profitable advertisements will continue to run long term. So if you can identify what ads, for what keywords are profitable then it should be possible to duplicate them and get some of that profitable traffic for yourself.

The following script is a Python program that perhaps breaks the Google terms of service. So consider it as a guide for how this kind of HTML parsing could be done. It spoofs the User-agent to appear as though it is a real browser, and then does a search through all the keywords stored in an sqlite database and stores the ads displayed for that keyword in the database.

The script makes use of the awesome Beautiful Soup library. Beautiful Soup makes parsing HTML content really easy. But because of the nature of scraping the web it is very fragile since it makes several assumptions about the structure of the Google results page and if they change their site then the script could break.

#!/usr/bin/env python
 
import sys
import urllib2
import re
import sqlite3
import datetime
 
from BeautifulSoup import BeautifulSoup  # available at: http://www.crummy.com/software/BeautifulSoup/
 
conn = sqlite3.connect("espionage.sqlite")
conn.row_factory = sqlite3.Row
 
def get_google_search_results(keywordPhrase):
	"""make the GET request to Google.com for the keyword phrase and return the HTML text
	"""
	url='http://www.google.com/search?hl=en&q=' + '+'.join(keywordPhrase.split())
	req = urllib2.Request(url)
	req.add_header('User-agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.29 Safari/525.13')
	page = urllib2.urlopen(req)
	HTML = page.read()
	return HTML
 
def scrape_ads(text, phraseID):
	"""Scrape the text as HTML, find and parse out all the ads and store them in a database
	"""
	soup = BeautifulSoup(text)
	#get the ads on the right hand side of the page
	ads = soup.find(id='rhsline').findAll('li')
	position = 0
	for ad in ads:
		position += 1
 
		#display url
		parts = ad.find('cite').findAll(text=True)
		site = ''.join([word.strip() for word in parts]).strip()
		ad.find('cite').replaceWith("")
 
		#the header line
		parts = ad.find('a').findAll(text=True)
		title = ' '.join([word.strip() for word in parts]).strip()
 
		#the destination URL
		href = ad.find('a')['href']
		start = href.find('&q=')
		if start != -1 :
			dest = href[start+3:]
		else :
			dest = None
			print 'error', href
 
		ad.find('a').replaceWith("")
 
		#body of ad
		brs = ad.findAll('br')
		for br in brs:
			br.replaceWith("%BR%")
		parts = ad.findAll(text=True)
		body = ' '.join([word.strip() for word in parts]).strip()
		line1 = body.split('%BR%')[0].strip()
		line2 = body.split('%BR%')[1].strip()
 
		#see if the ad is in the database
		c = conn.cursor()
		c.execute('SELECT adID FROM AdTable WHERE destination=? and title=? and line1=? and line2=? and site=? and phraseID=?', (dest, title, line1, line2, site, phraseID))
		result = c.fetchall() 
		if len(result) == 0:
			#NEW AD - insert into the table
			c.execute('INSERT INTO AdTable (`destination`, `title`, `line1`, `line2`, `site`, `phraseID`) VALUES (?,?,?,?,?,?)', (dest, title, line1, line2, site, phraseID))
			conn.commit()
			c.execute('SELECT adID FROM AdTable WHERE destination=? and title=? and line1=? and line2=? and site=? and phraseID=?', (dest, title, line1, line2, site, phraseID))
			result = c.fetchall()
		elif len(result) > 1:
			continue
 
		adID = result[0]['adID']
 
		c.execute('INSERT INTO ShowTime (`adID`,`date`,`time`, `position`) VALUES (?,?,?,?)', (adID, datetime.datetime.now(), datetime.datetime.now(), position))
 
 
def do_all_keywords():
	c = conn.cursor()
	c.execute('SELECT * FROM KeywordList')
	result = c.fetchall()
	for row in result:
		html = get_google_search_results(row['keywordPhrase'])
		scrape_ads(html, row['phraseID'])
 
if __name__ == '__main__' :
	do_all_keywords()

It is extremely useful to send emails from scripts.  Emails can alert you to errors as soon as they happen or can give you regular status updates about the running of your programs.

I have several scripts that run regularly to update various websites or scrape data from different places and quite often when dealing with the internet things change.  Code breaks constantly as the things they depend on change so to make sure everything continues to run it’s important to be notified when errors happen.

One of the greatest ways to do this is to have your programs send email messages to you.  I use Google’s Gmail SMTP server to relay my messages to me.  That way I don’t have to rely on having sendmail installed on the machine or hooking into something like MS Outlook to compose an email.

This small simple script uses smtplib to send simple text emails using Gmail’s SMTP service.

#!/usr/bin/python
 
import smtplib
from email.MIMEText import MIMEText
 
GMAIL_LOGIN = 'myemail@gmail.com'
GMAIL_PASSWORD = 'password'
 
 
def send_email(subject, message, from_addr=GMAIL_LOGIN, to_addr=GMAIL_LOGIN):
    msg = MIMEText(message)
    msg['Subject'] = subject
    msg['From'] = from_addr
    msg['To'] = to_addr
 
    server = smtplib.SMTP('smtp.gmail.com',587) #port 465 or 587
    server.ehlo()
    server.starttls()
    server.ehlo()
    server.login(GMAIL_LOGIN,GMAIL_PASSWORD)
    server.sendmail(from_addr, to_addr, msg.as_string())
    server.close()
 
 
if __name__=="__main__":
    send_email('test', 'This is a test email')

Another in my series of Python scripting blog posts. This time I’m sharing a script that can rip through RSS feeds and devour their content and stuff it into a database in a way that scales up to 1000s of feeds. To accomplish this the script is multi-threaded.

The big problem with scaling up a web script like this is that there is a huge amount of latency when requesting something over the internet. Due to the bandwidth as well as remote processing time it can take as long as a couple of seconds to get anything back. Requesting one feed after the other in series will waste a lot of time, and that makes this type of script a prime candidate for some threading.

I borrowed parts of this script from this post: Threaded data collection with Python, including examples

What could you do with all this content? Just off the top of my head I can think of many interesting things to do:

  • Create histograms of the publish times of posts to find out the most/least popular days and times are for publishing
  • Plot trends of certain words or phrases over time
  • create your own aggregation website
  • get the trending topics by doing counting the occurrence of words by day
  • Try writing some natural language processing algorithms

This script is coded at 20 threads, but that really needs to be fine tuned for the best performance. Depending on your bandwidth and the sites you want to grab you may want to tweak the THREAD_LIMIT value.

import sqlite3
import threading
import time
import Queue
from time import strftime
 
import feedparser     # available at http://feedparser.org
 
 
THREAD_LIMIT = 20
jobs = Queue.Queue(0)
rss_to_process = Queue.Queue(THREAD_LIMIT)
 
DATABASE = "rss.sqlite"
 
conn = sqlite3.connect(DATABASE)
conn.row_factory = sqlite3.Row
c = conn.cursor()
 
#insert initial values into feed database
c.execute('CREATE TABLE IF NOT EXISTS RSSFeeds (id INTEGER PRIMARY KEY AUTOINCREMENT, url VARCHAR(1000));')
c.execute('CREATE TABLE IF NOT EXISTS RSSEntries (entry_id INTEGER PRIMARY KEY AUTOINCREMENT, id, url, title, content, date);')
c.execute("INSERT INTO RSSFeeds(url) VALUES('http://www.halotis.com/feed/');")
 
feeds = c.execute('SELECT id, url FROM RSSFeeds').fetchall()
 
def store_feed_items(id, items):
    """ Takes a feed_id and a list of items and stored them in the DB """
    for entry in items:
        c.execute('SELECT entry_id from RSSEntries WHERE url=?', (entry.link,))
        if len(c.fetchall()) == 0:
            c.execute('INSERT INTO RSSEntries (id, url, title, content, date) VALUES (?,?,?,?,?)', (id, entry.link, entry.title, entry.summary, strftime("%Y-%m-%d %H:%M:%S",entry.updated_parsed)))
 
def thread():
    while True:
        try:
            id, feed_url = jobs.get(False) # False = Don't wait
        except Queue.Empty:
            return
 
        entries = feedparser.parse(feed_url).entries
        rss_to_process.put((id, entries), True) # This will block if full
 
for info in feeds: # Queue them up
    jobs.put([info['id'], info['url']])
 
for n in xrange(THREAD_LIMIT):
    t = threading.Thread(target=thread)
    t.start()
 
while threading.activeCount() > 1 or not rss_to_process.empty():
    # That condition means we want to do this loop if there are threads
    # running OR there's stuff to process
    try:
        id, entries = rss_to_process.get(False, 1) # Wait for up to a second
    except Queue.Empty:
        continue
 
    store_feed_items(id, entries)
 
conn.commit()

Bit.ly offers a very simple API for creating short URLs. The service can also provide you with some basic click statistics. Unfortunately there are a few missing pieces to the API. To get around that you’ll have to keep a list of bit.ly links you want to track. Depending on your situation you may need to keep some of the information updated regularly and stored locally to do a deeper analysis of your links.

There are a couple of advanced tricks you can use to get more out of your tracking.

  1. Add GET arguments to the end of the URL to split test
  2. – If you want to track clicks from different sources that land at the same page you need use different links. The easiest way to create two links to the same page is to append a GET argument. So if you wanted to promote my site http://halotis.com and wanted to compare Twitter to AdWords then you could create bit.ly links to http://halotis.com?from=twitter and http://halotis.com?from=adwords. You can add more information with more arguments such as http://halotis.com/?from=adwords&adgroup=group1. If you control the landing page, then you will see those arguments in Google Analytics and will have even more information about who clicked your links.

  3. Look at stats for any bit.ly link including referring sites, real-time click time-lines, and locations by adding a + to the end of it: http://bit.ly/10HYCo+
  4. Find out which other bit.ly users have shortened a link using the API – google.com bitly info
  5. Use the javascript library to grab stats and embed them into a webpage — see code below

Get click count stats inserted with this Javascript example code. Just update the login & ApiKey and put this in the head section of your webpage:

<script type="text/javascript" charset="utf-8" src="http://bit.ly/javascript-api.js?version=latest&login=YOURBITLYLOGIN&apiKey=YOURAPIKEYGOESHERE"></script>
<script type="text/javascript" charset="utf-8">
	BitlyCB.myStatsCallback = function(data) {
		var results = data.results;
 
		var links = document.getElementsByTagName('a');
		for (var i=0; i < links.length; i++) {
			var a = links[i];
			if (a.href && a.href.match(/^http\:\/\/bit\.ly/)) {
				var hash = BitlyClient.extractBitlyHash(a.href);
				if (results.hash == hash || results.userHash == hash) {
					if (results.userClicks) {
						var uc = results.userClicks + " clicks on this bit.ly URL. ";
					} else {
						var uc = "";
					}
 
					if (results.clicks) {
						var c = results.clicks;
					} else {
						var c = "0";
					}
					c += " clicks on all shortened URLS for this source. ";
 
					var sp = BitlyClient.createElement('span', {'text': " [ " + uc + c + " ] "});
					a.parentNode.insertBefore(sp, a.nextSibling);
				}
			}
 
		};
 
	}
 
	// wait until page is loaded to call API
	BitlyClient.addPageLoadEvent(function(){
		var links = document.getElementsByTagName('a');
		var fetched = {};
		var hashes = [];
		for (var i=0; i < links.length; i++) {
			var a = links[i];
			if (a.href && a.href.match(/^http\:\/\/bit\.ly/)) {
				if (!fetched[a.href]) {
					BitlyClient.stats(BitlyClient.extractBitlyHash(a.href), 'BitlyCB.myStatsCallback');
					fetched[a.href] = true;
				}
			}
		};
 
	});
	</script>

If you want to have a small command line script that can fetch this data from bit.ly and print it then check out this Python script that uses the bitly library which makes it very easy:

import bitly       #http://code.google.com/p/python-bitly/
BITLY_LOGIN = "YOUR_BITLY_LOGIN"
BITLY_API_KEY = "YOUR_BITLY_API_KEY"
 
short_url='http://bit.ly/31IqMl'
 
b = bitly.Api(login=BITLY_LOGIN,apikey=BITLY_API_KEY)
stats = b.stats(short_url)
print "%s - User clicks %s, total clicks: %s" % (short_url, stats.user_clicks, stats.total_clicks)

app_engine_logo_smFor the past two weeks I have been working on a project that has great potential of really taking off in a big way. I’m developing the site using Python and the Django framework running on Google App Engine. I have a lot of good things to say about working with this development stack. Some of the big wins are:

  • Python is an awesome language – easy to read, write, and maintain.  There’s lots of libraries available which makes development go faster.
  • Django is a great framework that makes developing webapps very clean.  There’s a great separation between templates, views, and urls.  Once I got the hang of how things are supposed to be done in django it’s very easy  to get things up quickly.
  • Google App Engine has a really amazing admin interface that gives access to the logging information, database tables, and website statistics.  The free quotas are generous, it scales well, and takes almost no time to set up.  The GUI development app for OS X works really well and does development debugging better than the stock django manage.py script.

But there have been some really frustrating points during the development of my first real web service running on GAE.

  • There are too many choices/variations of Django – none have great documentation
    • The built in Django that comes with GAE is stripped down to the bare essentials – no admin interface, different forms, different Models, different User/authentication.  Big portions of the documentation at djangoproject.org are useless if you use this version of Django.
    • app-engine-helper – provides a way to get more of the standard Django installed.  I haven’t tried this one.
    • app-engine-patch – similar to helper, but development seems more active.  app-engine-patch also includes a bunch of app-engine ready Django applications such as jQuery, blueprintCSS, and registration.  It supports using standard Django user accounts and the admin interface.

The biggest problem I’ve had is with user registration and authentication. Between the app-engine-patch and Google App Engine, there seems to be at least 4 different authentication and session schemes, and multiple User Models to choose from. Some require additional middleware – others don’t. I want to use the registration application and standard Django Users but it doesn’t seem to want to work with a Model’s UserProperty. To top it off there’s very little documentation and I haven’t found an example application to see how it should be done. Argh.

The exciting news is that I expect to have my first web service up and running in about a week. The second one is in development and I expect to launch it in early August.

I was a little bored today and decided to write up a simple script that pushes RSS feed information out to Twitter and manages to keep track of the history so that tweets are not sent out more than once.

It was actually a very trivial little script to write but it could actually be useful for something that I’m working on in the future.

The script makes use of an Sqlite database to store history and bit.ly for shortening URLs. I’ve made heavy use of some really nice open source libraries to make for a very short and sweet little script.

Grab the necessary python libraries:
python-twitter
python-bitly
feedparser

You’ll need to sign up for free accounts at Twitter and Bit.ly to use this script.

Hopefully someone out there can take this code example to do something really cool with Twitter and Python.

Update: I’ve added some bit.ly link tracking output to this script. After it twitters the RSS feed it will print out the click count information for every bit.ly link.

from time import strftime
import sqlite3
 
import twitter     #http://code.google.com/p/python-twitter/
import bitly       #http://code.google.com/p/python-bitly/
import feedparser  #available at feedparser.org
 
 
DATABASE = "tweets.sqlite"
 
BITLY_LOGIN = "bitlyUsername"
BITLY_API_KEY = "api key"
 
TWITTER_USER = "username"
TWITTER_PASSWORD = "password"
 
def print_stats():
	conn = sqlite3.connect(DATABASE)
	conn.row_factory = sqlite3.Row
	c = conn.cursor()
 
	b = bitly.Api(login=BITLY_LOGIN,apikey=BITLY_API_KEY)
 
	c.execute('SELECT title, url, short_url from RSSContent')
	all_links = c.fetchall()
 
	for row in all_links:
 
		short_url = row['short_url']
 
		if short_url is None:
			short_url = b.shorten(row['url'])
			c.execute('UPDATE RSSContent SET `short_url`=? WHERE `url`=?',(short_url,row['url']))
 
 
		stats = b.stats(short_url)
		print "%s - User clicks %s, total clicks: %s" % (row['title'], stats.user_clicks,stats.total_clicks)
 
	conn.commit()
 
def tweet_rss(url):
 
	conn = sqlite3.connect(DATABASE)
	conn.row_factory = sqlite3.Row
	c = conn.cursor()
 
	#create the table if it doesn't exist
	c.execute('CREATE TABLE IF NOT EXISTS RSSContent (`url`, `title`, `dateAdded`, `content`, `short_url`)')
 
	api = twitter.Api(username=TWITTER_USER, password=TWITTER_PASSWORD)
	b = bitly.Api(login=BITLY_LOGIN,apikey=BITLY_API_KEY)
 
	d = feedparser.parse(url)
 
	for entry in d.entries:
 
		#check for duplicates
		c.execute('select * from RSSContent where url=?', (entry.link,))
		if not c.fetchall():
 
			tweet_text = "%s - %s" % (entry.title, entry.summary)
 
			shortened_link = b.shorten(entry.link)
 
			t = (entry.link, entry.title, strftime("%Y-%m-%d %H:%M:%S", entry.updated_parsed), entry.summary, shortened_link)
			c.execute('insert into RSSContent (`url`, `title`,`dateAdded`, `content`, `short_url`) values (?,?,?,?,?)', t)
			print "%s.. %s" % (tweet_text[:115], shortened_link)
 
			api.PostUpdate("%s.. %s" % (tweet_text[:115], shortened_link))
 
	conn.commit()
 
if __name__ == '__main__':
  tweet_rss('http://www.halotis.com/feed/')
  print_stats()

I discovered this very handy trick for getting relevant YouTube videos in an RSS feed and I have used it to build up some very powerful blog posting scripts to grab relevant content for some blogs that I have.  It could also be helpful to pull these into an RSS Reader to quickly skim the newest videos relevant to a specific search.  I thought I would share some of this with you and hopefully you’ll be able to use these scripts to get an idea of your own.

To start with you need to create the URL of the RSS feed for the search.  To do that you can do the search on YouTube and click the RSS icon in the address bar.  The structure of the URL should be something like this:

http://gdata.youtube.com/feeds/base/videos?q=halotis%20marketing&client=ytapi-youtube-search&alt=rss&v=2

The problem with the RSS feed is that they don’t include the HTML required to embed the video. You have to parse the RSS content and find the URL for the video which can be used to create the embed code so you can post the videos somewhere else.

In my example code I have categorized each url to a target keyword phrase.  The code below is not a comprehensive program, but just an idea of how to go about repurposing YouTube RSS content.

import feedparser  # available at: feedparser.org
from time import strftime
import sqlite3
 
DATABASE = "YouTubeMatches.sqlite"
 
conn = sqlite3.connect(DATABASE)
conn.row_factory = sqlite3.Row
c = conn.cursor()
 
def LoadIntoDatabase(phrase, url):
 
	d = feedparser.parse(url)
	for entry in d.entries:
 
		#check for duplicates with the url
		Constants.c.execute('select * from YoutubeResources where url=?', (entry.link,))
		if len(Constants.c.fetchall()) == 0:
			#only adding the resources that are not already in the table.
			t = (phrase,entry.link, 0, entry.title, strftime("%Y-%m-%d %H:%M:%S", entry.updated_parsed), entry.summary)
			Constants.c.execute('insert into YoutubeResources (`phrase`, `url`, `used`, `title`,`dateAdded`, `content`) values (?,?,?,?,?,?)', t)
 
	Constants.conn.commit()
 
def getYouTubeEmbedCode(phrase):
 
	c.execute("select * from YoutubeResources where phrase=? and used=0", (phrase,))
	result = Constants.c.fetchall()
	random.shuffle(result)
 
	content = result[0]
 
	contentString = content[3]
	url=content[1].replace('?', '').replace('=' , '/')
	embedCode='&lt;div class="youtube-video"&gt;&lt;object width="425" height="344"&gt;&lt;param name="movie" value="%s&amp;hl=en&amp;fs=1"&gt; &lt;/param&gt;&lt;param name="allowFullScreen" value="true"&gt; &lt;/param&gt;&lt;param name="allowscriptaccess" value="always"&gt; &lt;/param&gt;&lt;embed src="%s&amp;hl=en&amp;fs=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"&gt; &lt;/embed&gt;&lt;/object&gt;&lt;/div&gt;' % (url, url)
 
	t=time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
	c.execute("UPDATE YoutubeResources SET used = '1', dateUsed = ? WHERE url = ? ", (t, url) )
 
	return embedCode

dropbox

Dropbox is one of the best services there is for moving files from computer to computer, sharing videos or pictures work or even applications. If you’ve ever emailed yourself a file, or wondered how send a friend a video without publishing it on youtube then you should at least check out what Dropbox will do for you.

In a nutshell, Dropbox is a web service where you select a particular folder on your computer to be the “dropbox” and any file you put in there gets immediately synced up to the web and pulled down to any other computers that you have the software installed on. The files are also available through their website if you want to get at them from a computer without the software.

So the way that I have it set up is that I have it on my laptop, and also on my computer at work. When I get some MP3s that I want to send to myself at the office I’ll just download them at home and copy them into my dropbox. Then the next day at work the files will already be on my computer and I can listen to them right away. Or when I find something online that I want to have that I want to look at on my laptop I can download it at work and it will be on my laptop when I get home.

One of the other neat tips I’ve figured out is that I can use it to keep my personal project code in sync. I use Mercurial to manage the history of my code and I can keep a repository of that code in my dropbox. Then anytime I work on the code at home those changes are immediately available to me at the office, or anywhere with internet access.

It also acts as a small backup for my secret files. I have an encrypted file that contains some sensitive documents, and all my saved internet passwords. If I lose my laptop or the hard drive dies I will still be able to get at that information and restore it.

There’s also a special folder within the dropbox folder for shared files. Any files placed within the Public folder is visible to anyone with the URL. I’ve seen people use it as a place to host video files which are posted to blog sites which is a cool way to host the files and bandwidth for free. Here’s my secret text file that I made public.

The free account gives you 2GB of space which is not too shabby. I’ve found it to be an awesome tool for moving bigger files between computers and I would recommend it to just about anyone that wants to share or move files over the internet.

Check out dropbox at getdropbox.com

I am a programmer and over the years that I’ve been working to make money online I have spent a considerable amount of time writing small scripts and tools to help me manage my marketing efforts.  Already on this site I have published a couple of things such as a url redirection script that allows you to create branded urls very easily, and a twitter client for Excel.

But I have a secret stash of software that I’ve written which I’m going to be making public very soon on this site.

Stay tuned for my program that re-purposes content from youtube and some other sources and pushes it out to your network of blogs.  This software will allow you to scale up your network of websites massively to drive some serious traffic.

Up to now I’ve been holding back on publishing my software on this site, but look forward to seeing more code examples and software ready for you to download and use in the near future.