Tag Archives: urllib

As much as I’ve found the basic webscraping to be really simple with urllib and BeautifulSoup. It leaves somethings to be desired. The BeautifulSoup project has languished and recent versions have switched the HTML parser for one that is less able to manage with the poorly encoded pages on real websites.

Scrapy is a full on framework for scraping websites and it offers many features including a stand alone command-line interface and daemon tool to make scraping websites much more systematic and organized.

I have yet to build any substantial scraping scripts based on Scrapy but judging from the snippets I’ve read at http://snippets.scrapy.org, the documentation at http://doc.scrapy.org and the project blog at http://blog.scrapy.org. It seems like a solid project with a good future and a lot of really great features that will make my scripts more automate-able and standardized.

I got an email the other day from Frank Kern who was pimping another make money online product from his cousin Trey. The Number Effect is a DVD containing the results of an experiment where he created an affiliate link to every one of the 12,000 products for sale on ClickBank and sent paid (PPV) traffic to all of those links and found which ones were profitable. He found 54 niches with profitable campaigns out of 12,000.

Trey went on to talk about the software that he had written for this experiment. It apparently took a bit of work to get going from his outsourced programmer.

I thought it would be fun to try and implement the same script myself. It took about 1 hour to program the whole thing.

So if you want to create your own clickbank affiliate link for all of the clickbank products for sale here’s a script that will do it. Keep in mind that I never did any work to make this thing fast. and it takes about 8 hours to scrape all 13,000 products, create the affiliate links, and resolve the urls for where it goes. Sure I could make it faster, but I’m lazy.

Here’s the python script to do it:

#!/usr/bin/env python
# encoding: utf-8
"""
ClickBankMarketScrape.py
 
Created by Matt Warren on 2010-09-07.
Copyright (c) 2010 HalOtis.com. All rights reserved.
 
"""
 
 
 
CLICKBANK_URL = 'http://www.clickbank.com'
MARKETPLACE_URL = CLICKBANK_URL+'/marketplace.htm'
AFF_LINK_FORM = CLICKBANK_URL+'/info/jmap.htm'
 
AFFILIATE = 'mfwarren'
 
import urllib, urllib2
from BeautifulSoup import BeautifulSoup
import re
 
product_links = []
product_codes = []
pages_to_scrape = []
 
def get_category_urls():
	request = urllib2.Request(MARKETPLACE_URL, None)
	urlfile = urllib2.urlopen(request)
	page = urlfile.read()
	urlfile.close()
 
	soup = BeautifulSoup(page)
	parentCatLinks = [x['href'] for x in soup.findAll('a', {'class':'parentCatLink'})]
	return parentCatLinks
 
def get_products():
 
	fout = open('ClickBankLinks.csv', 'w')
 
	while len(pages_to_scrape) > 0:
 
		url = pages_to_scrape.pop()
		request = urllib2.Request(url, None)
		urlfile = urllib2.urlopen(request)
		page = urlfile.read()
		urlfile.close()
 
		soup = BeautifulSoup(page)
 
		results = [x.find('a') for x in soup.findAll('tr', {'class':'result'})]
 
		nextLink = soup.find('a', title='Next page')
		if nextLink:
			page_to_scrape.append(nextLink['href'])
 
		for product in results:
			try:
				product_code = str(product).split('.')[1]
				product_codes.append(product_code)
				m = re.search('^< (.*)>(.*)< ', str(product))
				title = m.group(2)
				my_link = get_hoplink(product_code)
				request = urllib2.Request(my_link)
				urlfile = urllib2.urlopen(request)
				display_url = urlfile.url
				#page = urlfile.read()  #continue here if you want to scrape keywords etc from landing page
 
				print my_link, display_url
				product_links.append({'code':product_code, 'aff_link':my_link, 'dest_url':display_url})
				fout.write(product_code + ', ' + my_link + ', ' + display_url + '\n')
				fout.flush()
			except:
				continue  # handle cases where destination url is offline
 
	fout.close()
 
def get_hoplink(vendor):
	request = urllib2.Request(AFF_LINK_FORM + '?affiliate=' + AFFILIATE + '&promocode=&submit=Create&vendor='+vendor+'&results=', None)
	urlfile = urllib2.urlopen(request)
	page = urlfile.read()
	urlfile.close()
	soup = BeautifulSoup(page)
	link = soup.findAll('input', {'class':'special'})[0]['value']
	return link
 
if __name__=='__main__':
	urls = get_category_ids()
	for url in urls:
		pages_to_scrape.append(CLICKBANK_URL+url)
	get_products()

Have you ever wanted to track and assess your SEO efforts by seeing how they change your position in Google’s organic SERP? With this script you can now track and chart your position for any number of search queries and find the position of the site/page you are trying to rank.

This will allow you to visually identify any target keyword phrases that are doing well, and which ones may need some more SEO work.

This python script has a number of different components.

  • SEOCheckConfig.py script is used to add new target search queries to the database.
  • SEOCheck.py searches Google and saves the best position (in the top 100 results)
  • SEOCheckCharting.py graph all the results

The charts produced look like this:

seocheck

The main part of the script is SEOCheck.py. This script should be scheduled to run regularly (I have mine running 3 times per day on my webfaction hosting account).

For a small SEO consultancy business this type of application generates the feedback and reports that you should be using to communicate with your clients. It identifies where the efforts should go and how successful you have been.

To use this set of script you first will need to edit and run the SEOCheckConfig.py file. Add your own queries and domains that you’d like to check to the SETTINGS variable then run the script to load those into the database.

Then schedule SEOCheck.py to run periodically. On Windows you can do that using Scheduled Tasks:
Scheduled Task Dialog

On either Mac OSX or Linux you can use crontab to schedule it.

To generate the Chart simply run the SEOCheckCharting.py script. It will plot all the results on one graph.

You can find and download all the source code for this in the HalOtis-Collection on bitbucket. It requires BeautifulSoup, matplotlib, and sqlalchemy libraries to be installed.

translate_logoOk, so this isn’t my script but it’s a much nicer version of the one I wrote that scrapes the actual Google translate website to do the same thing. I’d like to thank Ashish Yadav for writing and sharing this.

Translating text is an easy way to create variations of content that is recognized as unique by the search engines. As part of a bigger SEO strategy this can make a big impact on your traffic. Or it could be used to provide an automated way to translate your website to another language.

# -*- coding: utf-8 -*-
 
import re
import sys
import urllib
import simplejson
 
baseUrl = "http://ajax.googleapis.com/ajax/services/language/translate"
 
def getSplits(text,splitLength=4500):
    '''
    Translate Api has a limit on length of text(4500 characters) that can be translated at once, 
    '''
    return (text[index:index+splitLength] for index in xrange(0,len(text),splitLength))
 
 
def translate(text,src='', to='en'):
    '''
    A Python Wrapper for Google AJAX Language API:
    * Uses Google Language Detection, in cases source language is not provided with the source text
    * Splits up text if it's longer then 4500 characters, as a limit put up by the API
    '''
 
    params = ({'langpair': '%s|%s' % (src, to),
             'v': '1.0'
             })
    retText=''
    for text in getSplits(text):
            params['q'] = text
            resp = simplejson.load(urllib.urlopen('%s' % (baseUrl), data = urllib.urlencode(params)))
            try:
                    retText += resp['responseData']['translatedText']
            except:
                    raise
    return retText
 
 
def test():
    msg = "      Write something You want to be translated to English,\n"\
        "      Enter ctrl+c to exit"
    print msg
    while True:
        text = raw_input('#>  ')
        retText = translate(text)
        print retText
 
 
if __name__=='__main__':
    try:
        test()
    except KeyboardInterrupt:
        print "\n"
        sys.exit(0)

I was a bit hesitant to post this script since it is such a powerful marketing tool that it could be used very badly in the hands of a spammer. The basic premise is to directly respond to someone’s tweet if they mention your product or service. So for example I might want to have a tweet that goes out directly to someone who mentions twitter and python in a tweet and let them know about this blog. This will accomplish the same thing as the TwitterHawk service except you won’t have to pay per tweet.

To do this I had a choice. I could use a service like TweetBeep.com and then write a script that responded to the emails in my inbox, or I could use the Twitter Search API directly. The search API is so dead simple that I wanted to try that route.

The other thing to consider is that I don’t want to send a tweet to the same person more than once so I need to keep a list of twitter users that I have responded to. I used pickle to persist that list of usernames to disk so that it sticks around between uses.

The query functionality provided by the Twitter Search API is pretty cool and provides much more power than I have used in this script. For example it is possible to geo-target, lookup hashtags, or reply tweets. You can check out the full spec at http://apiwiki.twitter.com/Twitter-API-Documentation

Lastly, to keep it a bit simpler I’m ignoring the pagination in the search results and this script will only respond to the first page worth of results. Adding a loop per page would be pretty straight forward but I didn’t want to clutter up the code.

Example Usage:

>>> import tweetBack
>>> tweetBack.tweet_back('python twitter', 'Here is a blog with some good Python scripts you might find interesting http://halotis.com', 'twitter_username', 'twitter_password')
@nooble sent message
@ichiro_j sent message
@Ghabrie1 sent message
.....

Here’s the Python Code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# http://halotis.com/
 
try:
   import json as simplejson
except:
   import simplejson  # http://undefined.org/python/#simplejson
import twitter     #http://code.google.com/p/python-twitter/
 
import urllib
import pickle
 
TWITTER_USER = 'username'
TWITTER_PASSWORD = 'password'
 
USER_LIST_FILE = 'tweetback.pck'
 
#read stored list of twitter users that have been responded to already in a file
try:
    f = open(USER_LIST_FILE, 'r')
    user_list = pickle.load(f)
except:
    user_list = []
 
def search_results(query):
    url = 'http://search.twitter.com/search.json?q=' + '+'.join(query.split())
    return simplejson.load(urllib.urlopen(url))
 
def tweet_back(query, tweet_reply, username=TWITTER_USER, password=TWITTER_PASSWORD):
    results = search_results(query)
 
    api = twitter.Api(username, password)
    try:
        for result in results['results']:
            if result['from_user'] not in user_list:
                api.PostUpdate('@' + result['from_user'] + ' ' + tweet_reply)
                print '@' + result['from_user'] + ' sent message'
 
                user_list.append(result['from_user'])
    except:
        print 'Failed to post update. may have gone over the twitter API limit.. please wait and try again'
 
    #write the user_list to disk
    f = open(USER_LIST_FILE, 'w')
    pickle.dump(user_list, f)
 
if __name__=='__main__':
    tweet_back('python twitter', 'Here is a blog with some good Python scripts you might find interesting http://halotis.com')

Update: thanks tante for the simplejson note.

I noticed that several accounts are spamming the twitter trends. Go to twitter.com and select one of the trends in the right column. You’ll undoubtedly see some tweets that are blatantly inserting words from the trending topics list into unrelated ads.

I was curious just how easy it would be to get the trending topics to target them with tweets. Turns out it is amazingly simple and shows off some of the beauty of Python.

This script doesn’t actually do anything with the trend information. It just simply downloads and prints out the list. But combine this code with the sample code from
RSS Twitter Bot in Python and you’ll have a recipe for some seriously powerful promotion.

import simplejson  # http://undefined.org/python/#simplejson
import urllib
 
result = simplejson.load(urllib.urlopen('http://search.twitter.com/trends.json'))
 
print [trend['name'] for trend in result['trends']]