Scraping the ClickBank Marketplace

I got an email the other day from Frank Kern who was pimping another make money online product from his cousin Trey. The Number Effect is a DVD containing the results of an experiment where he created an affiliate link to every one of the 12,000 products for sale on ClickBank and sent paid (PPV) traffic to all of those links and found which ones were profitable. He found 54 niches with profitable campaigns out of 12,000.

Trey went on to talk about the software that he had written for this experiment. It apparently took a bit of work to get going from his outsourced programmer.

I thought it would be fun to try and implement the same script myself. It took about 1 hour to program the whole thing.

So if you want to create your own clickbank affiliate link for all of the clickbank products for sale here’s a script that will do it. Keep in mind that I never did any work to make this thing fast. and it takes about 8 hours to scrape all 13,000 products, create the affiliate links, and resolve the urls for where it goes. Sure I could make it faster, but I’m lazy.

Here’s the python script to do it:

#!/usr/bin/env python
# encoding: utf-8
"""
ClickBankMarketScrape.py
 
Created by Matt Warren on 2010-09-07.
Copyright (c) 2010 HalOtis.com. All rights reserved.
 
"""
 
 
 
CLICKBANK_URL = 'http://www.clickbank.com'
MARKETPLACE_URL = CLICKBANK_URL+'/marketplace.htm'
AFF_LINK_FORM = CLICKBANK_URL+'/info/jmap.htm'
 
AFFILIATE = 'mfwarren'
 
import urllib, urllib2
from BeautifulSoup import BeautifulSoup
import re
 
product_links = []
product_codes = []
pages_to_scrape = []
 
def get_category_urls():
	request = urllib2.Request(MARKETPLACE_URL, None)
	urlfile = urllib2.urlopen(request)
	page = urlfile.read()
	urlfile.close()
 
	soup = BeautifulSoup(page)
	parentCatLinks = [x['href'] for x in soup.findAll('a', {'class':'parentCatLink'})]
	return parentCatLinks
 
def get_products():
 
	fout = open('ClickBankLinks.csv', 'w')
 
	while len(pages_to_scrape) > 0:
 
		url = pages_to_scrape.pop()
		request = urllib2.Request(url, None)
		urlfile = urllib2.urlopen(request)
		page = urlfile.read()
		urlfile.close()
 
		soup = BeautifulSoup(page)
 
		results = [x.find('a') for x in soup.findAll('tr', {'class':'result'})]
 
		nextLink = soup.find('a', title='Next page')
		if nextLink:
			page_to_scrape.append(nextLink['href'])
 
		for product in results:
			try:
				product_code = str(product).split('.')[1]
				product_codes.append(product_code)
				m = re.search('^< (.*)>(.*)< ', str(product))
				title = m.group(2)
				my_link = get_hoplink(product_code)
				request = urllib2.Request(my_link)
				urlfile = urllib2.urlopen(request)
				display_url = urlfile.url
				#page = urlfile.read()  #continue here if you want to scrape keywords etc from landing page
 
				print my_link, display_url
				product_links.append({'code':product_code, 'aff_link':my_link, 'dest_url':display_url})
				fout.write(product_code + ', ' + my_link + ', ' + display_url + '\n')
				fout.flush()
			except:
				continue  # handle cases where destination url is offline
 
	fout.close()
 
def get_hoplink(vendor):
	request = urllib2.Request(AFF_LINK_FORM + '?affiliate=' + AFFILIATE + '&promocode=&submit=Create&vendor='+vendor+'&results=', None)
	urlfile = urllib2.urlopen(request)
	page = urlfile.read()
	urlfile.close()
	soup = BeautifulSoup(page)
	link = soup.findAll('input', {'class':'special'})[0]['value']
	return link
 
if __name__=='__main__':
	urls = get_category_ids()
	for url in urls:
		pages_to_scrape.append(CLICKBANK_URL+url)
	get_products()