Tag Archives: web scraping

deliciousIn yet another of my series of web scrapers this time I’m posting some code that will scrape links from delicious.com. This is a pretty cool way of finding links that other people have found relevant. And this could be used to generate useful content for visitors.

You could easily add this to a WordPress blogging robot script so that the newest links are posted in a weekly digest post. This type of promotion will get noticed by the people they link to, and spreads some of that link love. It will hopefully result in some reciprocal links for your site.

Another idea would be to create a link directory and seed it with links gathered from delicious. Or you could create a widget of the hottest links in your niche that automatically gets updated.

This script makes use of the BeautifulSoup library for parsing the HTML pages.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# http://halotis.com/
"""
Scraper for Del.icio.us SERP.
 
This pulls the results for a match for a query on http://del.icio.us.
"""
 
import urllib2
import re
 
from BeautifulSoup import BeautifulSoup
 
def get_delicious_results(query, page_limit=10):
 
    page = 1
    links = []
 
    while page < page_limit :
        url='http://delicious.com/search?p=' + '%20'.join(query.split()) + '&context=all&lc=1&page=' + str(page)
        req = urllib2.Request(url)
        HTML = urllib2.urlopen(req).read()
        soup = BeautifulSoup(HTML)
 
        next = soup.find('a', attrs={'class':re.compile('.*next$', re.I)})
 
        #links is a list of (url, title) tuples
        links +=   [(link['href'], ''.join(link.findAll(text=True)) ) for link in soup.findAll('a', attrs={'class':re.compile('.*taggedlink.*', re.I)}) ]
 
        if next :
            page = page+1
        else :
            break
 
    return links
 
if __name__=='__main__':
    links = get_delicious_results('halotis marketing')
    print links

Today’s script will perform a search on Technorati and then scrape out the search results. It is useful because Technorati is up to date about things that are happening in the blogosphere. And that gives you a way to tune into everything going on there.

The scope of the blogosphere matched with Technoratis ability sort the results by the most recent is what makes this very powerful. This script will help you find up to the moment content which you can then data-mine for whatever purposes you want.

Possible uses:

  • Create a tag cloud of what is happening today within your niche
  • aggregate the content into your own site
  • post it to Twitter
  • convert the search results into an RSS feed

And here’s the Python code:

import urllib2
 
from BeautifulSoup import BeautifulSoup
 
def get_technorati_results(query, page_limit=10):
 
    page = 1
    links = []
 
    while page < page_limit :
        url='http://technorati.com/search/' + '+'.join(query.split()) + '?language=n&page=' + str(page)
        req = urllib2.Request(url)
        HTML = urllib2.urlopen(req).read()
        soup = BeautifulSoup(HTML)
 
        next = soup.find('li', attrs={'class':'next'}).find('a')
 
        #links is a list of (url, summary, title) tuples
        links +=   [(link.find('blockquote')['cite'], ''.join(link.find('blockquote').findAll(text=True)), ''.join(link.find('h3').findAll(text=True))) for link in soup.find('div', id='results').findAll('li', attrs={'class':'hentry'})]
 
        if next :
            page = page+1
        else :
            break
 
    return links
 
if __name__=='__main__':
    links = get_technorati_results('halotis marketing')
    print links

ea_logoIf you’re not familiar with Ezine articles they are basically niche content about 200 to 2000 words long that some ‘expert’ writes and shares for re-publishing the content under the stipulation that it includes the signature (and usually a link) for the author. Articles are great from both the advertiser and publisher perspective since the author can get good links back to their site for promotion and the publishers get quality content without having to write it themselves.

I thought it might be handy to have a script that could scrape an ezine website for articles and save them in a database for later use. A bit of Googling revealed no scripts out there to do this sort of thing so I decided to write it myself.

The script I wrote will perform a search on ezinearticles.com and then get the top 25 results and download all the content of the articles and store them in an sqlite database.

scaling this up should make it possible to source 1000s of articles using a keyword list as input.  Used correctly this script could generate massive websites packed with content in just a few minutes.

Here’s the script:

import sys
import urllib2
import urllib
import sqlite3
 
from BeautifulSoup import BeautifulSoup # available at: http://www.crummy.com/software/BeautifulSoup/
 
USER_AGENT = 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.29 Safari/525.13'
 
conn = sqlite3.connect("ezines.sqlite")
conn.row_factory = sqlite3.Row
 
c = conn.cursor()
c.execute('CREATE TABLE IF NOT EXISTS Ezines (`url`, `title`, `summary`, `tail`, `content`, `signature`)')
conn.commit()
 
def transposed(lists):
   if not lists: return []
   return map(lambda *row: list(row), *lists)
 
def search(query):
    """Runs the search on ezineartles.com and returns the HTML
    """
    url='http://ezinearticles.com/search/?q=' + '+'.join(query.split())
    req = urllib2.Request(url)
    req.add_header('User-agent', USER_AGENT)
    HTML = urllib2.urlopen(req).read()
    return HTML
 
def parse_search_results(HTML):
    """Givin the result of the search function this parses out the results into a list
    """
    soup = BeautifulSoup(HTML)
    match_titles = soup.findAll(attrs={'class':'srch_title'})
    match_sum = soup.findAll(attrs={'class':'srch_sum'})
    match_tail = soup.findAll(attrs={'class':'srch_tail'})
 
    return transposed([match_titles, match_sum, match_tail])
 
def get_article_content(url):
    """Parse the body and signature from the content of an article
    """
    req = urllib2.Request(url)
    req.add_header('User-agent', USER_AGENT)
    HTML = urllib2.urlopen(req).read()
 
    soup = BeautifulSoup(HTML)
    return {'text':soup.find(id='body'), 'sig':soup.find(id='sig')}
 
def store_results(search_results):
    """put the results into an sqlite database if they haven't already been downloaded.
    """
    c = conn.cursor()
    for row in search_results:
        title = row[0]
        summary = row[1]
        tail = row[2]
 
        link = title.find('a').get('href')
        have_url = c.execute('SELECT url from Ezines WHERE url=?', (link, )).fetchall()
        if not have_url:
            content = get_article_content('http://ezinearticles.com' + link)
            c.execute('INSERT INTO Ezines (`title`, `url`, `summary`, `tail`, `content`, `signature`) VALUES (?,?,?,?,?,?)', 
                      (title.find('a').find(text=True), 
                       link, 
                       summary.find(text=True), 
                       tail.find(text=True), 
                       str(content['text']), 
                       str(content['sig'])) )
 
    conn.commit()
 
if __name__=='__main__':
    #example usage
    page = search('seo')
    search_results = parse_search_results(page)
 
    store_results(search_results)

translate_logoSometimes is can be quite useful to be able to translate content from one language to another from within a program. There are many compelling reasons why you might like the idea of auto translating text. The reason why I’m interested in writing this script is that it is useful to sometimes create unique content online for SEO reasons. Search engines like to see unique content rather than words that have been copied and pasted from other websites. What you’re looking for in web content is:

  1. A lot of it.
  2. Highly related to the keywords you’re targeting.

When trying to get a great position in the organic search results it is important to recognize that you’re competing against an army of low cost outsourced people that are pumping out page after page of mediocre content and then running scripts to generate thousands of back-links to the sites they are trying to rank.  It is very much impossible to get the top spot for any desirable keyword if you’re writing all the content yourself.  You need some help with this.

That’s where Google Translate comes in.

Take an article from somewhere, push it through a round trip of translation such as English->French->English and the content will then be unique enough that it won’t raise any flags that it has been copied from somewhere else on the internet.  The content may not be readable but it will make for fodder for the search engines to eat up.

Using this technique it is possible to build massive websites of unique content overnight and have it quickly rank highly.

Unfortunately Google doesn’t provide an API for translating text.  That means the script has to resort to scraping which is inherently prone to breaking.  The script uses BeautifulSoup to help with the parsing of the HTML content. (Note: I had to use the older 3.0.x series of BeautifulSoup to successfully parse the content)

The code for this was based on this script by technobabble.

import sys
import urllib2
import urllib
 
from BeautifulSoup import BeautifulSoup # available at: http://www.crummy.com/software/BeautifulSoup/
 
def translate(sl, tl, text):
    """ Translates a given text from source language (sl) to
        target language (tl) """
 
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)')]
 
    translated_page = opener.open(
        "http://translate.google.com/translate_t?" + 
        urllib.urlencode({'sl': sl, 'tl': tl}),
        data=urllib.urlencode({'hl': 'en',
                               'ie': 'UTF8',
                               'text': text.encode('utf-8'),
                               'sl': sl, 'tl': tl})
    )
 
    translated_soup = BeautifulSoup(translated_page)
 
    return translated_soup('div', id='result_box')[0].string
 
if __name__=='__main__':
    print translate('en', 'fr', u'hello')

To generate unique content you can use this within your own python program like this:

import translate
 
content = get_content()
new_content = translate('fr', 'en', translate('en','fr', content))
publish_content(new_content)

3038922333_79273fbb30_oThere are a number of services out there such as Google Cash Detective that will go run some searches on Google and then save the advertisements so you can track who is advertising for what keywords over time. It’s actually a very accurate technique for finding out what ads are profitable.

After tracking a keyword for several weeks it’s possible to see what ads have been running consistently over time. The nature of Pay Per Click is that only profitable advertisements will continue to run long term. So if you can identify what ads, for what keywords are profitable then it should be possible to duplicate them and get some of that profitable traffic for yourself.

The following script is a Python program that perhaps breaks the Google terms of service. So consider it as a guide for how this kind of HTML parsing could be done. It spoofs the User-agent to appear as though it is a real browser, and then does a search through all the keywords stored in an sqlite database and stores the ads displayed for that keyword in the database.

The script makes use of the awesome Beautiful Soup library. Beautiful Soup makes parsing HTML content really easy. But because of the nature of scraping the web it is very fragile since it makes several assumptions about the structure of the Google results page and if they change their site then the script could break.

#!/usr/bin/env python
 
import sys
import urllib2
import re
import sqlite3
import datetime
 
from BeautifulSoup import BeautifulSoup  # available at: http://www.crummy.com/software/BeautifulSoup/
 
conn = sqlite3.connect("espionage.sqlite")
conn.row_factory = sqlite3.Row
 
def get_google_search_results(keywordPhrase):
	"""make the GET request to Google.com for the keyword phrase and return the HTML text
	"""
	url='http://www.google.com/search?hl=en&q=' + '+'.join(keywordPhrase.split())
	req = urllib2.Request(url)
	req.add_header('User-agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.29 Safari/525.13')
	page = urllib2.urlopen(req)
	HTML = page.read()
	return HTML
 
def scrape_ads(text, phraseID):
	"""Scrape the text as HTML, find and parse out all the ads and store them in a database
	"""
	soup = BeautifulSoup(text)
	#get the ads on the right hand side of the page
	ads = soup.find(id='rhsline').findAll('li')
	position = 0
	for ad in ads:
		position += 1
 
		#display url
		parts = ad.find('cite').findAll(text=True)
		site = ''.join([word.strip() for word in parts]).strip()
		ad.find('cite').replaceWith("")
 
		#the header line
		parts = ad.find('a').findAll(text=True)
		title = ' '.join([word.strip() for word in parts]).strip()
 
		#the destination URL
		href = ad.find('a')['href']
		start = href.find('&q=')
		if start != -1 :
			dest = href[start+3:]
		else :
			dest = None
			print 'error', href
 
		ad.find('a').replaceWith("")
 
		#body of ad
		brs = ad.findAll('br')
		for br in brs:
			br.replaceWith("%BR%")
		parts = ad.findAll(text=True)
		body = ' '.join([word.strip() for word in parts]).strip()
		line1 = body.split('%BR%')[0].strip()
		line2 = body.split('%BR%')[1].strip()
 
		#see if the ad is in the database
		c = conn.cursor()
		c.execute('SELECT adID FROM AdTable WHERE destination=? and title=? and line1=? and line2=? and site=? and phraseID=?', (dest, title, line1, line2, site, phraseID))
		result = c.fetchall() 
		if len(result) == 0:
			#NEW AD - insert into the table
			c.execute('INSERT INTO AdTable (`destination`, `title`, `line1`, `line2`, `site`, `phraseID`) VALUES (?,?,?,?,?,?)', (dest, title, line1, line2, site, phraseID))
			conn.commit()
			c.execute('SELECT adID FROM AdTable WHERE destination=? and title=? and line1=? and line2=? and site=? and phraseID=?', (dest, title, line1, line2, site, phraseID))
			result = c.fetchall()
		elif len(result) > 1:
			continue
 
		adID = result[0]['adID']
 
		c.execute('INSERT INTO ShowTime (`adID`,`date`,`time`, `position`) VALUES (?,?,?,?)', (adID, datetime.datetime.now(), datetime.datetime.now(), position))
 
 
def do_all_keywords():
	c = conn.cursor()
	c.execute('SELECT * FROM KeywordList')
	result = c.fetchall()
	for row in result:
		html = get_google_search_results(row['keywordPhrase'])
		scrape_ads(html, row['phraseID'])
 
if __name__ == '__main__' :
	do_all_keywords()