Getting Ezine Article Content Automatically with Python

ea_logoIf you’re not familiar with Ezine articles they are basically niche content about 200 to 2000 words long that some ‘expert’ writes and shares for re-publishing the content under the stipulation that it includes the signature (and usually a link) for the author. Articles are great from both the advertiser and publisher perspective since the author can get good links back to their site for promotion and the publishers get quality content without having to write it themselves.

I thought it might be handy to have a script that could scrape an ezine website for articles and save them in a database for later use. A bit of Googling revealed no scripts out there to do this sort of thing so I decided to write it myself.

The script I wrote will perform a search on and then get the top 25 results and download all the content of the articles and store them in an sqlite database.

scaling this up should make it possible to source 1000s of articles using a keyword list as input.  Used correctly this script could generate massive websites packed with content in just a few minutes.

Here’s the script:

import sys
import urllib2
import urllib
import sqlite3
from BeautifulSoup import BeautifulSoup # available at:
USER_AGENT = 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/ Safari/525.13'
conn = sqlite3.connect("ezines.sqlite")
conn.row_factory = sqlite3.Row
c = conn.cursor()
c.execute('CREATE TABLE IF NOT EXISTS Ezines (`url`, `title`, `summary`, `tail`, `content`, `signature`)')
def transposed(lists):
   if not lists: return []
   return map(lambda *row: list(row), *lists)
def search(query):
    """Runs the search on and returns the HTML
    url='' + '+'.join(query.split())
    req = urllib2.Request(url)
    req.add_header('User-agent', USER_AGENT)
    HTML = urllib2.urlopen(req).read()
    return HTML
def parse_search_results(HTML):
    """Givin the result of the search function this parses out the results into a list
    soup = BeautifulSoup(HTML)
    match_titles = soup.findAll(attrs={'class':'srch_title'})
    match_sum = soup.findAll(attrs={'class':'srch_sum'})
    match_tail = soup.findAll(attrs={'class':'srch_tail'})
    return transposed([match_titles, match_sum, match_tail])
def get_article_content(url):
    """Parse the body and signature from the content of an article
    req = urllib2.Request(url)
    req.add_header('User-agent', USER_AGENT)
    HTML = urllib2.urlopen(req).read()
    soup = BeautifulSoup(HTML)
    return {'text':soup.find(id='body'), 'sig':soup.find(id='sig')}
def store_results(search_results):
    """put the results into an sqlite database if they haven't already been downloaded.
    c = conn.cursor()
    for row in search_results:
        title = row[0]
        summary = row[1]
        tail = row[2]
        link = title.find('a').get('href')
        have_url = c.execute('SELECT url from Ezines WHERE url=?', (link, )).fetchall()
        if not have_url:
            content = get_article_content('' + link)
            c.execute('INSERT INTO Ezines (`title`, `url`, `summary`, `tail`, `content`, `signature`) VALUES (?,?,?,?,?,?)', 
                       str(content['sig'])) )
if __name__=='__main__':
    #example usage
    page = search('seo')
    search_results = parse_search_results(page)