Tag Archives: web development

wordpressWordPress is probably the best blogging software out there. This site runs on WordPress. It’s easy to install, amazingly extensible with themes and plugins and very easy to use. In fact the vast majority of the websites I maintain run on WordPress.

wordpresslib is a Python library that makes it possible to programatically put new content onto a blog. It works with both the self-hosted as well as the freely hosted wordpress.com blogs and It gives you the power to do these tasks:

  • Publishing new post
  • Editing old post
  • Publishing draft post
  • Deleting post
  • Changing post categories
  • Getting blog and user informations
  • Upload multimedia files like movies or photos
  • Getting last recents post
  • Getting last post
  • Getting Trackbacks of post
  • Getting Pingbacks of post

When used in conjunction with some of the other scripts I have shared on this site such as Getting Ezine Article Content Automatically with Python, Translating Text Using Google Translate and Python, How To Get RSS Content Into An Sqlite Database With Python – Fast it is possible to build a very powerful blog posting robot.

Here’s an example of just how simple it is to send a new post to a wordpress blog:

import wordpresslib
 
def wordpressPublish(url, username, password, content, title, category):
 
	wp = wordpresslib.WordPressClient(url, username, password)
	wp.selectBlog(0)
 
	post = wordpresslib.WordPressPost()
	post.title = title
	post.description = content
	post.categories = (wp.getCategoryIdFromName(category),)
 
	idNewPost = wp.newPost(post, True)

I make use of this on a daily basis in various scripts I have that re-purpose content from other places and push it onto several aggregation blogs that I have. Over the next few posts I’ll be revealing exactly how that system works so stay tuned. (and make sure you’re subscribed to the RSS feed)

ea_logoIf you’re not familiar with Ezine articles they are basically niche content about 200 to 2000 words long that some ‘expert’ writes and shares for re-publishing the content under the stipulation that it includes the signature (and usually a link) for the author. Articles are great from both the advertiser and publisher perspective since the author can get good links back to their site for promotion and the publishers get quality content without having to write it themselves.

I thought it might be handy to have a script that could scrape an ezine website for articles and save them in a database for later use. A bit of Googling revealed no scripts out there to do this sort of thing so I decided to write it myself.

The script I wrote will perform a search on ezinearticles.com and then get the top 25 results and download all the content of the articles and store them in an sqlite database.

scaling this up should make it possible to source 1000s of articles using a keyword list as input.  Used correctly this script could generate massive websites packed with content in just a few minutes.

Here’s the script:

import sys
import urllib2
import urllib
import sqlite3
 
from BeautifulSoup import BeautifulSoup # available at: http://www.crummy.com/software/BeautifulSoup/
 
USER_AGENT = 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.29 Safari/525.13'
 
conn = sqlite3.connect("ezines.sqlite")
conn.row_factory = sqlite3.Row
 
c = conn.cursor()
c.execute('CREATE TABLE IF NOT EXISTS Ezines (`url`, `title`, `summary`, `tail`, `content`, `signature`)')
conn.commit()
 
def transposed(lists):
   if not lists: return []
   return map(lambda *row: list(row), *lists)
 
def search(query):
    """Runs the search on ezineartles.com and returns the HTML
    """
    url='http://ezinearticles.com/search/?q=' + '+'.join(query.split())
    req = urllib2.Request(url)
    req.add_header('User-agent', USER_AGENT)
    HTML = urllib2.urlopen(req).read()
    return HTML
 
def parse_search_results(HTML):
    """Givin the result of the search function this parses out the results into a list
    """
    soup = BeautifulSoup(HTML)
    match_titles = soup.findAll(attrs={'class':'srch_title'})
    match_sum = soup.findAll(attrs={'class':'srch_sum'})
    match_tail = soup.findAll(attrs={'class':'srch_tail'})
 
    return transposed([match_titles, match_sum, match_tail])
 
def get_article_content(url):
    """Parse the body and signature from the content of an article
    """
    req = urllib2.Request(url)
    req.add_header('User-agent', USER_AGENT)
    HTML = urllib2.urlopen(req).read()
 
    soup = BeautifulSoup(HTML)
    return {'text':soup.find(id='body'), 'sig':soup.find(id='sig')}
 
def store_results(search_results):
    """put the results into an sqlite database if they haven't already been downloaded.
    """
    c = conn.cursor()
    for row in search_results:
        title = row[0]
        summary = row[1]
        tail = row[2]
 
        link = title.find('a').get('href')
        have_url = c.execute('SELECT url from Ezines WHERE url=?', (link, )).fetchall()
        if not have_url:
            content = get_article_content('http://ezinearticles.com' + link)
            c.execute('INSERT INTO Ezines (`title`, `url`, `summary`, `tail`, `content`, `signature`) VALUES (?,?,?,?,?,?)', 
                      (title.find('a').find(text=True), 
                       link, 
                       summary.find(text=True), 
                       tail.find(text=True), 
                       str(content['text']), 
                       str(content['sig'])) )
 
    conn.commit()
 
if __name__=='__main__':
    #example usage
    page = search('seo')
    search_results = parse_search_results(page)
 
    store_results(search_results)

This script will create an image on the fly of a users most recent twitter message.  It could be used as an email or forum signature or any place that allows you to embed a custom image such as on a blog or website.

I saw a website that did this the other day and wanted to try to duplicate the functionality.  Turns out it was pretty trivial even for someone with very little PHP experience. So I felt inspired enough to create a new website based on this script and called it TwitSig.us. Check it out.

It creates images something like this:

And here’s the code that does it:

<?php
include "twitter.php"; // from http://twitter.slawcup.com/twitter.class.phps
 
$t = new twitter();
$res = $t->userTimeline($_GET["user"], 1);
 
$my_img = imagecreatefrompng ( "base.png" );
 
$grey = imagecolorallocate( $my_img, 150, 150, 150 );
$red = imagecolorallocate( $my_img, 255, 0,  0 );
$text_colour = imagecolorallocate( $my_img, 0, 0, 0 );
 
if($res===false){
	imagestring( $my_img, 4, 30, 25, "no messages at this time",
	  $text_colour );
} else {
	$newtext = wordwrap($res->status->text, 65, "\n");
	imagettftext( $my_img, 10, 0, 10, 35, $text_colour, "Arial.ttf", $newtext);
	imagettftext( $my_img, 10, 0, 90, 15, $red, "Arial Bold.ttf", "@".$_GET["user"]);
	imagettftext( $my_img, 10, 0, 225, 15, $grey, "Arial.ttf", strftime("%a %d %b %H:%M %Y", strtotime($res->status->created_at)));
}
 
header( "Content-type: image/png" );
imagepng( $my_img );
?>

To get this script working for yourself you’ll need to make sure that you have the two font files and the base.png file for the background image that the text is put on.

Another in my series of Python scripting blog posts. This time I’m sharing a script that can rip through RSS feeds and devour their content and stuff it into a database in a way that scales up to 1000s of feeds. To accomplish this the script is multi-threaded.

The big problem with scaling up a web script like this is that there is a huge amount of latency when requesting something over the internet. Due to the bandwidth as well as remote processing time it can take as long as a couple of seconds to get anything back. Requesting one feed after the other in series will waste a lot of time, and that makes this type of script a prime candidate for some threading.

I borrowed parts of this script from this post: Threaded data collection with Python, including examples

What could you do with all this content? Just off the top of my head I can think of many interesting things to do:

  • Create histograms of the publish times of posts to find out the most/least popular days and times are for publishing
  • Plot trends of certain words or phrases over time
  • create your own aggregation website
  • get the trending topics by doing counting the occurrence of words by day
  • Try writing some natural language processing algorithms

This script is coded at 20 threads, but that really needs to be fine tuned for the best performance. Depending on your bandwidth and the sites you want to grab you may want to tweak the THREAD_LIMIT value.

import sqlite3
import threading
import time
import Queue
from time import strftime
 
import feedparser     # available at http://feedparser.org
 
 
THREAD_LIMIT = 20
jobs = Queue.Queue(0)
rss_to_process = Queue.Queue(THREAD_LIMIT)
 
DATABASE = "rss.sqlite"
 
conn = sqlite3.connect(DATABASE)
conn.row_factory = sqlite3.Row
c = conn.cursor()
 
#insert initial values into feed database
c.execute('CREATE TABLE IF NOT EXISTS RSSFeeds (id INTEGER PRIMARY KEY AUTOINCREMENT, url VARCHAR(1000));')
c.execute('CREATE TABLE IF NOT EXISTS RSSEntries (entry_id INTEGER PRIMARY KEY AUTOINCREMENT, id, url, title, content, date);')
c.execute("INSERT INTO RSSFeeds(url) VALUES('http://www.halotis.com/feed/');")
 
feeds = c.execute('SELECT id, url FROM RSSFeeds').fetchall()
 
def store_feed_items(id, items):
    """ Takes a feed_id and a list of items and stored them in the DB """
    for entry in items:
        c.execute('SELECT entry_id from RSSEntries WHERE url=?', (entry.link,))
        if len(c.fetchall()) == 0:
            c.execute('INSERT INTO RSSEntries (id, url, title, content, date) VALUES (?,?,?,?,?)', (id, entry.link, entry.title, entry.summary, strftime("%Y-%m-%d %H:%M:%S",entry.updated_parsed)))
 
def thread():
    while True:
        try:
            id, feed_url = jobs.get(False) # False = Don't wait
        except Queue.Empty:
            return
 
        entries = feedparser.parse(feed_url).entries
        rss_to_process.put((id, entries), True) # This will block if full
 
for info in feeds: # Queue them up
    jobs.put([info['id'], info['url']])
 
for n in xrange(THREAD_LIMIT):
    t = threading.Thread(target=thread)
    t.start()
 
while threading.activeCount() > 1 or not rss_to_process.empty():
    # That condition means we want to do this loop if there are threads
    # running OR there's stuff to process
    try:
        id, entries = rss_to_process.get(False, 1) # Wait for up to a second
    except Queue.Empty:
        continue
 
    store_feed_items(id, entries)
 
conn.commit()