Category Archives: Internet

A reader suggested that it might be useful to have a script that could get an RSS feed translate it to another language and republish that feed somewhere else. Thankfully that’s pretty easy to do in Python.

I wrote this script by taking bits and pieces from some of the other scripts that I’ve posted on this blog in the past. It’s surprising just how much of a resource this site has turned into.

It uses the Google Translate Service to convert the RSS feed content from one language to another and will simply echo out the new RSS content to the standard out. If you wanted to republish the content then you could easily direct the output to a file and upload that to your web server.

Example Usage:

$ python translateRSS.py
< ?xml version="1.0" encoding="iso-8859-1"?>
<rss version="2.0"><channel><title>HalOtis Marketing</title><link>http://www.halotis.com</link><description>Esprit d&amp;#39;entreprise dans le 21?me si?cle</description>
.....
</channel></rss>

Here’s the Script:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# http://halotis.com/
 
import feedparser  # available at feedparser.org
from translate import translate  # available at http://www.halotis.com/2009/07/20/translating-text-using-google-translate-and-python/
import PyRSS2Gen # avaliable at http://www.dalkescientific.com/Python/PyRSS2Gen.html
 
import datetime 
import re
 
def remove_html_tags(data):
    p = re.compile(r'< .*?>')
    return p.sub('', data)
 
def translate_rss(sl, tl, url):
 
    d = feedparser.parse(url)
 
    #unfortunately feedparser doesn't output rss so we need to create the RSS feed using PyRSS2Gen
    items = [PyRSS2Gen.RSSItem( 
        title = translate(sl, tl, x.title), 
        link = x.link, 
        description = translate(sl, tl, remove_html_tags(x.summary)), 
        guid = x.link, 
        pubDate = datetime.datetime( 
            x.modified_parsed[0], 
            x.modified_parsed[1], 
            x.modified_parsed[2], 
            x.modified_parsed[3], 
            x.modified_parsed[4], 
            x.modified_parsed[5])) 
        for x in d.entries]
 
    rss = PyRSS2Gen.RSS2( 
        title = d.feed.title, 
        link = d.feed.link, 
        description = translate(sl, tl, d.feed.description), 
        lastBuildDate = datetime.datetime.now(), 
        items = items) 
    #emit the feed 
    xml = rss.to_xml()
 
    return xml
 
if __name__ == '__main__':
  feed = translate_rss('en', 'fr', 'http://www.halotis.com/feed/')
  print feed

This isn’t my script but I thought it would appeal to the reader of this blog.  It’s a script that  will lookup the Google Page Rank for any website and uses the same interface as the Google Toolbar to do it. I’d like to thank Fred Cirera for writing it and you can checkout his blog about this script here.

I’m not exactly sure what I would use this for but it might have applications for anyone who wants to do some really advanced SEO work and find a real way to accomplish Page Rank sculpting. Perhaps finding the best websites to put links on.

The reason it is such an involved bit of math is that it need to compute a checksum in order to work. It should be pretty reliable since it doesn’t involve and scraping.

Example usage:

$ python pagerank.py http://www.google.com/
PageRank: 10	URL: http://www.google.com/
 
$ python pagerank.py http://www.mozilla.org/
PageRank: 9	URL: http://www.mozilla.org/
 
$ python pagerank.py http://halotis.com
PageRange: 3   URL: http://www.halotis.com/

And the script:

#!/usr/bin/env python
#
#  Script for getting Google Page Rank of page
#  Google Toolbar 3.0.x/4.0.x Pagerank Checksum Algorithm
#
#  original from http://pagerank.gamesaga.net/
#  this version was adapted from http://www.djangosnippets.org/snippets/221/
#  by Corey Goldberg - 2010
#
#  Licensed under the MIT license: http://www.opensource.org/licenses/mit-license.php
 
 
 
import urllib
 
 
def get_pagerank(url):
    hsh = check_hash(hash_url(url))
    gurl = 'http://www.google.com/search?client=navclient-auto&features=Rank:&q=info:%s&ch=%s' % (urllib.quote(url), hsh)
    try:
        f = urllib.urlopen(gurl)
        rank = f.read().strip()[9:]
    except Exception:
        rank = 'N/A'
    if rank == '':
        rank = '0'
    return rank
 
 
def  int_str(string, integer, factor):
    for i in range(len(string)) :
        integer *= factor
        integer &= 0xFFFFFFFF
        integer += ord(string[i])
    return integer
 
 
def hash_url(string):
    c1 = int_str(string, 0x1505, 0x21)
    c2 = int_str(string, 0, 0x1003F)
 
    c1 >>= 2
    c1 = ((c1 >> 4) & 0x3FFFFC0) | (c1 & 0x3F)
    c1 = ((c1 >> 4) & 0x3FFC00) | (c1 & 0x3FF)
    c1 = ((c1 >> 4) & 0x3C000) | (c1 & 0x3FFF)
 
    t1 = (c1 & 0x3C0) < < 4
    t1 |= c1 & 0x3C
    t1 = (t1 << 2) | (c2 & 0xF0F)
 
    t2 = (c1 & 0xFFFFC000) << 4
    t2 |= c1 & 0x3C00
    t2 = (t2 << 0xA) | (c2 & 0xF0F0000)
 
    return (t1 | t2)
 
 
def check_hash(hash_int):
    hash_str = '%u' % (hash_int)
    flag = 0
    check_byte = 0
 
    i = len(hash_str) - 1
    while i >= 0:
        byte = int(hash_str[i])
        if 1 == (flag % 2):
            byte *= 2;
            byte = byte / 10 + byte % 10
        check_byte += byte
        flag += 1
        i -= 1
 
    check_byte %= 10
    if 0 != check_byte:
        check_byte = 10 - check_byte
        if 1 == flag % 2:
            if 1 == check_byte % 2:
                check_byte += 9
            check_byte >>= 1
 
    return '7' + str(check_byte) + hash_str
 
 
 
if __name__ == '__main__':
    if len(sys.argv) != 2:
        url = 'http://www.google.com/'
    else:
        url = sys.argv[1]
 
    print get_pagerank(url)

deliciousIn yet another of my series of web scrapers this time I’m posting some code that will scrape links from delicious.com. This is a pretty cool way of finding links that other people have found relevant. And this could be used to generate useful content for visitors.

You could easily add this to a WordPress blogging robot script so that the newest links are posted in a weekly digest post. This type of promotion will get noticed by the people they link to, and spreads some of that link love. It will hopefully result in some reciprocal links for your site.

Another idea would be to create a link directory and seed it with links gathered from delicious. Or you could create a widget of the hottest links in your niche that automatically gets updated.

This script makes use of the BeautifulSoup library for parsing the HTML pages.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# http://halotis.com/
"""
Scraper for Del.icio.us SERP.
 
This pulls the results for a match for a query on http://del.icio.us.
"""
 
import urllib2
import re
 
from BeautifulSoup import BeautifulSoup
 
def get_delicious_results(query, page_limit=10):
 
    page = 1
    links = []
 
    while page &lt; page_limit :
        url='http://delicious.com/search?p=' + '%20'.join(query.split()) + '&amp;context=all&amp;lc=1&amp;page=' + str(page)
        req = urllib2.Request(url)
        HTML = urllib2.urlopen(req).read()
        soup = BeautifulSoup(HTML)
 
        next = soup.find('a', attrs={'class':re.compile('.*next$', re.I)})
 
        #links is a list of (url, title) tuples
        links +=   [(link['href'], ''.join(link.findAll(text=True)) ) for link in soup.findAll('a', attrs={'class':re.compile('.*taggedlink.*', re.I)}) ]
 
        if next :
            page = page+1
        else :
            break
 
    return links
 
if __name__=='__main__':
    links = get_delicious_results('halotis marketing')
    print links

Today’s script will perform a search on Technorati and then scrape out the search results. It is useful because Technorati is up to date about things that are happening in the blogosphere. And that gives you a way to tune into everything going on there.

The scope of the blogosphere matched with Technoratis ability sort the results by the most recent is what makes this very powerful. This script will help you find up to the moment content which you can then data-mine for whatever purposes you want.

Possible uses:

  • Create a tag cloud of what is happening today within your niche
  • aggregate the content into your own site
  • post it to Twitter
  • convert the search results into an RSS feed

And here’s the Python code:

import urllib2
 
from BeautifulSoup import BeautifulSoup
 
def get_technorati_results(query, page_limit=10):
 
    page = 1
    links = []
 
    while page < page_limit :
        url='http://technorati.com/search/' + '+'.join(query.split()) + '?language=n&page=' + str(page)
        req = urllib2.Request(url)
        HTML = urllib2.urlopen(req).read()
        soup = BeautifulSoup(HTML)
 
        next = soup.find('li', attrs={'class':'next'}).find('a')
 
        #links is a list of (url, summary, title) tuples
        links +=   [(link.find('blockquote')['cite'], ''.join(link.find('blockquote').findAll(text=True)), ''.join(link.find('h3').findAll(text=True))) for link in soup.find('div', id='results').findAll('li', attrs={'class':'hentry'})]
 
        if next :
            page = page+1
        else :
            break
 
    return links
 
if __name__=='__main__':
    links = get_technorati_results('halotis marketing')
    print links

alexa_logoSometimes it’s useful to know where all the back-links to a website are coming from.

As a competitor it can give you information about how your competition is promoting their site. You can shortcut the process of finding the good places to get links from, and who might be a client or a good contact for your business by finding out who is linking to your competitors.

If you’re buying or selling a website the number and quality of back-links helps determine the value of a site. checking the links to a site should be on the checklist you use when buying a website.

With that in mind I wrote a short script that scrapes the links to a particular domain from the list that Alexa provides.

import urllib2
 
from BeautifulSoup import BeautifulSoup
 
def get_alexa_linksin(domain):
 
    page = 0
    linksin = []
 
    while True :
        url='http://www.alexa.com/site/linksin;'+str(page)+'/'+domain
        req = urllib2.Request(url)
        HTML = urllib2.urlopen(req).read()
        soup = BeautifulSoup(HTML)
 
        next = soup.find(id='linksin').find('a', attrs={'class':'next'})
 
        linksin += [(link['href'], link.string) for link in soup.find(id='linksin').findAll('a')]
 
        if next :
	    page = page+1
        else :
	    break
 
    return linksin
 
if __name__=='__main__':
    linksin = get_alexa_linksin('halotis.com')
    print linksin

I noticed that several accounts are spamming the twitter trends. Go to twitter.com and select one of the trends in the right column. You’ll undoubtedly see some tweets that are blatantly inserting words from the trending topics list into unrelated ads.

I was curious just how easy it would be to get the trending topics to target them with tweets. Turns out it is amazingly simple and shows off some of the beauty of Python.

This script doesn’t actually do anything with the trend information. It just simply downloads and prints out the list. But combine this code with the sample code from
RSS Twitter Bot in Python and you’ll have a recipe for some seriously powerful promotion.

import simplejson  # http://undefined.org/python/#simplejson
import urllib
 
result = simplejson.load(urllib.urlopen('http://search.twitter.com/trends.json'))
 
print [trend['name'] for trend in result['trends']]

wordpressWordPress is probably the best blogging software out there. This site runs on WordPress. It’s easy to install, amazingly extensible with themes and plugins and very easy to use. In fact the vast majority of the websites I maintain run on WordPress.

wordpresslib is a Python library that makes it possible to programatically put new content onto a blog. It works with both the self-hosted as well as the freely hosted wordpress.com blogs and It gives you the power to do these tasks:

  • Publishing new post
  • Editing old post
  • Publishing draft post
  • Deleting post
  • Changing post categories
  • Getting blog and user informations
  • Upload multimedia files like movies or photos
  • Getting last recents post
  • Getting last post
  • Getting Trackbacks of post
  • Getting Pingbacks of post

When used in conjunction with some of the other scripts I have shared on this site such as Getting Ezine Article Content Automatically with Python, Translating Text Using Google Translate and Python, How To Get RSS Content Into An Sqlite Database With Python – Fast it is possible to build a very powerful blog posting robot.

Here’s an example of just how simple it is to send a new post to a wordpress blog:

import wordpresslib
 
def wordpressPublish(url, username, password, content, title, category):
 
	wp = wordpresslib.WordPressClient(url, username, password)
	wp.selectBlog(0)
 
	post = wordpresslib.WordPressPost()
	post.title = title
	post.description = content
	post.categories = (wp.getCategoryIdFromName(category),)
 
	idNewPost = wp.newPost(post, True)

I make use of this on a daily basis in various scripts I have that re-purpose content from other places and push it onto several aggregation blogs that I have. Over the next few posts I’ll be revealing exactly how that system works so stay tuned. (and make sure you’re subscribed to the RSS feed)

ea_logoIf you’re not familiar with Ezine articles they are basically niche content about 200 to 2000 words long that some ‘expert’ writes and shares for re-publishing the content under the stipulation that it includes the signature (and usually a link) for the author. Articles are great from both the advertiser and publisher perspective since the author can get good links back to their site for promotion and the publishers get quality content without having to write it themselves.

I thought it might be handy to have a script that could scrape an ezine website for articles and save them in a database for later use. A bit of Googling revealed no scripts out there to do this sort of thing so I decided to write it myself.

The script I wrote will perform a search on ezinearticles.com and then get the top 25 results and download all the content of the articles and store them in an sqlite database.

scaling this up should make it possible to source 1000s of articles using a keyword list as input.  Used correctly this script could generate massive websites packed with content in just a few minutes.

Here’s the script:

import sys
import urllib2
import urllib
import sqlite3
 
from BeautifulSoup import BeautifulSoup # available at: http://www.crummy.com/software/BeautifulSoup/
 
USER_AGENT = 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.29 Safari/525.13'
 
conn = sqlite3.connect("ezines.sqlite")
conn.row_factory = sqlite3.Row
 
c = conn.cursor()
c.execute('CREATE TABLE IF NOT EXISTS Ezines (`url`, `title`, `summary`, `tail`, `content`, `signature`)')
conn.commit()
 
def transposed(lists):
   if not lists: return []
   return map(lambda *row: list(row), *lists)
 
def search(query):
    """Runs the search on ezineartles.com and returns the HTML
    """
    url='http://ezinearticles.com/search/?q=' + '+'.join(query.split())
    req = urllib2.Request(url)
    req.add_header('User-agent', USER_AGENT)
    HTML = urllib2.urlopen(req).read()
    return HTML
 
def parse_search_results(HTML):
    """Givin the result of the search function this parses out the results into a list
    """
    soup = BeautifulSoup(HTML)
    match_titles = soup.findAll(attrs={'class':'srch_title'})
    match_sum = soup.findAll(attrs={'class':'srch_sum'})
    match_tail = soup.findAll(attrs={'class':'srch_tail'})
 
    return transposed([match_titles, match_sum, match_tail])
 
def get_article_content(url):
    """Parse the body and signature from the content of an article
    """
    req = urllib2.Request(url)
    req.add_header('User-agent', USER_AGENT)
    HTML = urllib2.urlopen(req).read()
 
    soup = BeautifulSoup(HTML)
    return {'text':soup.find(id='body'), 'sig':soup.find(id='sig')}
 
def store_results(search_results):
    """put the results into an sqlite database if they haven't already been downloaded.
    """
    c = conn.cursor()
    for row in search_results:
        title = row[0]
        summary = row[1]
        tail = row[2]
 
        link = title.find('a').get('href')
        have_url = c.execute('SELECT url from Ezines WHERE url=?', (link, )).fetchall()
        if not have_url:
            content = get_article_content('http://ezinearticles.com' + link)
            c.execute('INSERT INTO Ezines (`title`, `url`, `summary`, `tail`, `content`, `signature`) VALUES (?,?,?,?,?,?)', 
                      (title.find('a').find(text=True), 
                       link, 
                       summary.find(text=True), 
                       tail.find(text=True), 
                       str(content['text']), 
                       str(content['sig'])) )
 
    conn.commit()
 
if __name__=='__main__':
    #example usage
    page = search('seo')
    search_results = parse_search_results(page)
 
    store_results(search_results)

translate_logoSometimes is can be quite useful to be able to translate content from one language to another from within a program. There are many compelling reasons why you might like the idea of auto translating text. The reason why I’m interested in writing this script is that it is useful to sometimes create unique content online for SEO reasons. Search engines like to see unique content rather than words that have been copied and pasted from other websites. What you’re looking for in web content is:

  1. A lot of it.
  2. Highly related to the keywords you’re targeting.

When trying to get a great position in the organic search results it is important to recognize that you’re competing against an army of low cost outsourced people that are pumping out page after page of mediocre content and then running scripts to generate thousands of back-links to the sites they are trying to rank.  It is very much impossible to get the top spot for any desirable keyword if you’re writing all the content yourself.  You need some help with this.

That’s where Google Translate comes in.

Take an article from somewhere, push it through a round trip of translation such as English->French->English and the content will then be unique enough that it won’t raise any flags that it has been copied from somewhere else on the internet.  The content may not be readable but it will make for fodder for the search engines to eat up.

Using this technique it is possible to build massive websites of unique content overnight and have it quickly rank highly.

Unfortunately Google doesn’t provide an API for translating text.  That means the script has to resort to scraping which is inherently prone to breaking.  The script uses BeautifulSoup to help with the parsing of the HTML content. (Note: I had to use the older 3.0.x series of BeautifulSoup to successfully parse the content)

The code for this was based on this script by technobabble.

import sys
import urllib2
import urllib
 
from BeautifulSoup import BeautifulSoup # available at: http://www.crummy.com/software/BeautifulSoup/
 
def translate(sl, tl, text):
    """ Translates a given text from source language (sl) to
        target language (tl) """
 
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)')]
 
    translated_page = opener.open(
        "http://translate.google.com/translate_t?" + 
        urllib.urlencode({'sl': sl, 'tl': tl}),
        data=urllib.urlencode({'hl': 'en',
                               'ie': 'UTF8',
                               'text': text.encode('utf-8'),
                               'sl': sl, 'tl': tl})
    )
 
    translated_soup = BeautifulSoup(translated_page)
 
    return translated_soup('div', id='result_box')[0].string
 
if __name__=='__main__':
    print translate('en', 'fr', u'hello')

To generate unique content you can use this within your own python program like this:

import translate
 
content = get_content()
new_content = translate('fr', 'en', translate('en','fr', content))
publish_content(new_content)

This script will create an image on the fly of a users most recent twitter message.  It could be used as an email or forum signature or any place that allows you to embed a custom image such as on a blog or website.

I saw a website that did this the other day and wanted to try to duplicate the functionality.  Turns out it was pretty trivial even for someone with very little PHP experience. So I felt inspired enough to create a new website based on this script and called it TwitSig.us. Check it out.

It creates images something like this:

And here’s the code that does it:

<?php
include "twitter.php"; // from http://twitter.slawcup.com/twitter.class.phps
 
$t = new twitter();
$res = $t->userTimeline($_GET["user"], 1);
 
$my_img = imagecreatefrompng ( "base.png" );
 
$grey = imagecolorallocate( $my_img, 150, 150, 150 );
$red = imagecolorallocate( $my_img, 255, 0,  0 );
$text_colour = imagecolorallocate( $my_img, 0, 0, 0 );
 
if($res===false){
	imagestring( $my_img, 4, 30, 25, "no messages at this time",
	  $text_colour );
} else {
	$newtext = wordwrap($res->status->text, 65, "\n");
	imagettftext( $my_img, 10, 0, 10, 35, $text_colour, "Arial.ttf", $newtext);
	imagettftext( $my_img, 10, 0, 90, 15, $red, "Arial Bold.ttf", "@".$_GET["user"]);
	imagettftext( $my_img, 10, 0, 225, 15, $grey, "Arial.ttf", strftime("%a %d %b %H:%M %Y", strtotime($res->status->created_at)));
}
 
header( "Content-type: image/png" );
imagepng( $my_img );
?>

To get this script working for yourself you’ll need to make sure that you have the two font files and the base.png file for the background image that the text is put on.