Tag Archives: scripting

Google Adwords LogoLast night I had one of those sleepless nights. I’m sure you have had one of these before – After hearing a great idea your mind starts spinning with the possibilities and there’s no way you’ll be able to sleep. I got excited last night about a new approach to Google Adwords that has just might have a lot of potential.

Google Adwords has never really proven to be a profitable way to drive traffic for me (though Microsoft Adcenter has). However several times a year for the past 4 years I have heard a little tip or seen someone use it successfully and have become intrigued enough to dive back in and test the waters again. Each time my testing has been plagued with failure and I have lost thousands of dollars trying to find a system that works.

Yesterday I got a tip, something new that I haven’t yet tried but that sounded promising. And so over the next few weeks I’m going to be running some tests. The problem with the approach I’m testing is that it requires creating a MASSIVE number of keyword targeted ads – a total of over 100,000 ads per niche.

It took me 2.5 hours last night to manually create 400 of the 100,000 ads I need (for the one niche I’m going to test first). There’s no feasible way to create all those ads manually and I’m not interested in spending yet more money on ebooks or software that claims to make money or magically do the work for me. So I am going to program some scripts myself to test the techniques. If it works or doesn’t work I will let you know, and share the code right here on this blog.

The testing started last night. Check back next week for the preliminary results (and maybe a hint about what I’m doing).

This is more of a helpful snippit than a useful program but it can sometimes be useful to have some user agent strings handy for web scraping.

Some websites check the user agent string and will filter the results of a request. It’s a very simple way to prevent automated scraping. But it is very easy to get around. The user agent can also be checked by spam filters to help detect automated posting.

A great resource for finding and understanding what user agent strings mean is UserAgentString.com.

This simple snippit uses a file containing the list of user agent strings that you want to use. It can very simply source that file and return a random one from the list.

Here’s my source file UserAgents.txt:

Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.3) Gecko/20090913 Firefox/3.5.3
Mozilla/5.0 (Windows; U; Windows NT 6.1; en; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3 (.NET CLR 3.5.30729)
Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3 (.NET CLR 3.5.30729)
Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.1) Gecko/20090718 Firefox/3.5.1
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/532.1 (KHTML, like Gecko) Chrome/4.0.219.6 Safari/532.1
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; InfoPath.2)
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 1.1.4322; .NET CLR 3.5.30729; .NET CLR 3.0.30729)
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.2; Win64; x64; Trident/4.0)
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; SV1; .NET CLR 2.0.50727; InfoPath.2)Mozilla/5.0 (Windows; U; MSIE 7.0; Windows NT 6.0; en-US)
Mozilla/4.0 (compatible; MSIE 6.1; Windows XP)

And here is the python code that makes getting a random agent very simple:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# http://halotis.com/
 
import random
 
SOURCE_FILE='UserAgents.txt'
 
def get():
    f = open(SOURCE_FILE)
    agents = f.readlines()
 
    return random.choice(agents).strip()
 
def getAll():
    f = open(SOURCE_FILE)
    agents = f.readlines()
    return [a.strip() for a in agents]
 
if __name__=='__main__':
    agents = getAll()
    for agent in agents:
        print agent

You can grab the source code for this along with my other scripts from the bitbucket repository.

spider_webHere’s a simple web crawling script that will go from one url and find all the pages it links to up to a pre-defined depth. Web crawling is of course the lowest level tool used by Google to create its multi-billion dollar business. You may not be able to compete with Google’s search technology but being able to crawl your own sites, or that of your competitors can be very valuable.

You could for instance routinely check your websites to make sure that it is live and all the links are working. it could notify you of any 404 errors. By adding in a page rank check you could identify better linking strategies to boost your page rank scores. And you could identify possible leaks – paths a user could take that takes them away from where you want them to go.

Here’s the script:

# -*- coding: utf-8 -*-
from HTMLParser import HTMLParser
from urllib2 import urlopen
 
class Spider(HTMLParser):
    def __init__(self, starting_url, depth, max_span):
        HTMLParser.__init__(self)
        self.url = starting_url
        self.db = {self.url: 1}
        self.node = [self.url]
 
        self.depth = depth # recursion depth max
        self.max_span = max_span # max links obtained per url
        self.links_found = 0
 
    def handle_starttag(self, tag, attrs):
        if self.links_found < self.max_span and tag == 'a' and attrs:
            link = attrs[0][1]
            if link[:4] != "http":
                link = '/'.join(self.url.split('/')[:3])+('/'+link).replace('//','/')
 
            if link not in self.db:
                print "new link ---> %s" % link
                self.links_found += 1
                self.node.append(link)
            self.db[link] = (self.db.get(link) or 0) + 1
 
    def crawl(self):
        for depth in xrange(self.depth):
            print "*"*70+("\nScanning depth %d web\n" % (depth+1))+"*"*70
            context_node = self.node[:]
            self.node = []
            for self.url in context_node:
                self.links_found = 0
                try:
                    req = urlopen(self.url)
                    res = req.read()
                    self.feed(res)
                except:
                    self.reset()
        print "*"*40 + "\nRESULTS\n" + "*"*40
        zorted = [(v,k) for (k,v) in self.db.items()]
        zorted.sort(reverse = True)
        return zorted
 
if __name__ == "__main__":
    spidey = Spider(starting_url = 'http://www.7cerebros.com.ar', depth = 5, max_span = 10)
    result = spidey.crawl()
    for (n,link) in result:
        print "%s was found %d time%s." %(link,n, "s" if n is not 1 else "")

wordpressWordPress is probably the best blogging software out there. This site runs on WordPress. It’s easy to install, amazingly extensible with themes and plugins and very easy to use. In fact the vast majority of the websites I maintain run on WordPress.

wordpresslib is a Python library that makes it possible to programatically put new content onto a blog. It works with both the self-hosted as well as the freely hosted wordpress.com blogs and It gives you the power to do these tasks:

  • Publishing new post
  • Editing old post
  • Publishing draft post
  • Deleting post
  • Changing post categories
  • Getting blog and user informations
  • Upload multimedia files like movies or photos
  • Getting last recents post
  • Getting last post
  • Getting Trackbacks of post
  • Getting Pingbacks of post

When used in conjunction with some of the other scripts I have shared on this site such as Getting Ezine Article Content Automatically with Python, Translating Text Using Google Translate and Python, How To Get RSS Content Into An Sqlite Database With Python – Fast it is possible to build a very powerful blog posting robot.

Here’s an example of just how simple it is to send a new post to a wordpress blog:

import wordpresslib
 
def wordpressPublish(url, username, password, content, title, category):
 
	wp = wordpresslib.WordPressClient(url, username, password)
	wp.selectBlog(0)
 
	post = wordpresslib.WordPressPost()
	post.title = title
	post.description = content
	post.categories = (wp.getCategoryIdFromName(category),)
 
	idNewPost = wp.newPost(post, True)

I make use of this on a daily basis in various scripts I have that re-purpose content from other places and push it onto several aggregation blogs that I have. Over the next few posts I’ll be revealing exactly how that system works so stay tuned. (and make sure you’re subscribed to the RSS feed)

ea_logoIf you’re not familiar with Ezine articles they are basically niche content about 200 to 2000 words long that some ‘expert’ writes and shares for re-publishing the content under the stipulation that it includes the signature (and usually a link) for the author. Articles are great from both the advertiser and publisher perspective since the author can get good links back to their site for promotion and the publishers get quality content without having to write it themselves.

I thought it might be handy to have a script that could scrape an ezine website for articles and save them in a database for later use. A bit of Googling revealed no scripts out there to do this sort of thing so I decided to write it myself.

The script I wrote will perform a search on ezinearticles.com and then get the top 25 results and download all the content of the articles and store them in an sqlite database.

scaling this up should make it possible to source 1000s of articles using a keyword list as input.  Used correctly this script could generate massive websites packed with content in just a few minutes.

Here’s the script:

import sys
import urllib2
import urllib
import sqlite3
 
from BeautifulSoup import BeautifulSoup # available at: http://www.crummy.com/software/BeautifulSoup/
 
USER_AGENT = 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.29 Safari/525.13'
 
conn = sqlite3.connect("ezines.sqlite")
conn.row_factory = sqlite3.Row
 
c = conn.cursor()
c.execute('CREATE TABLE IF NOT EXISTS Ezines (`url`, `title`, `summary`, `tail`, `content`, `signature`)')
conn.commit()
 
def transposed(lists):
   if not lists: return []
   return map(lambda *row: list(row), *lists)
 
def search(query):
    """Runs the search on ezineartles.com and returns the HTML
    """
    url='http://ezinearticles.com/search/?q=' + '+'.join(query.split())
    req = urllib2.Request(url)
    req.add_header('User-agent', USER_AGENT)
    HTML = urllib2.urlopen(req).read()
    return HTML
 
def parse_search_results(HTML):
    """Givin the result of the search function this parses out the results into a list
    """
    soup = BeautifulSoup(HTML)
    match_titles = soup.findAll(attrs={'class':'srch_title'})
    match_sum = soup.findAll(attrs={'class':'srch_sum'})
    match_tail = soup.findAll(attrs={'class':'srch_tail'})
 
    return transposed([match_titles, match_sum, match_tail])
 
def get_article_content(url):
    """Parse the body and signature from the content of an article
    """
    req = urllib2.Request(url)
    req.add_header('User-agent', USER_AGENT)
    HTML = urllib2.urlopen(req).read()
 
    soup = BeautifulSoup(HTML)
    return {'text':soup.find(id='body'), 'sig':soup.find(id='sig')}
 
def store_results(search_results):
    """put the results into an sqlite database if they haven't already been downloaded.
    """
    c = conn.cursor()
    for row in search_results:
        title = row[0]
        summary = row[1]
        tail = row[2]
 
        link = title.find('a').get('href')
        have_url = c.execute('SELECT url from Ezines WHERE url=?', (link, )).fetchall()
        if not have_url:
            content = get_article_content('http://ezinearticles.com' + link)
            c.execute('INSERT INTO Ezines (`title`, `url`, `summary`, `tail`, `content`, `signature`) VALUES (?,?,?,?,?,?)', 
                      (title.find('a').find(text=True), 
                       link, 
                       summary.find(text=True), 
                       tail.find(text=True), 
                       str(content['text']), 
                       str(content['sig'])) )
 
    conn.commit()
 
if __name__=='__main__':
    #example usage
    page = search('seo')
    search_results = parse_search_results(page)
 
    store_results(search_results)

translate_logoSometimes is can be quite useful to be able to translate content from one language to another from within a program. There are many compelling reasons why you might like the idea of auto translating text. The reason why I’m interested in writing this script is that it is useful to sometimes create unique content online for SEO reasons. Search engines like to see unique content rather than words that have been copied and pasted from other websites. What you’re looking for in web content is:

  1. A lot of it.
  2. Highly related to the keywords you’re targeting.

When trying to get a great position in the organic search results it is important to recognize that you’re competing against an army of low cost outsourced people that are pumping out page after page of mediocre content and then running scripts to generate thousands of back-links to the sites they are trying to rank.  It is very much impossible to get the top spot for any desirable keyword if you’re writing all the content yourself.  You need some help with this.

That’s where Google Translate comes in.

Take an article from somewhere, push it through a round trip of translation such as English->French->English and the content will then be unique enough that it won’t raise any flags that it has been copied from somewhere else on the internet.  The content may not be readable but it will make for fodder for the search engines to eat up.

Using this technique it is possible to build massive websites of unique content overnight and have it quickly rank highly.

Unfortunately Google doesn’t provide an API for translating text.  That means the script has to resort to scraping which is inherently prone to breaking.  The script uses BeautifulSoup to help with the parsing of the HTML content. (Note: I had to use the older 3.0.x series of BeautifulSoup to successfully parse the content)

The code for this was based on this script by technobabble.

import sys
import urllib2
import urllib
 
from BeautifulSoup import BeautifulSoup # available at: http://www.crummy.com/software/BeautifulSoup/
 
def translate(sl, tl, text):
    """ Translates a given text from source language (sl) to
        target language (tl) """
 
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)')]
 
    translated_page = opener.open(
        "http://translate.google.com/translate_t?" + 
        urllib.urlencode({'sl': sl, 'tl': tl}),
        data=urllib.urlencode({'hl': 'en',
                               'ie': 'UTF8',
                               'text': text.encode('utf-8'),
                               'sl': sl, 'tl': tl})
    )
 
    translated_soup = BeautifulSoup(translated_page)
 
    return translated_soup('div', id='result_box')[0].string
 
if __name__=='__main__':
    print translate('en', 'fr', u'hello')

To generate unique content you can use this within your own python program like this:

import translate
 
content = get_content()
new_content = translate('fr', 'en', translate('en','fr', content))
publish_content(new_content)