Tag Archives: program

deliciousIn yet another of my series of web scrapers this time I’m posting some code that will scrape links from delicious.com. This is a pretty cool way of finding links that other people have found relevant. And this could be used to generate useful content for visitors.

You could easily add this to a WordPress blogging robot script so that the newest links are posted in a weekly digest post. This type of promotion will get noticed by the people they link to, and spreads some of that link love. It will hopefully result in some reciprocal links for your site.

Another idea would be to create a link directory and seed it with links gathered from delicious. Or you could create a widget of the hottest links in your niche that automatically gets updated.

This script makes use of the BeautifulSoup library for parsing the HTML pages.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# http://halotis.com/
"""
Scraper for Del.icio.us SERP.
 
This pulls the results for a match for a query on http://del.icio.us.
"""
 
import urllib2
import re
 
from BeautifulSoup import BeautifulSoup
 
def get_delicious_results(query, page_limit=10):
 
    page = 1
    links = []
 
    while page < page_limit :
        url='http://delicious.com/search?p=' + '%20'.join(query.split()) + '&context=all&lc=1&page=' + str(page)
        req = urllib2.Request(url)
        HTML = urllib2.urlopen(req).read()
        soup = BeautifulSoup(HTML)
 
        next = soup.find('a', attrs={'class':re.compile('.*next$', re.I)})
 
        #links is a list of (url, title) tuples
        links +=   [(link['href'], ''.join(link.findAll(text=True)) ) for link in soup.findAll('a', attrs={'class':re.compile('.*taggedlink.*', re.I)}) ]
 
        if next :
            page = page+1
        else :
            break
 
    return links
 
if __name__=='__main__':
    links = get_delicious_results('halotis marketing')
    print links

I was a little bored today and decided to write up a simple script that pushes RSS feed information out to Twitter and manages to keep track of the history so that tweets are not sent out more than once.

It was actually a very trivial little script to write but it could actually be useful for something that I’m working on in the future.

The script makes use of an Sqlite database to store history and bit.ly for shortening URLs. I’ve made heavy use of some really nice open source libraries to make for a very short and sweet little script.

Grab the necessary python libraries:
python-twitter
python-bitly
feedparser

You’ll need to sign up for free accounts at Twitter and Bit.ly to use this script.

Hopefully someone out there can take this code example to do something really cool with Twitter and Python.

Update: I’ve added some bit.ly link tracking output to this script. After it twitters the RSS feed it will print out the click count information for every bit.ly link.

from time import strftime
import sqlite3
 
import twitter     #http://code.google.com/p/python-twitter/
import bitly       #http://code.google.com/p/python-bitly/
import feedparser  #available at feedparser.org
 
 
DATABASE = "tweets.sqlite"
 
BITLY_LOGIN = "bitlyUsername"
BITLY_API_KEY = "api key"
 
TWITTER_USER = "username"
TWITTER_PASSWORD = "password"
 
def print_stats():
	conn = sqlite3.connect(DATABASE)
	conn.row_factory = sqlite3.Row
	c = conn.cursor()
 
	b = bitly.Api(login=BITLY_LOGIN,apikey=BITLY_API_KEY)
 
	c.execute('SELECT title, url, short_url from RSSContent')
	all_links = c.fetchall()
 
	for row in all_links:
 
		short_url = row['short_url']
 
		if short_url is None:
			short_url = b.shorten(row['url'])
			c.execute('UPDATE RSSContent SET `short_url`=? WHERE `url`=?',(short_url,row['url']))
 
 
		stats = b.stats(short_url)
		print "%s - User clicks %s, total clicks: %s" % (row['title'], stats.user_clicks,stats.total_clicks)
 
	conn.commit()
 
def tweet_rss(url):
 
	conn = sqlite3.connect(DATABASE)
	conn.row_factory = sqlite3.Row
	c = conn.cursor()
 
	#create the table if it doesn't exist
	c.execute('CREATE TABLE IF NOT EXISTS RSSContent (`url`, `title`, `dateAdded`, `content`, `short_url`)')
 
	api = twitter.Api(username=TWITTER_USER, password=TWITTER_PASSWORD)
	b = bitly.Api(login=BITLY_LOGIN,apikey=BITLY_API_KEY)
 
	d = feedparser.parse(url)
 
	for entry in d.entries:
 
		#check for duplicates
		c.execute('select * from RSSContent where url=?', (entry.link,))
		if not c.fetchall():
 
			tweet_text = "%s - %s" % (entry.title, entry.summary)
 
			shortened_link = b.shorten(entry.link)
 
			t = (entry.link, entry.title, strftime("%Y-%m-%d %H:%M:%S", entry.updated_parsed), entry.summary, shortened_link)
			c.execute('insert into RSSContent (`url`, `title`,`dateAdded`, `content`, `short_url`) values (?,?,?,?,?)', t)
			print "%s.. %s" % (tweet_text[:115], shortened_link)
 
			api.PostUpdate("%s.. %s" % (tweet_text[:115], shortened_link))
 
	conn.commit()
 
if __name__ == '__main__':
  tweet_rss('http://www.halotis.com/feed/')
  print_stats()

I discovered this very handy trick for getting relevant YouTube videos in an RSS feed and I have used it to build up some very powerful blog posting scripts to grab relevant content for some blogs that I have.  It could also be helpful to pull these into an RSS Reader to quickly skim the newest videos relevant to a specific search.  I thought I would share some of this with you and hopefully you’ll be able to use these scripts to get an idea of your own.

To start with you need to create the URL of the RSS feed for the search.  To do that you can do the search on YouTube and click the RSS icon in the address bar.  The structure of the URL should be something like this:

http://gdata.youtube.com/feeds/base/videos?q=halotis%20marketing&client=ytapi-youtube-search&alt=rss&v=2

The problem with the RSS feed is that they don’t include the HTML required to embed the video. You have to parse the RSS content and find the URL for the video which can be used to create the embed code so you can post the videos somewhere else.

In my example code I have categorized each url to a target keyword phrase.  The code below is not a comprehensive program, but just an idea of how to go about repurposing YouTube RSS content.

import feedparser  # available at: feedparser.org
from time import strftime
import sqlite3
 
DATABASE = "YouTubeMatches.sqlite"
 
conn = sqlite3.connect(DATABASE)
conn.row_factory = sqlite3.Row
c = conn.cursor()
 
def LoadIntoDatabase(phrase, url):
 
	d = feedparser.parse(url)
	for entry in d.entries:
 
		#check for duplicates with the url
		Constants.c.execute('select * from YoutubeResources where url=?', (entry.link,))
		if len(Constants.c.fetchall()) == 0:
			#only adding the resources that are not already in the table.
			t = (phrase,entry.link, 0, entry.title, strftime("%Y-%m-%d %H:%M:%S", entry.updated_parsed), entry.summary)
			Constants.c.execute('insert into YoutubeResources (`phrase`, `url`, `used`, `title`,`dateAdded`, `content`) values (?,?,?,?,?,?)', t)
 
	Constants.conn.commit()
 
def getYouTubeEmbedCode(phrase):
 
	c.execute("select * from YoutubeResources where phrase=? and used=0", (phrase,))
	result = Constants.c.fetchall()
	random.shuffle(result)
 
	content = result[0]
 
	contentString = content[3]
	url=content[1].replace('?', '').replace('=' , '/')
	embedCode='<div class="youtube-video"><object width="425" height="344"><param name="movie" value="%s&hl=en&fs=1"> </param><param name="allowFullScreen" value="true"> </param><param name="allowscriptaccess" value="always"> </param><embed src="%s&hl=en&fs=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"> </embed></object></div>' % (url, url)
 
	t=time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
	c.execute("UPDATE YoutubeResources SET used = '1', dateUsed = ? WHERE url = ? ", (t, url) )
 
	return embedCode