Showing posts with label CastSampler. Show all posts
Showing posts with label CastSampler. Show all posts

Sunday, May 11, 2008

Caching RSS Feeds With feedcache

The past several years have seen a steady increase in the use of RSS and Atom feeds for data sharing. Blogs, podcasts, social networking sites, search engines, and news services are just a few examples of data sources delivered via such feeds. Working with internet services requires care, because inefficiencies in one client implementation may cause performance problems with the service that can be felt by all of the consumers accessing the same server. In this article, I describe the development of the feedcache package, and give examples of how you can use it to optimize the use of data feeds in your application.

Read more

This article was originally published by Python Magazine in November of 2007.

Sunday, August 12, 2007

feedcache 1.0

I am happy with the API for feedcache, so I bumped the release to 1.0 and beta status. The package includes tests and 2 separate examples (one with shelve and another using threads and shove).

Sunday, August 5, 2007

New project: feedcache

Back in mid-June I promised Jeremy Jones that I would clean up some of the code I use for CastSampler.com to cache RSS and Atom feeds so he could look at it for his podgrabber project. I finally found some time to work on it this weekend (not quite 2 months later, sorry Jeremy).

The result is feedcache, which I have released in "alpha" status, for now. I don't usually bother releasing my code in alpha state, because that usually means I'm not actually using it anywhere with enough regularity to ensure that it is robust. I am going ahead and releasing feedcache early because I am hoping for some feedback on the API. I realized that the way I cache feeds for CastSampler.com is not the way all applications will want to cache them, so the design might be biased.

The Design

There are two aspects to handling caching the feed data. The high level code that knows it is working with RSS or Atom feeds, and low level code that saves the data with a timestamp. The high level Cache class is responsible for fetching, updating, and expiring feed content. The low level storage classes are responsible for saving and restoring feed content.

Since the storage handling is separated from the cache management, it is possible to adapt the Cache to whatever sort of storage option might work best for you. So far, I have implemented two backend storage options. MemoryStorage keeps everything in memory, and is mostly useful for testing. ShelveStorage option uses the shelve module to store all of the feed data in one file using pickles. I hope that the API for the backend storage manager is simple enough to make it easy for you to tie in your own backend if neither of these options is appealing. Something that uses memcached would be very interesting, for example.


The Cache class uses a fairly simple algorithm to decide if it needs to update the stored data:


  1. If there is nothing stored for the URL, fetch the data.

  2. If there is something stored for the URL and its time-to-live has not passed, use that data. (This throttles repeated requests for the same feed content.)

  3. If the stored data has expired, use any available ETag and modification time header data to perform a conditional GET of the data. If new data is returned, update the stored data. If no new data is returned, update the time-to-live for the stored data and return what is stored.



The feed data is retrieved and parsed by Mark Pilgrim's feedparser module, so the Cache really does just manage the contents of the backend storage.

Another benefit of separating the cache manager from the storage handler is only the storage handler needs to be thread-safe. The storage handler is given to each Cache as an argument to the construtor. In a multi-threaded app, each thread can have its own Cache (which does the fetching, when needed) and share a single backend storage handler.

Example

Here is a simple example program that uses a shelve file for storage. The example does not use multiple threads, but should still illustrate how to use the cache.

def main(urls=[]):
print 'Saving feed data to ./.feedcache'
storage = shelvestorage.ShelveStorage('.feedcache')
storage.open()
try:
fc = cache.Cache(storage)
for url in urls:
parsed_data = fc[url]
print parsed_data.feed.title
for entry in parsed_data.entries:
print '\t', entry.title
finally:
storage.close()
return


Additional Work

This project is still a work in process, but I would appreciate any feedback you have, good or bad. And of course, report bugs if you find them!

Saturday, March 3, 2007

Things to Do

In no particular order:

  1. Cull my Google Reader subscriptions. 364 is too many.
  2. Finish reading Dreaming in Code.
  3. Add tagging support to codehosting.
  4. Verify all of the domains under my control with Google Web Master tools.
  5. Create a Trac plugin for code reviews based on the process we use at work.
  6. Change the monitor feeds on CastSampler.com so they do not include items without enclosures.
  7. Enhance BlogBackup to save enclosures and images linked from blog posts.
  8. Write a tool to convert an m3u file to an RSS/Atom feed for Patrick so he will set up a podcast of his demo recordings.
  9. Improve AppleScript support in Adium.
  10. Add support to Adium for notifications when a screen name appears in a chat message.

Sunday, January 28, 2007

CastSampler.com monitoring feeds

On the plane back from Phoenix this week, I implemented some changes to the way CastSampler.com republishes feeds for the sites a user subscribes to. The user page used to link directly to the original feed so it would be easy to copy it to a regular RSS reader to keep up to date on new shows. That link has been replaced with a "monitor" feed which uses the original description and title for each item, but replaces the link with a new URL that causes the show to be added to your CastSampler queue. The user page still links to the original home page for the feed, so I think I am doing enough as far as attribution and advertisement. Any author information included in the original feed is also passed through to the monitor feed. The OPML file generated for a user's feeds links to these "monitor" feeds instead of the original source, too.

The goal of these changes is to make it easy to use a feed-reader such as Bloglines or Google Reader to monitor podcasts from CastSampler. To add an episode to your queue, just click the link in the monitor feed to be directed to the appropriate CastSampler.com page.

By the way, how cool is it to be able to develop a web app on my Powerbook while I'm on a plane? What an age to be alive.

Sunday, December 17, 2006

feed auto-discovery

I added feed auto-discovery to CastSampler.com today. It was pretty easy using the feedfinder.py
module, except for one small problem. Something about the timelimit() decorator in that module causes problems with django or mod_python (probably mod_python). When timelimit() is enabled, the finder either produces no URLs at all or an exception about "unmarshalling code objects" in a "restricted execution environment." It works great in my development environment, which does not use mod_python. To get it to work in production, I disabled the timelimit() decorator. I hope that does not come back to bite me in the future.

Tuesday, December 5, 2006

CastSampler.com

My most recent project is CastSampler.com, a tool for building a personal "mix-tape" style podcast. I tend to listen to one or two episodes from a lot of different shows, so I don't want to subscribe to the full show feed. Instead, I add the show to my CastSampler list, then I can add only those episodes that I want to my personal feed.

I have plenty of work left to do, but the basic features all work now so I would love to get some feedback.