Sunday, April 29, 2007

PyMOTW: linecache

Module: linecache
Purpose: Retrieve lines of text from files or imported python modules, holding a cache of the results to make reading many lines from the same file more efficient.
Python Version: 1.4

Description:

The linecache module is used extensively throughout the Python standard library when dealing with Python source files. The implementation of the cache simply holds the contents of files, parsed into separate lines, in a dictionary in memory. The API returns the requested line(s) by indexing into a list. The time savings is from (repeatedly) reading the file and parsing lines to find the one desired. This is especially useful when looking for multiple lines from the same file, such as when producing a traceback for an error report.

Example:

import linecache

import os
import tempfile


We will use some text produced by the Lorem Ipsum generator as sample input:

lorem = '''Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
Vivamus eget elit. In posuere mi non risus. Mauris id quam posuere

lectus sollicitudin varius. Praesent at mi. Nunc eu velit. Sed augue
massa, fermentum id, nonummy a, nonummy sit amet, ligula. Curabitur
eros pede, egestas at, ultricies ac, pellentesque eu, tellus.

Sed sed odio sed mi luctus mollis. Integer et nulla ac augue convallis
accumsan. Ut felis. Donec lectus sapien, elementum nec, condimentum ac,
interdum non, tellus. Aenean viverra, mauris vehicula semper porttitor,
ipsum odio consectetuer lorem, ac imperdiet eros odio a sapien. Nulla
mauris tellus, aliquam non, egestas a, nonummy et, erat. Vivamus

sagittis porttitor eros.'''

# Create a temporary text file with some text in it
fd, temp_file_name = tempfile.mkstemp()

os.close(fd)
f = open(temp_file_name, 'wt')

try:
f.write(lorem)
finally:
f.close()


And now that we have a temporary file to work with, let's get on to the interesting bits. Reading the 5th line from the file is a simple one-liner. Notice that the line numbers in the linecache module start with 1, but if we split the string ourselves we start indexing the array from 0. We also need to strip the trailing newline from the value returned from the cache.

# Pick out the same line from source and cache.
# (Notice that linecache counts from 1)
print 'SOURCE: ', lorem.split('\n')[4]
print 'CACHE : ', linecache.getline(temp_file_name, 5).rstrip()


Next let's see what happens if the line we want is empty:

# Blank lines include the newline
print '\nBLANK : "%s"' % linecache.getline(temp_file_name, 6)


If the requested line number falls out of the range of valid lines in the file, linecache returns an empty string.

# The cache always returns a string, and uses
# an empty string to indicate a line which does
# not exist.
not_there = linecache.getline(temp_file_name, 500)
print '\nNOT THERE: "%s" includes %d characters' % (not_there, len(not_there))


The module never raises an exception, even if the file does not exist:

# Errors are even hidden if linecache cannot find the file
no_such_file = linecache.getline('this_file_does_not_exist.txt', 1)
print '\nNO FILE: ', no_such_file


Since the linecache module is used so heavily when producing tracebacks, one of the key features is the ability to find Python source modules in sys.path by specifying the base name of the module. The cache population code in linecache searches sys.path for the module if it cannot find the file directly.

# Look for the linecache module, using
# the built in sys.path search.
module_line = linecache.getline('linecache.py', 3)
print '\nMODULE : ', module_line


Example Output:


SOURCE: eros pede, egestas at, ultricies ac, pellentesque eu, tellus.
CACHE : eros pede, egestas at, ultricies ac, pellentesque eu, tellus.

BLANK : "
"

NOT THERE: "" includes 0 characters

NO FILE:

MODULE : This is intended to read lines from modules imported -- hence if a filename


References:

Download the full example.
Python Module of the Week

Thanks to Noah for the inspiration for this week's topic.

Lorem Ipsum generated by www.ipsum.com.

Updated 5/20/2007 with technorati tags.
Updated 9/5/2007 with minor formatting changes.

Technorati Tags:
,


Sunday, April 22, 2007

codehosting now supports feedburner

I just posted a new version of my codehosting project for django which supports passing the Atom feeds for release updates through feedburner.com. There isn't anything tying the implementation to FeedBurner, of course, but since that's why I wanted the feature that's how I am describing it.

One tricky bit was I wanted all of the existing subscribers to my feed(s) to be redirected to the FeedBurner URL. I couldn't just add a redirect rule in Apache, since not all of the feeds are set up with FeedBurner yet. So I opted for letting the django code handle the redirection. If a project has an external_feed property that is not null, that value is used as the URL for feeds for the project. So when someone accesses the old URL for the codehosting release feed (http://www.doughellmann.com/projects/feed/atom/codehosting/) they are redirected to http://feeds.feedburner.com/DougHellmann-codehosting instead. And FeedBurner looks at http://www.doughellmann.com/projects/local_feed/atom/codehosting/, which always produces the Atom content locally.

The "local_feed" URL is never included in any templates, so no web crawlers should ever find it by themselves.

This is one of those cases where I had thought to include this feature from the beginning, since migrating the existing readers of the feed(s) required this hackish change. But, it looks like it is working. I would be interested in any feedback anyone else might have on other ways I could have handled the redirects.

PyMOTW: textwrap

Module: textwrap
Purpose: Formatting text by adjusting where line breaks occur in a paragraph.
Python Version: 2.5

Description:

The textwrap module can be used to format text for output in situations where pretty-printing is desired. It offers programmatic functionality similar to the paragraph wrapping or filling features found in many text editors.

Example:

import textwrap

# Provide some sample text
sample_text = '''

The textwrap module can be used to format text for output in situations
where pretty-printing is desired. It offers programmatic functionality similar
to the paragraph wrapping or filling features found in many text editors.
'''


The fill() convenience function takes text as input and produces formatted text as output. Let's see what it does with the sample_text provided.

print 'No dedent:\n'
print textwrap.fill(sample_text)


The results are something less than what we want:

No dedent:

The textwrap module can be used to format text for output in
situations where pretty-printing is desired. It offers
programmatic functionality similar to the paragraph wrapping
or filling features found in many text editors.


Notice the embedded tabs and extra spaces mixed into the middle of the
output. It looks pretty rough. Of course, we can do better. We want
to start by removing any common whitespace prefix from all of the lines
in the sample text. This allows us to use docstrings or embedded
multi-line strings straight from our Python code while removing the
formatting of the code itself. The sample string has an artificial
indent level introduced for illustrating this feature.

# Remove common whitespace prefix from the lines in the sample text
dedented_text = textwrap.dedent(sample_text).strip()
print 'Dedented:\n'
print dedented_text


The results are starting to look better:

Dedented:

The textwrap module can be used to format text for output in situations
where pretty-printing is desired. It offers programmatic functionality similar
to the paragraph wrapping or filling features found in many text editors.


Since "dedent" is the opposite of "indent", the result is a block of text with the common initial whitespace from each line removed. If one line is already indented more than another, some of the whitespace will not be removed.

 One tab.
Two tabs.
One tab.


becomes

One tab.
Two tabs.
One tab.


Next, let's see what happens if we take the dedented text and pass it through fill() with a few different width values.

# Format the output with a few different max line width values
for width in [ 20, 60, 80 ]:
print
print '%d Columns:\n' % width
print textwrap.fill(dedented_text, width=width)


This gives several sets of output in the specified widths:

20 Columns:

The textwrap module
can be used to
format text for
output in situations
where pretty-
printing is desired.
It offers
programmatic
functionality
similar to the
paragraph wrapping
or filling features
found in many text
editors.

60 Columns:

The textwrap module can be used to format text for output in
situations where pretty-printing is desired. It offers
programmatic functionality similar to the paragraph wrapping
or filling features found in many text editors.

80 Columns:

The textwrap module can be used to format text for output in situations where
pretty-printing is desired. It offers programmatic functionality similar to the
paragraph wrapping or filling features found in many text editors.


Besides the width of the output, you can control the indent of the first line independently of subsequent lines.

# Demonstrate how to produce a hanging indent
print '\nHanging indent:\n'
print textwrap.fill(dedented_text, initial_indent='', subsequent_indent=' ')


This makes it relatively easy to produce a hanging indent, where the first line is indented less than the other lines.

Hanging indent:

The textwrap module can be used to format text for output in
situations where pretty-printing is desired. It offers
programmatic functionality similar to the paragraph wrapping or
filling features found in many text editors.


The indent values can include non-whitespace characters, too, so the hanging indent can be prefixed with * to produce bullet points, etc. That came in handy when I converted my old zwiki content so I could import it into trac. I used the StructuredText package from Zope to parse the zwiki data, then created a formatter to produce trac's wiki markup as output. Using textwrap, I was able to format the output pages so almost no manual tweaking was needed after the conversion.

References:

textwrap_example.py
Python Module of the Week

Updated 5/20/2007 with technorati tags.
Updated 9/5/2007 with minor formatting changes.

Technorati Tags:
,


Sunday, April 15, 2007

PyMOTW: StringIO and cStringIO

Module: StringIO and cStringIO
Purpose: Work with text buffers using file-like API
Python Version: StringIO: 1.4, cStringIO: 1.5

Description:

The StringIO class provides a convenient means of working with text in-memory using the file API (read, write. etc.). There are 2 separate implementations. The cStringIO module is written in C for speed, while the StringIO module is written in Python for portability. Using cStringIO to build large strings can offer performance savings over some other string conctatenation techniques.

Example:

Here are some pretty standard, simple, examples of using StringIO buffers:

#!/usr/bin/env python

"""Simple examples with StringIO module
"""

# Find the best implementation available on this platform

try:
from cStringIO import StringIO
except:
from StringIO import StringIO

# Writing to a buffer
output = StringIO()
output.write('This goes into the buffer. ')

print >>output, 'And so does this.'

# Retrieve the value written
print output.getvalue()

output.close() # discard buffer memory

# Initialize a read buffer
input = StringIO('Inital value for read buffer')

# Read from the buffer
print input.read()


This example uses read(), but of course the readline() and readlines() methods are also available. The StringIO class also provides a seek() method so it is possible to jump around in a buffer while reading, which can be useful for rewinding if you are using some sort of look-ahead parsing algorithm.

Real world applications of StringIO include a web application stack where various parts of the stack may add text to the response, or testing the output generated by parts of a program which typically write to a file.

The application we are building at work includes a shell scripting interface in the form of several command line programs. Some of these programs are responsible for pulling data from the database and dumping it on the console (either to show the user, or so the text can serve as input to another command). The commands share a set of formatter plugins to produce a text representation of an object in a variety of ways (XML, bash syntax, human readable, etc.). Since the formatters normally write to standard output, testing the results would be a little tricky without the StringIO module. Using StringIO to intercept the output of the formatter gives us an easy way to collect the output in memory to compare against expected results.


References:


Updated 5/20/2007 with technorati tags.
Updated 9/5/2007 with minor formatting changes.

Technorati Tags:
,


Sunday, April 8, 2007

PyMOTW: Queue

Module: Queue
Purpose: Provides a thread-safe FIFO implementation
Python Version: at least 1.4

Description:

The Queue module provides a FIFO implementation suitable for multi-threaded programming. It can be used to pass messages or other data between producer and consumer threads safely. Locking is handled for the caller, so it is simple to have as many threads as you want working with the same Queue instance. A Queue's size (number of elements) may be restricted to throttle memory usage or processing.

This disucssion assumes you already understand the general nature of a queue. If you don't, you may want to read these references before continuing:



Example:

As an example of how to use the Queue class with multiple threads, we can create a very simplistic podcasting client. This client reads one or more RSS feeds, queues up the enclosures for download, and processes several downloads in parallel using threads. It is extremely simplistic and is entirely unsuitable for actual use, but the skeleton implementation gives us enough code to work with to provide an example of using the Queue module.

To start, we import a few useful modules:

from Queue import Queue

from threading import Thread
import time

import feedparser


Now, let's establish some operating parameters. Normally these would come from user inputs (preferences, a database, whatever). For our example we hard code a few values:

# Set up some global variables
num_fetch_threads = 2
enclosure_queue = Queue()

# A real app wouldn't use hard-coded data...
feed_urls = [ 'http://www.castsampler.com/cast/feed/rss/guest',
]


Next, we need to define the function that will run in the worker thread, processing the downloads. Again, for illustration purposes this only simulates the download. To actually download the enclosure, check out the urllib module, which we will cover in a later episode. In our example, we sleep a variable amount of time, depending on the thread id.

def downloadEnclosures(i, q):
"""This is the worker thread function.
It processes items in the queue one after

another. These daemon threads go into an
infinite loop, and only exit when
the main thread ends.
"""
while True:

print '%s: Looking for the next enclosure' % i
url = q.get()

print '%s: Downloading:' % i, url
time.sleep(i + 2) # instead of really downloading the URL, we just pretend

q.task_done()


Once this target function is defined, we can start the worker threads. Notice that downloadEnclosures() will block on the statement "url = q.get()" until the queue has something to return, so it is safe to start the threads before there is anything in the queue.

# Set up some threads to fetch the enclosures
for i in range(num_fetch_threads):

worker = Thread(target=downloadEnclosures, args=(i, enclosure_queue,))

worker.setDaemon(True)
worker.start()


And now we retrieve the feed contents (using Mark Pilgrim's feedparser module) and enqueue the URLs of the enclosures. As soon as the first URL is added to the queue, one of the worker threads should pick it up and start downloading it. The loop below will continue to add items until the feed is exhausted, and the worker threads will take turns dequeuing URLs to download them.

# Download the feed(s) and put the enclosure URLs into

# the queue.
for url in feed_urls:
response = feedparser.parse(url, agent='fetch_podcasts.py')

for entry in response['entries']:
for enclosure in entry.get('enclosures', []):

print 'Queuing:', enclosure['url']
enclosure_queue.put(enclosure['url'])


And the only thing left to do is wait for the queue to empty out again.

# Now wait for the queue to be empty, indicating that we have
# processed all of the downloads.
print '*** Main thread waiting'
enclosure_queue.join()
print '*** Done'


If you download the sample script and run it without modification, you should see output something like this:


0: Looking for the next enclosure
1: Looking for the next enclosure
Queuing: http://http.earthcache.net/htc-01.media.globix.net/COMP009996MOD1/Danny_Meyer.mp3
Queuing: http://feeds.feedburner.com/~r/drmoldawer/~5/104445110/moldawerinthemorning_show34_032607.mp3
Queuing: http://www.podtrac.com/pts/redirect.mp3/twit.cachefly.net/MBW-036.mp3
Queuing: http://media1.podtech.net/media/2007/04/PID_010848/Podtech_calacaniscast22_ipod.mp4
Queuing: http://media1.podtech.net/media/2007/03/PID_010592/Podtech_SXSW_KentBrewster_ipod.mp4
Queuing: http://media1.podtech.net/media/2007/02/PID_010171/Podtech_IDM_ChrisOBrien2.mp3
Queuing: http://feeds.feedburner.com/~r/drmoldawer/~5/96188661/moldawerinthemorning_show30_022607.mp3
*** Main thread waiting
0: Downloading: http://http.earthcache.net/htc-01.media.globix.net/COMP009996MOD1/Danny_Meyer.mp3
1: Downloading: http://feeds.feedburner.com/~r/drmoldawer/~5/104445110/moldawerinthemorning_show34_032607.mp3
0: Looking for the next enclosure
0: Downloading: http://www.podtrac.com/pts/redirect.mp3/twit.cachefly.net/MBW-036.mp3
1: Looking for the next enclosure
1: Downloading: http://media1.podtech.net/media/2007/04/PID_010848/Podtech_calacaniscast22_ipod.mp4
0: Looking for the next enclosure
0: Downloading: http://media1.podtech.net/media/2007/03/PID_010592/Podtech_SXSW_KentBrewster_ipod.mp4
0: Looking for the next enclosure
0: Downloading: http://media1.podtech.net/media/2007/02/PID_010171/Podtech_IDM_ChrisOBrien2.mp3
1: Looking for the next enclosure
1: Downloading: http://feeds.feedburner.com/~r/drmoldawer/~5/96188661/moldawerinthemorning_show30_022607.mp3
0: Looking for the next enclosure
1: Looking for the next enclosure
*** Done


YMMV, depending on whether anyone modifies the subscriptions in the guest account on CastSampler.com.

References:

Python Module of the Week


Updated 5/20/2007 with technorati tags.
Updated 9/5/2007 with minor formatting fixes.

Technorati Tags:
,


Wednesday, April 4, 2007

This is a test from google docs. My friend Luis sent me a link to "Why I Switched from OpenOffice to Google Docs", which happened to mention that it supports "publishing", so I wanted to give it a try.

Sunday, April 1, 2007

PyMOTW: ConfigParser

Module: ConfigParser
Purpose: Read/write configuration files similar to Windows INI files
Python Version: 1.5

Description:

The ConfigParser module is very useful for creating user-editable configuration files for your applications. The configuration files are broken up into sections, and each section can contain name-value pairs for configuration data. Value interpolation using Python formatting strings is also supported, to build values which depend on one another (this is especially handy for paths or URLs).

Example:

At work, before we moved to svn and trac, we had rolled our own tool for conducting distributed code reviews. To prepare the code for review, a developer would write up a summary "approach" document, then attach code diffs to it. The approach document supported comments through the web page, so developers not located in our main office could also review code. The only trouble was, posting the diffs could be a bit of a pain. To make that part of the process easier, I wrote a command line tool to run against a CVS sandbox to automatically find and post the diffs.

For the tool to update the diffs on an approach, it needed to know how to reach the web server hosting the approach documents. Since our developers were not always in the office, the URL to reach the server from any given host might be port-forwarded through SSH. Rather than forcing each developer to use the same port-forwarding scheme, the tool used a simple config file to remember the URL.

A developer's configuration file might look like:

[portal]
url = http://%(host)s:%(port)s/Portal
username = dhellmann
host = localhost
password = SECRET
port = 8080


The "portal" section refers to the approach document web site. Once the diffs were ready to be posted to the site, the tool would load the config file using the ConfigParser module to access the URL. That might look something like this:

from ConfigParser import ConfigParser
import os

filename = os.path.join(os.environ['HOME'], '.approachrc')

config = ConfigParser()
config.read([filename])

url = config.get('portal', 'url')


In the example above, the value of the url variable is "http://localhost:8080/Portal". The "url" value from the config file contains 2 formatting strings: "%(host)s" and "%(port)s". The values of the host and port variables are automatically substituted in place of the formatting strings by the get() method.

Of course, this is old code, written for Python 2.1. The ConfigParser module has been improved in more recent versions. The SafeConfigParser class is a drop in replacement for ConfigParser with improvements to the interpolation processing.

For this tool, I only needed string options. The ConfigParser supports other types of options as well: integer, floating point, and boolean. Since the option file format does not offer a way to associate a "type" with a value, the caller needs to know when to use a different method to retrieve options with these other types. For example, to retrieve a boolean option, use the getboolean() method instead of get(). The method arguments are the same, but the option value is converted to a boolean before being returned. Similarly, there are separate getint() and getfloat() methods.

The ConfigParser class also supports adding and removing sections to the file programmaticaly, and saving the results. This makes it possible to create a user interface for editing the configuration of your program, or to use the config file format for simple data files. For example, an app which needed to store a very small amount of data in a database-like format might take advantage of ConfigParser so the files would be human-readable as well.

Updated 5/20/2007 with technorati tags.


Technorati Tags:
,