Monday, August 27, 2007

PyMOTW: optparse

Module: optparse
Purpose: Command line option parser to replace getopt.
Python Version: 2.3

Description:

The optparse module is a modern alternative for command line option parsing that offers several features not available in getopt, including type conversion, option callbacks, and automatic help generation. There are many more features for to optparse than can be covered here, but hopefully this introduction will get you started if you are writing a command line program soon.

Creating an OptionParser:

There are two phases to parsing options with optparse. First, the OptionParser instance is constructed and configured with the expected options. Then a sequence of options is fed in and processed.

import optparse
parser = optparse.OptionParser()


Usually, once the parser has been created, each option is added to the parser explicitly, with information about what to do when the option is encountered on the command line. It is also possible to pass a list of options to the OptionParser constructor, but that form does not seem to be used as frequently.

Defining Options:

Options can be added one at a time using the add_option() method. Any un-named string arguments at the beginning of the argument list are treated as option names. To create aliases for an option, for example to have a short and long form of the same option, simply pass both names.

Unlike getopt, which only parses the options, optparse is a full option processing library. Options can be handled with different actions, specified by the action argument to add_option(). Supported actions include storing the argument (singly, or as part of a list), storing a constant value when the option is encountered (including special handling for true/false values for boolean switches), counting the number of times an option is seen, and calling a callback.

The default action is to store the argument to the option. In this case, if a type is provided, the argument value is converted to that type before it is stored. If the dest argument is provided, the option value is saved to an attribute of that name on the options object returned when the command line arguments are parsed.

Parsing a Command Line:

Once all of the options are defined, the command line is parsed by passing a sequence of argument strings to parse_args(). By default, the arguments are taken from sys.argv[1:], but you can also pass your own list. The options are processed using the GNU/POSIX syntax, so option and argument values can be mixed in the sequence.

The return value from parse_args() is a two-part tuple containing an optparse.Values instance and the list of arguments to the command that were not interpreted as options. The Values instance holds the option values as attributes, so if your option dest is "myoption", you access the value as: options.myoption.

Simple Examples:

Here is a simple example with 3 different options: a boolean option (-a), a simple string option (-b), and an integer option (-c).

import optparse
parser = optparse.OptionParser()
parser.add_option('-a', action="store_true", default=False)
parser.add_option('-b', action="store", dest="b")
parser.add_option('-c', action="store", dest="c", type="int")
print parser.parse_args(['-a', '-bval', '-c', '3'])


The options on the command line are parsed with the same rules that getopt.gnu_getopt() uses, so there are two ways to pass values to single character options. The example above uses both forms, -bval and -c val.

$ python optparse_short.py 
(<Values at 0xe29b8: {'a': True, 'c': 3, 'b': 'val'}>, [])


Notice that the type of the value associated with 'c' in the output is an integer, since the OptionParser was told to convert the argument before storing it.

Unlike with getopt, "long" option names are not handled any differently by optparse:

parser = optparse.OptionParser()
parser.add_option('--noarg', action="store_true", default=False)
parser.add_option('--witharg', action="store", dest="witharg")
parser.add_option('--witharg2', action="store", dest="witharg2", type="int")
print parser.parse_args([ '--noarg', '--witharg', 'val', '--witharg2=3' ])


And the results are similar:

$ python optparse_long.py
(<Values at 0xd3ad0: {'noarg': True, 'witharg': 'val', 'witharg2': 3}>, [])


Comparing with getopt:

Here is an implementation of the same example program used in the discussion of getopt:

import optparse
import sys
print 'ARGV :', sys.argv[1:]
parser = optparse.OptionParser()
parser.add_option('-o', '--output',
dest="output_filename",
default="default.out",
)
parser.add_option('-v', '--verbose',
dest="verbose",
default=False,
action="store_true",
)
parser.add_option('--version',
dest="version",
default=1.0,
type="float",
)
options, remainder = parser.parse_args()
print 'VERSION :', options.version
print 'VERBOSE :', options.verbose
print 'OUTPUT :', options.output_filename
print 'REMAINING :', remainder


Notice how the options -o and --output are aliased by being added at the same time. Either option can be used on the command line:

$ python optparse_getoptcomparison.py -o output.txtARGV      : ['-o', 'output.txt']
VERSION : 1.0
VERBOSE : False
OUTPUT : output.txt
REMAINING : []
$ python optparse_getoptcomparison.py --output output.txt
ARGV : ['--output', 'output.txt']
VERSION : 1.0
VERBOSE : False
OUTPUT : output.txt
REMAINING : []


And, any unique prefix of the long option can also be used:

$ python optparse_getoptcomparison.py --out output.txt
ARGV : ['--out', 'output.txt']
VERSION : 1.0
VERBOSE : False
OUTPUT : output.txt
REMAINING : []


Option Callbacks:

Beside saving the arguments for options directly, it is possible to define callback functions to be invoked when the option is encountered on the command line. Callbacks for options take 4 arguments: the optparse.Option instance causing the callback, the option string from the command line, any argument value associated with the option, and the optparse.OptionParser instance doing the parsing work.

import optparse
def flag_callback(option, opt_str, value, parser):
print 'flag_callback:'
print '\toption:', repr(option)
print '\topt_str:', opt_str
print '\tvalue:', value
print '\tparser:', parser
return
def with_callback(option, opt_str, value, parser):
print 'with_callback:'
print '\toption:', repr(option)
print '\topt_str:', opt_str
print '\tvalue:', value
print '\tparser:', parser
return
parser = optparse.OptionParser()
parser.add_option('--flag', action="callback", callback=flag_callback)
parser.add_option('--with',
action="callback",
callback=with_callback,
type="string",
help="Include optional feature")
parser.parse_args(['--with', 'foo', '--flag'])


In this example, the --with option is configured to take a string argument (other types are support as well, of course).

$ python optparse_callback.py
with_callback:
option: <Option at 0x78b98: --with>
opt_str: --with
value: foo
parser: <optparse.OptionParser instance at 0x78b48>
flag_callback:
option: <Option at 0x7c620: --flag>
opt_str: --flag
value: None
parser: <optparse.OptionParser instance at 0x78b48>


Help Messages:

The OptionParser automatically includes a help option to all option sets, so the user can pass --help on the command line to see instructions for running the program. The help message includes all of the options an indication of whether or not they take an argument. It is also possible to pass help text to add_option() to give a more verbose description of an option.

parser = optparse.OptionParser()
parser.add_option('--no-foo', action="store_true",
default=False,
dest="foo",
help="Turn off foo",
)
parser.add_option('--with', action="store", help="Include optional feature")
parser.parse_args()


The options are listed in alphabetical order, with aliases included on the same line. When the option takes an argument, the dest value is included as an argument name in the help output. The help text is printed in the right column.

$ python optparse_help.py --help
Usage: optparse_help.py [options]

Options:
-h, --help show this help message and exit
--no-foo Turn off foo
--with=WITH Include optional feature


Callbacks can be configured to take multiple arguments using the nargs option.

def with_callback(option, opt_str, value, parser):
print 'with_callback:'
print '\toption:', repr(option)
print '\topt_str:', opt_str
print '\tvalue:', value
print '\tparser:', parser
return
parser = optparse.OptionParser()
parser.add_option('--with',
action="callback",
callback=with_callback,
type="string",
nargs=2,
help="Include optional feature")
parser.parse_args(['--with', 'foo', 'bar'])


In this case, the arguments are passed to the callback function as a tuple via the value argument.

$ python optparse_callback_nargs.py 
with_callback:
option: <Option at 0x7c4e0: --with>
opt_str: --with
value: ('foo', 'bar')
parser: <optparse.OptionParser instance at 0x78a08>


References:

Python Module of the Week Home
Download Sample Code


Technorati Tags:
,


Tuesday, August 21, 2007

feedcache 1.1

There is a new release of feedcache available tonight, based on a patch from Thomas Perl. The update includes Unicode support for URLs, a "force" flag to always download data, and an "offline" mode flag to never download data.

Thanks, Thomas!

Monday, August 20, 2007

PyMOTW: csv

Module: csv
Purpose: Read and write comma separated value files.
Python Version: 2.3 and later

Description:

The csv module is very useful for working with data exported from spreadsheets and databases into text files. There is no well-defined standard, so the csv module uses "dialects" to support parsing using different parameters. Along with a generic reader and writer, the module includes a dialect for working with Microsoft Excel.

Limitations:

The Python 2.5 version of csv does not support Unicode data. There are also "issues with ASCII NUL characters". Using UTF-8 or printable ASCII is recommended.

Reading:

To read data from a csv file, use the reader() function to create a reader object. The reader can be used as an iterator to process the rows of the file in order. For example:

import csv
import sys

f = open(sys.argv[1], 'rt')
try:
reader = csv.reader(f)
for row in reader:
print row
finally:
f.close()


The first argument to reader() is the source of text lines. In this case, it is a file, but any iterable is accepted (StringIO instances, lists, etc.). Other optional arguments can be given to control how the input data is parsed.

The example file "testdata.csv" was exported from NeoOffice.

$ cat testdata.csv 
"Title 1","Title 2","Title 3"
1,"a",08/18/07
2,"b",08/19/07
3,"c",08/20/07
4,"d",08/21/07
5,"e",08/22/07
6,"f",08/23/07
7,"g",08/24/07
8,"h",08/25/07
9,"i",08/26/07


As it is read, each row of the input data is converted to a list of strings.

$ python csv_reader.py testdata.csv
['Title 1', 'Title 2', 'Title 3']
['1', 'a', '08/18/07']
['2', 'b', '08/19/07']
['3', 'c', '08/20/07']
['4', 'd', '08/21/07']
['5', 'e', '08/22/07']
['6', 'f', '08/23/07']
['7', 'g', '08/24/07']
['8', 'h', '08/25/07']
['9', 'i', '08/26/07']


If you know that certain columns have specific types, you can convert the strings yourself, but csv does not automatically convert the input. It does handle line breaks embedded within strings in a row (which is why a "row" is not always the same as a "line" of input from the file).

$ cat testlinebreak.csv 
"Title 1","Title 2","Title 3"
1,"first line
second line",08/18/07

$ python csv_reader.py testlinebreak.csv
['Title 1', 'Title 2', 'Title 3']
['1', 'first line\nsecond line', '08/18/07']


Writing:

When you have data to be imported into some other application, writing CSV files is just as easy as reading them. Use the writer() function to create a writer object. For each row, use writerow() to print the row.

import csv
import sys

f = open(sys.argv[1], 'wt')
try:
writer = csv.writer(f)
writer.writerow( ('Title 1', 'Title 2', 'Title 3') )
for i in range(10):
writer.writerow( (i+1, chr(ord('a') + i), '08/%02d/07' % (i+1)) )
finally:
f.close()


The output does not look exactly like the exported data used in the reader example:

$ python csv_writer.py testout.csv 
$ cat testout.csv
Title 1,Title 2,Title 3
1,a,08/01/07
2,b,08/02/07
3,c,08/03/07
4,d,08/04/07
5,e,08/05/07
6,f,08/06/07
7,g,08/07/07
8,h,08/08/07
9,i,08/09/07
10,j,08/10/07


The default quoting behavior is different for the writer, so the string column is not quoted. That is easy to change by adding a quoting argument to quote non-numeric values:

    writer = csv.writer(f, quoting=csv.QUOTE_NONNUMERIC)


And now the strings are quoted:

 $ python csv_writer_quoted.py testout_quoted.csv 
$ cat testout_quoted.csv
"Title 1","Title 2","Title 3"
1,"a","08/01/07"
2,"b","08/02/07"
3,"c","08/03/07"
4,"d","08/04/07"
5,"e","08/05/07"
6,"f","08/06/07"
7,"g","08/07/07"
8,"h","08/08/07"
9,"i","08/09/07"
10,"j","08/10/07"


Quoting:

There are four different quoting options, defined as constants in the csv module.


QUOTE_ALL

Quote everything, regardless of type.


QUOTE_MINIMAL

Quote fields with special characters (anything that would confuse a parser configured with the same dialect and options). This is the default


QUOTE_NONNUMERIC

Quote all fields that are not integers or floats. When used with the reader, input fields that are not quoted are converted to floats.


QUOTE_NONE

Do not quote anything on output. When used with the reader, quote characters are included in the field values (normally, they are treated as delimiters and stripped).



Dialects:

There are many parameters to control how the csv module parses or writes data. Rather than passing each of these parameters to the reader and writer separately, they are grouped together conveniently into a "dialect" object. Dialect classes can be registered by name, so that callers of the csv module do not need to know the parameter settings in advance. The standard library includes two dialects: excel, and excel-tabs. The "excel" dialect is for working with data in the default export format for Microsoft Excel, and also works with OpenOffice or NeoOffice. For details on the dialect parameters and how they are used, refer to section 9.1.2 the the standard library documentation for the csv module.

DictReader and DictWriter:

In addition to working with sequences of data, the csv module includes classes for working with rows as dictionaries. The DictReader and DictWriter classes translate rows to dictionaries. Keys for the dictionary can be passed in, or inferred from the first row in the input (when the row contains headers).

import csv
import sys

f = open(sys.argv[1], 'rt')
try:
reader = csv.DictReader(f)
for row in reader:
print row
finally:
f.close()


The dictionary-based reader and writer are implemented as wrappers around the sequence-based classes, and use the same arguments and API. The only difference is that rows are dictionaries instead of lists or tuples.

$ python csv_dictreader.py testdata.csv 
{'Title 1': '1', 'Title 3': '08/18/07', 'Title 2': 'a'}
{'Title 1': '2', 'Title 3': '08/19/07', 'Title 2': 'b'}
{'Title 1': '3', 'Title 3': '08/20/07', 'Title 2': 'c'}
{'Title 1': '4', 'Title 3': '08/21/07', 'Title 2': 'd'}
{'Title 1': '5', 'Title 3': '08/22/07', 'Title 2': 'e'}
{'Title 1': '6', 'Title 3': '08/23/07', 'Title 2': 'f'}
{'Title 1': '7', 'Title 3': '08/24/07', 'Title 2': 'g'}
{'Title 1': '8', 'Title 3': '08/25/07', 'Title 2': 'h'}
{'Title 1': '9', 'Title 3': '08/26/07', 'Title 2': 'i'}


The DictWriter must be given a list of field names so it knows how the columns should be ordered in the output.

import csv
import sys

f = open(sys.argv[1], 'wt')
try:
fieldnames = ('Title 1', 'Title 2', 'Title 3')
writer = csv.DictWriter(f, fieldnames=fieldnames)
headers = {}
for n in fieldnames:
headers[n] = n
writer.writerow(headers)
for i in range(10):
writer.writerow({ 'Title 1':i+1,
'Title 2':chr(ord('a') + i),
'Title 3':'08/%02d/07' % (i+1),
})
finally:
f.close()


$ python csv_dictwriter.py testout.csv 
$ cat testout.csv
Title 1,Title 2,Title 3
1,a,08/01/07
2,b,08/02/07
3,c,08/03/07
4,d,08/04/07
5,e,08/05/07
6,f,08/06/07
7,g,08/07/07
8,h,08/08/07
9,i,08/09/07
10,j,08/10/07


References:

Python Module of the Week Home
Download Sample Code
PEP 305, CSV File API


Technorati Tags:
,


Updated: Fixed link to section 9.1.2.

Python Developer Networking

Jesse Noller is leading a campaign to have Python developers form a network via LinkedIn.com. He talks about it over on his blog, so check it out for the details.

According to Doug Napoleone's comment on Jesse's post, there is a more formal effort to set up a PyCon08 group and tie it in with the web site for the convention. I didn't realize that LinkedIn supported groups other than "employers". It looks like the right way to go for the community is a "networking group" (there are only a few types, and the others seem to imply a more formal organization than what we would have). Unfortunately, the groups feature is closed for right now.

For now, following Jesse's lead, I set up a position with the "Python community" organization and job title "None". I had to guess at the start date. :-)

I noticed several Python-oriented groups over on Facebook, so that might be an alternative if LinkedIn doesn't come through. Somehow LinkedIn feels more professional; maybe I just have a historical bias based on Facebooks origins, though.

Thursday, August 16, 2007

PyMOTW on O'Reilly ONLamp

I'm pleased to bring the Python Module of the Week series to the O'Reilly ONLamp site.

The main feed and home page are not moving, and all posts will continue to be posted here.

Updated: The announcement post is here.

Sunday, August 12, 2007

CommandLineApp

Back when Python 1.5.4 was hot and new, I wrote a class to serve as a basis for the many command line programs I was working on for myself and my employer. This was long before the Option Parsing Wars that resulted in the addition of optparse to the standard library. If optparse had been around, I probably wouldn't have written CommandLineApp, but since all I had to work with at the time was getopt, and it operated at such a low level, I hacked together a helper class.

The difference between CommandLineApp and optparse is that CommandLineApp treats your application as an object, just like everything else in the application. The application class is responsible for option processing, although it collaborates with getopt to do the parsing work.

To use it, you subclass CommandLineApp and define option handler methods and a main(). To invoke the program, call run(). The option handlers are identified by name, and used to build the list of supported options. "optionHandler_myopt" is called when "--myopt" is encountered. If the method takes an argument, so does your option. The docstrings for the callback methods are used to create the help output. Support for short-form usage (via -h) and long-form help (via --help) are built-in to the base class.

The old version (released as 1.0), which had not received a substantial rewrite in many years (mostly because it still worked fine and I had more important projects to work on) can run under Python 1.4 through 2.5. It was some of the earliest complex Python code I ever wrote, and that is clear from the code quality (both style and substance). The new version has been tested under Python 2.5. It feels less hack-ish, since it uses inspect instead of scanning the class hierarchy and method signatures directly.

The 2.0 rewrite works in essentially the same way as 1.0, but is much more compact and (I think) the code is cleaner. I called it 2.0 because the class API is different in a few important ways from the earlier version. I still want to add argument validation (for non-option arguments to the program), but that will take a little more time.

feedcache 1.0

I am happy with the API for feedcache, so I bumped the release to 1.0 and beta status. The package includes tests and 2 separate examples (one with shelve and another using threads and shove).

PyMOTW: getopt

Module: getopt
Purpose: Command line option parsing
Python Version: 1.4

Description:

The getopt module is the old-school command line option parser which supports the conventions established by the Unix function getopt(). It parses an argument sequence, such as sys.argv and returns a sequence of (option, argument) pairs and a sequence of non-option arguments.

Supported option syntax includes:

-a
-bval
-b val
--noarg
--witharg=val
--witharg val


Function Arguments

The getopt function takes three arguments:

The first argument is the sequence of arguments to be parsed. This usually comes from sys.argv[1:] (ignoring the program name in sys.arg[0]).

The second argument is the option definition string for single character options. If one of the options requires an argument, its letter is followed by a colon.

The third argument, if used, should be a sequence of the long-style option names. Long style options can be more than a single character, such as --noarg or --witharg. The option names in the sequence should not include the -- prefix. If any long option requires an argument, its name should have a suffix of =.

Short and long form options can be combined in a single call.

Short Form Options

If a program wants to take 2 options, -a, and -b with the b option requiring an argument, the value should be "ab:".

print getopt.getopt(['-a', '-bval', '-c', 'val'], 'ab:c:')


$ python getopt_short.py 
([('-a', ''), ('-b', 'val'), ('-c', 'val')], [])


Long Form Options

If a program wants to take 2 options, --noarg and --witharg the sequence should be [ 'noarg', 'witharg=' ].

print getopt.getopt([ '--noarg', '--witharg', 'val', '--witharg2=another' ],
'',
[ 'noarg', 'witharg=', 'witharg2=' ])


$ python getopt_long.py 
([('--noarg', ''), ('--witharg', 'val'), ('--witharg2', 'another')], [])


Example

Below is a more complete example program which takes 5 options: -o, -v, --output, --verbose, and --version. The -o, --output, and --version options require an argument.

import getopt
import sys

version = '1.0'
verbose = False
output_filename = 'default.out'

print 'ARGV :', sys.argv[1:]

options, remainder = getopt.getopt(sys.argv[1:], 'o:v', ['output=',
'verbose',
'version=',
])
print 'OPTIONS :', options

for opt, arg in options:
if opt in ('-o', '--output'):
output_filename = arg
elif opt in ('-v', '--verbose'):
verbose = True
elif opt == '--version':
version = arg

print 'VERSION :', version
print 'VERBOSE :', verbose
print 'OUTPUT :', output_filename
print 'REMAINING :', remainder


The program can be called in a variety of ways.

$ python ./getopt_example.py
ARGV : []
OPTIONS : []
VERSION : 1.0
VERBOSE : False
OUTPUT : default.out
REMAINING : []


A single letter option can be a separate from its argument:

$ python ./getopt_example.py -o foo
ARGV : ['-o', 'foo']
OPTIONS : [('-o', 'foo')]
VERSION : 1.0
VERBOSE : False
OUTPUT : foo
REMAINING : []


or combined:

$ python ./getopt_example.py -ofoo
ARGV : ['-ofoo']
OPTIONS : [('-o', 'foo')]
VERSION : 1.0
VERBOSE : False
OUTPUT : foo
REMAINING : []


A long form option can similarly be separate:

$ python ./getopt_example.py --output foo    
ARGV : ['--output', 'foo']
OPTIONS : [('--output', 'foo')]
VERSION : 1.0
VERBOSE : False
OUTPUT : foo
REMAINING : []


or combined, with =:

$ python ./getopt_example.py --output=foo
ARGV : ['--output=foo']
OPTIONS : [('--output', 'foo')]
VERSION : 1.0
VERBOSE : False
OUTPUT : foo
REMAINING : []


Abbreviating Long Form Options

The long form option does not have to be spelled out entirely, so long as a unique prefix is provided:

$ python ./getopt_example.py --o foo
ARGV : ['--o', 'foo']
OPTIONS : [('--output', 'foo')]
VERSION : 1.0
VERBOSE : False
OUTPUT : foo
REMAINING : []


If a unique prefix is not provided, an exception is raised.

 $ python ./getopt_example.py --ver 2.0
ARGV : ['--ver', '2.0']
Traceback (most recent call last):
File "./getopt_example.py", line 43, in
'version=',
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/getopt.py", line 89, in getopt
opts, args = do_longs(opts, args[0][2:], longopts, args[1:])
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/getopt.py", line 153, in do_longs
has_arg, opt = long_has_args(opt, longopts)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/getopt.py", line 180, in long_has_args
raise GetoptError('option --%s not a unique prefix' % opt, opt)
getopt.GetoptError: option --ver not a unique prefix


Option processing stops as soon as the first non-option argument is encountered.

$ python ./getopt_example.py -v not_an_option --output foo
ARGV : ['-v', 'not_an_option', '--output', 'foo']
OPTIONS : [('-v', '')]
VERSION : 1.0
VERBOSE : True
OUTPUT : default.out
REMAINING : ['not_an_option', '--output', 'foo']


GNU-style Option Parsing

New in Python 2.3, an additional function gnu_getopt() was added. It allows option and non-option arguments to be mixed on the command line in any order. After changing the call in the previous example, the difference becomes clear:

$ python ./getopt_gnu.py -v not_an_option --output foo
ARGV : ['-v', 'not_an_option', '--output', 'foo']
OPTIONS : [('-v', ''), ('--output', 'foo')]
VERSION : 1.0
VERBOSE : True
OUTPUT : foo
REMAINING : ['not_an_option']


Special Case: --

If getopt encounters -- in the input arguments, it stops processing the remaining arguments as options.

$ python ./getopt_example.py -v -- --output foo
ARGV : ['-v', '--', '--output', 'foo']
OPTIONS : [('-v', '')]
VERSION : 1.0
VERBOSE : True
OUTPUT : default.out
REMAINING : ['--output', 'foo']


References:

Python Module of the Week Home
Download Sample Code
getopt - Linux Command - Unix Command


Technorati Tags:
,


Friday, August 10, 2007

ORM envy

Jonathan LaCour presented an overview of SQLAlchemy at last night's PyATL meeting, and now I have ORM envy. It's too bad I can't afford the effort that would be involved in replacing the in-house ORM we use at work, but I'll definitely consider using it for my own projects.


Technorati Tags:
,


Sunday, August 5, 2007

PyMOTW: shelve

Module: shelve
Purpose: The shelve module implements persistent storage for arbitrary Python objects which can be pickled, using a dictionary-like API.
Python Version: 1.4

Description:

The shelve module can be used as a simple persistent storage option for Python objects when a relational database is overkill. The shelf is accessed by keys, just as with a dictionary. The values are pickled and written to a database created and managed by anydbm.

Creating a new Shelf

The simplest way to use shelve is via the DbfilenameShelf class. It uses anydbm to store the data. You can use the class directly, or simply call shelve.open():

import shelve

s = shelve.open('test_shelf.db')
try:
s['key1'] = { 'int': 10, 'float':9.5, 'string':'Sample data' }
finally:
s.close()


To access the data again, open the shelf and use it like a dictionary:

s = shelve.open('test_shelf.db')
try:
existing = s['key1']
finally:
s.close()

print existing


If you run both sample scripts, you should see:

$ python shelve_create.py
$ python shelve_existing.py
{'int': 10, 'float': 9.5, 'string': 'Sample data'}


The dbm module does not support multiple applications writing to the same database at the same time. If you know your client will not be modifying the shelf, you can tell shelve to open the database read-only.

s = shelve.open('test_shelf.db', flag='r')
try:
existing = s['key1']
finally:
s.close()

print existing


If your program tries to modify the database while it is opened read-only, an access error exception is generated. The exception type depends on the database module selected by anydbm when the database was created.

Write-back

Shelves do not track modifications to volatile objects, by default. That means if you change the contents of an item stored in the shelf, you must update the shelf explicitly by storing the item again.

s = shelve.open('test_shelf.db')
try:
print s['key1']
s['key1']['new_value'] = 'this was not here before'
finally:
s.close()

s = shelve.open('test_shelf.db', writeback=True)
try:
print s['key1']
finally:
s.close()


In this example, the dictionary at 'key1' is not stored again, so when the shelf is re-opened, the changes have not been preserved.

$ python shelve_create.py
$ python shelve_withoutwriteback.py
{'int': 10, 'float': 9.5, 'string': 'Sample data'}
{'int': 10, 'float': 9.5, 'string': 'Sample data'}


To automatically catch changes to volatile objects stored in the shelf, open the shelf with writeback enabled. The writeback flag causes the shelf to remember all of the objects retrieved from the database using an in-memory cache. Each cache object is also written back to the database when the shelf is closed.

s = shelve.open('test_shelf.db', writeback=True)
try:
print s['key1']
s['key1']['new_value'] = 'this was not here before'
print s['key1']
finally:
s.close()

s = shelve.open('test_shelf.db', writeback=True)
try:
print s['key1']
finally:
s.close()


Although it reduces the chance of programmer error, and can make object persistence more transparent, using writeback mode may not be desirable in every situation. The cache consumes extra memory while the shelf is open, and pausing to write every cached object back to the database when it is closed can take extra time. Since there is no way to tell if the cached objects have been modified, they are all written back. If your application reads data more than it writes, writeback will add more overhead than you might want.

$ python shelve_create.py
$ python shelve_writeback.py
{'int': 10, 'float': 9.5, 'string': 'Sample data'}
{'int': 10, 'new_value': 'this was not here before', 'float': 9.5, 'string': 'Sample data'}
{'int': 10, 'new_value': 'this was not here before', 'float': 9.5, 'string': 'Sample data'}


Specific Shelf Types

The examples above all use the default shelf implementation. Using shelve.open() instead of one of the shelf implementations directly is a common usage pattern, especially if you do not care what type of database is used to store the data. There are times, however, when you do care. In those situations, you may want to use DbfilenameShelf or BsdDbShelf directly, or even subclass Shelf for a custom solution.

References:

Python Module of the Week Home
Download Sample Code
feedcache uses shelve as a default storage option


Technorati Tags:
,


New project: feedcache

Back in mid-June I promised Jeremy Jones that I would clean up some of the code I use for CastSampler.com to cache RSS and Atom feeds so he could look at it for his podgrabber project. I finally found some time to work on it this weekend (not quite 2 months later, sorry Jeremy).

The result is feedcache, which I have released in "alpha" status, for now. I don't usually bother releasing my code in alpha state, because that usually means I'm not actually using it anywhere with enough regularity to ensure that it is robust. I am going ahead and releasing feedcache early because I am hoping for some feedback on the API. I realized that the way I cache feeds for CastSampler.com is not the way all applications will want to cache them, so the design might be biased.

The Design

There are two aspects to handling caching the feed data. The high level code that knows it is working with RSS or Atom feeds, and low level code that saves the data with a timestamp. The high level Cache class is responsible for fetching, updating, and expiring feed content. The low level storage classes are responsible for saving and restoring feed content.

Since the storage handling is separated from the cache management, it is possible to adapt the Cache to whatever sort of storage option might work best for you. So far, I have implemented two backend storage options. MemoryStorage keeps everything in memory, and is mostly useful for testing. ShelveStorage option uses the shelve module to store all of the feed data in one file using pickles. I hope that the API for the backend storage manager is simple enough to make it easy for you to tie in your own backend if neither of these options is appealing. Something that uses memcached would be very interesting, for example.


The Cache class uses a fairly simple algorithm to decide if it needs to update the stored data:


  1. If there is nothing stored for the URL, fetch the data.

  2. If there is something stored for the URL and its time-to-live has not passed, use that data. (This throttles repeated requests for the same feed content.)

  3. If the stored data has expired, use any available ETag and modification time header data to perform a conditional GET of the data. If new data is returned, update the stored data. If no new data is returned, update the time-to-live for the stored data and return what is stored.



The feed data is retrieved and parsed by Mark Pilgrim's feedparser module, so the Cache really does just manage the contents of the backend storage.

Another benefit of separating the cache manager from the storage handler is only the storage handler needs to be thread-safe. The storage handler is given to each Cache as an argument to the construtor. In a multi-threaded app, each thread can have its own Cache (which does the fetching, when needed) and share a single backend storage handler.

Example

Here is a simple example program that uses a shelve file for storage. The example does not use multiple threads, but should still illustrate how to use the cache.

def main(urls=[]):
print 'Saving feed data to ./.feedcache'
storage = shelvestorage.ShelveStorage('.feedcache')
storage.open()
try:
fc = cache.Cache(storage)
for url in urls:
parsed_data = fc[url]
print parsed_data.feed.title
for entry in parsed_data.entries:
print '\t', entry.title
finally:
storage.close()
return


Additional Work

This project is still a work in process, but I would appreciate any feedback you have, good or bad. And of course, report bugs if you find them!