Sunday, October 28, 2007

PyMOTW: commands

The commands module contains utility functions for working with shell command output under Unix.

Module: commands
Purpose: Run external shell commands and capture the status code and output.
Python Version: 1.4

Description:

Note: This module is made obsolete by the subprocess module.

There are 3 functions in the commands module for working with external commands. The functions are shell-aware and return the output or status code from the command.

getstatusoutput():

The function getstatusoutput() runs a command via the shell and returns the exit code and the text output (stdout and stderr combined). The exit codes are the same as for the C function wait() or os.wait(). The code is a 16-bit number. The low byte contains the signal number that killed the process. When the signal is zero, the high byte is the exit status of the program. If a core file was produced, the high bit of the low byte is set.

from commands import *

def run_command(cmd):
print 'Running: "%s"' % cmd
status, text = getstatusoutput(cmd)
exit_code = status >> 8
signal_num = status % 256
print 'Signal: %d' % signal_num
print 'Exit : %d' % exit_code
print 'Core? : %s' % bool(exit_code / 256)
print 'Output:'
print text
print

run_command('ls -l *.py')
run_command('ls -l *.notthere')
run_command('echo "WAITING TO BE KILLED"; read input')


This example runs 2 commands which exit normally, and a third which blocks waiting to be killed from another shell. (Don't simply use Ctrl-C as the interpreter will intercept that signal. Use ps and grep in another window to find the read process and send it a signal with kill.)

$ python commands_getstatusoutput.py
Running: "ls -l *.py"
Signal: 0
Exit : 0
Core? : False
Output:
-rw-r--r-- 1 dhellman dhellman 1191 Oct 21 09:41 __init__.py
-rw-r--r-- 1 dhellman dhellman 1321 Oct 21 09:48 commands_getoutput.py
-rw-r--r-- 1 dhellman dhellman 1265 Oct 21 09:50 commands_getstatus.py
-rw-r--r-- 1 dhellman dhellman 1626 Oct 21 10:10 commands_getstatusoutput.py

Running: "ls -l *.notthere"
Signal: 0
Exit : 1
Core? : False
Output:
ls: *.notthere: No such file or directory

Running: "echo "WAITING TO BE KILLED"; read input"
Signal: 1
Exit : 0
Core? : False
Output:
WAITING TO BE KILLED


In this example, I used "kill -HUP $PID" to kill the read process.

getoutput():

If the exit code is not useful for your application, you can use getoutput() to receive only the text output from the command.

from commands import *

text = getoutput('ls -l *.py')
print 'ls -l *.py:'
print text

print

text = getoutput('ls -l *.notthere')
print 'ls -l *.py:'
print text


$ python commands_getoutput.py      
ls -l *.py:
-rw-r--r-- 1 dhellman dhellman 1191 Oct 21 09:41 __init__.py
-rw-r--r-- 1 dhellman dhellman 1321 Oct 21 09:48 commands_getoutput.py
-rw-r--r-- 1 dhellman dhellman 1265 Oct 21 09:50 commands_getstatus.py
-rw-r--r-- 1 dhellman dhellman 1626 Oct 21 10:10 commands_getstatusoutput.py

ls -l *.py:
ls: *.notthere: No such file or directory


getstatus():

Contrary to what you might expect, getstatus() does not run a command and return the status code. Instead, it's argument is a filename which is combined with "ls -ld" to build a command to be run by getoutput(). The text output of the command is returned.

from commands import *

status = getstatus('commands_getstatus.py')
print 'commands_getstatus.py:', status
status = getstatus('notthere.py')
print 'notthere.py:', status
status = getstatus('$filename')
print '$filename:', status


As you notice from the output, the $ character in the argument to the last call is escaped so the environment variable name is not expanded.

 $ python commands_getstatus.py
commands_getstatus.py: -rw-r--r-- 1 dhellman dhellman 1387 Oct 21 10:19 commands_getstatus.py
notthere.py: ls: notthere.py: No such file or directory
$filename: ls: $filename: No such file or directory


References:

Python Module of the Week Home
Download Sample Code


Technorati Tags:
,


Saturday, October 20, 2007

PyMOTW: itertools

The itertools module includes a set of functions for working with iterable (sequence-like) data sets.

Module: itertools
Purpose: Iterator functions for efficient looping
Python Version: 2.3

Description:

The functions provided are inspired by similar features of the "lazy functional programming language" Haskell and SML. They are intended to be fast and use memory efficiently, but also to be hooked together to express more complicated iteration-based algorithms.

Iterator-based code may be preferred over code which uses lists for several reasons. Since data is not produced from the iterator until it is needed, all of the data is not stored in memory at the same time. Reducing memory usage can reduce swapping and other side-effects of large data sets, increasing performance.

All of the examples below assume the contents of itertools was imported using from itertools import *.

Merging and Splitting Iterators:

The chain() function takes several iterators as arguments and returns a single iterator that produces the contents of all of them as though they came from a single sequence.

for i in chain([1, 2, 3], ['a', 'b', 'c']):
print i


$ python itertools_chain.py 
1
2
3
a
b
c


izip() returns an iterator that combines the elements of several iterators into tuples. It works like the built-in function zip(), except that it returns an iterator instead of a list.

for i in izip([1, 2, 3], ['a', 'b', 'c']):
print i


$ python itertools_izip.py 
(1, 'a')
(2, 'b')
(3, 'c')


The islice() function returns an iterator which returns selected items from the input iterator, by index.

print 'Stop at 5:'
for i in islice(count(), 5):
print i


Stop at 5:
0
1
2
3
4


It takes the same arguments as the slice operator for lists: start, stop, and step. The start and step arguments are optional.

print 'Start at 5, Stop at 10:'
for i in islice(count(), 5, 10):
print i


Start at 5, Stop at 10:
5
6
7
8
9


print 'By tens to 100:'
for i in islice(count(), 0, 100, 10):
print i


By tens to 100:
0
10
20
30
40
50
60
70
80
90


The tee() function returns several independent iterators (defaults to 2) based on a single original input. It has semantics similar to the Unix tee utility, which repeats the values it reads from its input and writes them to a named file and standard output.

r = islice(count(), 5)
i1, i2 = tee(r)

for i in i1:
print 'i1:', i
for i in i2:
print 'i2:', i


$ python itertools_tee.py
i1: 0
i1: 1
i1: 2
i1: 3
i1: 4
i2: 0
i2: 1
i2: 2
i2: 3
i2: 4


Since the new iterators created by tee() share the input, you should not use the original iterator any more. If you do consume values from the original input, the new iterators will not produce those values:

r = islice(count(), 5)
i1, i2 = tee(r)

for i in r:
print 'r:', i
if i > 1:
break

for i in i1:
print 'i1:', i
for i in i2:
print 'i2:', i


$ python itertools_tee_error.py
r: 0
r: 1
r: 2
i1: 3
i1: 4
i2: 3
i2: 4


Converting Inputs:

The imap() function returns an iterator that calls a function on the values in the input iterators, and returns the results. It works like the built-in map(), except that it stops when any input iterator is exhausted (instead of inserting None values to completely consume all of the inputs).

In this first example, the lambda function multiplies the input values by 2:

print 'Doubles:'
for i in imap(lambda x:2*x, xrange(5)):
print i


$ python itertools_imap.py
Doubles:
0
2
4
6
8


In a second example, the lambda function multiplies 2 arguments, taken from separate iterators, and returns a tuple with the original arguments and the computed value.

print 'Multiples:'
for i in imap(lambda x,y:(x, y, x*y), xrange(5), xrange(5,10)):
print '%d * %d = %d' % i


Multiples:
0 * 5 = 0
1 * 6 = 6
2 * 7 = 14
3 * 8 = 24
4 * 9 = 36


The starmap() function is similar to imap(), but instead of constructing a tuple from multiple iterators it splits up the items in a single iterator as arguments to the mapping function using the * syntax. Where the mapping function to imap() is called f(i1, i2), the mapping function to starmap() is called f(*i).

values = [(0, 5), (1, 6), (2, 7), (3, 8), (4, 9)]
for i in starmap(lambda x,y:(x, y, x*y), values):
print '%d * %d = %d' % i


$ python itertools_starmap.py 
0 * 5 = 0
1 * 6 = 6
2 * 7 = 14
3 * 8 = 24
4 * 9 = 36


Producing New Values:

The count() function returns an interator that produces consecutive integers, indefinitely. The first number can be passed as an argument, the default is zero. There is no upper bound argument (see the built-in xrange() for more control over the result set). In this example, the iteration stops because the list argument is consumed.

for i in izip(count(1), ['a', 'b', 'c']):
print i


$ python itertools_count.py 
(1, 'a')
(2, 'b')
(3, 'c')


The cycle() function returns an iterator that repeats the contents of the arguments it is given indefinitely. Since it has to remember the entire contents of the input iterator, it may consume quite a bit of memory if the iterator is long. In this example, a counter variable is used to break out of the loop after a few cycles.

i = 0
for item in cycle(['a', 'b', 'c']):
i += 1
if i == 10:
break
print (i, item)


$ python itertools_cycle.py
(1, 'a')
(2, 'b')
(3, 'c')
(4, 'a')
(5, 'b')
(6, 'c')
(7, 'a')
(8, 'b')
(9, 'c')


The repeat() function returns an iterator that produces the same value each time it is accessed. It keeps going forever, unless the optional times argument is provided to limit it.

for i in repeat('over-and-over', 5):
print i


$ python itertools_repeat.py 
over-and-over
over-and-over
over-and-over
over-and-over
over-and-over


It is useful to combine repeat() with izip() or imap() when invariant values need to be included with the values from the other iterators.

for i, s in izip(count(), repeat('over-and-over', 5)):
print i, s


$ python itertools_repeat_izip.py 
0 over-and-over
1 over-and-over
2 over-and-over
3 over-and-over
4 over-and-over


for i in imap(lambda x,y:(x, y, x*y), repeat(2), xrange(5)):
print '%d * %d = %d' % i


$ python itertools_repeat_imap.py 
2 * 0 = 0
2 * 1 = 2
2 * 2 = 4
2 * 3 = 6
2 * 4 = 8


Filtering:

The dropwhile() function returns an iterator that returns elements of the input iterator after a condition becomes true false for the first time. It does not filter every item of the input; after the condition is true false the first time, all of the remaining items in the input are returned.

def should_drop(x):
print 'Testing:', x
return (x<1)

for i in dropwhile(should_drop, [ -1, 0, 1, 2, 3, 4, 1, -2 ]):
print 'Yielding:', i


$ python itertools_dropwhile.py 
Testing: -1
Testing: 0
Testing: 1
Yielding: 1
Yielding: 2
Yielding: 3
Yielding: 4
Yielding: 1
Yielding: -2


The opposite of dropwhile(), takewhile() returns an iterator that returns items from the input iterator as long as the test function returns true.

def should_take(x):
print 'Testing:', x
return (x<2)

for i in takewhile(should_take, [ -1, 0, 1, 2, 3, 4, 1, -2 ]):
print 'Yielding:', i


$ python itertools_takewhile.py
Testing: -1
Yielding: -1
Testing: 0
Yielding: 0
Testing: 1
Yielding: 1
Testing: 2


ifilter() returns an iterator that works like the built-in filter() does for lists, including only items for which the test function returns true. It is different from dropwhile() in that every item is tested before it is returned.

def check_item(x):
print 'Testing:', x
return (x<1)

for i in ifilter(check_item, [ -1, 0, 1, 2, 3, 4, 1, -2 ]):
print 'Yielding:', i


$ python itertools_ifilter.py 
Testing: -1
Yielding: -1
Testing: 0
Yielding: 0
Testing: 1
Testing: 2
Testing: 3
Testing: 4
Testing: 1
Testing: -2
Yielding: -2


The opposite of ifilter(), ifilterfalse() returns an iterator that includes only items where the test function returns false.

def check_item(x):
print 'Testing:', x
return (x<1)

for i in ifilterfalse(check_item, [ -1, 0, 1, 2, 3, 4, 1, -2 ]):
print 'Yielding:', i


$ python itertools_ifilterfalse.py Testing: -1
Testing: 0
Testing: 1
Yielding: 1
Testing: 2
Yielding: 2
Testing: 3
Yielding: 3
Testing: 4
Yielding: 4
Testing: 1
Yielding: 1
Testing: -2


Grouping Data:

The groupby() function returns an iterator that produces sets of values grouped by a common key.

This example from the standard library documentation shows how to group keys in a dictionary which have the same value:

from itertools import *
from operator import itemgetter

d = dict(a=1, b=2, c=1, d=2, e=1, f=2, g=3)
di = sorted(d.iteritems(), key=itemgetter(1))
for k, g in groupby(di, key=itemgetter(1)):
print k, map(itemgetter(0), g)


$ python itertools_groupby.py
1 ['a', 'c', 'e']
2 ['b', 'd', 'f']
3 ['g']


This more complicated example illustrates grouping related values based on some attribute. Notice that the input sequence needs to be sorted on the key in order for the groupings to work out as expected:

class Point:
def __init__(self, x, y):
self.x = x
self.y = y
def __repr__(self):
return 'Point(%s, %s)' % (self.x, self.y)
def __cmp__(self, other):
return cmp((self.x, self.y), (other.x, other.y))

# Create a dataset of Point instances
data = list(imap(Point,
cycle(islice(count(), 3)),
islice(count(), 10),
)
)
print 'Data:', data
print

# Try to group the unsorted data based on X values
print 'Grouped, unsorted:'
for k, g in groupby(data, lambda o:o.x):
print k, list(g)
print

# Sort the data
data.sort()
print 'Sorted:', data
print

# Group the sorted data based on X values
print 'Grouped, sorted:'
for k, g in groupby(data, lambda o:o.x):
print k, list(g)
print


$ python itertools_groupby_seq.py
Data: [Point(0, 0), Point(1, 1), Point(2, 2), Point(0, 3),
Point(1, 4), Point(2, 5), Point(0, 6), Point(1, 7),
Point(2, 8), Point(0, 9)]

Grouped, unsorted:
0 [Point(0, 0)]
1 [Point(1, 1)]
2 [Point(2, 2)]
0 [Point(0, 3)]
1 [Point(1, 4)]
2 [Point(2, 5)]
0 [Point(0, 6)]
1 [Point(1, 7)]
2 [Point(2, 8)]
0 [Point(0, 9)]

Sorted: [Point(0, 0), Point(0, 3), Point(0, 6), Point(0, 9),
Point(1, 1), Point(1, 4), Point(1, 7), Point(2, 2),
Point(2, 5), Point(2, 8)]

Grouped, sorted:
0 [Point(0, 0), Point(0, 3), Point(0, 6), Point(0, 9)]
1 [Point(1, 1), Point(1, 4), Point(1, 7)]
2 [Point(2, 2), Point(2, 5), Point(2, 8)]


References:

Python Module of the Week Home
Download Sample Code
The Standard ML Basis Library
Definition of Haskell and the Standard Libraries

[Updated 30 Oct 2007 to correct the description of dropwhile().]


Technorati Tags:
,


Monday, October 15, 2007

Python Magazine wish-list

Brian and I have been compiling a list of topics we would like to have covered in the magazine. Since we're just starting, the field is really wide-open for anything, but sometimes it is easier to solicit articles about specific topics instead of just saying, "Write for us!"

A few of my personal wishes:

We have had a couple of PyGTK articles submitted already, but nothing for any of the other toolkits. Whenever I see the question "Which GUI toolkit should I use?" there are always a lot of responses for wxWindows and quite a few for Qt. We haven't had any submissions for articles on either yet, so if you use them and want to talk about it, yours might be the first.

I'm aware of several ORM-related books in the works right now, but that's another area where a short article (4000 words) on a focused aspect would be useful. Not all queries are equal (even if the result sets are), so how about a discussion of SQL optimization with your favorite ORM? Or how about adapting an ORM to an existing database? And my favorite topic: How the heck am I supposed to upgrade my schema when I make changes?

I need to write a trac plugin, but haven't had the time to figure out where to start. Will you write an article to show me how?

The List:

We'll be updating this list and will eventually post it online somewhere, but until we decide on the best way to do that, here is the "full" wish-list we have put together for now, in no specific order. Do not interpret the absence of a topic as lack of interest; we just haven't added it to the list, yet!

If you are interested in writing about these or other topics, contact us through the web site and let us know.

# High Performance Computing (HPC)

* Parallel Python (pp) module
* PyMOL
* VTK
* SciPy

# Browser

* Django
* Writing a django app
* TurboGears
* CherryPy
* Zope
* Writing a Zope product
* Plone
* Writing a plugin for trac

# Web Services

* XMLRPC
o simplexmlrpcserver
o xmlrpclib
* SOAP
* Flickr (Beej's API?)
* Google Calendar/GData (w/ ElementTree)
* Amazon
* Yahoo

# System Administration

* SNMP
* LDAP
o python-ldap
o Luma (extending?)
* User/Group management

# GUI Frameworks

* wxPython
* PyQT
* PyGTK


Technorati Tags:


Sunday, October 14, 2007

PyMOTW: shlex

The shlex module can be used to create mini-languages using simple syntaxes like the Unix shell. It is also handy for parsing quoted strings.

Module: shlex
Purpose: Lexical analysis of shell-style syntaxes.
Python Version: 1.5.2, with additions in later versions

Description:

The shlex module implements a class for parsing simple shell-like syntaxes. It can be used for writing your own domain specific language, or for parsing quoted strings (a task that is more complex than it seems, at first).

Quoted Strings:

A common problem when working with input text is to identify a sequence of quoted words as a single entity. Splitting the text on quotes does not always work as expected, especially if there are nested levels of quotes. Take the following text:

 """This string has embedded "double quotes" and 'single quotes' in it, and even "a 'nested example'"."""


A naive approach might attempt to construct a regular expression to find the parts of the text outside the quotes to separate them from the text inside the quotes, or vice versa. Such an approach would be unnecessarily complex and prone to errors resulting from edge cases like apostrophes or even typos. A better solution is to use a true parser, such as the one provided by the shlex module. Here is a simple example which prints the tokens identified in the input file:

import shlex
import sys

if len(sys.argv) != 2:
print 'Please specify one filename on the command line.'
sys.exit(1)

filename = sys.argv[1]
body = file(filename, 'rt').read()
print 'ORIGINAL:', repr(body)
print

print 'TOKENS:'
lexer = shlex.shlex(body)
for token in lexer:
print repr(token)


When run on data with embedded quotes, the parser produces the list of tokens we expect:

$ python shlex_example.py quotes.txt
ORIGINAL: 'This string has embedded "double quotes" and \'single quotes\' in it, and even "a \'nested example\'".'

TOKENS:
'This'
'string'
'has'
'embedded'
'"double quotes"'
'and'
"'single quotes'"
'in'
'it'
','
'and'
'even'
'"a \'nested example\'"'
'.'


Isolated quotes such as apostrophes are also handled:

$ python shlex_example.py apostrophe.txt 
ORIGINAL: "This string has an embedded apostrophe, doesn't it?"

TOKENS:
'This'
'string'
'has'
'an'
'embedded'
'apostrophe'
','
"doesn't"
'it'
'?'


Comments:

Since the parser is intended to be used with command languages, it needs to handle comments. By default, any text following a # is considered part of a comment, and ignored. Due to the nature of the parser, only single character comment prefixes are supported. The set of comment characters used can be configured through the commenters property.

$ python shlex_example.py comments.txt
ORIGINAL: 'This line is recognized.\n# But this line is ignored.\nAnd this line is processed.'

TOKENS:
'This'
'line'
'is'
'recognized'
'.'
'And'
'this'
'line'
'is'
'processed'
'.'


Split:

If you just need to split an existing string into component tokens, the convenience function split() is a simple wrapper around the parser.

import shlex

text = """This text has "quoted parts" inside it."""
print 'ORIGINAL:', repr(text)
print

print 'TOKENS:'
print shlex.split(text)


The result is a list:

$ python shlex_split.py 
ORIGINAL: 'This text has "quoted parts" inside it.'

TOKENS:
['This', 'text', 'has', 'quoted parts', 'inside', 'it.']


Including Other Sources of Tokens:

The shlex class includes several configuration properties which allow us to control its behavior. The source property enables a feature for code (or configuration) re-use by allowing one token stream to include another. This is similar to the Bourne shell "source" operator, hence the name.

import shlex

text = """This text says to source quotes.txt before continuing."""
print 'ORIGINAL:', repr(text)
print

lexer = shlex.shlex(text)
lexer.wordchars += '.'
lexer.source = 'source'

print 'TOKENS:'
for token in lexer:
print repr(token)


Notice the string source quotes.txt embedded in the original text. Since the source property of the lexer is set to "source", when the keyword is encountered the filename appearing in the next title is automatically included. In order to cause the filename to appear as a single token, the . character needs to be added to the list of characters which are included in words (otherwise "quotes.txt" becomes three tokens, "quotes", ".", "txt"). The output looks like:

$ python shlex_source.py 
ORIGINAL: 'This text says to source quotes.txt before continuing.'

TOKENS:
'This'
'text'
'says'
'to'
'This'
'string'
'has'
'embedded'
'"double quotes"'
'and'
"'single quotes'"
'in'
'it'
','
'and'
'even'
'"a \'nested example\'"'
'.'
'before'
'continuing.'


The "source" feature uses a method called sourcehook() to load the additional input source, so you can subclass shlex to provide your own implementation to load data from anywhere.

Controlling the Parser:

I have already given an example changing the wordchars value to control which characters are included in words. It is also possible to set the quotes character to use additional or alternative quotes. Each quote must be a single character, so it is not possible to have different open and close quotes (no parsing on parentheses, for example).

import shlex

text = """|Col 1||Col 2||Col 3|"""
print 'ORIGINAL:', repr(text)
print

lexer = shlex.shlex(text)
lexer.quotes = '|'

print 'TOKENS:'
for token in lexer:
print repr(token)


In this example, each table cell is wrapped in vertical bars:

$ python shlex_table.py 
ORIGINAL: '|Col 1||Col 2||Col 3|'

TOKENS:
'|Col 1|'
'|Col 2|'
'|Col 3|'


It is also possible to control the whitespace characters used to split words. If we modify the example in shlex_example.py to include period and comma, as follows:

lexer = shlex.shlex(body)
lexer.whitespace += '.,'


The results change to:

$ python shlex_whitespace.py quotes.txt 
ORIGINAL: 'This string has embedded "double quotes" and \'single quotes\' in it, and even "a \'nested example\'".'

TOKENS:
'This'
'string'
'has'
'embedded'
'"double quotes"'
'and'
"'single quotes'"
'in'
'it'
'and'
'even'
'"a \'nested example\'"'


Error Handling:

When the parser encounters the end of its input before all quoted strings are closed, it raises ValueError. When that happens, it is useful to examine some of the properties of the parser maintained as it processes the input. For example, infile refers to the name of the file being processed (which might be different from the original file, if one file sources another). The lineno reports the line when the error is discovered. The lineno is typically the end of the file, which may be far away from the first quote. The token attribute contains the buffer of text not already included in a valid token. The error_leader() method produces a message prefix in a style similar to Unix compilers, which enables editors such as emacs to parse the error and take the user directly to the invalid line.

import shlex

text = """This line is ok.
This line has an "unfinished quote.
This line is ok, too.
"""

print 'ORIGINAL:', repr(text)
print

lexer = shlex.shlex(text)

print 'TOKENS:'
try:
for token in lexer:
print repr(token)
except ValueError, err:
first_line_of_error = lexer.token.splitlines()[0]
print 'ERROR:', lexer.error_leader(), str(err), 'following "' + first_line_of_error + '"'


The example above produces this output:

$ python shlex_errors.py 
ORIGINAL: 'This line is ok.\nThis line has an "unfinished quote.\nThis line is ok, too.\n'

TOKENS:
'This'
'line'
'is'
'ok'
'.'
'This'
'line'
'has'
'an'
ERROR: "None", line 4: No closing quotation following ""unfinished quote."


POSIX vs. Non-POSIX Parsing:

The default behavior for the parser is to use a backwards-compatible style which is not POSIX-compliant. For POSIX behavior, set the posix argument when constructing the parser.

import shlex

for s in [ 'Do"Not"Separate',
'"Do"Separate',
'Escaped \e Character not in quotes',
'Escaped "\e" Character in double quotes',
"Escaped '\e' Character in single quotes",
r"Escaped '\'' \"\'\" single quote",
r'Escaped "\"" \'\"\' double quote',
"\"'Strip extra layer of quotes'\"",
]:
print 'ORIGINAL :', repr(s)
print 'non-POSIX:',

non_posix_lexer = shlex.shlex(s, posix=False)
try:
print repr(list(non_posix_lexer))
except ValueError, err:
print 'error(%s)' % err


print 'POSIX :',
posix_lexer = shlex.shlex(s, posix=True)
try:
print repr(list(posix_lexer))
except ValueError, err:
print 'error(%s)' % err

print


Here are a few examples of the differences in parsing behavior:

$ python shlex_posix.py
ORIGINAL : 'Do"Not"Separate'
non-POSIX: ['Do"Not"Separate']
POSIX : ['DoNotSeparate']

ORIGINAL : '"Do"Separate'
non-POSIX: ['"Do"', 'Separate']
POSIX : ['DoSeparate']

ORIGINAL : 'Escaped \\e Character not in quotes'
non-POSIX: ['Escaped', '\\', 'e', 'Character', 'not', 'in', 'quotes']
POSIX : ['Escaped', 'e', 'Character', 'not', 'in', 'quotes']

ORIGINAL : 'Escaped "\\e" Character in double quotes'
non-POSIX: ['Escaped', '"\\e"', 'Character', 'in', 'double', 'quotes']
POSIX : ['Escaped', '\\e', 'Character', 'in', 'double', 'quotes']

ORIGINAL : "Escaped '\\e' Character in single quotes"
non-POSIX: ['Escaped', "'\\e'", 'Character', 'in', 'single', 'quotes']
POSIX : ['Escaped', '\\e', 'Character', 'in', 'single', 'quotes']

ORIGINAL : 'Escaped \'\\\'\' \\"\\\'\\" single quote'
non-POSIX: error(No closing quotation)
POSIX : ['Escaped', '\\ \\"\\"', 'single', 'quote']

ORIGINAL : 'Escaped "\\"" \\\'\\"\\\' double quote'
non-POSIX: error(No closing quotation)
POSIX : ['Escaped', '"', '\'"\'', 'double', 'quote']

ORIGINAL : '"\'Strip extra layer of quotes\'"'
non-POSIX: ['"\'Strip extra layer of quotes\'"']
POSIX : ["'Strip extra layer of quotes'"]


References:

Python Module of the Week Home
Download Sample Code
effbot.org - The shlex module


Technorati Tags:
,


PyATL Blog

Noah set up a group blog for PyATL members. It will be more convenient to follow announcements there than the Meetup group (do they even have an RSS feed?) though we will still need to post announcements to the python-groups blog separately. It's sort of too bad that wasn't set up as a planet-style aggregation, but I guess this works better to control what actually goes out on the feed and there are already 2 different planets for python blogs anyway.

Friday, October 12, 2007

The more things change...

Quote of the week, from Paul Graham's essay "How to Do Philosphy":

"Much to the surprise of the builders of the first digital computers," Rod Brooks wrote, "programs written for them usually did not work."

Wednesday, October 10, 2007

PyATL meetup Oct. 11th

The Python Atlanta Meetup group meets tomorrow night at Turner, on Techwood Drive. This month's theme is "Zope Related Technologies". Here's the schedule:

Oct. 11th Schedule: Round Table Discussion, Lightening Talks, Main Presentation

7:15-7:30 Meet at Turner Lobby.
7:30-7:45 Opening Remarks and setup.
7:45-8:25 20 Minute Interactive discussion Atlanta Plone and/or Derek Richardson
8:35-8:40 5 Minute Break
8:40-9:00 20 Minute Main Presentation: Drew Smathers, Zope 3
9:00-? General Discussion, Coding Sessions

Email is not a file transfer protocol

Brian Jones made my day when he said he didn't want to email articles back and forth for Python Magazine. He describes our editorial toolset, based on svn.

Reconsidering kids

This is a point in favor.

Tuesday, October 9, 2007

Python Community on LinkedIn

As usual, I'm a little late to the party and Jesse beat me to the punch. If you haven't already, head on over to LinkedIn and join the new Python community group set up by Danny Adair.

Sunday, October 7, 2007

PyMOTW: difflib

The difflib module contains several classes for comparing sequences, especially of lines of text from files, and manipulating the results.

Updated: I can't quite make the formatting for the examples come out the way I want with this blog template (the content column is too narrow, and apparently fixed). If you are viewing this through the web page on Blogger, I encourage you to download the sample code and run it yourself to see what it looks like. If there are any CSS wizards out there who can help make the table actually visible, I would appreciate any comments you might leave.

Module: difflib
Purpose: Library of tools for computing and working with differences between sequences, especially of lines in text files.
Python Version: 2.1

Description:

The SequenceMatcher class compares any 2 sequences of values, as long as the values are hashable. It uses a recursive algorithm to identify the longest contiguous matching blocks from the sequences, eliminating "junk" values. The Differ class works on sequences of text lines and produces human-readable deltas, including differences within individual lines. The HtmlDiff class produces similar results formatted as an HTML table.

Test Data:

The examples below will all use this common test data in the difflib_data module:

text1 = """Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Integer
eu lacus accumsan arcu fermentum euismod. Donec pulvinar porttitor
tellus. Aliquam venenatis. Donec facilisis pharetra tortor. In nec
mauris eget magna consequat convallis. Nam sed sem vitae odio
pellentesque interdum. Sed consequat viverra nisl. Suspendisse arcu
metus, blandit quis, rhoncus ac, pharetra eget, velit. Mauris
urna. Morbi nonummy molestie orci. Praesent nisi elit, fringilla ac,
suscipit non, tristique vel, mauris. Curabitur vel lorem id nisl porta
adipiscing. Suspendisse eu lectus. In nunc. Duis vulputate tristique
enim. Donec quis lectus a justo imperdiet tempus."""
text1_lines = text1.splitlines()

text2 = """Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Integer
eu lacus accumsan arcu fermentum euismod. Donec pulvinar porttitor
tellus. Aliquam venenatis. Donec facilisis pharetra tortor. In nec
mauris eget magna consequat convallis. Nam sed sem vitae odio
pellentesque interdum. Sed consequat viverra nisl. Suspendisse arcu
metus, blandit quis, rhoncus ac, pharetra eget, velit. Mauris
urna. Morbi nonummy molestie orci. Praesent nisi elit, fringilla ac,
suscipit non, tristique vel, mauris. Curabitur vel lorem id nisl porta
adipiscing. Suspendisse eu lectus. In nunc. Duis vulputate tristique
enim. Donec quis lectus a justo imperdiet tempus."""
text2_lines = text2.splitlines()


Differ Example:

Reproducing output similar to the diff command line tool is simple with the Differ class:

import difflib
from difflib_data import *

d = difflib.Differ()
diff = d.compare(text1_lines, text2_lines)
print '\n'.join(list(diff))


The output includes the original input values from both lists, including common values, and markup data to indicate what changes were made. Lines may be prefixed with - to indicate that they were in the first sequence, but not the second. Lines prefixed with + were in the second sequence, but not the first. If a line has an incremental change between versions, an extra line prefixed with ? is used to try to indicate where the change occurred within the line. If a line has not changed, it is printed with an extra blank space on the left column to let it line up with the other lines which may have other markup.

The beginning of both text segments is the same.

 1:   Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Integer


The second line has been changed to include a comma in the modified text. Both versions of the line are printed, with the extra information on line 4 showing the column where the text was modified, including the fact that the , character was added.

 2: - eu lacus accumsan arcu fermentum euismod. Donec pulvinar porttitor
3: + eu lacus accumsan arcu fermentum euismod. Donec pulvinar, porttitor
4: ? +
5:


Lines 6-9 of the output shows where an extra space was removed.

 6: - tellus. Aliquam venenatis. Donec facilisis pharetra tortor.  In nec
7: ? -
8:
9: + tellus. Aliquam venenatis. Donec facilisis pharetra tortor. In nec


Next a more complex change was made, replacing several words in a phrase.

10: - mauris eget magna consequat convallis. Nam sed sem vitae odio
11: ? - --
12:
13: + mauris eget magna consequat convallis. Nam cras vitae mi vitae odio
14: ? +++ +++++ +
15:


The last sentence in the paragraph was changed significantly, so the difference is represented by simply removing the old version and adding the new (lines 20-23).

16:   pellentesque interdum. Sed consequat viverra nisl. Suspendisse arcu
17: metus, blandit quis, rhoncus ac, pharetra eget, velit. Mauris
18: urna. Morbi nonummy molestie orci. Praesent nisi elit, fringilla ac,
19: suscipit non, tristique vel, mauris. Curabitur vel lorem id nisl porta
20: - adipiscing. Suspendisse eu lectus. In nunc. Duis vulputate tristique
21: - enim. Donec quis lectus a justo imperdiet tempus.
22: + adipiscing. Duis vulputate tristique enim. Donec quis lectus a justo
23: + imperdiet tempus. Suspendisse eu lectus. In nunc.


The ndiff() function produces essentially the same output. The processing is specifically tailored to working with text data and eliminating "noise" in the input.

diff = difflib.ndiff(text1_lines, text2_lines)


Other Diff Formats:

Where the Differ class shows all of the inputs, a unified diff only includes modified lines and a bit of context. In version 2.3, a unified_diff() function was added to produce this sort of output:

import difflib
from difflib_data import *

diff = difflib.unified_diff(text1_lines, text2_lines, lineterm='')
print '\n'.join(list(diff))


The output should look familiar to users of svn or other version control tools:

$ python difflib_unified.py
---
+++
@@ -1,10 +1,10 @@
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Integer
-eu lacus accumsan arcu fermentum euismod. Donec pulvinar porttitor
-tellus. Aliquam venenatis. Donec facilisis pharetra tortor. In nec
-mauris eget magna consequat convallis. Nam sed sem vitae odio
+eu lacus accumsan arcu fermentum euismod. Donec pulvinar, porttitor
+tellus. Aliquam venenatis. Donec facilisis pharetra tortor. In nec
+mauris eget magna consequat convallis. Nam cras vitae mi vitae odio
pellentesque interdum. Sed consequat viverra nisl. Suspendisse arcu
metus, blandit quis, rhoncus ac, pharetra eget, velit. Mauris
urna. Morbi nonummy molestie orci. Praesent nisi elit, fringilla ac,
suscipit non, tristique vel, mauris. Curabitur vel lorem id nisl porta
-adipiscing. Suspendisse eu lectus. In nunc. Duis vulputate tristique
-enim. Donec quis lectus a justo imperdiet tempus.
+adipiscing. Duis vulputate tristique enim. Donec quis lectus a justo
+imperdiet tempus. Suspendisse eu lectus. In nunc.


Using context_diff() produces similar readable output:

$ python difflib_context.py
***
---
***************
*** 1,10 ****
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Integer
! eu lacus accumsan arcu fermentum euismod. Donec pulvinar porttitor
! tellus. Aliquam venenatis. Donec facilisis pharetra tortor. In nec
! mauris eget magna consequat convallis. Nam sed sem vitae odio
pellentesque interdum. Sed consequat viverra nisl. Suspendisse arcu
metus, blandit quis, rhoncus ac, pharetra eget, velit. Mauris
urna. Morbi nonummy molestie orci. Praesent nisi elit, fringilla ac,
suscipit non, tristique vel, mauris. Curabitur vel lorem id nisl porta
! adipiscing. Suspendisse eu lectus. In nunc. Duis vulputate tristique
! enim. Donec quis lectus a justo imperdiet tempus.
--- 1,10 ----
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Integer
! eu lacus accumsan arcu fermentum euismod. Donec pulvinar, porttitor
! tellus. Aliquam venenatis. Donec facilisis pharetra tortor. In nec
! mauris eget magna consequat convallis. Nam cras vitae mi vitae odio
pellentesque interdum. Sed consequat viverra nisl. Suspendisse arcu
metus, blandit quis, rhoncus ac, pharetra eget, velit. Mauris
urna. Morbi nonummy molestie orci. Praesent nisi elit, fringilla ac,
suscipit non, tristique vel, mauris. Curabitur vel lorem id nisl porta
! adipiscing. Duis vulputate tristique enim. Donec quis lectus a justo
! imperdiet tempus. Suspendisse eu lectus. In nunc.


HTML Output:

HtmlDiff (new in Python 2.4) produces HTML output with the same information as the Diff class. This example uses make_table(), but the make_file() method produces a fully-formed HTML file as output.

import difflib
from difflib_data import *

d = difflib.HtmlDiff()
print d.make_table(text1_lines, text2_lines)


The output produces a table like the one below. (I have modified the styles slightly to make the table work on this blog.)












f1Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Integerf1Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Integer
n2eu lacus accumsan arcu fermentum euismod. Donec pulvinar porttitorn2eu lacus accumsan arcu fermentum euismod. Donec pulvinar, porttitor
3tellus. Aliquam venenatis. Donec facilisis pharetra tortor.  In nec3tellus. Aliquam venenatis. Donec facilisis pharetra tortor. In nec
4mauris eget magna consequat convallis. Nam sed sem vitae odio4mauris eget magna consequat convallis. Nam cras vitae mi vitae odio
5pellentesque interdum. Sed consequat viverra nisl. Suspendisse arcu5pellentesque interdum. Sed consequat viverra nisl. Suspendisse arcu
6metus, blandit quis, rhoncus ac, pharetra eget, velit. Mauris6metus, blandit quis, rhoncus ac, pharetra eget, velit. Mauris
7urna. Morbi nonummy molestie orci. Praesent nisi elit, fringilla ac,7urna. Morbi nonummy molestie orci. Praesent nisi elit, fringilla ac,
8suscipit non, tristique vel, mauris. Curabitur vel lorem id nisl porta8suscipit non, tristique vel, mauris. Curabitur vel lorem id nisl porta
t9adipiscing. Suspendisse eu lectus. In nunc. Duis vulputate tristiquet9adipiscing. Duis vulputate tristique enim. Donec quis lectus a justo
10enim. Donec quis lectus a justo imperdiet tempus.10imperdiet tempus. Suspendisse eu lectus. In nunc. 


Junk Data:

All of the functions which produce diff sequences accept arguments to indicate which lines should be ignored, and which characters within a line should be ignored. This can be used to ignore markup or whitespace changes in two versions of file, for example. The default for Differ is to not ignore any lines or characters explicitly, but to rely on the SequenceMatcher's ability to detect noise. The default for ndiff is to ignore space and tab characters.

SequenceMatcher:

SequenceMatcher, which implements the comparison algorithm, can be used with sequences of any type of object as long as the object is hashable. For example, two lists of integers can be compared, and using get_opcodes() a set of instructions for converting the original list into the newer can be printed:

import difflib
from difflib_data import *

s1 = [ 1, 2, 3, 5, 6, 4 ]
s2 = [ 2, 3, 5, 4, 6, 1 ]

matcher = difflib.SequenceMatcher(None, s1, s2)
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
print ("%7s s1[%d:%d] (%s) s2[%d:%d] (%s)" %
(tag, i1, i2, s1[i1:i2], j1, j2, s2[j1:j2]))