Sunday, November 30, 2008

preparing Sphinx output for Blogger

Blogger doesn't let me set a different option on an individual post, and since not all of the posts are PyMOTW articles I've been trying to keep the "convert line breaks" flag on because it makes it easier for posts like these. The results have been a little ugly, but I think I have that straightened out, finally.

I prepare the PyMOTW articles using reST and convert them to HTML with Sphinx. I have a custom template that spits out only the body of the HTML (with no html or body tags). Code passes through pygments automatically as part of the Sphinx processing. The results include newlines after most of the tags, though. Blogger was converting those newlines to br tags, even when the tags themselves were otherwise invisible (like table tags).

I needed a cleanup script anyway because Sphinx (or docutils, I'm not 100% certain) inserts permalink anchors for each header. The stylesheet I use for the PyMOTW site causes them to be hidden unless the user mouses over the link, but I didn't want them at all in the blog posts. A previous attempt at a cleanup script with BeautifulSoup stripped the permalinks but also removed the whitespace from within pre tags. A recent update to BeautifulSoup fixed that problem, so I gave it another try today.

Unfortunately, I coudn't find any combination of arguments to tell BeautifulSoup not to insert newlines between tags. The prettyPrint option was either ignored, or I don't understand how it is intended to be used. So I use BeautifulSoup to remove the permalinks but fell back on regular expressions for the newline handling.

I want to remove all newline characters immediately after closing tags, except if the tags are part of code or other pre-formatted output. Lines that do not end with tags are probably part of pre blocks, and whitespace is obviously important there. I realized that since pygments consistently uses span tags, as long as I ignored newlines after span tags I should be safe.

This is the script I came up with to take the HTML output of Sphinx and prepare it for posting through Blogger:

#!/usr/bin/env python
# encoding: utf-8
#
# Copyright (c) 2008 Doug Hellmann All rights reserved.
#
"""Clean a sphinx-generated HTML blob to make a blog post.
"""

import re
import sys
from BeautifulSoup import BeautifulSoup
from cStringIO import StringIO

# The post body is passed to stdin.
body = sys.stdin.read()
soup = BeautifulSoup(body)

# Remove the permalinks to each header since the blog does not have
# the styles to hide them.
links = soup.findAll('a', attrs={'class':"headerlink"})
[l.extract() for l in links]

# Get BeautifulSoup's version of the string
s = soup.__str__(prettyPrint=False)

# Remove extra newlines. This depends on the fact that
# code blocks are passed through pygments, which wraps each part of the line
# in a span tag.
pattern = re.compile(r'([^s][^p][^a][^n]>)\n$', re.DOTALL|re.IGNORECASE)
s = ''.join(pattern.sub(r'\1', l) for l in StringIO(s))
print s


Today's PyMOTW post on readline is the first example of the results.

Updated 1 Dec to change import line based on reader comment.

PyMOTW: readline


readline – Interface to the GNU readline library

Purpose:Provides an interface to the GNU readline library for interacting with the user at a command prompt.
Python Version:1.4 and later

The readline module can be used to enhance interactive command line programs to make them easier to use. It is primarily used to provide command line text completion, or “tab completion”.

Note

Because readline interacts with the console content, printing debug messages
makes it difficult to see what it happening in the sample code versus what readline
is doing for free. The examples below use the logging module to write debug information
to a separate file. The log output is shown with each example.

Configuring

There are two ways to configure the underlying readline library, using a configuration file or the parse_and_bind() function. Configuration options include the keybinding to invoke completion, editing modes (vi or emacs), and many other values. Refer to the GNU readline library documentation for details.

The easiest way to enable tab-completion is through a call to parse_and_bind(). Other options can be set at the same time. This example changes the default editing controls to use “vi” mode instead of the default of “emacs”. To edit the line, press ESC then normal vi navigation keys.

import readline

readline.parse_and_bind('tab: complete')
readline.parse_and_bind('set editing-mode vi')

while True:
line = raw_input('Prompt ("stop" to quit): ')
if line == 'stop':
break
print 'ENTERED: "%s"' % line

The same configuration can be stored as instructions in a file read by the library with a single call. If myreadline.rc contains:

# Turn on tab completion
tab: complete

# Use vi editing mode instead of emacs
set editing-mode vi

the file can be read with read_init_file():

import readline

readline.read_init_file('myreadline.rc')

while True:
line = raw_input('Prompt ("stop" to quit): ')
if line == 'stop':
break
print 'ENTERED: "%s"' % line

Completing Text

As an example of how to build command line completion, we can look at a program that has a built-in set of possible commands and uses tab-completion when the user is entering instructions.

import readline
import logging

LOG_FILENAME = '/tmp/completer.log'
logging.basicConfig(filename=LOG_FILENAME,
level=logging.DEBUG,
)

class SimpleCompleter(object):

def __init__(self, options):
self.options = sorted(options)
return

def complete(self, text, state):
response = None
if state == 0:
# This is the first time for this text, so build a match list.
if text:
self.matches = [s
for s in self.options
if s and s.startswith(text)]
logging.debug('%s matches: %s', repr(text), self.matches)
else:
self.matches = self.options[:]
logging.debug('(empty input) matches: %s', self.matches)

# Return the state'th item from the match list,
# if we have that many.
try:
response = self.matches[state]
except IndexError:
response = None
logging.debug('complete(%s, %s) => %s',
repr(text), state, repr(response))
return response

def input_loop():
line = ''
while line != 'stop':
line = raw_input('Prompt ("stop" to quit): ')
print 'Dispatch %s' % line

# Register our completer function
readline.set_completer(SimpleCompleter(['start', 'stop', 'list', 'print']).complete)

# Use the tab key for completion
readline.parse_and_bind('tab: complete')

# Prompt the user for text
input_loop()

The input_loop() function simply reads one line after another until the input value is "stop". A more sophisticated program could actually parse the input line and run the command.

The SimpleCompleter class keeps a list of “options” that are candidates for auto-completion. The complete() method for an instance is designed to be registered with readline as the source of completions. The arguments are a “text” string to complete and a “state” value, indicating how many times the function has been called with the same text. The function is called repeatedly with the state incremented each time. It should return a string if there is a candidate for that state value or None if there are no more candidates. The implementation of complete() here looks for a set of matches when state is 0, and then returns all of the candidate matches one at a time on subsequent calls.

When run, the initial output looks something like this:

$ python readline_completer.py
Prompt ("stop" to quit):

If you press TAB twice, a list of options are printed.

$ python readline_completer.py
Prompt ("stop" to quit):
list print start stop
Prompt ("stop" to quit):

The log file shows that complete() was called with two separate sequences of state values.

$ tail -f /tmp/completer.log
DEBUG:root:(empty input) matches: ['list', 'print', 'start', 'stop']
DEBUG:root:complete('', 0) => 'list'
DEBUG:root:complete('', 1) => 'print'
DEBUG:root:complete('', 2) => 'start'
DEBUG:root:complete('', 3) => 'stop'
DEBUG:root:complete('', 4) => None
DEBUG:root:(empty input) matches: ['list', 'print', 'start', 'stop']
DEBUG:root:complete('', 0) => 'list'
DEBUG:root:complete('', 1) => 'print'
DEBUG:root:complete('', 2) => 'start'
DEBUG:root:complete('', 3) => 'stop'
DEBUG:root:complete('', 4) => None

The first sequence is from the first TAB key-press. The completion algorithm asks for all candidates but does not expand the empty input line. Then on the second TAB, the list of candidates is recalculated so it can be printed for the user.

If next we type “l” and press TAB again, the screen shows:

Prompt ("stop" to quit): list

and the log reflects the different arguments to complete():

DEBUG:root:'l' matches: ['list']
DEBUG:root:complete('l', 0) => 'list'
DEBUG:root:complete('l', 1) => None

Pressing RETURN now causes raw_input() to return the value, and the while loop cycles.

Dispatch list
Prompt ("stop" to quit):

There are two possible completions for a command beginning with “s”. Typing “s”, then pressing TAB finds that “start” and “stop” are candidates, but only partially completes the text on the screen by adding a “t”.

The log file shows:

DEBUG:root:'s' matches: ['start', 'stop']
DEBUG:root:complete('s', 0) => 'start'
DEBUG:root:complete('s', 1) => 'stop'
DEBUG:root:complete('s', 2) => None

and the screen:

Prompt ("stop" to quit): st

Warning

If your completer function raises an exception, it is ignored silently and
readline assumes there are no matching completions.

Accessing the Completion Buffer

The completion algorithm above is simplistic because it only looks at the text argument passed to the function. It is also possible to use functions in the readline module to manipulate the text of the input buffer.

import readline
import logging

LOG_FILENAME = '/tmp/completer.log'
logging.basicConfig(filename=LOG_FILENAME,
level=logging.DEBUG,
)

class BufferAwareCompleter(object):

def __init__(self, options):
self.options = options
self.current_candidates = []
return

def complete(self, text, state):
response = None
if state == 0:
# This is the first time for this text, so build a match list.

origline = readline.get_line_buffer()
begin = readline.get_begidx()
end = readline.get_endidx()
being_completed = origline[begin:end]
words = origline.split()

logging.debug('origline=%s', repr(origline))
logging.debug('begin=%s', begin)
logging.debug('end=%s', end)
logging.debug('being_completed=%s', being_completed)
logging.debug('words=%s', words)

if not words:
self.current_candidates = sorted(self.options.keys())
else:
try:
if begin == 0:
# first word
candidates = self.options.keys()
else:
# later word
first = words[0]
candidates = self.options[first]

if being_completed:
# match options with portion of input
# being completed
self.current_candidates = [ w for w in candidates
if w.startswith(being_completed) ]
else:
# matching empty string so use all candidates
self.current_candidates = candidates

logging.debug('candidates=%s', self.current_candidates)

except (KeyError, IndexError), err:
logging.error('completion error: %s', err)
self.current_candidates = []

try:
response = self.current_candidates[state]
except IndexError:
response = None
logging.debug('complete(%s, %s) => %s', repr(text), state, response)
return response


def input_loop():
line = ''
while line != 'stop':
line = raw_input('Prompt ("stop" to quit): ')
print 'Dispatch %s' % line

# Register our completer function
readline.set_completer(BufferAwareCompleter(
{'list':['files', 'directories'],
'print':['byname', 'bysize'],
'stop':[],
}).complete)

# Use the tab key for completion
readline.parse_and_bind('tab: complete')

# Prompt the user for text
input_loop()

In this example, commands with sub-options are are being completed. The complete() method needs to look at the position of the completion within the input buffer to determine whether it is part of the first word or a later word. If the target is the first word, the keys of the options dictionary are used as candidates. If it is not the first word, then the first word is used to find candidates from the options dictionary.

There are three top-level commands, two of which have subcommands:

  • list
    • files
    • directories
  • print
    • byname
    • bysize
  • stop

Following the same sequence of actions as before, pressing TAB twice gives us the three top-level commands:

$ python readline_buffer.py
Prompt ("stop" to quit):
list print stop
Prompt ("stop" to quit):

and in the log:

DEBUG:root:origline=''
DEBUG:root:begin=0
DEBUG:root:end=0
DEBUG:root:being_completed=
DEBUG:root:words=[]
DEBUG:root:complete('', 0) => list
DEBUG:root:complete('', 1) => print
DEBUG:root:complete('', 2) => stop
DEBUG:root:complete('', 3) => None
DEBUG:root:origline=''
DEBUG:root:begin=0
DEBUG:root:end=0
DEBUG:root:being_completed=
DEBUG:root:words=[]
DEBUG:root:complete('', 0) => list
DEBUG:root:complete('', 1) => print
DEBUG:root:complete('', 2) => stop
DEBUG:root:complete('', 3) => None

If the first word is “list ” (with a space after the word), the candidates for completion are different:

Prompt ("stop" to quit): list
directories files

The log shows that the text being completed is not the full line, but the portion after

DEBUG:root:origline='list '
DEBUG:root:begin=5
DEBUG:root:end=5
DEBUG:root:being_completed=
DEBUG:root:words=['list']
DEBUG:root:candidates=['files', 'directories']
DEBUG:root:complete('', 0) => files
DEBUG:root:complete('', 1) => directories
DEBUG:root:complete('', 2) => None
DEBUG:root:origline='list '
DEBUG:root:begin=5
DEBUG:root:end=5
DEBUG:root:being_completed=
DEBUG:root:words=['list']
DEBUG:root:candidates=['files', 'directories']
DEBUG:root:complete('', 0) => files
DEBUG:root:complete('', 1) => directories
DEBUG:root:complete('', 2) => None

Input History

readline tracks the input history automatically. There are two different sets of functions for working with the history. The history for the current session can be accessed with get_current_history_length() and get_history_item(). That same history can be saved to a file to be reloaded later using write_history_file() and read_history_file(). By default the entire history is saved but the maximum length of the file can be set with set_history_length(). A length of -1 means no limit.

import readline
import logging
import os

LOG_FILENAME = '/tmp/completer.log'
HISTORY_FILENAME = '/tmp/completer.hist'

logging.basicConfig(filename=LOG_FILENAME,
level=logging.DEBUG,
)

def get_history_items():
return [ readline.get_history_item(i)
for i in xrange(1, readline.get_current_history_length() + 1)
]

class HistoryCompleter(object):

def __init__(self):
self.matches = []
return

def complete(self, text, state):
response = None
if state == 0:
history_values = get_history_items()
logging.debug('history: %s', history_values)
if text:
self.matches = sorted(h
for h in history_values
if h and h.startswith(text))
else:
self.matches = []
logging.debug('matches: %s', self.matches)
try:
response = self.matches[state]
except IndexError:
response = None
logging.debug('complete(%s, %s) => %s',
repr(text), state, repr(response))
return response

def input_loop():
if os.path.exists(HISTORY_FILENAME):
readline.read_history_file(HISTORY_FILENAME)
print 'Max history file length:', readline.get_history_length()
print 'Startup history:', get_history_items()
try:
while True:
line = raw_input('Prompt ("stop" to quit): ')
if line == 'stop':
break
if line:
print 'Adding "%s" to the history' % line
finally:
print 'Final history:', get_history_items()
readline.write_history_file(HISTORY_FILENAME)

# Register our completer function
readline.set_completer(HistoryCompleter().complete)

# Use the tab key for completion
readline.parse_and_bind('tab: complete')

# Prompt the user for text
input_loop()

The HistoryCompleter remembers everything you type and uses those values when completing subsequent inputs.

$ python readline_history.py
Max history file length: -1
Startup history: []
Prompt ("stop" to quit): foo
Adding "foo" to the history
Prompt ("stop" to quit): bar
Adding "bar" to the history
Prompt ("stop" to quit): blah
Adding "blah" to the history
Prompt ("stop" to quit): b
bar blah
Prompt ("stop" to quit): b
Prompt ("stop" to quit): stop
Final history: ['foo', 'bar', 'blah', 'stop']

The log shows this output when the “b” is followed by two TABs.

DEBUG:root:history: ['foo', 'bar', 'blah']
DEBUG:root:matches: ['bar', 'blah']
DEBUG:root:complete('b', 0) => 'bar'
DEBUG:root:complete('b', 1) => 'blah'
DEBUG:root:complete('b', 2) => None
DEBUG:root:history: ['foo', 'bar', 'blah']
DEBUG:root:matches: ['bar', 'blah']
DEBUG:root:complete('b', 0) => 'bar'
DEBUG:root:complete('b', 1) => 'blah'
DEBUG:root:complete('b', 2) => None

When the script is run the second time, all of the history is read from the file.

$ python readline_history.py
Max history file length: -1
Startup history: ['foo', 'bar', 'blah', 'stop']
Prompt ("stop" to quit):

There are functions for removing individual history items and clearing the entire history, as well.

Hooks

There are several hooks available for triggering actions as part of the interaction sequence. The startup hook is invoked immediately before printing the prompt, and the pre-input hook is run after the prompt but before reading text from the user.

import readline

def startup_hook():
readline.insert_text('from startup_hook')

def pre_input_hook():
readline.insert_text(' from pre_input_hook')
readline.redisplay()

readline.set_startup_hook(startup_hook)
readline.set_pre_input_hook(pre_input_hook)
readline.parse_and_bind('tab: complete')

while True:
line = raw_input('Prompt ("stop" to quit): ')
if line == 'stop':
break
print 'ENTERED: "%s"' % line

Either hook is a potentially good place to use insert_text() to modify the input buffer.

$ python readline_hooks.py
Prompt ("stop" to quit): from startup_hook from pre_input_hook

If the buffer is modified inside the pre-input hook, you need to call redisplay() to update the screen.

See also

readline
The standard library documentation for this module.
GNU readline
Documentation for the GNU readline library.
readline init file format
The initialization and configuration file format.
effbot: The readline module
Effbot’s guide to the readline module.
cmd
The cmd module uses readline extensively to implement
tab-completion in the command interface. Some of the examples here
were adapted from the code in cmd.
rlcompleter
rlcompleter uses readline to add tab-completion to the interactive
Python interpreter.

PyMOTW Home



Updated 1 Dec to include link to pyreadline.

Wednesday, November 26, 2008

BlogBackup 1.4

A recent post on my blog exposed a problem with the unicode handling in BlogBackup. Release 1.4 fixes the problem with saving posts that contain unicode characters.

Monday, November 24, 2008

Book Review: Expert Python Programming

184719494X.png

Neha Shaikh at Packt publishing sent me a copy of Tarek Ziadé's new book Expert Python Programming for review and I finally finished it this weekend. Overall, I liked the book.



My first impression was, "Really, a chapter on installing Python in an expert level book?" As it turns out, I'm glad that chapter was there because I learned about setting a default module to be imported when the interactive interpreter starts. I'm sure I'd heard of that feature at some point, but I've never actually tried it until now.



The book continues by covering a range topics, alternating between introducing new tools and promoting techniques to make your coding better. The overall themes of the chapters progress from "make it work" to "make it work right" and then "make it work faster" -- just as your development cycles should.



There were a few sections where I would have liked him to go deeper into certain topics, but the author was clearly trying to introduce a wide variety of topics and achieved that goal. There are plenty of references to supplemental resources online, so it's easy to keep digging on your own. And, given the breadth of material covered, there's something here for everyone.



Although a few minor mistakes slipped through the editing process, Tarek has an errata page online and is making corrections online.



In summary: Recommended.



Updated 1 Dec: Neha sent me a link to a sample chapter so you can check it out before buying>

Python Magazine for November 2008 is released



The November 2008 issue of Python Magazine is available for download now.



The cover story this month is Building E-commerce on Plone with GetPaid by Horacio Durán. He walks through several scenarios for configuring different types of sites that need to accept payments, from charity donations to shipping physical goods.



Gloria Jacobs' article, An Introduction to SQLAlchemy, uses a straightforward intranet application to illustrate the power and portability of SQLAlchemy.



Pablo Troncoso uses some novel techniques to interpret MP3 ID3 tags in A Grammar-Based Approach for Decoding Binary Streams.



And Matt Willson gives us several Clever Uses for Metaclasses.



In his regular column, Mark Mruss talks about the occasionally contentious subject of using "slots" in Python classes. He shows how to use them and covers the pros and cons to help you decide when they might be useful in your own applications.

Rick Harding continues his bzr tutorial series by showing how to use different workflows, depending on the style of development that makes you most comfortable.

Steve Holden talks about ways you can contribute to Python, including being active in the PyCon organization process and helping with the language itself.

And I take a look back at the growth of the magazine over the last year, since Python Magazine is now officially one year old!

Saturday, November 15, 2008

now on twitter

At the PyWorks convention this week there were enough people talking about interesting things I missed from not having a Twitter account that I decided to abandon my resistance and sign up. I'm still setting up the list of feeds I want to follow, and since people search is down I expect that to take a while.

Tuesday, November 11, 2008

virtualenvwrapper 1.4, now with .pth management

Version 1.4 of virtualenvwrapper includes a .pth file management function contributed by James Bennett and Jannis Leidel. The new add2virtualenv function makes it easy to share code between virtual environments without installing it in the system site-packages directory, by adding directories to a .pth file in the virtualenv.

Thanks, James and Jannis!

Sunday, November 9, 2008

PyMOTW: array


array – Sequence of fixed-type data











Purpose:Manage sequences of fixed-type numerical data efficiently.
Python Version:1.4 and later

The array module defines a sequence data structure that looks very much like a list except that all of the members have to be of the same type. The types supported are listed in the standard library documentation. They are all numeric or other fixed-size primitive types such as bytes.



array Initialization


An array is instantiated with an argument describing the type of data to be allowed, and possibly an initialization sequence.


import array
import binascii

s = 'This is the array.'
a = array.array('c', s)

print 'As string:', s
print 'As array :', a
print 'As hex :', binascii.hexlify(a)

In this example, the array is configured to hold a sequence of bytes and is initialized with a simple string.


$ python array_string.py
As string: This is the array.
As array : array('c', [84, 104, 105, 115, 32, 105, 115,
32, 116, 104, 101, 32, 97, 114, 114, 97, 121, 46])
As hex : 54686973206973207468652061727261792e



Manipulating Arrays


An array can be extended and otherwise manipulated in the same ways as other Python sequences.


import array

a = array.array('i', xrange(5))
print 'Initial :', a

a.extend(xrange(5))
print 'Extended:', a

print 'Slice :', a[3:6]

print 'Iterator:', list(enumerate(a))

$ python array_sequence.py
Initial : array('i', [0, 1, 2, 3, 4])
Extended: array('i', [0, 1, 2, 3, 4, 0, 1, 2, 3, 4])
Slice : array('i', [3, 4, 0])
Iterator: [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4),
(5, 0), (6, 1), (7, 2), (8, 3), (9, 4)]



Arrays and Files


The contents of an array can be written to and read from files using built-in methods coded efficiently for that purpose.


import array
import binascii
import tempfile

a = array.array('i', xrange(5))
print 'A1:', a

# Write the array of numbers to the file
output = tempfile.NamedTemporaryFile()
a.tofile(output.file) # must pass an *actual* file
output.flush()

# Read the raw data
input = open(output.name, 'rb')
raw_data = input.read()
print 'Raw Contents:', binascii.hexlify(raw_data)

# Read the data into an array
input.seek(0)
a2 = array.array('i')
a2.fromfile(input, len(a))
print 'A2:', a2

This example illustrates reading the data “raw”, directly from the binary file, versus reading it into a new array and converting the bytes to the appropriate types.


$ python array_file.py
A1: array('i', [0, 1, 2, 3, 4])
Raw Contents: 0000000001000000020000000300000004000000
A2: array('i', [0, 1, 2, 3, 4])



Alternate Byte Ordering


If the data in the array is not in the native byte order, or needs to be swapped before being written to a file intended for a system with a different byte order, it is easy to convert the entire array without iterating over the elements from Python.


import array
import binascii

def to_hex(a):
chars_per_item = a.itemsize * 2 # 2 hex digits
hex_version = binascii.hexlify(a)
num_chunks = len(hex_version) / chars_per_item
for i in xrange(num_chunks):
start = i*chars_per_item
end = start + chars_per_item
yield hex_version[start:end]

a1 = array.array('i', xrange(5))
a2 = array.array('i', xrange(5))
a2.byteswap()

fmt = '%10s %10s %10s %10s'
print fmt % ('A1 hex', 'A1', 'A2 hex', 'A2')
print fmt % (('-' * 10,) * 4)
for values in zip(to_hex(a1), a1, to_hex(a2), a2):
print fmt % values

$ python array_byteswap.py
A1 hex A1 A2 hex A2
---------- ---------- ---------- ----------
00000000 0 00000000 0
01000000 1 00000001 16777216
02000000 2 00000002 33554432
03000000 3 00000003 50331648
04000000 4 00000004 67108864


See also



array

The standard library documentation for this module.

struct

The struct module.

Numerical Python

NumPy is a Python library for working with large datasets efficiently.



PyMOTW Home






Updated Nov. 10: Replaced NumPy link.

Saturday, November 8, 2008

virtualenvwrapper 1.3

John Shimek identified a nasty bug in rmvirtualenv and provided a patch to fix it. Release 1.3 includes the fix. I strongly recommend upgrading.

Tuesday, November 4, 2008

Atlanta will be full of hackers in November

Nov 12-14: PyWorks 2008

Nov 15: Google AppEngine Hack-a-thon, co-hosted by PyATL and Google

Nov 15-16: TurboGears Sprint

Sunday, November 2, 2008

PyWorks 2008 Nov 12-14

MTA is putting the finishing touches on our plans to host the first annual combined PyWorks and php|works conference Nov 12-14. This year's conference builds on the past success of php|works events by adding 2 new tracks of presentations for Python developers and a cross-over track with topics of interest to everyone.

The tutorials on Wednesday offer an excellent introduction to four of the popular frameworks available for Python web applications. Mark Ramm promises a fast paced introduction to TurboGears and its related components. Travis Cline will be presenting Django from the perspective of a PHP developer, comparing and contrasting the approaches used by both. Brandon Rhodes' Grok tutorial is sure to create a few Zope converts, and Noah Gift will show how to build and deploy AJAX applications on Google's AppEngine platform.

The keynote Thursday morning will be given by Kevin Dangoor, well known in the Python community for his leadership of the TurboGears project. Kevin will talk about the aspects of running successful open source projects aside from the code itself. Managing developers is challenging enough, but when your project is run by volunteers there are unique aspects to consider. He will offer suggestions for how to increase and manage the participation in a project based on his experiences with the TurboGears project.

Illustrating the broad spectrum of applications areas where it is used, the subjects of the Python talks on Thursday and Friday include topics such as language internals, artificial intelligence, user interface tools, scaling multi-threaded applications, and systems administration. Of course, there's a healthy dose of web development as well.

I'm really looking forward to the the conference, and I hope to see you there!

PyMOTW in PDF format

I've had a couple of requests to make it easier to print the PyMOTW articles by using a print-specific style sheet on my site or blog. I decided that since it is so easy, I should just set up sphinx to produce a PDF of the entire thing. As a result, starting with today's article on struct each release will be available as a PDF from the project home page.

PyMOTW: struct


struct – Working with Binary Data











Purpose:Convert between strings and binary data.
Python Version:1.4 and later

The struct module includes functions for converting between strings of bytes and native Python data types such as numbers and strings.



Functions vs. Struct Class


There are a set of module-level functions for working with structured values, and there is also the Struct class (new in Python 2.5). Format specifiers are converted from their string format to a compiled representation, similar to the way regular expressions are. The conversion takes some resources, so it is typically more efficient to do it once when creating a Struct instance and call methods on the instance instead of using the module-level functions. All of the examples below use the Struct class.




Packing and Unpacking


Structs support packing data into strings, and unpacking data from strings using format specifiers made up of characters representing the type of the data and optional count and endian-ness indicators. For complete details, refer to the standard library documentation.


In this example, the format specifier calls for an integer or long value, a 2 character string, and a floating point number. The spaces between the format specifiers are included here for clarity, and are ignored when the format is compiled.


import struct
import binascii

values = (1, 'ab', 2.7)
s = struct.Struct('I 2s f')
packed_data = s.pack(*values)

print 'Original values:', values
print 'Format string :', s.format
print 'Uses :', s.size, 'bytes'
print 'Packed Value :', binascii.hexlify(packed_data)

The packed value is converted to a sequence of hex bytes for printing, since some of the characters are nulls.


$ python struct_pack.py
Original values: (1, 'ab', 2.7000000000000002)
Format string : I 2s f
Uses : 12 bytes
Packed Value : 0100000061620000cdcc2c40

If we pass the packed value to unpack(), we get basically the same values back (note the discrepancy in the floating point value).


import struct
import binascii

packed_data = binascii.unhexlify('0100000061620000cdcc2c40')

s = struct.Struct('I 2s f')
unpacked_data = s.unpack(packed_data)
print 'Unpacked Values:', unpacked_data

$ python struct_unpack.py
Unpacked Values: (1, 'ab', 2.7000000476837158)



Endianness


By default values are encoded using the native C library notion of “endianness”. It is easy to override that choice by providing an explicit endianness directive in the format string.


import struct
import binascii

values = (1, 'ab', 2.7)
print 'Original values:', values

endianness = [
('@', 'native, native'),
('=', 'native, standard'),
('<', 'little-endian'),
('>', 'big-endian'),
('!', 'network'),
]

for code, name in endianness:
s = struct.Struct(code + ' I 2s f')
packed_data = s.pack(*values)
print
print 'Format string :', s.format, 'for', name
print 'Uses :', s.size, 'bytes'
print 'Packed Value :', binascii.hexlify(packed_data)
print 'Unpacked Value :', s.unpack(packed_data)

$ python struct_endianness.py
Original values: (1, 'ab', 2.7000000000000002)

Format string : @ I 2s f for native, native
Uses : 12 bytes
Packed Value : 0100000061620000cdcc2c40
Unpacked Value : (1, 'ab', 2.7000000476837158)

Format string : = I 2s f for native, standard
Uses : 10 bytes
Packed Value : 010000006162cdcc2c40
Unpacked Value : (1, 'ab', 2.7000000476837158)

Format string : < I 2s f for little-endian
Uses : 10 bytes
Packed Value : 010000006162cdcc2c40
Unpacked Value : (1, 'ab', 2.7000000476837158)

Format string : > I 2s f for big-endian
Uses : 10 bytes
Packed Value : 000000016162402ccccd
Unpacked Value : (1, 'ab', 2.7000000476837158)

Format string : ! I 2s f for network
Uses : 10 bytes
Packed Value : 000000016162402ccccd
Unpacked Value : (1, 'ab', 2.7000000476837158)



Buffers


Working with binary packed data is typically reserved for highly performance sensitive situations or passing data into and out of extension modules. One way to optimize is to avoid allocating a new buffer for each packed structure. The pack_into() and unpack_from() methods support writing to pre-allocated buffers directly.


import struct
import binascii

s = struct.Struct('I 2s f')
values = (1, 'ab', 2.7)
print 'Original:', values

print
print 'ctypes string buffer'

import ctypes
b = ctypes.create_string_buffer(s.size)
print 'Before :', binascii.hexlify(b.raw)
s.pack_into(b, 0, *values)
print 'After :', binascii.hexlify(b.raw)
print 'Unpacked:', s.unpack_from(b, 0)

print
print 'array'

import array
a = array.array('c', '\0' * s.size)
print 'Before :', binascii.hexlify(a)
s.pack_into(a, 0, *values)
print 'After :', binascii.hexlify(a)
print 'Unpacked:', s.unpack_from(a, 0)

$ python struct_buffers.py
Original: (1, 'ab', 2.7000000000000002)

ctypes string buffer
Before : 000000000000000000000000
After : 0100000061620000cdcc2c40
Unpacked: (1, 'ab', 2.7000000476837158)

array
Before : 000000000000000000000000
After : 0100000061620000cdcc2c40
Unpacked: (1, 'ab', 2.7000000476837158)


See also



struct

The standard library documentation for this module.

array

The array module, for working with sequences of fixed-type values.

binascii

The binascii module, for producing ASCII representations of binary data.

WikiPedia: Endianness

Explanation of byte order and endianness in encoding.



PyMOTW Home