Code Interstices All the little things that happen between bouts of coding. Covering internet technologies, Python, Mac OS X, and open source.
Wednesday, February 25, 2009
New Editor in Chief at Python Magazine
Editing a magazine like this every month is, not surprisingly, a huge amount of work. With the evolution and expansion of my duties at my day job (almost everyone at MTA is part-time, even me), I've found less and less time available to dedicate to my editorial duties. It was a tough decision, but rather than let the quality of the magazine suffer from lack of attention I found someone to take over who will be able to devote the time necessary to keep up our high standards. I am very pleased to announce that, beginning with the March 2009 issue, Brandon Rhodes will assume the duties of Editor in Chief.
I met Brandon through PyATL, the Atlanta Python user group, where he frequently gives presentations and is now the lead organizer. After convincing him to write a couple of articles for us, I invited him to be an Associate Editor for the magazine. His work as an AE over the past year made him stand out as an excellent candidate to replace me. We've been working on the transition for a little while now, so I have no doubt it will go smoothly. I am looking forward to watching PyMag continue to thrive through Brandon's skill and leadership.
This past year has been a great opportunity and a lot of fun. I want to thank MTA, and all the feature authors, columnists, editors, reviewers, as well as Arbi, Marco, and everyone else who worked with me. I'm proud of the work we've all done together!
I also want to thank our readers and the larger Python Community for their support. In the spirit of open source we have been welcomed warmly and enthusiastically over the past year, even though the magazine was a new enterprise and MTA was not widely known among Pythonistas. Your continued support (and writing) will help make Python Magazine a valuable resource for all of us, so keep those ideas and article submissions coming.
As for me, I plan to continue to contribute in smaller ways, such as an occasional article or column. I also have a long list of other projects that I have put aside for the past year, and it is time to pick up one or two of those and see what I can make of them. If you're interested, keep an eye on this blog for updates.
Python Magazine for February 2009
The February 2009 issue of Python Magazine is available for download now.
On the cover this month we have a story about the development of Urban Mediator, a tool for collaborative city planning created at the University of Art and Design Helsinki. This story is another example of how the variety of problems solved with Python is endless.
Michael Noll brings us a how-to for Writing a Personal Link Recommendation Engine. By studying the data available through the delicious.com API, Michael's example app can find links that might be of interest or related to links you already have bookmarked.
In Reactive Programming with Traited Python, Judah De Paula discusses how Traits can be used to add explicit typing, reactive programming, and fast user interface development to your application.
JC Cruz creates a simple text editor to show us how to use an MDI interface in Multiple Documents on PyObjC.
In Mark Mruss' Welcome to Python column this month we learn the basics of working with file I/O.
Brandon Rhodes regales us with the story of how Python's universal newline mode gave him headaches at a recent client engagement and, more importantly, how he solved the problem.
In this month's installment of Pragmatic Testers, Grig Gheorghiu covers using mock objects as a way of simulating components to simplify test setup.
And finally, Steve Holden covers some of the many Python information sources available online (and off), and suggests ways for you to help improve those resources.
We've packed this issue with lots of good content, so go download your copy today!
Sunday, February 22, 2009
PyMOTW: tarfile
tarfile – Tar archive access
| Purpose: | Tar archive access. |
|---|---|
| Python Version: | 2.3 and later |
The tarfile module provides read and write access to UNIX tar archives, including compressed files. In addition to the POSIX standards, several GNU tar extensions are supported. Various UNIX special file types (hard and soft links, device nodes, etc.) are also handled.
Testing Tar Files
The is_tarfile() function returns a boolean indicating whether or not the
filename passed as an argument refers to a valid tar file.
import tarfile
for filename in [ 'README.txt', 'example.tar',
'bad_example.tar', 'notthere.tar' ]:
try:
print '%20s %s' % (filename, tarfile.is_tarfile(filename))
except IOError, err:
print '%20s %s' % (filename, err)
Notice that if the file does not exist, is_tarfile() raises an IOError.
$ python tarfile_is_tarfile.py
README.txt False
example.tar True
bad_example.tar False
notthere.tar [Errno 2] No such file or directory: 'notthere.tar'Reading Meta-data from an Archive
Use the TarFile class to work directly with a tar archive. It supports methods for reading data about existing archives as well as modifying the archives by adding additional files.
To read the names of the files in an existing archive, use getnames():
import tarfile
t = tarfile.open('example.tar', 'r')
print t.getnames()
The return value is a list of strings with the names of the archive contents:
$ python tarfile_getnames.py
['README.txt']In addition to names, meta-data about the archive members is available as instances of TarInfo objects. Load the meta-data via getmembers() and getmember().
import tarfile
import time
t = tarfile.open('example.tar', 'r')
for member_info in t.getmembers():
print member_info.name
print '\tModified:\t', time.ctime(member_info.mtime)
print '\tMode :\t', oct(member_info.mode)
print '\tType :\t', member_info.type
print '\tSize :\t', member_info.size, 'bytes'
print
$ python tarfile_getmembers.py
README.txt
Modified: Sun Feb 22 11:13:55 2009
Mode : 0644
Type : 0
Size : 75 bytesIf you know in advance the name of the archive member, you can retrieve its TarInfo object with getmember().
import tarfile
import time
t = tarfile.open('example.tar', 'r')
for filename in [ 'README.txt', 'notthere.txt' ]:
try:
info = t.getmember(filename)
except KeyError:
print 'ERROR: Did not find %s in tar archive' % filename
else:
print '%s is %d bytes' % (info.name, info.size)
If the archive member is not present, getmember() raises a KeyError.
$ python tarfile_getmember.py
README.txt is 75 bytes
ERROR: Did not find notthere.txt in tar archiveExtracting Files From an Archive
To access the data from an archive member within your program, use the extractfile() method, passing the member’s name.
import tarfile
t = tarfile.open('example.tar', 'r')
for filename in [ 'README.txt', 'notthere.txt' ]:
try:
f = t.extractfile(filename)
except KeyError:
print 'ERROR: Did not find %s in tar archive' % filename
else:
print filename, ':', f.read()
$ python tarfile_extractfile.py
README.txt : The examples for the tarfile module use this file and example.tar as data.
ERROR: Did not find notthere.txt in tar archiveIf you just want to unpack the archive and write the files to the filesystem, use extract() or extractall() instead.
import tarfile
import os
os.mkdir('outdir')
t = tarfile.open('example.tar', 'r')
t.extract('README.txt', 'outdir')
print os.listdir('outdir')
$ python tarfile_extract.py
['README.txt']Note
The standard library documentation includes a note stating that extractall() is safer than extract(), and it should be used in most cases.
import tarfile
import os
os.mkdir('outdir')
t = tarfile.open('example.tar', 'r')
t.extractall('outdir')
print os.listdir('outdir')
$ python tarfile_extractall.py
['README.txt']If you only want to extract certain files from the archive, their names can be passed to extractall().
import tarfile
import os
os.mkdir('outdir')
t = tarfile.open('example.tar', 'r')
t.extractall('outdir', members=[t.getmember('README.txt')])
print os.listdir('outdir')
$ python tarfile_extractall_members.py
['README.txt']Creating New Archives
To create a new archive, simply open the TarFile with a mode of 'w'. Any existing file is truncated and a new archive is started. To add files, use the add() method.
import tarfile
print 'creating archive'
out = tarfile.open('tarfile_add.tar', mode='w')
try:
print 'adding README.txt'
out.add('README.txt')
finally:
print 'closing'
out.close()
print
print 'Contents:'
t = tarfile.open('tarfile_add.tar', 'r')
for member_info in t.getmembers():
print member_info.name
$ python tarfile_add.py
creating archive
adding README.txt
closing
Contents:
README.txtUsing Alternate Archive Member Names
It is possible to add a file to an archive using a name other than the original file name, by constructing a TarInfo object with an alternate arcname and passing it to addfile().
import tarfile
print 'creating archive'
out = tarfile.open('tarfile_addfile.tar', mode='w')
try:
print 'adding README.txt as RENAMED.txt'
info = out.gettarinfo('README.txt', arcname='RENAMED.txt')
out.addfile(info)
finally:
print 'closing'
out.close()
print
print 'Contents:'
t = tarfile.open('tarfile_addfile.tar', 'r')
for member_info in t.getmembers():
print member_info.name
The archive includes only the changed filename:
$ python tarfile_addfile.py
creating archive
adding README.txt as RENAMED.txt
closing
Contents:
RENAMED.txtWriting Data from Sources Other Than Files
Sometimes you want to write data to an archive but the data is not in a file on the filesystem. Rather than writing the data to a file, then adding that file to the ZIP archive, you can use addfile() to add data from an open file-like handle.
import tarfile
from cStringIO import StringIO
data = 'This is the data to write to the archive.'
out = tarfile.open('tarfile_addfile_string.tar', mode='w')
try:
info = tarfile.TarInfo('made_up_file.txt')
info.size = len(data)
out.addfile(info, StringIO(data))
finally:
out.close()
print
print 'Contents:'
t = tarfile.open('tarfile_addfile_string.tar', 'r')
for member_info in t.getmembers():
print member_info.name
f = t.extractfile(member_info)
print f.read()
By first constructing a TarInfo object ourselves, we can give the archive member any name we wish. After setting the size, we can write the data to the archive using addfile() and passing a StringIO buffer as a source of the data.
$ python tarfile_addfile_string.py
Contents:
made_up_file.txt
This is the data to write to the archive.Appending to Archives
In addition to creating new archives, it is possible to append to an existing file. To open a file to append to it, use mode 'a'.
import tarfile
print 'creating archive'
out = tarfile.open('tarfile_append.tar', mode='w')
try:
out.add('README.txt')
finally:
out.close()
print 'contents:', [m.name
for m in tarfile.open('tarfile_append.tar', 'r').getmembers()]
print 'adding index.rst'
out = tarfile.open('tarfile_append.tar', mode='a')
try:
out.add('index.rst')
finally:
out.close()
print 'contents:', [m.name
for m in tarfile.open('tarfile_append.tar', 'r').getmembers()]
The resulting archive ends up with 2 members:
$ python tarfile_append.py
creating archive
contents: ['README.txt']
adding index.rst
contents: ['README.txt', 'index.rst']Working with Compressed Archives
Besides regular tar archive files, the tarfile module can work with archives compressed via the gzip or bzip2 protocols. To open a compressed archive, modify the mode string passed to open() to include ":gz" or ":bz2", depending on the compression method you want to use.
import tarfile
import os
fmt = '%-30s %-10s'
print fmt % ('FILENAME', 'SIZE')
print fmt % ('README.txt', os.stat('README.txt').st_size)
for filename, write_mode in [
('tarfile_compression.tar', 'w'),
('tarfile_compression.tar.gz', 'w:gz'),
('tarfile_compression.tar.bz2', 'w:bz2'),
]:
out = tarfile.open(filename, mode=write_mode)
try:
out.add('README.txt')
finally:
out.close()
print fmt % (filename, os.stat(filename).st_size),
print [m.name for m in tarfile.open(filename, 'r:*').getmembers()]
When opening an existing archive for reading, you can specify "r:*" to have tarfile determine the compression method to use automatically.
$ python tarfile_compression.py
FILENAME SIZE
README.txt 75
tarfile_compression.tar 10240 ['README.txt']
tarfile_compression.tar.gz 213 ['README.txt']
tarfile_compression.tar.bz2 186 ['README.txt']See also
- tarfile
- The standard library documentation for this module.
- GNU tar manual
- Documentation of the tar format, including extensions.
- zipfile
- Similar access for ZIP archives.
- gzip
- GNU zip compression
- bz2
- bzip2 compression
Friday, February 20, 2009
"Writing About Python" at PyCon

Of course, authoring is only one aspect of producing high quality professional writing. As an editor, it's no surprise that I think all authors can benefit from having someone else read their work. Even an informal review can help identify gaps in an explanation or problems with flow, not to mention typos.
Writers, reviewers, and editors can all benefit from sharing tips, offering advice on tricky problem passages, or just chatting about ongoing projects. That list sounds just like the sorts of things we talk about when discussing programming, doesn't it?
I'd like to bring people together in person for a group discussion and a chance to network. To kick things off, I'm working on organizing a "meetup" as an open space session at PyCon in March. If you are a blogger, published writer, reviewer, or editor -- or aspire to be -- I want you to come and participate. No prior experience is necessary, everyone is welcome.
Let's connect people interested in doing technical reviews with authors and editors. Let's introduce potential writers to the editors and publishers. Let's learn from one another and improve our prose, the same way we improve our code.
We'll have to wait to schedule a precise time until the conference, but right now I'm thinking about late Saturday morning or early afternoon. If you are interested in attending, please post a comment so I can gauge the size of the room we'll need. If you have a preference for the time, make sure to include that information. And if you have ideas for things we should talk about, let's hear them.
Update: Although I was thinking of "print" writing, other forms of publishing like podcasting are welcome, too. We need more good Python podcasts!
Update 2: See the wiki page for updates as we develop these plans.
Sunday, February 15, 2009
PyMOTW: grp
grp – Unix Group Database
| Purpose: | Read group data from Unix group database. |
|---|---|
| Python Version: | 1.4 and later |
The grp module can be used to read information about Unix groups from the group database (usually /etc/group). The read-only interface returns tuple-like objects with named attributes for the standard fields of a group record.
| Index | Attribute | Meaning |
|---|---|---|
| 0 | gr_name | Name |
| 1 | gr_passwd | Password, if any (encrypted) |
| 2 | gr_gid | Numerical id (integer) |
| 3 | gr_mem | Names of group members |
The name and password values are both strings, the GID is an integer, and the members are reported as a list of strings.
Querying All Groups
Suppose you need to print a report of all of the “real” groups on a system, including their members (for our purposes, “real” is defined as having a name not starting with “_”). To load the entire password database, you would use getgrall(). The return value is a list with an undefined order, so you probably want to sort it before printing the report.
import grp
import operator
# Load all of the user data, sorted by username
all_groups = grp.getgrall()
interesting_groups = sorted((g
for g in all_groups
if not g.gr_name.startswith('_')),
key=operator.attrgetter('gr_name'))
# Find the longest length for the name
name_length = max(len(g.gr_name) for g in interesting_groups) + 1
# Print report headers
fmt = '%-*s %4s %10s %s'
print fmt % (name_length, 'Name',
'GID',
'Password',
'Members')
print '-' * name_length, '----', '-' * 10, '-' * 30
# Print the data
for g in interesting_groups:
print fmt % (name_length, g.gr_name,
g.gr_gid,
g.gr_passwd,
', '.join(g.gr_mem))
$ python grp_getgrall.py
Name GID Password Members
---------------------------------------- ---- ---------- ------------------------------
accessibility 90 *
admin 80 * root, dhellmann
authedusers 50
bin 7 *
certusers 29 * root, _jabber, _postfix, _cyrus, _calendar
com.apple.access_screensharing-disabled 101 dhellmann
consoleusers 53
daemon 1 * root
dhellmann 501
dialer 68 *
everyone 12
group 16
interactusers 51
kmem 2 * root
localaccounts 61
mail 6 *
netaccounts 62
netusers 52
network 69 *
nobody -2 *
nogroup -1 *
operator 5 * root
owner 10
postgres 401
procmod 9 * root
procview 8 * root
racemi 500 dhellmann
smmsp 25
staff 20 * root, test
sys 3 * root
tty 4 * root
utmp 45 *
wheel 0 * rootGroup Memberships for a User
Another common task might be to print a list of all the groups for a given user:
import grp
username = 'dhellmann'
groups = [g.gr_name for g in grp.getgrall() if username in g.gr_mem]
print username, 'belongs to:', ', '.join(groups)
$ python grp_groups_for_user.py
dhellmann belongs to: _lpadmin, admin, com.apple.access_screensharing-disabled, racemiFinding a Group By Name
As with pwd, it is also possible to query for information about a specific group, either by name or numeric id.
import grp
name = 'admin'
info = grp.getgrnam(name)
print 'Name :', info.gr_name
print 'GID :', info.gr_gid
print 'Password:', info.gr_passwd
print 'Members :', ', '.join(info.gr_mem)
$ python grp_getgrnam.py
Name : admin
GID : 80
Password: *
Members : root, dhellmannFinding a Group by ID
To identify the group running the current process, combine getgrgid() with os.getgid().
import grp
import os
gid = os.getgid()
group_info = grp.getgrgid(gid)
print 'Currently running with GID=%s name=%s' % (gid, group_info.gr_name)
$ python grp_getgrgid_process.py
Currently running with GID=501 name=dhellmannAnd to get the group name based on the permissions on a file, look up the group returned by os.stat().
import grp
import os
import sys
filename = 'grp_getgrgid_fileowner.py'
stat_info = os.stat(filename)
owner = grp.getgrgid(stat_info.st_gid).gr_name
print '%s is owned by %s (%s)' % (filename, owner, stat_info.st_gid)
$ python grp_getgrgid_fileowner.py
grp_getgrgid_fileowner.py is owned by dhellmann (501)See also
- grp
- The standard library documentation for this module.
- pwd
- Read user data from the password database.
- spwd
- Read user data from the shadow password database.
Sunday, February 8, 2009
PyMOTW: pwd
pwd – Unix Password Database
| Purpose: | Read user data from Unix password database. |
|---|---|
| Python Version: | 1.4 and later |
The pwd module can be used to read user information from the Unix password database (usually /etc/passwd). The read-only interface returns tuple-like objects with named attributes for the standard fields of a password record.
| Index | Attribute | Meaning |
|---|---|---|
| 0 | pw_name | The user’s login name |
| 1 | pw_passwd | Encrypted password (optional) |
| 2 | pw_uid | User id (integer) |
| 3 | pw_gid | Group id (integer) |
| 4 | pw_gecos | Comment/full name |
| 5 | pw_dir | Home directory |
| 6 | pw_shell | Application started on login, usually a command interpreter |
Querying All Users
Suppose you need to print a report of all of the “real” users on a system, including their home directories (for our purposes, “real” is defined as having a name not starting with “_”). To load the entire password database, you would use getpwall(). The return value is a list with an undefined order, so you probably want to sort it before printing the report.
import pwd
import operator
# Load all of the user data, sorted by username
all_user_data = pwd.getpwall()
interesting_users = sorted((u
for u in all_user_data
if not u.pw_name.startswith('_')),
key=operator.attrgetter('pw_name'))
# Find the longest lengths for a few fields
username_length = max(len(u.pw_name) for u in interesting_users) + 1
home_length = max(len(u.pw_dir) for u in interesting_users) + 1
# Print report headers
fmt = '%-*s %4s %-*s %s'
print fmt % (username_length, 'User',
'UID',
home_length, 'Home Dir',
'Description')
print '-' * username_length, '----', '-' * home_length, '-' * 30
# Print the data
for u in interesting_users:
print fmt % (username_length, u.pw_name,
u.pw_uid,
home_length, u.pw_dir,
u.pw_gecos)
Most of the example code above deals with formatting the results nicely. The for loop at the end shows how to access fields from the records by name.
$ python pwd_getpwall.py
User UID Home Dir Description
---------- ---- ----------------- ------------------------------
daemon 1 /var/root System Services
dhellmann 527 /Users/dhellmann Doug Hellmann
nobody -2 /var/empty Unprivileged User
postgres 401 /var/empty PostgreSQL Server
root 0 /var/root System AdministratorQuerying User By Name
If you need information about one user, it is not necessary to read the entire password database. Using getpwnam(), you can retrieve the information about a user by name.
import pwd
import sys
username = sys.argv[1]
user_info = pwd.getpwnam(username)
print 'Username:', user_info.pw_name
print 'Password:', user_info.pw_passwd
print 'Comment :', user_info.pw_gecos
print 'UID/GID :', user_info.pw_uid, '/', user_info.pw_gid
print 'Home :', user_info.pw_dir
print 'Shell :', user_info.pw_shell
The passwords on my system are stored outside of the main user database in a shadow file, so the password field, when set, is reported as all *.
$ python pwd_getpwnam.py dhellmann
Username: dhellmann
Password: ********
Comment : Doug Hellmann
UID/GID : 527 / 501
Home : /Users/dhellmann
Shell : /bin/bash
$ python pwd_getpwnam.py postgres
Username: postgres
Password:
Comment : PostgreSQL Server
UID/GID : 401 / 401
Home : /var/empty
Shell : /usr/bin/falseQuerying User By UID
It is also possible to look up a user by their numerical user id. This is useful to find the owner of a file:
import pwd
import os
import sys
filename = 'pwd_getpwuid_fileowner.py'
stat_info = os.stat(filename)
owner = pwd.getpwuid(stat_info.st_uid).pw_name
print '%s is owned by %s (%s)' % (filename, owner, stat_info.st_uid)
$ python pwd_getpwuid_fileowner.py
pwd_getpwuid_fileowner.py is owned by dhellmann (527)The numeric user id is can also be used to find information about the user currently running a process:
import pwd
import os
uid = os.getuid()
user_info = pwd.getpwuid(uid)
print 'Currently running with UID=%s username=%s' % (uid, user_info.pw_name)
$ python pwd_getpwuid_process.py
Currently running with UID=527 username=dhellmannSee also
- pwd
- The standard library documentation for this module.
- spwd
- Secure password database access for systems using shadow passwords.
- grp
- The grp module reads Unix group information.
Monday, February 2, 2009
Writing Technical Documentation with Sphinx, Paver, and Cog
Editing Text: TextMate
I work on a MacBook Pro, and use TextMate for editing the articles and source for PyMOTW. TextMate is the one tool I use regularly that is not open source. When I'm doing heavy editing of hundreds of files for my day job I use Aquamacs Emacs, but TextMate is better suited for prose editing and is easier to extend with quick actions. I discovered TextMate while looking for a native editor to use for Python Magazine, and after being able to write my own "bundle" to manage magazine articles (including defining a mode for the markup language we use) I was hooked.
Some of the features that I like about TextMate for prose editing are as-you-type spell-checking (I know some people hate this feature, but I find it useful), text statistics (word count, etc.), easy block selection (I can highlight a paragraph or several sentences and move them using cursor keys), a moderately good reStructuredText mode (emacs' is better, but TextMate's is good enough), paren and quote matching as you type, and very simple extensibility for repetitive tasks. I also like TextMate's project management features, since they makes it easy to open several related files at the same time.
Version Control: svn
I started out using a private svn repository for all of my projects, including PyMOTW. I'm in the middle of evaluating hosted DVCS options for PyMOTW, but still haven't had enough time to give them all the research I think is necessary before making the move. The Python core developers are considering a similar move (PEP 374) so it will be interesting to monitor that discussion. No doubt we have different requirements (for example, they are hosting their own repository), but the experiences with the various DVCS tools will be useful input to my own decision.
Markup Language: reStructuredText
When I began posting, I wrote each article by hand using HTML. One of the first tasks that I automated was the step of passing the source code through pygments to produce a syntax colorized version. This worked well enough for me at the time, but restricted me to producing only HTML output. Eventually John Benediktsson contacted me with a version of many of the posts converted from HTML to reStructuredText.
When reStructuredText was first put forward in the '90's, I was heavily into Zope development. As such, I was using StructuredText for documenting my code, and in the Zope-based wiki that we ran at ZapMedia. I even wrote my own app to extract comments and docstrings to generate library documentation for a couple of libraries I had released as open source. I really liked StructuredText and, at first, I didn't like reStructuredText. Frankly, it looked ugly compared to what I was used to. It quickly gained acceptance in the general community though, and I knew it would give me options for producing other output formats for the PyMOTW posts, so when John sent me the markup files I took another look.
While re-acquainting myself with reST, I realized two things. First, although there is a bit more punctuation involved in the markup than with the original StructuredText, the markup language was designed with consistency in mind so it isn't as difficult to learn as my first impressions had lead me to believe. Second, it turned out the part I thought was "ugly" was actually the part that made reST more powerful than StructuredText: It has a standard syntax for extension directives that users can define for their own documents.
Markup to Output: Sphinx
Before I made a final decision on switching from hand-coded HTML to reST, I needed a tool to convert to HTML (I still had to post the results on the blog, after all, and Blogger doesn't support reST). I first tried David Goodger's docutils package. The scripts it includes felt a little too much like "pieces" of a tool rather than a complete solution, though, and I didn't really want to assemble my own wrappers if I didn't have to -- I wanted to write text for this project, not code my own tools. Around this time, Georg Brandl had made significant progress on Sphinx, which turned out to be a more complete turn-key system for converting a pile of reST files to HTML or PDF. After a few hours of experimentation, I had a sample project set up and was generating HTML from my documents using the standard templates.
I decided that reStructuredText looked like the way to go.
HTML Templates: Jinja:
My next step was to work out exactly how to produce all of the outputs I needed from reST inputs. Each post for the PyMOTW series ends up going to several different places:
- the PyMOTW source distribution (HTML)
- my Blogger blog (HTML)
- the PyMOTW project site (HTML)
- O'Reilly.com (HTML)
- the PyMOTW "book" (PDF)
Each of the four HTML outputs uses slightly different formatting, requiring separate templates (PDF is a whole different problem, covered below). The source distribution and project site are both full HTML versions of all of the documents, but use different templates. I decided to use the default Sphinx templates for the packaged version; I may change that later, but it works for the time being, and it's one less custom template to deal with. I wanted the online version to match the appearance of the rest of my site, so I needed to create a template for it. The two blogs use a third template (O'Reilly's site ignores a lot of the markup due to their Moveable Type configuration, but the articles come out looking good enough so I can use the same template I use for my own blog without worrying about a separate custom template).
Sphinx uses Jinja templates to produce HTML output. The syntax for Jinja is very similar to Django's template language. As it happens, I use Django for the dynamic portion of my web site that I host myself. I lucked out, and my site's base template was simple enough to use with Sphinx without making any changes. Yay for compatibility!
Cleaning up HTML with BeautifulSoup
The blog posts need to be relatively clean HTML that I can upload to Blogger and O'Reilly, so they could not include any
html or body tags or require any markup or styles not supported by either blogging engine. The template I came up with is a stripped down version that doesn't include the CSS and markup for sidebars, header, or footer. The result was almost exactly what I wanted, but had two problems. The easiest problem to handle was the permalinks generated by Sphinx. After each heading on the page, Sphinx inserts an anchor tag with a ¶ character and applies CSS styles that hide/show the tag when the user hovers over it. That's a nice feature for the main site and packaged content, but they didn't work for the blogs. I have no control over the CSS used at O'Reilly, so the tags were always visible. I didn't really care if they were included on the Blogger pages, so the simplest thing to do was stick with one "blogging" template and remove the permalinks.
The second, more annoying, problem, was that Blogger wanted to insert extra whitespace into the post. There is a configuration option on Blogger to treat line breaks in the post as "paragraph breaks" (I think they actually insert
br tags). This is very convenient for normal posts with mostly straight text, since I can simply write each paragraph on one long line, wrapped visually by my editor, and break the paragraphs where I want them. The result is I can almost post directly from plain text input. Unfortunately, the option is applied to every post in the blog (even old posts), so changing it was not a realistic option -- I wasn't about to go back and re-edit every single post I had previously written.Sphinx didn't have an option to skip generating the permalinks, and there was no way to express that intent in the template, so I fell back to writing a little script to strip them out after the fact. I used BeautifulSoup to find the tags I wanted removed, delete them from the parse tree, then assemble the HTML text as a string again. I added code to the same script to handle the whitespace issue by removing all newlines from the input unless they were inside
pre tags, which Blogger handled correctly. The result was a single blob of partial HTML without newlines or permalinks that I could post directly to either blog without editing it by hand. Score a point for automation.def clean_blog_html(body):
# Clean up the HTML
import re
import sys
from BeautifulSoup import BeautifulSoup
from cStringIO import StringIO
# The post body is passed to stdin.
soup = BeautifulSoup(body)
# Remove the permalinks to each header since the blog does not have
# the styles to hide them.
links = soup.findAll('a', attrs={'class':"headerlink"})
[l.extract() for l in links]
# Get BeautifulSoup's version of the string
s = soup.__str__(prettyPrint=False)
# Remove extra newlines. This depends on the fact that
# code blocks are passed through pygments, which wraps each part of the line
# in a span tag.
pattern = re.compile(r'([^s][^p][^a][^n]>)\n$', re.DOTALL|re.IGNORECASE)
s = ''.join(pattern.sub(r'\1', l) for l in StringIO(s))
return s
Code Syntax Highlighting: pygments
I wanted my posts to look as good as possible, and an important factor in the appearance would be the presentation of the source code. I adopted pygments in the early hand-coded HTML days, because it was easy to integrate into TextMate with a simple script.
pygmentize -f html -O cssclass=syntax $@
Binding the command to a key combination meant with a few quick keypresses I had HTML ready to insert into the body of a post.
When I moved to Sphinx, using pygments became even easier because Sphinx automatically passes included source code through pygments as it generates its output. Syntax highlighting works for HTML and PDF, so I didn't need any custom processing.
Automation: Paver
Automation is important for my sense of well being. I hate dealing with mundane repetitive tasks, so once an article was written I didn't want to have to touch it to prepare it for publication of any of the final destinations. As I have written before, I started out using
make to run various shell commands. I have since converted the entire process to Paver.The stock Sphinx integration provided with that comes with Paver didn't quite meet my needs, but by examining the source I was able to create my own replacement tasks in an afternoon. The main problem was the tight coupling between the code to run Sphinx and the code to find the options to pass to it. For normal projects with a single documentation output format (Paver assumes HTML with a single config file), this isn't a problem. PyMOTW's requirements are different, with the four output formats discussed above.
In order to produce different output with Sphinx, you need different configuration files. Since the base name for the file must always be
conf.py, that means the files have to be stored in separate directories. One of the options passed to Sphinx on the command line tells it the directory to look in for its configuration file. Even though Paver doesn't fork() before calling Sphinx, it still uses the command line options to pass instructions. Creating separate Sphinx configuration files was easy. The problem was defining options in Paver to tell Sphinx about each configuration directory for the different output. Paver options are grouped into bundles, which are essentially a namespace. When a Paver task looks for an option, it scans through the bundles, possibly cascading to the global namespace, until it finds the option by name. The search can be limited to specific bundles, so that the same option name can be used to configure different tasks.
The
html task from paver.doctools sets the options search order to look for values first in the sphinx section, then globally. Once it has retrieved the path values, via _get_paths(), it invokes Sphinx.def _get_paths():
"""look up the options that determine where all of the files are."""
opts = options
docroot = path(opts.get('docroot', 'docs'))
if not docroot.exists():
raise BuildFailure("Sphinx documentation root (%s) does not exist."
% docroot)
builddir = docroot / opts.get("builddir", ".build")
builddir.mkdir()
srcdir = docroot / opts.get("sourcedir", "")
if not srcdir.exists():
raise BuildFailure("Sphinx source file dir (%s) does not exist"
% srcdir)
htmldir = builddir / "html"
htmldir.mkdir()
doctrees = builddir / "doctrees"
doctrees.mkdir()
return Bunch(locals())
@task
def html():
"""Build HTML documentation using Sphinx. This uses the following
options in a "sphinx" section of the options.
docroot
the root under which Sphinx will be working. Default: docs
builddir
directory under the docroot where the resulting files are put.
default: build
sourcedir
directory under the docroot for the source files
default: (empty string)
"""
options.order('sphinx', add_rest=True)
paths = _get_paths()
sphinxopts = ['', '-b', 'html', '-d', paths.doctrees,
paths.srcdir, paths.htmldir]
dry("sphinx-build %s" % (" ".join(sphinxopts),), sphinx.main, sphinxopts)
This didn't work for me because I needed to pass a separate configuration directory (not handled by the default
_get_paths()) and different build and output directories. The simplest solution turned out to be re-implementing the Paver-Sphinx integration to make it more flexible. I created my own _get_paths() and made it look for the extra option values and use the directory structure I needed.def _get_paths():
"""look up the options that determine where all of the files are."""
opts = options
docroot = path(opts.get('docroot', 'docs'))
if not docroot.exists():
raise BuildFailure("Sphinx documentation root (%s) does not exist."
% docroot)
builddir = docroot / opts.get("builddir", ".build")
builddir.mkdir()
srcdir = docroot / opts.get("sourcedir", "")
if not srcdir.exists():
raise BuildFailure("Sphinx source file dir (%s) does not exist"
% srcdir)
# Where is the sphinx conf.py file?
confdir = path(opts.get('confdir', srcdir))
# Where should output files be generated?
outdir = opts.get('outdir', '')
if outdir:
outdir = path(outdir)
else:
outdir = builddir / opts.get('builder', 'html')
outdir.mkdir()
# Where are doctrees cached?
doctrees = opts.get('doctrees', '')
if not doctrees:
doctrees = builddir / "doctrees"
else:
doctrees = path(doctrees)
doctrees.mkdir()
return Bunch(locals())
Then I defined a new function,
run_sphinx(), to set up the options search path, look for the option values, and invoke Sphinx. I set add_rest to False to disable searching globally for an option to avoid namespace polution from option collisions, since I knew I was going to have options with the same names but different values for each output format. I also look for a "builder", to support PDF generation.def run_sphinx(*option_sets):
"""Helper function to run sphinx with common options.
Pass the names of namespaces to be used in the search path
for options.
"""
if 'sphinx' not in option_sets:
option_sets += ('sphinx',)
kwds = dict(add_rest=False)
options.order(*option_sets, **kwds)
paths = _get_paths()
sphinxopts = ['',
'-b', options.get('builder', 'html'),
'-d', paths.doctrees,
'-c', paths.confdir,
paths.srcdir, paths.outdir]
dry("sphinx-build %s" % (" ".join(sphinxopts),), sphinx.main, sphinxopts)
return
With a working
run_sphinx() function I could define several Sphinx-based tasks, each taking options with the same names but from different parts of the namespace. The tasks simply call run_sphinx() with the desired namespace search path. For example, to generate the HTML to include in the sdist package, the html task looks in the html bunch:@task
@needs(['cog'])
def html():
set_templates(options.html.templates)
run_sphinx('html')
while generating the HTML output for the website uses a different set of options from the website bunch:
@task
@needs(['webtemplatebase', 'cog'])
def webhtml():
set_templates(options.website.templates)
run_sphinx('website')
return
All of the option search paths also include the sphinx bunch, so values that do not change (such as the source directory) do not need to be repeated. The relevant portion of the options from the PyMOTW pavement.py file looks like this:
options(
# ...
sphinx = Bunch(
sourcedir=PROJECT,
docroot = '.',
builder = 'html',
doctrees='sphinx/doctrees',
confdir = 'sphinx',
),
html = Bunch(
builddir='docs',
outdir='docs',
templates='pkg',
),
website=Bunch(
templates = 'web',
#outdir = 'web',
builddir = 'web',
),
pdf=Bunch(
templates='pkg',
#outdir='pdf_output',
builddir='web',
builder='latex',
),
blog=Bunch(
sourcedir=path(PROJECT)/MODULE,
builddir='blog_posts',
outdir='blog_posts',
confdir='sphinx/blog',
doctrees='blog_posts/doctrees',
),
# ...
)
To find the sourcedir for the
html task, _get_paths() first looks in the html bunch, then the sphinx bunch.Capturing Program Output: cog
As an editor at Python Magazine, and reviewer for several books, I've discovered that one of the most frequent sources of errors with technical writing occurs in the production process where the output of running sample code is captured to be included in the final text. This is usually done manually by running the program and copying and pasting its output from the console. It's not uncommon for a bug to be found, or a library to change, requiring a change in the source code provided with the article. That change, in turn, means the output of commands may be different. Sometimes the change is minor, but at other times the output is different in some significant way. Since I've seen the problem come up so many times, I spent time thinking about and looking for a solution to avoid it in my own work.
During my research, a few people suggested that I switch to using doctests for my examples, but I felt there were several problems with that approach. First, the doctest format isn't very friendly for users who want to copy and paste examples into their own scripts. The reader has to select each line individually, and can't simply grab the entire block of code. Distributing the examples as separate scripts makes this easier, since they can simply copy the entire file and modify it as they want. Using individual .py files also makes it possible for some of the more complicated examples to run clients and servers at the same time from different scripts (as with SimpleXMLRPCServer, for example). But most importantly, using doctests does not solve the fundamental problem. Doctests tell me when the output has changed, but I still have to manually run the scripts to generate that output and paste it into my document in the first place. What I really wanted to be able to do was run the script and insert the output, whatever it was, without manually copying and pasting text from the console.
I finally found what I was looking for in cog, from Ned Batchelder. Ned describes cog as a "code generation tool", and most of the examples he provides on his site are in that vein. But cog is a more general purpose tool than that. It gives you a way to include arbitrary Python instructions in your source document, have them executed, and then have the source document change to reflect the output.
For each code sample, I wanted to include the Python source followed by the output it produces when run on the console. There is a reST directive to include the source file, so that part is easy:
.. include:: anydbm_whichdb.py
:literal:
:start-after: #end_pymotw_header
The
include directive tells Sphinx that the file "anydbm_whichdb.py" should be treated as a literal text block (instead of more reST) and to only include the parts following the last line of the standard header I use in all my source code. Syntax highlighting comes for free when the literal block is converted to the output format.Grabbing the command output was a little trickier. Normally with cog, one would embed the actual source to be run in the document. In my case, I had the text in an external file. Most of the source is Python, and I could just import it, but I would have to go to special lengths to capture any output and pass it to
cog.out(), the cog function for including text in the processed document. I didn't want my example code littered with calls to cog.out() instead of print, so I needed to capture sys.stdout and sys.stdin. A bigger question was whether I wanted to have all of the sample files imported into the namespace of the build process. Considering both issues, it made sense to run the script in a separate process and capture the output.There is a bit of setup work needed to run the scripts this way, so I decided to put it all into a function instead of including the boilerplate code in every cog block. The reST source for running anydbm_whichdb.py looks like:
.. {{{cog
.. cog.out(run_script(cog.inFile, 'anydbm_whichdb.py'))
.. }}}
.. {{{end}}}
The
.. at the start of each line causes the reStructuredText parser to treat the line as a comment, so it is not included in the output. After passing the reST file through cog, it is rewritten to contain:.. {{{cog
.. cog.out(run_script(cog.inFile, 'anydbm_whichdb.py'))
.. }}}
::
$ python anydbm_whichdb.py
dbhash
.. {{{end}}}
The
run_script() function runs the python script it is given, adds a prefix to make reST treat the following lines as literal text, then indents the script output. The script is run via Paver's sh() function, which wraps the subprocess module and supports the dry-run feature of Paver. Because the cog instructions are comments, the only part that shows up in the ouput is the literal text block with the command output.def run_script(input_file, script_name, interpreter='python'):
"""Run a script in the context of the input_file's directory,
return the text output formatted to be included as an rst
literal text block.
"""
from paver.runtime import sh
from paver.path import path
rundir = path(input_file).dirname()
output_text = sh('cd %(rundir)s; %(interpreter)s %(script_name)s 2>&1' % vars(),
capture=True)
response = '\n::\n\n\t$ %(interpreter)s %(script_name)s\n\t' % vars()
response += '\n\t'.join(output_text.splitlines())
while not response.endswith('\n\n'):
response += '\n'
return response
# Stuff run_script() into the builtins so we don't have to
# import it in all of the cog blocks where we want to use it.
__builtins__['run_script'] = run_script
I defined
run_script() in my pavement.py file, and added it to the __builtins__ namespace to avoid having to import it each time I wanted to use it from a source document.A somewhat more complicated example shows another powerful feature of cog. Because it can run any arbitrary Python code, it is possible to establish the pre-conditions for a script before running it. For example, anydbm_new.py assumes that its output database does not already exist. I can ensure that condition by removing it before running the script.
.. {{{cog
.. from paver.path import path
.. from paver.runtime import sh
.. workdir = path(cog.inFile).dirname()
.. sh("cd %s; rm -f /tmp/example.db" % workdir)
.. cog.out(run_script(cog.inFile, 'anydbm_new.py'))
.. }}}
{{{end}}}
Since cog is integrated into Sphinx, all I had to do to enable it was define the options and import the module. I chose to change the begin and end tags used by cog because the default patterns (
[[[cog and ]]]) appeared in the output of some of the scripts (printing nested lists, for example).cog=Bunch(
beginspec='{{{cog',
endspec='}}}',
endoutput='{{{end}}}',
),
To process all of the input files through cog before generating the output, I added
'cog' to the @needs list for any task running sphinx. Then it was simply a matter of running "paver html" or "paver webhtml" to generate the output.Paver includes an "uncog" task to remove the cog output from your source files before committing to a source code repository, but I decided to include the cogged values in committed versions so I would be alerted if the output ever changed.
Generating PDF: TexLive
Generating HTML using Sphinx and Jinja templates is fairly straightforward; PDF output wasn't quite so easy to set up. Sphinx actually produces LaTeX, another text-based format, as output, along with a Makefile to run third-party LaTeX tools to create the PDF. I started out experimenting on a Linux system (normally I use a Mac, but this box claimed to have the required tools installed). Due to the age of the system, however, the tools weren't compatible with the LaTeX produced by Sphinx. After some searching, and asking on the sphinx-dev mailing list, I installed a copy of TeX Live, a newer TeX distro. A few tweaks to my $PATH later and I was in business building PDFs right on my Mac.
My
pdf task runs Sphinx with the "latex" builder, then runs make using the generated Makefile.@task
@needs(['cog'])
def pdf():
"""Generate the PDF book.
"""
set_templates(options.pdf.templates)
run_sphinx('pdf')
latex_dir = path(options.pdf.builddir) / 'latex'
sh('cd %s; make' % latex_dir)
return
I still need to experiment with some of the LaTeX options, including templates for pages in different sizes, logos, and styles. For now I'm happy with the default look.
Releasing
Once I had the "build" fully automated, it was time to address the distribution process. For each version, I need to:
- upload HTML, PDF, and tar.gz files to my server
- update PyPI
- post to my blog
- post to the O'Reilly blog
The HTML and PDF files are copied to my server using rsync, invoked from Paver. I use a web browser and the admin interface for django-codehosting to upload the tar.gz file containing the source distribution manually. That will be automated, eventually. Once the tar.gz is available, PyPI can be updated via the builtin task "paver register". That just leaves the two blog posts.
For my own blog, I use MarsEdit to post and edit entries. I find the UI easy to use, and I like the ability to work on drafts of posts offline. It is much nicer than the web interface for Blogger, and has the benefit of being AppleScript-able. I have plans to automate all of the steps right up to actually posting the new blog entry, but for now I copy the generated blog entry into a new post window by hand.
O'Reilly's blogging policy does not allow desktop clients (too much of a support issue for the tech staff), so I need to use their Moveable Type web UI to post. As with MarsEdit, I simply copy the output and paste it into the field in the browser window, then add tags.
Tying it All Together
A quick overview of my current process is:
1. Pick a module, research it, and write examples in reST and Python. Include the Python source and use cog directives to bring in the script output.
2. Use the command "paver html" to produce HTML output to verify the results look good and I haven't messed up any markup.
3. Commit the changes to svn. When I'm done with the module, copy the "trunk" to a release branch for packging.
4. Use "paver sdist" to create the tar.gz file containing the Python source and HTML documentation.
5. Upload the tar.gz file to my site.
6. Run "paver installwebsite" to regenerate the hosted version of the HTML and the PDF, then copy both to my web server.
7. Run "paver register" to update PyPI with the latest release information.
8. Run "paver blog" to generate the HTML to be posted to the blogs. The task opens a new TextMate window containing the HTML so it is ready to be copied.
9. Paste the blog post contents into MarsEdit, add tags, and send it to Blogger.
10. Paste the blog post contents into the MT UI for O'Reilly, add tags, verify that it renders properly, then publish.
Try It Yourself
All of the source for PyMOTW (including the pavement.py file with configuration options, task definitions, and Sphinx integration) is available from the PyMOTW web site. Sphinx, Paver, cog, and BeautifulSoup are all open source projects. I've only tested the PyMOTW "build" on Mac OS X, but it should work on Linux without any major alterations. If you're on Windows, let me know if you get it working.