Sunday, June 24, 2007

PyMOTW: pickle and cPickle

Module: pickle and cPickle
Purpose: Python object serialization
Python Version: pickle at least 1.4, cPickle 1.5

Description:

The pickle module implements an algorithm for turning an arbitrary Python object into a series of bytes ("serializing" the object). The byte stream can then be transmitted or stored, and later reconstructed to create a new object with the same characteristics.

The cPickle module implements the same algorithm, in C instead of Python. It is many times faster than the Python implementation, but does not allow the user to subclass from Pickle. If subclassing is not important for your use, you probably want to use cPickle.

Warning: The documentation for pickle makes clear that it offers no security guarantees. Be careful if you use pickle for inter-process communication or data storage and do not trust data you cannot verify as secure.

Example:

This first example of pickle encodes a data structure as a string, then prints the string to the console.

try:
import cPickle as pickle
except:
import pickle
import pprint


We first try to import cPickle, giving an alias of "pickle". If that import fails for any reason, we fall back to the native Python implementation in the pickle module. This gives us the faster implementation, if it is available, and the portable implementation otherwise.

Next we define a data structure made up of entirely native types. Instances of any class can be pickled, as will be illustrated in a later example. I chose native data types to start to keep the example simple.

data = [ { 'a':'A', 'b':2, 'c':3.0 } ]
print 'DATA:',
pprint.pprint(data)


And now we use pickle.dumps() to create a string representation of the value of data.

data_string = pickle.dumps(data)
print 'PICKLE:', data_string


By default, the pickle will use only ASCII characters. A more efficient binary format is also available, but I will be sticking with the ASCII version for these examples.

$ python pickle_string.py
DATA:[{'a': 'A', 'b': 2, 'c': 3.0}]
PICKLE: (lp1
(dp2
S'a'
S'A'
sS'c'
F3
sS'b'
I2
sa.


Once the data is serialized, you can write it to a file, socket, pipe, etc. Then later you can read the file and unpickle the data to construct a new object with the same values.

data1 = [ { 'a':'A', 'b':2, 'c':3.0 } ]
print 'BEFORE:',
pprint.pprint(data1)

data1_string = pickle.dumps(data1)

data2 = pickle.loads(data1_string)
print 'AFTER:',
pprint.pprint(data2)

print 'SAME?:', (data1 is data2)
print 'EQUAL?:', (data1 == data2)


As you see, the newly constructed object is the equal to but not the same object as the original. No surprise there.

$ python pickle_unpickle.py
BEFORE:[{'a': 'A', 'b': 2, 'c': 3.0}]
AFTER:[{'a': 'A', 'b': 2, 'c': 3.0}]
SAME?: False
EQUAL?: True


In addition to dumps() and loads(), pickle provides a couple of convenience functions for working with file-like streams. It is possible to write multiple objects to a stream, and then read them from the stream without knowing in advance how many objects are written or how big they are.

try:
import cPickle as pickle
except:
import pickle
import pprint
from StringIO import StringIO

class SimpleObject(object):

def __init__(self, name):
self.name = name
l = list(name)
l.reverse()
self.name_backwards = ''.join(l)
return

data = []
data.append(SimpleObject('pickle'))
data.append(SimpleObject('cPickle'))
data.append(SimpleObject('last'))

# Simulate a file with StringIO
out_s = StringIO()

# Write to the stream
for o in data:
print 'WRITING: %s (%s)' % (o.name, o.name_backwards)
pickle.dump(o, out_s)
out_s.flush()

# Set up a read-able stream
in_s = StringIO(out_s.getvalue())

# Read the data
while True:
try:
o = pickle.load(in_s)
except EOFError:
break
else:
print 'READ: %s (%s)' % (o.name, o.name_backwards)


The example simulates streams using StringIO buffers, so we have to play a little trickery to establish the readable stream. A simple database format might use pickles to store objects, too, though using the shelve module might be easier to work with.

$ python pickle_stream.py
WRITING: pickle (elkcip)
WRITING: cPickle (elkciPc)
WRITING: last (tsal)
READ: pickle (elkcip)
READ: cPickle (elkciPc)
READ: last (tsal)


In addition to storing data, pickles are very handy for inter-process communication. For example, using os.fork() and os.pipe(), one can establish worker processes which read job instructions from one pipe and write the results to another pipe. The core code for managing the worker pool and sending jobs in and receiving responses can be reused, since the job and response objects don't have to be of a particular class. If you are using pipes or sockets, do not forget to flush after dumping each object, to push the data through the connection to the other end.

When working with your own classes, you must ensure that the class being pickled appears in the namespace of the process reading the pickle. Only the data for the instance is pickled, not the class definition. The class name is used to find the constructor to create the new object when unpickling. Take this example, which writes instances of a class to a file:

try:
import cPickle as pickle
except:
import pickle
import sys

class SimpleObject(object):

def __init__(self, name):
self.name = name
l = list(name)
l.reverse()
self.name_backwards = ''.join(l)
return

if __name__ == '__main__':
data = []
data.append(SimpleObject('pickle'))
data.append(SimpleObject('cPickle'))
data.append(SimpleObject('last'))

try:
filename = sys.argv[1]
except IndexError:
raise RuntimeError('Please specify a filename as an argument to %s' % sys.argv[0])

out_s = open(filename, 'wb')
try:
# Write to the stream
for o in data:
print 'WRITING: %s (%s)' % (o.name, o.name_backwards)
pickle.dump(o, out_s)
finally:
out_s.close()


When I run the script, it will create a file I name as an argument on the command line:

$ python pickle_dump_to_file_1.py test.dat
WRITING: pickle (elkcip)
WRITING: cPickle (elkciPc)
WRITING: last (tsal)


A simplistic attempt to load the resulting pickled objects might look like:

try:
import cPickle as pickle
except:
import pickle
import pprint
from StringIO import StringIO
import sys

try:
filename = sys.argv[1]
except IndexError:
raise RuntimeError('Please specify a filename as an argument to %s' % sys.argv[0])

in_s = open(filename, 'rb')
try:
# Read the data
while True:
try:
o = pickle.load(in_s)
except EOFError:
break
else:
print 'READ: %s (%s)' % (o.name, o.name_backwards)
finally:
in_s.close()


This version fails because there is no SimpleObject class available:

$ python pickle_load_from_file_1.py test.dat
Traceback (most recent call last):
File "pickle_load_from_file_1.py", line 52, in
o = pickle.load(in_s)
AttributeError: 'module' object has no attribute 'SimpleObject'


A corrected version, which imports SimpleObject from the script which dumps the data, succeeds.

Add:

from pickle_dump_to_file_1 import SimpleObject


to the end of the import list, then run the script:

$ python pickle_load_from_file_2.py test.dat
READ: pickle (elkcip)
READ: cPickle (elkciPc)
READ: last (tsal)


There are some special considerations when pickling data types with values that cannot be pickled (sockets, file handles, database connections, etc.). Classes which use values which cannot be pickled can define __getstate__() and __setstate__() to return a subset of the state of the instance to be pickled. New-style classes can also define __getnewargs__(), which should return arguments to be passed to the class memory allocator (C.__new__()). Use of these features is covered in more detail in the standard library documentation.

References:

Python Module of the Week
Example Code
Pickle: An interesting stack language by Alexandre Vassalotti

Updated 9/5/2007 with minor formatting changes.

Technorati Tags:
,


Tuesday, June 19, 2007

DjangoKit help?

I spent a little time last night trying to assemble an application using DjangoKit without much success.

I'm running Python 2.5 on a PowerBook with Mac OS 10.4. I downloaded and installed PyObjC from source so it would compile (I thought) against the right version of Python, then installed DjangoKit using python setup.py install. Everything seemed to be working, and I was able to build an application. But when I ran that app, it produced an error about the version of the SQLite libraries being used (2 instead of 3) and missing libraries.

I gave up on Python 2.5, re-installed PyObjC and DjangoKit for 2.4 and tried again. Same error.

Just for grins, I copied the app over to my wife's laptop (she has a MacBook Pro). The result was, of course, a new error about the platform. No universal binaries? Really?

I'm sure there are options, or something, that I'm leaving out when I build the app. This was mostly an experiment, and I was in a hurry, so I gave up easily and just installed the django code I wanted on an existing (Linux) web server and let her use that instead of messing with a desktop application.

Has anyone else had more success building portable Python apps, esp. with django, on Mac OS X?

Updated:

Of course I knew better than to post in frustration when I posted this originally. In my haste, I didn't post sample code, the error message, or much of the rest of the information I would have wanted if I was the DjangoKit author trying to help someone out. Nonetheless, Tom did some digging anyway and offered suggestions. Others did as well. Thanks! I finally found time to follow up, and am coming closer to an answer.

Here are the full details:

The application is very, very simple. The model just contains 2 classes for creating an index of a pile of Cook's Illustrated magazine we have laying around the house. There is no front-end, since the admin views already provide the functionality she wanted. I thought I would be cute and bundle it as a desktop app for Ms. PyMOTW, instead of setting the app up on my web server. I have packaged the sample code and placed it on my server.

I have included 2 separate setup.py files (setup.py and djangokit_setup.py). I couldn't package the source using the DjangoKit version of setup:

$ python djangokit_setup.py sdist --force-manifest
Loading 'initial_data' fixtures...
No fixtures found.
running sdist
warning: sdist: missing required meta-data: name, url
warning: sdist: manifest template 'MANIFEST.in' does not exist (using default file list)
error: dist/Cook's Illustrated Index.app/Contents/Frameworks/Python.framework/Versions/2.5/Frameworks/Python.framework/Versions/2.5/Frameworks/Python.framework/Versions/2.5/Frameworks/Python.framework/Versions/2.5/Frameworks/Python.framework/Versions/2.5/Frameworks/Python.framework/Versions/2.5/Frameworks/Python.framework/Versions/2.5/Frameworks/Python.framework/Versions/2.5/Frameworks/Python.framework/Versions/2.5/Frameworks/Python.framework/Versions/2.5/Frameworks/Python.framework/Versions/2.5/Frameworks/Python.framework/Versions/2.5/Frameworks/Python.framework/Versions/2.5/Frameworks/Python.framework/Versions/2.5/Frameworks/Python.framework/Versions/2.5/Frameworks/Python.framework/Versions/2.5/Frameworks/Python.framework/Versions/2.5/Frameworks/Python.framework/Versions/2.5/Frameworks/Python.framework/Versions/2.5/Frameworks/Python.framework/Versions/2.5/Frameworks/Python.framework/Versions/2.5/Frameworks/Python.framework/Versions/2.5/Frameworks/Python.framework/Versions/2.5/Frameworks/Python.framework/Versions/2.5: File name too long


I only seem to get the error if my dist directory includes the application, too. Otherwise I get a minimal package with the name 'UNKNOWN'. So, the tarball was packaged with a regular distutils setup.py. That's not a big problem, since it is easy to use separate files.

When I ran python djangokit_setup.py py2app, the first time it reported this error:

*** creating application bundle: Cook's Illustrated Index ***
error: can't copy 'media': doesn't exist or not a regular file


I eventually figured out (guessed) that even though I don't have any external media, I need a media directory at the same level in the directory tree as the setup file. Creating the directory let me create the app. Running that app gives me this traceback:

Traceback (most recent call last):
File "/Users/dhellmann/Devel/personal/CooksIndex/trunk/dist/Cook's Illustrated Index.app/Contents/Resources/__boot__.py", line 31, in
_run('app.py')
File "/Users/dhellmann/Devel/personal/CooksIndex/trunk/dist/Cook's Illustrated Index.app/Contents/Resources/__boot__.py", line 28, in _run
execfile(path, globals(), globals())
File "/Users/dhellmann/Devel/personal/CooksIndex/trunk/dist/Cook's Illustrated Index.app/Contents/Resources/app.py", line 9, in
from pysqlite2 import dbapi2 as sqlite
ImportError: No module named pysqlite2


That brings the error reporting up to date, without trying any of the suggestions in the comments, yet. As I mentioned, the code itself works if I run django outside of the packaged application (from the command line, etc.). So I'm confident that my own imports are valid, etc.

Based on a hint from Tom (in the comments, he suggests that I install pysqlite2), I tried editing the app.py file created inside the application to import from sqlite3 instead of sqlite2. Editing the file directly didn't do it. Editing the copy already in my application changed the error message to:

Traceback (most recent call last):
File "/Users/dhellmann/Devel/personal/CooksIndex/trunk/dist/Cook's Illustrated Index.app/Contents/Resources/__boot__.py", line 31, in
_run('app.py')
File "/Users/dhellmann/Devel/personal/CooksIndex/trunk/dist/Cook's Illustrated Index.app/Contents/Resources/__boot__.py", line 28, in _run
execfile(path, globals(), globals())
File "/Users/dhellmann/Devel/personal/CooksIndex/trunk/dist/Cook's Illustrated Index.app/Contents/Resources/app.py", line 10, in
from sqlite3 import dbapi2 as sqlite
ImportError: No module named sqlite3


Next I tried editing the version of app.py in /Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/djangokit. After removing the application and rebuilding, I see the same error.

So I finally broke down and installed the pysqlte2 package Tom pointed out for me. The package seems to imply that it is for Python 2.4, and I'm running 2.5, but I installed it anyway.

The application Packaged with python 2.5 gave me "No module named pysqlite2" when I ran it. Repackaged using "python2.4 djangokit_setup.py py2app", I got the app to run but it does not seem to actually work. The console log shows this:

2007-06-23 15:07:28.403 Cook's Illustrated Index[14739] creating support folder /Users/dhellmann/Library/Application Support/DjangoKit/CooksIndex
2007-06-23 15:07:28.405 Cook's Illustrated Index[14739] installing default database
Starting web server on port 10557
Unhandled exception in thread started by
Traceback (most recent call last):
File "/Users/dhellmann/Devel/personal/CooksIndex/trunk/dist/Cook's Illustrated Index.app/Contents/Resources/app.py", line 81, in startWebServer
handler = AdminMediaHandler(WSGIHandler(), path)
TypeError: __init__() takes exactly 2 arguments (3 given)


So, I am a lot closer but not quite where I would like to be. I don't really care whether I package under 2.4 or 2.5, so long as the result runs on my wife's laptop, which doesn't have any development packages installed.




Technorati Tags:
,


Sunday, June 17, 2007

PyMOTW: os (Part 4)

Module: os (Part 4)

Description:

This week I am wrapping up coverage of the os module (saving os.path for a future post of its own) and discuss functions useful for working with multiple processes. I covered use of pipes in part 2, so this week we will look at system(), fork(), exec(), and related functions.

Disclaimer

Many of these functions have limited portability. For a more consistent way to work with processes in a platform independent manner, see the subprocess module instead.

Running External Command

The simplest way to run a separate command, without interacting with it at all, is os.system(). It takes a single string which is the command line to be executed by a sub-process running a shell.

import os

# Simple command
os.system('ls -l')


$ python os_system_example.py
total 168
-rw-r--r-- 1 dhellman dhellman 0 May 27 06:58 __init__.py
-rw-r--r-- 1 dhellman dhellman 1391 Jun 10 09:36 os_access.py
-rw-r--r-- 1 dhellman dhellman 1383 May 27 09:23 os_cwd_example.py
-rw-r--r-- 1 dhellman dhellman 1535 Jun 10 09:36 os_directories.py
-rw-r--r-- 1 dhellman dhellman 1613 May 27 09:23 os_environ_example.py
-rw-r--r-- 1 dhellman dhellman 2816 Jun 3 08:34 os_popen_examples.py
-rw-r--r-- 1 dhellman dhellman 1438 May 27 09:23 os_process_id_example.py
-rw-r--r-- 1 dhellman dhellman 1887 May 27 09:23 os_process_user_example.py
-rw-r--r-- 1 dhellman dhellman 1545 Jun 10 09:36 os_stat.py
-rw-r--r-- 1 dhellman dhellman 1638 Jun 10 09:36 os_stat_chmod.py
-rw-r--r-- 1 dhellman dhellman 1452 Jun 10 09:36 os_symlinks.py
-rw-r--r-- 1 dhellman dhellman 1279 Jun 17 12:17 os_system_example.py
-rw-r--r-- 1 dhellman dhellman 1672 Jun 10 09:36 os_walk.py


Since the command is passed directly to the shell for processing, it can even include shell syntax such as globbing or environment variables:

# Command with shell expansion
os.system('ls -l $HOME')


total 40
-rwx------ 1 dhellman dhellman 1328 Dec 13 2005 %backup%~
drwx------ 11 dhellman dhellman 374 Jun 17 12:11 Desktop
drwxr-xr-x 15 dhellman dhellman 510 May 27 07:50 Devel
drwx------ 29 dhellman dhellman 986 May 31 17:01 Documents
drwxr-xr-x 45 dhellman dhellman 1530 Jun 17 12:12 DownloadedApps
drwx------ 55 dhellman dhellman 1870 May 22 14:53 Library
drwx------ 8 dhellman dhellman 272 Mar 4 2006 Movies
drwx------ 10 dhellman dhellman 340 Feb 14 10:54 Music
drwx------ 12 dhellman dhellman 408 Jun 17 01:00 Pictures
drwxr-xr-x 5 dhellman dhellman 170 Oct 1 2006 Public
drwxr-xr-x 15 dhellman dhellman 510 May 12 15:19 Sites
drwxr-xr-x 4 dhellman dhellman 136 Jan 23 2006 iPod
-rw-r--r-- 1 dhellman dhellman 105 Mar 7 11:48 pgadmin.log
drwxr-xr-x 3 dhellman dhellman 102 Apr 29 16:32 tmp


Unless you explicitly run the command in the background, the call to os.system() blocks until it is complete. Standard input, output, and error from the child process are tied to the appropriate streams owned by the caller by default, but can be redirected using shell syntax.

import os
import time

print 'Calling...'
os.system('date; (sleep 3; date) &')

print 'Sleeping...'
time.sleep(5)


This is getting into shell trickery, though, and there are better ways to accomplish the same thing.

$ python os_system_background.py
Calling...
Sun Jun 17 12:27:20 EDT 2007
Sleeping...
Sun Jun 17 12:27:23 EDT 2007


Creating Processes with os.fork()

The POSIX functions fork() and exec*() (available under Mac OS X, Linux, and other UNIX variants) are available through the os module. Entire books have been written about reliably using these functions, so check your library or bookstore for more details than I will present here.

To create a new process as a clone of the current process, use os.fork():

pid = os.fork()

if pid:
print 'Child process id:', pid
else:
print 'I am the child'


Your output will vary based on the state of your system each time you run the example, but it should look something like:

$ python os_fork_example.py
Child process id: 5883
I am the child


After the fork, you end up with 2 processes running the same code. To tell which one you are in, check the return value. If it is 0, you are inside the child process. If it is not 0, you are in the parent process and the return value is the process id of the child process.

From the parent process, it is possible to send the child signals. This is a bit more complicated to set up, and uses the signal module, so let's walk through the code. First we can define a signal handler to be invoked when the signal is received.

import os
import signal
import time

def signal_usr1(signum, frame):
pid = os.getpid()
print 'Received USR1 in process %s' % pid


Then we fork, and in the parent pause a short amount of time before sending a USR1 signal using os.kill(). The short pause gives the child process time to set up the signal handler.

print 'Forking...'
child_pid = os.fork()
if child_pid:
print 'PARENT: Pausing before sending signal...'
time.sleep(1)
print 'PARENT: Signaling %s' % child_pid
os.kill(child_pid, signal.SIGUSR1)


In the child, we set up the signal handler and go to sleep for a while to give the parent time to send us the signal:

else:
print 'CHILD: Setting up signal handler'
signal.signal(signal.SIGUSR1, signal_usr1)
print 'CHILD: Pausing to wait for signal'
time.sleep(5)


In a real app, you probably wouldn't need to (or want to) call sleep, of course.

$ python os_kill_example.py
Forking...
PARENT: Pausing before sending signal...
CHILD: Setting up signal handler
CHILD: Pausing to wait for signal
PARENT: Signaling 6053
Received USR1 in process 6053


As you see, a simple way to handle separate behavior in the child process is to check the return value of fork() and branch. For more complex behavior, you may want more code separation than a simple branch. In other cases, you may have an existing program you have to wrap. For both of these situations, you can use the os.exec*() series of functions to run another program. When you "exec" a program, the code from that program replaces the code from your existing process.

child_pid = os.fork()
if child_pid:
os.waitpid(child_pid, 0)
else:
os.execlp('ls', 'ls', '-l', '/tmp/')


$ python os_exec_example.py       
total 40
drwxr-xr-x 2 dhellman wheel 68 Jun 17 14:35 527
prw------- 1 root wheel 0 Jun 15 19:24 afpserver_PIPE
drwx------ 3 dhellman wheel 102 Jun 17 12:13 emacs527
drwxr-xr-x 2 dhellman wheel 68 Jun 16 05:01 hsperfdata_dhellmann
-rw------- 1 nobody wheel 12 Jun 17 13:55 objc_sharing_ppc_4294967294
-rw------- 1 dhellman wheel 144 Jun 17 14:32 objc_sharing_ppc_527
-rw------- 1 security wheel 24 Jun 17 07:09 objc_sharing_ppc_92
drwxr-xr-x 4 dhellman dhellman 136 Jun 8 03:16 var_backups


There are many variations of exec*(), depending on what form you might have the arguments in, whether you want the path and environment of the parent process to be copied to the child, etc. Have a look at the library documentation to for details.

For all variations, the first argument is a path or filename and the remaining arguments control how that program runs. They are either passed as command line arguments or override the process "environment" (see os.environ and os.getenv).

Waiting for a Child

Suppose you are using multiple processes to work around the threading limitations of Python and the Global Interpreter Lock. If you start several processes to run separate tasks, you will want to wait for one or more of them to finish before starting new ones, to avoid overloading the server. There are a few different ways to do that using wait() and related functions.

If you don't care, or know, which child process might exit first os.wait() will return as soon as any exits:

import os
import sys
import time

for i in range(3):
print 'PARENT: Forking %s' % i
worker_pid = os.fork()
if not worker_pid:
print 'WORKER %s: Starting' % i
time.sleep(2 + i)
print 'WORKER %s: Finishing' % i
sys.exit(i)

for i in range(3):
print 'PARENT: Waiting for %s' % i
done = os.wait()
print 'PARENT:', done


Notice that the return value from os.wait() is a tuple containing the process id and exit status ("a 16-bit number, whose low byte is the signal number that killed the process, and whose high byte is the exit status").

$ python os_wait_example.py
PARENT: Forking 0
PARENT: Forking 1
PARENT: Forking 2
PARENT: Waiting for 0
WORKER 0: Starting
WORKER 1: Starting
WORKER 2: Starting
WORKER 0: Finishing
PARENT: (6501, 0)
PARENT: Waiting for 1
WORKER 1: Finishing
PARENT: (6502, 256)
PARENT: Waiting for 2
WORKER 2: Finishing
PARENT: (6503, 512)


If you want a specific process, use os.waitpid().

import os
import sys
import time

workers = []
for i in range(3):
print 'PARENT: Forking %s' % i
worker_pid = os.fork()
if not worker_pid:
print 'WORKER %s: Starting' % i
time.sleep(2 + i)
print 'WORKER %s: Finishing' % i
sys.exit(i)
workers.append(worker_pid)

for pid in workers:
print 'PARENT: Waiting for %s' % pid
done = os.waitpid(pid, 0)
print 'PARENT:', done


$ python os_waitpid_example.py
PARENT: Forking 0
WORKER 0: Starting
PARENT: Forking 1
WORKER 1: Starting
PARENT: Forking 2
WORKER 2: Starting
PARENT: Waiting for 6547
WORKER 0: Finishing
PARENT: (6547, 0)
PARENT: Waiting for 6548
WORKER 1: Finishing
PARENT: (6548, 256)
PARENT: Waiting for 6549
WORKER 2: Finishing
PARENT: (6549, 512)


wait3() and wait4() work in a similar manner, but return more detailed information about the child process with the pid, exit status, and resource usage.

Spawn

As a convenience, the os.spawn*() family of functions handles the fork() and exec*() calls for you in one statement:

os.spawnlp(os.P_WAIT, 'ls', 'ls', '-l', '/tmp/')


$ python os_exec_example.py       
total 40
drwxr-xr-x 2 dhellman wheel 68 Jun 17 14:35 527
prw------- 1 root wheel 0 Jun 15 19:24 afpserver_PIPE
drwx------ 3 dhellman wheel 102 Jun 17 12:13 emacs527
drwxr-xr-x 2 dhellman wheel 68 Jun 16 05:01 hsperfdata_dhellmann
-rw------- 1 nobody wheel 12 Jun 17 13:55 objc_sharing_ppc_4294967294
-rw------- 1 dhellman wheel 144 Jun 17 14:32 objc_sharing_ppc_527
-rw------- 1 security wheel 24 Jun 17 07:09 objc_sharing_ppc_92
drwxr-xr-x 4 dhellman dhellman 136 Jun 8 03:16 var_backups


Conclusion

There are a lot of other considerations to be taken into account when working with multiple processes, such as handling signals, closing duplicated file descriptors, etc. All of these topics are covered in reference books such as Advanced Programming in the UNIX(R) Environment.

Next week, I'll pick a module that won't take 4 weeks to write about. :-) Suggestions are welcome, as usual.

References:

Python Module of the Week
Sample Code
Delve into UNIX process creation
Advanced Programming in the UNIX(R) Environment

Updated 9/5/2007 with minor formatting changes.

Technorati Tags:
,


Friday, June 15, 2007

PyAtl Presentations

Last night's Atlanta Python Meetup included several interesting presentations. Living so far outside of Atlanta, it isn't easy to make it down for as many of the meetings as I would like, but it was definitely worth the effort last night. We had a larger than usual crowd, due to the fact that Google was sponsoring pizza and providing speakers, all apparently part of their current recruiting drive.

Cary Hull of Google talked about twisted. I hadn't ever really looked at twisted closely, so the overview and examples he provided were new material for me. It seems to be very informative and something worth looking into, particularly since we are looking at redoing the architecture of part of our system at work, and we will want to handle a lot of sub-processes which might block on I/O from a number of sources. We already know we want to use processes to take advantage of multi-processor systems, but twisted seems to offer some nice tools for managing those processes.

Sandwiched between Cary and the final presenter, Luis Caamano gave a presentation on DynaCenter, our use of Pyro, and the event manager we have built with it. Luis' talk was our first "public" technical presentation from Racemi, and it seemed to be well received.

After Luis, Dan Morrill of Google spoke on cross-site scripting vulnerabilities. Lots of food for thought there, especially regarding the trustworthiness of data coming from your own database. Dan is on the web toolkit team at Google. He has obviously given similar presentations before, and it was clear that he knew what he was talking about. Due to a miscommunication about network access, the live demo he had planned wasn't possible. He wasn't phased a bit, and proceeded to work up sample code on the fly in front of the audience.

Noah also announced the formation of the PyAtl Book Club, membership in which comes with a discount code for O'Reilly books. I'm looking forward to participating, since it will be something I can do without making that 90 minute drive. :-) If you are interested, even if you aren't in the Atlanta area, join our Google group.


Technorati Tags:
,


Sunday, June 10, 2007

PyMOTW: os (Part 3)

Module: os (Part 3)

Description:

The previous installments covered process parameters and input/output. This week I will look at some of the functions for working with files and directories.

File Descriptors

The os module includes the standard set of functions for working with low-level "file descriptors" (integers representing open files owned by the current process). This is a lower-level API than is provided by file() objects. Although I promised to cover file descriptors last time, I am going to skip over describing them here, since it is generally easier to work directly with file() objects. Refer to the library documentation for details if you do need to use file descriptors.

Filesystem Permissions

The function os.access() can be used to test the access rights a process has for a file.

import os

print 'Testing:', __file__
print 'Exists:', os.access(__file__, os.F_OK)
print 'Readable:', os.access(__file__, os.R_OK)
print 'Writable:', os.access(__file__, os.W_OK)
print 'Executable:', os.access(__file__, os.X_OK)


Your results will vary depending on how you install the example code, but it should look something like this:

$ python os_access.py
Testing: os_access.py
Exists: True
Readable: True
Writable: True
Executable: False


The library documentation for os.access() includes 2 special warnings. First, there isn't much sense in calling os.access() to test whether a file can be opened before actually calling open() on it. There is a small, but real, window between the 2 calls during which the permissions on the file could change. The other warning applies mostly to networked filesystems which extend the POSIX permission semantics. Some filesystem types may respond to the POSIX call that a process has permission to access a file, then report a failure when the attempt is made using open() for some reason not tested via the POSIX call. All in all, it is better to call open() with the required mode and catch the IOError raised if there is a problem.

More detailed information about the file can be accessed using os.stat() or os.lstat() (if you want the status of something that might be a symbolic link).

import os
import sys
import time

if len(sys.argv) == 1:
filename = __file__
else:
filename = sys.argv[1]

stat_info = os.stat(filename)

print 'os.stat(%s):' % filename
print '\tSize:', stat_info.st_size
print '\tPermissions:', oct(stat_info.st_mode)
print '\tOwner:', stat_info.st_uid
print '\tDevice:', stat_info.st_dev
print '\tLast modified:', time.ctime(stat_info.st_mtime)


Once again, your results will vary depending on how the example code was installed. Try passing different filenames on the command line to os_stat.py.

$ python os_stat.py
os.stat(os_stat.py):
Size: 1547
Permissions: 0100644
Owner: 527
Device: 234881026
Last modified: Sun Jun 10 08:13:26 2007


On Unix-like systems, file permissions can be changed using os.chmod(), passing the mode as an integer. Mode values can be constructed using constants defined in the stat module. Here is an example which toggles the user's execute permission bit:

import os
import stat

# Determine what permissions are already set using stat
existing_permissions = stat.S_IMODE(os.stat(__file__).st_mode)

if not os.access(__file__, os.X_OK):
print 'Adding execute permission'
new_permissions = existing_permissions | stat.S_IXUSR
else:
print 'Removing execute permission'
# use xor to remove the user execute permission
new_permissions = existing_permissions ^ stat.S_IXUSR

os.chmod(__file__, new_permissions)


The script assumes you have the right permissions to modify the mode of the file to begin with:

$ python os_stat_chmod.py
Adding execute permission
$ python os_stat_chmod.py
Removing execute permission


Directories

There are several functions for working with directories on the filesystem, including creating, listing contents, and removing them.

import os

dir_name = 'os_directories_example'

print 'Creating', dir_name
os.makedirs(dir_name)

file_name = os.path.join(dir_name, 'example.txt')
print 'Creating', file_name
f = open(file_name, 'wt')
try:
f.write('example file')
finally:
f.close()

print 'Listing', dir_name
print os.listdir(dir_name)

print 'Cleaning up'
os.unlink(file_name)
os.rmdir(dir_name)


$ python os_directories.py
Creating os_directories_example
Creating os_directories_example/example.txt
Listing os_directories_example
['example.txt']
Cleaning up


There are 2 sets of functions for creating and deleting directories. When creating a new directory with os.mkdir(), all of the parent directories must already exist. When removing a directory with os.rmdir(), only the leaf directory (the last part of the path) is actually removed. In contrast, os.makedirs() and os.removedirs() operate on all of the nodes in the path. os.makedirs() will create any parts of the path which do not exist, and os.removedirs() will remove all of the parent directories (assuming it can).

Symbolic Links

For platforms and filesystems which support them, there are several functions for working with symlinks.

import os, tempfile

link_name = tempfile.mktemp()

print 'Creating link %s->%s' % (link_name, __file__)
os.symlink(__file__, link_name)

stat_info = os.lstat(link_name)
print 'Permissions:', oct(stat_info.st_mode)

print 'Points to:', os.readlink(link_name)

# Cleanup
os.unlink(link_name)


Notice that although os includes os.tempnam() for creating temporary filenames, it is not as secure as the tempfile module and produces a RuntimeWarning message when it is used. In general it is better to use the tempfile module.

$ python os_symlinks.py
Creating link /tmp/tmpRxRiHn->os_symlinks.py
Permissions: 0120755
Points to: os_symlinks.py


Walking a Directory Tree

The function os.walk() traverses a directory recursively and for each directory generates a tuple containing the directory path, any immediate sub-directories of that path, and the names of any files in that directory. This example shows a simplistic recursive directory listing.

import os, sys

# If we are not given a path to list, use /tmp
if len(sys.argv) == 1:
root = '/tmp'
else:
root = sys.argv[1]

for dir_name, sub_dirs, files in os.walk(root):
print '\n', dir_name
# Make the subdirectory names stand out with /
sub_dirs = [ '%s/' % n for n in sub_dirs ]
# Mix the directory contents together
contents = sub_dirs + files
contents.sort()
# Show the contents
for c in contents:
print '\t%s' % c


$ python os_walk.py

/tmp
.KerberosLogin-0--1074266944 (inited,root,local)/
.KerberosLogin-527-4839472 (inited,gui,tty,local)/
527/
cs_cache_lock_527
cs_cache_lock_92
emacs527/
fry.log
hsperfdata_dhellmann/
objc_sharing_ppc_4294967294
objc_sharing_ppc_527
objc_sharing_ppc_92
svn.arg.1835l59
var_backups/

/tmp/.KerberosLogin-527-4839472 (inited,gui,tty,local)
KLLCCache.lock

/tmp/527

/tmp/emacs527
server

/tmp/hsperfdata_dhellmann
976

/tmp/var_backups
infodir.bak
local.nidump


To be continued...

Next time I'll wrap up this discussion of the os module with coverage of functions for creating and managing processes.

References:

Python Module of the Week
Example Source
Working with Files and Directories
tempfile module

Updated 9/5/2007 with minor formatting changes.

Technorati Tags:
,


Sunday, June 3, 2007

Dialing down email distractions

I have been experimenting with various productivity hacks lately. I feel like I'm already fairly productive, based on tracking the amount of work I accomplish week-to-week. So instead of trying to do more, I'm trying to maintain the same level of output with less effort (and hopefully time).

One of the top tips I have seen repeatedly is to reduce the amount of time spent checking email, and only check it a couple of times per day. I'm not sure I'm ready go go that hard-core, yet. Our team is spread out and relies heavily on email for communication. Most of the messages are important, and there aren't too many that I feel like I have trouble keeping up. The biggest issue is that they can be very time-sensitive. Our code-review process is triggered via email messages generated when a ticket is given a specific status in trac. If one of us does not notice the change, the author of the change might be blocked for some period of time from doing dependent work, until the code review can be completed. Since we want to encourage code reviews, we don't want those blockages to last too long. So, I can't turn email off entirely. But I can dial it down fairly far.

Besides cutting down on email during the week, I also want to break myself of the habit of checking work email over the weekend. Working for a startup, it is too easy (for me) to get sucked into giving up weekend after weekend. This is draining, so I'm not as fresh during the week as I would be if I avoided the work mail. But, I don't want to give up my personal email at the same time.

So, the question is, how do I strike that balance?
The first step was to decide what my new mail schedule would be. I tend to get up fairly early in the mornings, and enjoy breakfast on the patio when the weather is nice enough (it rained last night, finally). This is relaxing time, but not especially conducive to the long stretches of deep thought needed for development or debugging. So, mornings are a good time to do a lot of email, instant messaging with the rest of the team, review code or documentation, and the other sorts of tasks that don't take hours at a stretch to complete. Afternoons are too warm to sit outside and think anyway, so I move inside to work on coding projects then. That gives me a schedule: I am willing to have more interactions (and interruptions) in the morning than the afternoon and I want to "reclaim" weekends. The next step is to tell Mail the news.

I use the same email client (OS X's Mail.app) to access all of my mail accounts. Even though I use IMAP, this lets me read and search old messages when I don't have access to the network. The brute-force way would be to manually change the preference "Check for new mail" to the appropriate schedule. I hate doing things like this manually, though.

I've done some work with AppleScript and Mail in the past. This week (I'm not sure why it took me so long to figure out this approach) I realized I could use AppleScript to control how often Mail checks for new messages, and which accounts are checked.

Now that I have a general schedule identified, I can configure some events in iCal using AppleScripts to control Mail. I began by composing a few AppleScripts in Script Editor.

To check mail frequently in the morning:


To check mail infrequently, for the afternoons:



To turn of automatic checking entirely in the evening:


To disable my work account for the weekend:


To enable my work account on Monday mornings:


With the scripts in place, I configured events in iCal to run the scripts to adjust my settings at appropriate times.

Every week day morning, I turn up the frequence to every 5 minutes. This ensures that by the time I am up and ready to look at email, the mailbox is up to date.

















Around lunch time, I turn the frequency back down to once per hour. I find I don't even notice the change, and when I come back from lunch I am ready to settle in and concentrate. I don't make it through the afternoon without checking email, but stretching out the time between checks does help.
















And in the evenings, I turn email off entirely. Note that this script only runs Monday-Thursday. On Friday, I leave Mail set to check messages once per hour, since I do receive personal messages over the weekend and I want to see those.
















To avoid being sucked back into work, I disable that account entirely. Of course, on Monday morning, I have a similar job scheduled to run the MailCheckWorkEnable script to re-enable the account for the week.

Disabling the account entirely seemed like a drastic step, but is very effective. When Monday comes around, I am refreshed and ready to work again. I do not miss any personal mail, and have not been tempted to "just look at this one thing" from my work messages.


Technorati Tags:
, ,


PyMOTW: os (Part 2)

Module: os (Part 2)

Description:

The previous installment covered process parameters. This time we'll cover some of the input/output features provided by the module.

Pipes:

The os module provides several functions for managing the I/O of child processes using pipes. The functions all work essentially the same way, but return different file handles depending on the type of input or output desired. For the most part, these functions are made obsolete by the new-ish subprocess module (added in 2.4), but there is a good chance you will encounter them if you are maintaining existing code.

The most commonly used pipe function is popen(). It creates a new process running the command given and attaches a single stream to the input or output of that process, depending on the mode argument. While popen functions work on Windows, some of these examples assume some sort of Unix-like shell. The descriptions of the streams also assume Unix-like terminology:

stdin - The "standard input" stream for a process (file descriptor 0) is readable by the process. This is usually where terminal input goes.

stdout - The "standard output" stream for a process (file descriptor 1) is writable by the process, and is used for displaying non-error information to the user.

stderr - The "standard error" stream for a process (file descriptor 2) is writable by the process, and is used for conveying error messages.

import os

print '\npopen, read:'
pipe_stdout = os.popen('echo "to stdout"', 'r')
try:
stdout_value = pipe_stdout.read()
finally:
pipe_stdout.close()
print '\tstdout:', repr(stdout_value)

print '\npopen, write:'
pipe_stdin = os.popen('cat -', 'w')
try:
pipe_stdin.write('\tstdin: to stdin\n')
finally:
pipe_stdin.close()


popen, read:
stdout: 'to stdout\n'

popen, write:
stdin: to stdin


The caller can only read from OR write to the streams associated with the child process, which limits the usefulness. The other popen varients provide additional streams so it is possible to work with stdin, stdout, and stderr as needed.

For example, popen2() returns a write-only stream attached to stdin of the child process, and a read-only stream attached to its stdout.

print '\npopen2:'
pipe_stdin, pipe_stdout = os.popen2('cat -')
try:
pipe_stdin.write('through stdin to stdout')
finally:
pipe_stdin.close()
try:
stdout_value = pipe_stdout.read()
finally:
pipe_stdout.close()
print '\tpass through:', repr(stdout_value)


This simplistic example illustrates bi-directional communication. The value written to stdin is read by cat (because of the '-' argument), then written back to stdout. Obviously a more complicated process could pass other types of messages back and forth through the pipe; even serialized objects.

popen2:
pass through: 'through stdin to stdout'


In most cases, it is desirable to have access to both stdout and stderr. The stdout stream is used for message passing and the stderr stream is used for errors, so reading from it separately reduces the complexity for parsing any error messages. The popen3() function returns 3 open streams tied to stdin, stdout, and stderr of the new process.

print '\npopen3:'
pipe_stdin, pipe_stdout, pipe_stderr = os.popen3('cat -; echo ";to stderr" 1>&2')
try:
pipe_stdin.write('through stdin to stdout')
finally:
pipe_stdin.close()
try:
stdout_value = pipe_stdout.read()
finally:
pipe_stdout.close()
print '\tpass through:', repr(stdout_value)
try:
stderr_value = pipe_stderr.read()
finally:
pipe_stderr.close()
print '\tstderr:', repr(stderr_value)


Notice that we have to read from and close both streams separately. There are some related to flow control and sequencing when dealing with I/O for multiple processes. The I/O is buffered, and if the caller expects to be able to read all of the data from a stream then the child process must close that stream to indicate the end-of-file. For more information on these issues, refer to the Flow Control Issues section of the Python library documentation.

popen3:
pass through: 'through stdin to stdout'
stderr: ';to stderr\n'


And finally, popen4() returns 2 streams, stdin and a merged stdout/stderr. This is useful when the results of the command need to be logged, but not parsed directly.

print '\npopen4:'
pipe_stdin, pipe_stdout_and_stderr = os.popen4('cat -; echo ";to stderr" 1>&2')
try:
pipe_stdin.write('through stdin to stdout')
finally:
pipe_stdin.close()
try:
stdout_value = pipe_stdout_and_stderr.read()
finally:
pipe_stdout.close()
print '\tcombined output:', repr(stdout_value)


popen4:
combined output: 'through stdin to stdout;to stderr\n'


Besides accepting a single string command to be given to the shell for parsing, popen2(), popen3(), and popen4() also accept a sequence of strings (command, followed by arguments). In this case, the arguments are not processed by the shell.

print '\npopen2, cmd as sequence:'
pipe_stdin, pipe_stdout = os.popen2(['cat', '-'])
try:
pipe_stdin.write('through stdin to stdout')
finally:
pipe_stdin.close()
try:
stdout_value = pipe_stdout.read()
finally:
pipe_stdout.close()
print '\tpass through:', repr(stdout_value)


popen2, cmd as sequence:
pass through: 'through stdin to stdout'


To be continued...

Next time I'll cover working with file descriptors.

References:

Python Module of the Week
Example Source
Unix Concepts for more discussion of stdin, stdout, and stderr.
File Object Creation with the os module
subprocess module
Flow Control Issues

[Updated: Jesse posted on why not to use os.popen*() when working with threads.]

Updated 9/5/2007 with minor formatting changes.

Technorati Tags:
,