Thursday, November 29, 2007

Help Wanted: Give us work!

Due to overwhelming response, and the fact that we underestimated the skill-level of the contestants, the list of tasks assembled by the PSF for the Google Highly Open Participation contest is running out rapidly. The contestants are completing tasks faster than the existing mentors can think of new ones. We need your help to come up with more ideas for the contest, and soon!

This is your opportunity to commission someone to write that little feature you've never quite found the time to get around to finishing, translate the UI of your favorite program into your favorite language, or even figure out that library you've been meaning to try and write some documentation. Tasks should fall into one of these categories:

  1. Code: Tasks related to writing or refactoring code
  2. Documenation: Tasks related to creating/editing documents
  3. Outreach: Tasks related to community management and outreach/marketing
  4. Quality Assurance: Tasks related to testing and ensuring code is of high quality
  5. Research: Tasks related to studying a problem and recommending solutions
  6. Training: Tasks related to helping others learn more
  7. Translation: Tasks related to localization
  8. User Interface: Tasks related to user experience research or user interface design and interaction


We're trying to keep the tasks interesting, rather than pawning off all of the little administrivia sorts of things no one else wants to do. Part of the goal of the contest is to attract new contributors to projects, after all. Relatively small coding tasks (fixing a bug, adding a small feature, etc.) are especially welcome.

If you have some ideas for us, or just want to participate, head over to the google-highly-open-participation-psf project site and look at the existing tasks. If something sparks an idea, send it to the mentors list for discussion.

Task Guidelines:

From our NewTaskGuidelines:

The primary requirement for a new task is that it be specific.

It helps if it's relatively small, but there's nothing wrong with challenging students with bigger tasks ;). The idea is to have tasks take students at most 2-4 days with 2-3 hrs of work a day, but obviously that will vary with student skills.

Don't feel like you need to write down every detail needed to accomplish the task, but do not be vague. Students with good google fu (hah!) should be given enough info to figure stuff out, but we should be encouraging student communication.

Provide specific completion criteria for the task, so the completeness of a submission can be judged accurately. Code and documentation contributions can be sent to individual project sites, but should also be added to the task as comments or file attachments.

Ultimately your task must be approved by someone. For the moment the process is relatively informal: simply type up the description and send it to the mentors list, where we will try to come to a consensus decision about whether the task is suitable for students. Note that tasks cannot build on other tasks.

If you don't already have a project in mind but want to contribute by writing up a new project, go check out ProjectIdeaIndex and ProjectSuggestionNotes.



Technorati Tags:
,


Wednesday, November 28, 2007

Google Highly Open Participation Contest & the PSF

For the past few days I've been one of several people helping Titus Brown set up the Python Software Foundation's portion of the Google Highly Open Participation(TM) contest. GHOP is an extension of Google's Summer of Code project, for students not yet in college. The goal of the contest is to attract young people to open source, and teach them how to participate. Check out the FAQ for more details.

The contest is structured around "tasks" specified by several participating projects and covering areas like documentation, outreach, coding, testing, and training. The GHOP home page has a list of the projects, and all of the details for the PSF tasks are in the wiki in our Google Code project: google-highly-open-participation-psf.

How You Can Participate:

Each group has a limited number of tasks to contribute to the overall contest, and while we've posted several tasks already, each group also left space open for suggestions from the wider community (this is and open source project, after all!). If you have ideas for Python-related tasks, check out the guidelines for instructions on submitting them. We need community participation!

Speaking of which: Many of the contestants will be new to open source in general, and most probably won't have experience in the participating projects. They're going to need help from patient mentors to solve the tasks set before them. If you have time, and the inclination, join the ghop-python google group and watch for questions you can answer. Also keep an eye out for them on comp.lang.python or other forums where they may be going for assistance.

This is a really exciting project, and I'm looking forward to watching how things unfold over the next couple of months. It sounds a little cliched, but these kids are the next generation of open source developers. This is a great way to introduce them to open source and pass on the traditions of cooperation and teamwork that form the basis of the thriving community we have today.

[Updated technorati tags]


Technorati Tags:


Tuesday, November 27, 2007

Python Magazine for November

Somehow the release of our November issue snuck right past me. There's a good range of articles this month, covering decorators, working with RSS feeds, IDLE, and Gtk. My column is about the use of Python in scientific applications, and Mark's discusses operator overloading. Brian's column addresses some of the feedback we've seen from readers of the October issue.

If you're a subscriber, you should have already received an email notification that the PDF is available for download. You can login to the site and download your copy right away, and print copies are on the way. If you're not a subscriber yet, it isn't too late!

Sunday, November 25, 2007

PyCon: The PyCon 2007 podcast

The audio recordings from PyCon 2007 are finally being posted! I know it's hard to record in those big ballrooms at conventions, so I hope the audio quality on these is ok.

PyCon: The PyCon 2007 podcast

requiring packages with distutils

The documentation for distutils alleges that using the requires keyword allows a package to declare a dependency. I can't for the life of me make this do anything useful. What I expect to happen is when I use easy_install to download a package with another requirement, that required package should also be downloaded.

Here's what I have:

from distutils.core import setup
import os

setup (
name = 'BlogBackup',
version = '1.2',

description = 'Script to dump a blog feed to files suitable for backing up or reprocessing.',
long_description = """
This script uses the feedparser module to access an Atom or RSS feed and
download the individual entries to a backup directory. It tracks both
etag and modified headers for each feed to reduce processing overhead.
""",

author = 'Doug Hellmann',
author_email = 'doug.hellmann@example.com',

url = 'http://www.doughellmann.com/projects/BlogBackup/',
download_url = 'http://www.doughellmann.com/downloads/BlogBackup-1.2.tar.gz',

classifiers = [ 'Development Status :: 4 - Beta',
'License :: OSI Approved :: BSD License',
'Programming Language :: Python',
'Intended Audience :: End Users/Desktop',
'Environment :: Console',
'Topic :: System :: Archiving :: Backup',
'Topic :: Utilities',
],

platforms = ('Any',),
keywords = ('backup', 'archive', 'atom', 'rss', 'blog', 'weblog'),

packages = [ 'blogbackuplib',
],

package_dir = { '': '.' },

scripts = ['blogbackup'],

requires=['CommandLineApp (>=2.5)'],
)


I set up a new virtual environment without any site-packages. I have verified that if I run the virtual environment interpreter, I cannot import CommandLineApp (so it is not already installed). When I run easy_install BlogBackup, it downloads and installs the correct version (1.2). Here's the output:

$ easy_install BlogBackup
Searching for BlogBackup
Reading http://pypi.python.org/simple/BlogBackup/
Reading http://www.doughellmann.com/projects/BlogBackup/
Best match: BlogBackup 1.2
Downloading http://www.doughellmann.com/downloads/BlogBackup-1.2.tar.gz
Processing BlogBackup-1.2.tar.gz
Running BlogBackup-1.2/setup.py -q bdist_egg --dist-dir /tmp/easy_install-p9F4P3/BlogBackup-1.2/egg-dist-tmp-VRoy9D
zip_safe flag not set; analyzing archive contents...
Adding BlogBackup 1.2 to easy-install.pth file
Installing blogbackup script to /Users/dhellmann/Devel/personal/Projects/BlogBackup/Test/bin

Installed /Users/dhellmann/Devel/personal/Projects/BlogBackup/Test/lib/python2.5/site-packages/BlogBackup-1.2-py2.5.egg
Processing dependencies for BlogBackup
Finished processing dependencies for BlogBackup


It says "Processing dependencies", but does not download the CommandLineApp package.

Have I specified the requirements value incorrectly? Or am I expecting too much?

feedcache 1.3: purging and redirects

Release 1.3 of feedcache is available. This version supports purging the cache by the age of the contents.

After some discussion with Thomas Perl about how to handle redirects, I decided to leave the existing behavior alone. That means redirected feeds are returned, but not stored in the cache. It's up to the caller to recognize that a feed was redirected and update the list of URLs being checked, depending on the actual response code.

PyMOTW: inspect

The inspect module provides a variety of functions for introspecting on live objects and their source code.

Module: inspect
Purpose: Inspect live objects
Python Version: added in 2.1, with updates in 2.3 and 2.5

Description:

The inspect module provides functions for learning about live objects, including modules, classes, instances, functions, and methods. You can use functions in this module to retrieve the original source code for a function, look at the arguments to a method on the stack, and extract the sort of information useful for producing library documentation for your source code. My own CommandLineApp module uses inspect to determine the valid options to a command line program, as well as any arguments and their names so command line programs are self-documenting and the help text is generated automatically.

Module Information:

The first kind of introspection supported lets you probe live objects to learn about them. For example, it is possible to discover the classes and functions in a module, the methods of a class, etc. Let's start with the module-level details and work our way down to the function level.

To determine how the interpreter will treat and load a file as a module, use getmoduleinfo(). Pass a filename as the only argument, and the return value is a tuple including the module base name, the suffix of the file, the mode which will be used for reading the file, and the module type as defined in the imp module. It is important to note that the function looks only at the file's name, and does not actually check if the file exists or try to read the file.

import imp
import inspect
import sys

if len(sys.argv) >= 2:
filename = sys.argv[1]
else:
filename = 'example.py'

try:
(name, suffix, mode, mtype) = inspect.getmoduleinfo(filename)
except TypeError:
print 'Could not determine module type of %s' % filename
else:
mtype_name = { imp.PY_SOURCE:'source',
imp.PY_COMPILED:'compiled',
}.get(mtype, mtype)

mode_description = { 'rb':'(read-binary)',
'U':'(universal newline)',
}.get(mode, '')

print 'NAME :', name
print 'SUFFIX :', suffix
print 'MODE :', mode, mode_description
print 'MTYPE :', mtype_name


Here are a few sample runs:

$ python inspect_getmoduleinfo.py example.py 
NAME : example
SUFFIX : .py
MODE : U (universal newline)
MTYPE : source

$ python inspect_getmoduleinfo.py readme.txt
Could not determine module type of readme.txt

$ python inspect_getmoduleinfo.py notthere.pyc
NAME : notthere
SUFFIX : .pyc
MODE : rb (read-binary)
MTYPE : compiled


Example Module:

The rest of the examples for this tutorial use a single example file source file, found in PyMOTW/inspect/example.py which is included below and also available as part of the source distribution associated with this series of articles.

#!/usr/bin/env python

# This comment appears first
# and spans 2 lines.

# This comment does not show up in the output of getcomments().

"""Sample file to serve as the basis for inspect examples.
"""

def module_level_function(arg1, arg2='default', *args, **kwargs):
"""This function is declared in the module."""
local_variable = arg1
return

class A(object):
"""The A class."""
def __init__(self, name):
self.name = name

def get_name(self):
"Returns the name of the instance."
return self.name

instance_of_a = A('sample_instance')

class B(A):
"""This is the B class.
It is derived from A.
"""

# This method is not part of A.
def do_something(self):
"""Does some work"""
pass

def get_name(self):
"Overrides version from A"
return 'B(' + self.name + ')'


Modules:

It is possible to probe live objects to determine their components using getmembers(). The arguments to getmembers() are an object to scan (a module, class, or instance) and an optional predicate function which is used to filter the objects returned. The return value is a list of tuples with 2 values: the name of the member, and the type of the member. The inspect module includes several such predicate functions with names like ismodule(), isclass(), etc. You can, of course, provide your own predicate function as well.

The types of members which might be returned depend on the type of object scanned. Modules can contain classes and functions; classes can contain methods and attributes; and so on.

import inspect

import example

for name, data in inspect.getmembers(example):
if name == '__builtins__':
continue
print '%s :' % name, repr(data)


This sample prints the members of the example module. Modules have a set of __builtins__, which are ignored in the output for this example because they are not actually part of the module and the list is long.

$ python inspect_getmembers_module.py
A : <class 'example.A'>
B : <class 'example.B'>
__doc__ : 'Sample file to serve as the basis for inspect examples.\n'
__file__ : '/Users/dhellmann/Documents/PyMOTW/branches/inspect/example.pyc'
__name__ : 'example'
instance_of_a : <example.A object at 0xbb810>
module_level_function : <function module_level_function at 0xc8230>


The predicate argument can be used to filter the types of objects returned.

import inspect

import example

for name, data in inspect.getmembers(example, inspect.isclass):
print '%s :' % name, repr(data)


Notice that only classes are included in the output, now:

$ python inspect_getmembers_module_class.py
A : <class 'example.A'>
B : <class 'example.B'>


Classes:

Classes can be scanned using getmembers() in the same way as modules, though the types of members are different.

import inspect
from pprint import pprint

import example

pprint(inspect.getmembers(example.A))


Since no filtering is applied, the output shows the attributes, methods, slots, and other members of the class:

$ python inspect_getmembers_class.py
[('__class__', <type 'type'>),
('__delattr__', <slot wrapper '__delattr__' of 'object' objects>),
('__dict__', <dictproxy object at 0xca090>),
('__doc__', 'The A class.'),
('__getattribute__', <slot wrapper '__getattribute__' of 'object' objects>),
('__hash__', <slot wrapper '__hash__' of 'object' objects>),
('__init__', <unbound method A.__init__>),
('__module__', 'example'),
('__new__', <built-in method __new__ of type object at 0x32ff38>),
('__reduce__', <method '__reduce__' of 'object' objects>),
('__reduce_ex__', <method '__reduce_ex__' of 'object' objects>),
('__repr__', <slot wrapper '__repr__' of 'object' objects>),
('__setattr__', <slot wrapper '__setattr__' of 'object' objects>),
('__str__', <slot wrapper '__str__' of 'object' objects>),
('__weakref__', <attribute '__weakref__' of 'A' objects>),
('get_name', <unbound method A.get_name>)]


To find the methods of a class, use the ismethod() predicate:

import inspect
from pprint import pprint

import example

pprint(inspect.getmembers(example.A, inspect.ismethod))


$ python inspect_getmembers_class_methods.py
[('__init__', <unbound method A.__init__>),
('get_name', <unbound method A.get_name>)]


If we look at class B, we see the over-ride for get_name() as well as the new method, and the inherited __init__() method implented in A.

import inspect
from pprint import pprint

import example

pprint(inspect.getmembers(example.B, inspect.ismethod))


Notice that even though __init__() is inherited from A, it is identified as a method of B.

$ python inspect_getmembers_class_methods_b.py
[('__init__', <unbound method B.__init__>),
('do_something', <unbound method B.do_something>),
('get_name', <unbound method B.get_name>)]


Documentation Strings:

The docstring for an object can be retrieved with getdoc(). The return value is the __doc__ attribute with tabs expanded to spaces and with indentation made uniform.

import inspect
import example

print 'B.__doc__:'
print example.B.__doc__
print
print 'getdoc(B):'
print inspect.getdoc(example.B)


Notice the difference in indentation on the second line of the doctring:

$ python inspect_getdoc.py 
B.__doc__:
This is the B class.
It is derived from A.


getdoc(B):
This is the B class.
It is derived from A.


In addition to the actual docstring, it is possible to retrieve the comments from the source file where an object is implemented, if the source is available. The getcomments() function looks at the source of the object and finds comments on lines preceding the implementation.

import inspect
import example

print inspect.getcomments(example.B.do_something)


The lines returned include the comment prefix, but any whitespace prefix is stripped off.

$ python inspect_getcomments_method.py 
# This method is not part of A.


When a module is passed to getcomments(), the return value is always the first comment in the module.

import inspect
import example

print inspect.getcomments(example)


Notice that contiguous lines from the example file are included as a single comment, but as soon as a blank line appears the comment is stopped.

$ python inspect_getcomments_module.py 
# This comment appears first
# and spans 2 lines.


Retrieving Source:

If the .py file is available, the original source code for the class or method can also be retrieved using getsource() and getsourcelines().

import inspect
import example

print inspect.getsource(example.A.get_name)


The original indent level is retained in this case.

$ python inspect_getsource_method.py
def get_name(self):
"Returns the name of the instance."
return self.name


When a class is passed in, all of the methods for the class are included in the output.

import inspect
import example

print inspect.getsource(example.A)


$ python inspect_getsource_class.py 
class A(object):
"""The A class."""
def __init__(self, name):
self.name = name

def get_name(self):
"Returns the name of the instance."
return self.name


If you need the lines of source split up, it can be easier to use getsourcelines() instead of getsource(). The return value from getsourcelines() is a tuple containing a list of strings (the lines from the source file), and a starting line number in the file where the source appears.

import inspect
import pprint
import example

pprint.pprint(inspect.getsourcelines(example.A.get_name))


 $ python inspect_getsourcelines_method.py (['    def get_name(self):\n',
' "Returns the name of the instance."\n',
' return self.name\n'],
53)


If the source (.py) file is not available, getsource() and getsourcelines() raise an IOError.

Method and Function Arguments:

In addition to the documentation for a function or method, it is possible to ask for a complete specification of the arguments the callable takes, including default values. The getargspec() function returns a tuple containing the list of positional argument names, the name of any variable positional arguments (e.g., *args), the neame of any variable named arguments (e.g., **kwds), and default values for the arguments. If there are default values, they match up with the end of the positional argument list.

import inspect
import example

arg_spec = inspect.getargspec(example.module_level_function)
print 'NAMES :', arg_spec[0]
print '* :', arg_spec[1]
print '** :', arg_spec[2]
print 'defaults:', arg_spec[3]

args_with_defaults = arg_spec[0][-len(arg_spec[3]):]
print 'args & defaults:', zip(args_with_defaults, arg_spec[3])


Note that the first argument, arg1, does not have a default value. The single default therefore is matched up with arg2.

$ python inspect_getargspec_function.py
NAMES : ['arg1', 'arg2']
* : args
** : kwargs
defaults: ('default',)
args & defaults: [('arg2', 'default')]


Class Hierarchies:

inspect includes 2 methods for working directly with class hierarchies. The first, getclasstree(), creates a tree-like data structure using nested lists and tuples based on the classes it is given and their base classes. Each element in the list returned is either a tuple with a class and its base classes, or another list containing tuples for subclasses.

import inspect
import example

class C(example.B):
pass

class D(C, example.A):
pass

def print_class_tree(tree, indent=-1):
if isinstance(tree, list):
for node in tree:
print_class_tree(node, indent+1)
else:
print ' ' * indent, tree[0].__name__
return

print_class_tree(inspect.getclasstree([example.A, example.B, C, D]))


The output from this example is the "tree" of inheritance for the A, B, C, and D classes. Note that D appears twice, since it inherits from both C and A.

$ python inspect_getclasstree.py 
object
A
D
B
C
D


If we call getclasstree() with unique=True, the output is different.

print_class_tree(inspect.getclasstree([example.A, example.B, C, D],
unique=True,
))


This time, D only appears in the output once:

$ python inspect_getclasstree_unique.py object
A
B
C
D


Method Resolution Order:

The other function for working with class hierarchies is getmro(), which returns a tuple of classes in the order they should be scanned when resolving an attribute that might be inherited from a base class. Each class in the sequence appears only once.

import inspect
import example

class C(object):
pass

class C_First(C, example.B):
pass

class B_First(example.B, C):
pass

print 'B_First:'
for c in inspect.getmro(B_First):
print '\t', c.__name__
print
print 'C_First:'
for c in inspect.getmro(C_First):
print '\t', c.__name__


This output demonstrates the "depth-first" nature of the MRO search. For B_First, A also comes before C in the search order, because B is derived from A.

$ python inspect_getmro.py 
B_First:
B_First
B
A
C
object

C_First:
C_First
C
B
A
object


The Stack and Frames:

In addition to introspection of code objects, the inspect module includes several functions for inspecting the runtime environment while a program is running. Most of these functions work with the call stack, and operate on "call frames". Each frame record in the stack is a 6 element tuple containing the frame object, the filename where the code exists, the line number in that file for the current line being run, the function name being called, a list of lines of context from the source file, and the index into that list of the current line. Typically such information is used to build tracebacks when exceptions are raised. It can also be useful when debugging programs, since the stack frames can be interrogated to discover the argument values passed into the functions.

The function currentframe() returns the frame at the top of the stack (for the current function). The function getargvalues() returns a tuple with argument names, the names of the variable arguments, and a dictionary with local values from the frame. By combining them, we can see the arguments to functions and local variables at different points in the call stack.

import inspect

def recurse(limit):
local_variable = '.' * limit
print limit, inspect.getargvalues(inspect.currentframe())
if limit <= 0:
return
recurse(limit - 1)
return

if __name__ == '__main__':
recurse(3)


The value for local_variable is included in the frame's local variables even though it is not an argument to the function.

$ python inspect_getargvalues.py 
3 (['limit'], None, None, {'local_variable': '...', 'limit': 3})
2 (['limit'], None, None, {'local_variable': '..', 'limit': 2})
1 (['limit'], None, None, {'local_variable': '.', 'limit': 1})
0 (['limit'], None, None, {'local_variable': '', 'limit': 0})


Using stack(), it is also possible to access all of the stack frames from the current frame to the first caller. This example is similar to the one above, except it waits until reaching the end of the recursion to print the stack information.

import inspect

def recurse(limit):
local_variable = '.' * limit
if limit <= 0:
for frame, filename, line_num, func, source_code, source_index in inspect.stack():
print '%s[%d]\n -> %s' % (filename, line_num, source_code[source_index].strip())
print inspect.getargvalues(frame)
print
return
recurse(limit - 1)
return

if __name__ == '__main__':
recurse(3)


The last part of the output represents the main program, outside of the recurse function.

$ python inspect_stack.py
inspect_stack.py[37]
-> for frame, filename, line_num, func, source_code, source_index in inspect.stack():
(['limit'], None, None, {'local_variable': '', 'line_num': 37, 'frame': <frame object at 0x61ba30>,
'filename': 'inspect_stack.py', 'limit': 0, 'func': 'recurse', 'source_index': 0,
'source_code': [' for frame, filename, line_num, func, source_code, source_index in inspect.stack():\n']})

inspect_stack.py[42]
-> recurse(limit - 1)
(['limit'], None, None, {'local_variable': '.', 'limit': 1})

inspect_stack.py[42]
-> recurse(limit - 1)
(['limit'], None, None, {'local_variable': '..', 'limit': 2})

inspect_stack.py[42]
-> recurse(limit - 1)
(['limit'], None, None, {'local_variable': '...', 'limit': 3})

inspect_stack.py[46]
-> recurse(3)
([], None, None, {'__builtins__': <module '__builtin__' (built-in)>,
'__file__': 'inspect_stack.py',
'inspect': <module 'inspect' from '/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/inspect.pyc'>,
'recurse': <function recurse at 0xc81b0>, '__name__': '__main__',
'__doc__': 'Inspecting the call stack.\n\n'})


There are other functions for building lists of frames in different contexts, such as when an exception is being processed. See the documentation for trace(), getouterframes(), and getinnerframes() for more details.

References:

Python Module of the Week Home
Download Sample Code


Technorati Tags:
,


Friday, November 23, 2007

virtualenv

A couple of days ago, Chris posted about using virtualenv to create sandboxes on Leopard instead of installing packages directly into the Frameworks directory. I'd heard of virtualenv, but never tried it before. After reading what Chris said, I downloaded it and gave it a try, and have to say, "Wow!"

I had been worried about installing a ton of dependencies on my system so I could test code associated with articles submitted to Python Magazine. I planned on setting up a VM, so I could at least isolate that code from my own development environment, but virtualenv is going to be much easier to deal with. I can create a separate environment for each article, and verify the dependencies as necessary.

When you run virtualenv, it sets up a fresh sandbox version of Python by copying/linking files from your default installation to create new bin and lib directories. It does not copy site-packages, so you have a clean place to install any packages you want (with easy_install, or by other means). It does include your default installation in the PYTHONPATH for the new sandbox, so you can use modules already installed there as well as anything you install into the sandbox.

The new sandbox also includes a simple script to activate/deactivate the environment for your current shell. When you activate an environment, your command prompt changes (at least under Mac OS X, and I assume other Unixen) to remind you that you're using that environment, and your PATH is automatically updated to use the new bin directory. You can also run the interpreter from the environment directly, without "activation", and it knows to look for modules using the correct path.

I'll probably still set up that VM, but I know I'll worry a lot less about conflicts between different modules as I test the code from our authors.


Technorati Tags:
,


Tuesday, November 20, 2007

Algorithm Blogs » Python module usage statistics

Imri Goldberg has put together some interesting statistics about Python module use frequency by analyzing modules downloaded from PyPI.

I'm not surprised to see Zope modules showing up so high on the list, given the number of separate Zope-related packages posted there now. He might need a switch to filter out Zope modules from the counts. :-)

Monday, November 19, 2007

OSS projects for new Python developers?

Titus Brown is looking for small projects suitable for new developer to work on. The idea is to come up with something they can do to make a real contribution without feeling overwhelmed by the scope of the project.

If you have any suggestions, head over and let him know via comments on his blog.

Sunday, November 18, 2007

PyMOTW: urlparse

The urlparse module provides an interface for splitting up Uniform Resource Locator strings into their parts.

Module: urlparse
Purpose: Split URL into component pieces.
Python Version: since 1.4

Description:

The urlparse module provides functions for breaking URLs down into their component parts, as defined by the relevant RFCs.

Parsing:

The return value from the urlparse function is an object which acts like a tuple with 6 elements.

from urlparse import urlparse
parsed = urlparse('http://netloc/path;parameters?query=argument#fragment')
print parsed


The parts of the URL available through the tuple interface are the scheme, network location, path, parameters, query, and fragment. In this example, I use "http" for the scheme since "scheme" is not a valid scheme.

$ python urlparse_urlparse.py 
('http', 'netloc', '/path', 'parameters', 'query=argument', 'fragment')


Although the return value acts like a tuple, it is really a subclass of tuple which supports accessing the parts of the URL via named attributes instead of indexes. That's especially useful if, like me, you can't remember the index order. In addition to being easier to use for the programmer, the attribute API also offers access to several values not available in the tuple API.

from urlparse import urlparse
parsed = urlparse('http://user:pass@NetLoc:80/path;parameters?query=argument#fragment')
print 'scheme :', parsed.scheme
print 'netloc :', parsed.netloc
print 'path :', parsed.path
print 'params :', parsed.params
print 'query :', parsed.query
print 'fragment:', parsed.fragment
print 'username:', parsed.username
print 'password:', parsed.password
print 'hostname:', parsed.hostname, '(netloc in lower case)'
print 'port :', parsed.port


The username and password are available when present and None when not. The hostname is the same value as netloc, but all lower case letters are enforced. And the port is converted to an integer when present and None when not.

$ python urlparse_urlparseattrs.py 
scheme : http
netloc : user:pass@NetLoc:80
path : /path
params : parameters
query : query=argument
fragment: fragment
username: user
password: pass
hostname: netloc (netloc in lower case)
port : 80


The urlsplit function is an alternative to urlparse. It does not split the parameters from the URL. This is useful for URLs following RFC 2396, which supports parameters for each segment of the path.

from urlparse import urlsplit
parsed = urlsplit('http://user:pass@NetLoc:80/path;parameters/path2;parameters2?query=argument#fragment')
print parsed
print 'scheme :', parsed.scheme
print 'netloc :', parsed.netloc
print 'path :', parsed.path
print 'query :', parsed.query
print 'fragment:', parsed.fragment
print 'username:', parsed.username
print 'password:', parsed.password
print 'hostname:', parsed.hostname, '(netloc in lower case)'
print 'port :', parsed.port


Since the parameters are not split out, the tuple API will show 5 elements instead of 6, and there is no params attribute.

$ python urlparse_urlsplit.py 
('http', 'user:pass@NetLoc:80', '/path;parameters/path2;parameters2', 'query=argument', 'fragment')
scheme : http
netloc : user:pass@NetLoc:80
path : /path;parameters/path2;parameters2
query : query=argument
fragment: fragment
username: user
password: pass
hostname: netloc (netloc in lower case)
port : 80


To simply strip the fragment identifier from a URL, as you might need to do to find a base page name from a URL, use urldefrag.

from urlparse import urldefrag
original = 'http://netloc/path;parameters?query=argument#fragment'
print original
url, fragment = urldefrag(original)
print url
print fragment


The return value is a tuple containing the base URL and the fragment.

$ python urlparse_urldefrag.py
http://netloc/path;parameters?query=argument#fragment
http://netloc/path;parameters?query=argument
fragment


Unparsing:

There are several ways to assemble a split URL back together into a single string. The parsed URL object has a geturl() method.

from urlparse import urlparse
original = 'http://netloc/path;parameters?query=argument#fragment'
print 'ORIG :', original
parsed = urlparse(original)
print 'PARSED:', parsed.geturl()


Of course, it only works on the object returned by urlparse or urlsplit.

$ python urlparse_geturl.py 
ORIG : http://netloc/path;parameters?query=argument#fragment
PARSED: http://netloc/path;parameters?query=argument#fragment


If you have a tuple of values, you can use urlunparse() to assemble it into a URL.

from urlparse import urlparse, urlunparse
original = 'http://netloc/path;parameters?query=argument#fragment'
print 'ORIG :', original
parsed = urlparse(original)
print 'PARSED:', type(parsed), parsed
t = parsed[:]
print 'TUPLE :', type(t), t
print 'NEW :', urlunparse(t)


While the ParseResult returned by urlparse can be used as a tuple, in this example I explicitly create a new tuple to show that urlunparse works with normal tuples, too.

$ python urlparse_urlunparse.py 
ORIG : http://netloc/path;parameters?query=argument#fragment
PARSED: ('http', 'netloc', '/path', 'parameters', 'query=argument', 'fragment')
TUPLE : ('http', 'netloc', '/path', 'parameters', 'query=argument', 'fragment')
NEW : http://netloc/path;parameters?query=argument#fragment


If the input URL included superfluous parts, those may be dropped from the unparsed version of the URL.

from urlparse import urlparse, urlunparse
original = 'http://netloc/path;?#'
print 'ORIG :', original
parsed = urlparse(original)
print 'PARSED:', type(parsed), parsed
t = parsed[:]
print 'TUPLE :', type(t), t
print 'NEW :', urlunparse(t)


In this case, the parameters, query, and fragment are all missing in the original URL. The new URL does not look the same as the original, but is equivalent according to the standard.

$ python urlparse_urlunparseextra.py 
ORIG : http://netloc/path;?#
PARSED: ('http', 'netloc', '/path', '', '', '')
TUPLE : ('http', 'netloc', '/path', '', '', '')
NEW : http://netloc/path


Joining:

In addition to parsing URLs, the urlparse module includes the urljoin() function for constructing absolute URLs from relative fragments.

from urlparse import urljoin
print urljoin('http://www.example.com/path/file.html', 'anotherfile.html')
print urljoin('http://www.example.com/path/file.html', '../anotherfile.html')


Notice that the relative portion of the path ("../") is taken into account when the second URL is computed.

$ python urlparse_urljoin.py
http://www.example.com/path/anotherfile.html
http://www.example.com/anotherfile.html


References:

Python Module of the Week Home
Download Sample Code
RFC 1378 - Uniform Resource Locators
RFC 2396 - Uniform Resource Identifiers


Technorati Tags:
,


Saturday, November 17, 2007

love/hate python stdlib modules

Titus wants to know which standard library modules we use frequently, even though we don't like how they work or find the documentation confusing.

re - I can never remember the difference between search() and match()

timeit - I haven't actually used it that often, but find the API a little weird. Why do I pass the text of the code to time, instead of a callable like a function or method?

optparse - It's not very object oriented. Why do I give a name for the action, instead of instantiating different types of option handlers for different behaviors?

logging - There are so many options. It has great potential, but I always have to look for an example to get a basic configuration setup.

bisect - Why isn't this handled as a method of list? Something like insert_sorted()?

distutils - Don't even get me started.

What are yours?

Friday, November 16, 2007

Racemi press on ZDNet

Dan Kusnetzky from ZDNet has posted this morning about Racemi and our product, DynaCenter.

DynaCenter repurposes servers on-the-fly from the iron up, making it easy to turn your disaster recovery assets into extra computing resources. When the production data center goes offline, the DR site can be brought online in little more than the amount of time it takes to reboot the servers. Similarly, in a test/development lab setup you can use DynaCenter to test an application under several operating systems on the same hardware, without manually swapping drives or re-installing anything. Point, click, reboot.

Kusnetzky is spot-on when he points out that our software does more than we market it as doing, though. Since we control the server's power and boot process, DynaCenter can also be used to manage a utility or grid computing environment and conserve power in a regular data center by powering down idle servers until load on an application rises to a point that they are actually needed.

In his conversation with our team, one little detail was not covered: We're writing the whole thing in Python.

[Updated: Python didn't come up in their conversation, so it's no surprise it wasn't mentioned in Dan's post.]

Thursday, November 15, 2007

The B-List: Instant web sites

Over at The B-List: Instant web sites, James Bennett has a nice description of the databrowse add-on for django, complete with screenshots. I'd never heard of the tool before, but it sounds extremely useful, and since I'm planning a new django-based site soon I'll definitely give it a try.

Wednesday, November 14, 2007

a private setuptools repository?

I've used easy_install for open source tools, but I have to admit I never thought of setting up a web server as a private repository for closed source apps. Jeremy makes it look simple.

new version of LinkingToMe

Version 0.3 of LinkingToMe remembers the link history and highlights new links since the previous run in bold.

Tuesday, November 13, 2007

Needed: SQL/Database design book recommendation

I need a book to teach someone about basic database design. They don't need relational algebra or calculus, and they don't have to be an expert about highly optimized storage, indexing, or anything like that. They just need some basic normalization, column type selection, and query help for what should be a pretty simple database.

They took a college class on RDBMSes, but the class and accompanying book were both terrible. The book from my class is great, but is more complex than what they want and need. I'm aware of the Dummies and Idiot series books, but I would prefer to avoid those, if possible.

I'd rather not give them something tied to a specific tool (since they haven't selected the tool they are going to use), but as long as the tool is not Access it's ok if the book is vendor-specific. We'll probably end up using Postgresql or SQLite for the actual database, but won't be doing anything that should require special features provided by either of those databases.

Does anyone have any recommendations?

Sunday, November 11, 2007

Book Review: Programming Collective Intelligence

The latest book I've been reading as part of the Atlanta Python Users's Group Book Club is Programming Collective Intelligence by Toby Segaran.

Disclosure: My copy of the book was provided free, as part of O'Reilly Media's support for the book club.

My Impressions:

I have to admit, I was a little concerned when I picked up Programming Collective Intelligence that my rusty math skills would be a hindrance to really understanding the material. But all of the statistics or linear algebra needed (not a lot) are explained quite clearly in context (something my college professors could never seem to manage). It did take me longer than it usually does to read a book of this size because this one is crammed full of great material. It has a high information density, but is still a pleasant read. Ending each chapter with a list of exercises you can use to explore the topics presented earlier in more depth was a nice touch.

While the source code is not always as clear as the prose (mostly due to variable name choices), it is presented with plenty of descriptive text that is clear. Most of the chapters build the source along with your own knowledge, rather then presenting a large complete program after a lengthy description. In fact, many of the inline examples are created using the Python interpreter command line, making it easy to work along with the text and experiment with the data on your own.

I definitely recommend this book. The algorithms covered are fascinating, and I'm already considering how I can use the optimization techniques to solve a sticky problem we've been trying to address at work.

Book Summary:

The first chapter introduces collective intelligence (combining input from a large group of people to achieve insight) and machine learning (adaptive algorithms which can be trained to perform a task more accurately or make predictions).

Chapter 2 dives right in to building a recommendation engine. The first small example program finds users with similar tastes in movies. This example is used to explore different ways to calculate similarity between data points, and how to use those values to rank other users who have critiqued movies based on how similar they are to you. Critics with taste similar to your own can be used to find a recommendation for a movie you have not seen. These ideas are expanded in a larger example which recommends links from del.icio.us. This is the first of many real example programs throughout the book which use Web 2.0 APIs to pull data from public sites. Chapter 2 closes with a discussion of the pros and cons of user-based vs. item-based filtering and when each is appropriate.

In chapter 3, the similarity calculations developed in chapter 2 are used to build data clustering algorithms (hierarchical, column, and K-Means). The first example groups blogs based on the words which appear in the posts on that blog. The example works through the entire process of breaking the input into words to be counted, all the way to visualization of the clustering results. Sample code for drawing dendrograms using PIL is included. Next, an example using Zebo.com discovers clusters in the preferences people have (Zebo lets users post lists of things they want).

Chapter 4 discusses the challenges experienced when building a full-text search engine. The example code starts out a little confusing because it stubs in the whole API instead of "evolving" the class throughout the chapter. But the discussion is clear, and once the code is complete it makes sense. The discussion of PageRank have especially good examples. Chapter 4 also introduces a simple neural network implementation and shows how to train it to include the click counts for search results in their rankings. The neural network code might have been more clear if it had used a functional programming style, but that might just be a personal preference. In general, the implementation is very straightforward and it should be possible to use it for other purposes. This was my first exposure to neural networks, and they strike me as surprisingly simple for something with such an exotic sounding name.

Chapter 5 covers "stochastic optimization" techniques for selecting the best result from several options in a set. Random searching, hill climbing, simulated annealing, and genetic algorithms are covered. The discussion also includes the limitations of optimization as an approach. Once the basic techniques are explained, the sample flight scheduling application is converted to use live data from Kayak.com.

In chapter 6, various algorithms for classifying documents are covered. A naive Bayesian spam filter is used to examine the challenges of breaking documents up into classifiable "features". There is good coverage of the techniques for limiting false classifications using separate thresholds and a description of how to combine the probabilities for each feature to calculate the probability of the source document belonging in one category or another. The Fisher method, used by SpamBayes, is also discussed. Once the classifier is complete, an example program for filtering blog feeds is built with it. The code samples in chapter 6 start to suffer from abbreviated symbol names, but once you figure out the abbreviations the rest of the structure of the code makes sense.

The material in chapter 7, Modeling with Decision Trees, reminded me of an expert systems class I had in college. In class, we had to build our decision trees by hand but chapter 7 shows how to "train" a tree from input data with known outcomes. The material covers methods for splitting the tree into sets based on Gini impurity or Entropy, and then building a tree recursively by repeatedly splitting sets until no more information is gained by having separate nodes in the decision tree. Again, once an example program is built with a simple dataset, the program is enhanced by introducing a web 2.0 site which can provide similar data. In this case, real estate price information from Zillow.com is used. To illustrate how the same decision tree code can be used completely different types of data, a hotornot predictor is built with data from hotornot.com.

Chapter 8 leaves the realm of strict classification and introduces tools for building price models for predicting price for items using multiple variables. As with the earlier chapters, several techniques are presented and their pros and cons are covered in detail. There are plenty of graphs to illustrate the importance of selecting the right number of neighbors for the k-nearest neighbors calculation, for example. This chapter also discusses optimizing the scale of data from heterogeneous variables, and weighting different variables based on how much they effect the outcome. The real world dataset for chapter 8 comes from eBay pricing data.

Chapter 9 returns to classification, and covers tools for classifying data where the division of the data can be expressed as a function of 2 or more variables. The problems with decision tree and basic linear classification are discussed in the context of a dating site match-making application. This segues into a discussion of kernel methods for dealing with non-linear classification. The input data for the match-maker uses the distance apart the 2 parties live, as calculated using the Yahoo! Maps API. Although support vector machines are discussed in theory, the actual code for working with them is written using the open source LIBSVM library due to the intense computational requirements. At the end of the chapter the completed match-maker is turned loose on Facebook data to predict "friends".

While the earlier chapters have focused on placing data into categories, chapter 10 covers techniques for discovering categories within the data itself. The first example presented is a tool for finding themes among news items in RSS feeds. The feature extraction technique discussed, non-negative matrix factorization, is implemented using the NumPy libraries for matrix math. The second example uses Yahoo! Finance APIs to examine trading volume for various stocks, looking for relationships.

Chapter 11 introduces a few techniques for genetic programming, evolving applications through trials and mutations. The sample code includes classes to represent programs as trees of data which are easier to mutate than raw text source would be. The chapter explains how to measure the success of any one program tree, apply random changes through mutation and crossover, then evolve a set of programs by identifying and retaining those which are becoming more successful at reaching the desired outcome. The importance of diversity for keeping the result set from reaching a local maxima is stressed. The sample programs include a formula building tool and a player for a simple game.

Chapter 12 wraps up the book with a summary of all of the algorithms and techniques presented earlier, and is intended to serve as a reference. There is less code, but all of the examples are with new data so the prose is not just a repetition of what has already been seen. There are additional diagrams to help explain the techniques, in particular the neural network details are expanded.

Web 2.0 APIs/Sites:

All of the examples throughout the book use either simple flat files for input, or a Web 2.0 API of some sort.



Open Source Libraries:

Many of the example programs use open source libraries to process or retrieve the data. All of those libraries are listed, with instructions for retrieving and installing them, in Appendix A.




Technorati Tags:
, ,


Saturday, November 10, 2007

PyMOTW: pprint

The pprint module includes a "pretty printer" for producing aesthetically pleasing representations of your data structures.

Module: pprint
Purpose: Pretty-print data structures
Python Version: 1.4

Description:

The formatter used in the pprint module prints representations of data structures in a format which can be parsed correctly by the interpreter, and which are also easy for a human to read. The output is kept on a single line, if possible, and indented correctly when split across multiple lines.

The examples here all depend on pprint_data.py, which contains:

data = [ (i, { 'a':'A',
'b':'B',
'c':'C',
'd':'D',
'e':'E',
'f':'F',
'g':'G',
'h':'H',
})
for i in xrange(3)
]


Printing:

The simplest way to use the module is with the pprint() function. It formats your object and writes it to the data stream passed as argument (or sys.stdout by default).

from pprint import pprint

from pprint_data import data

print 'PRINT:'
print data
print
print 'PPRINT:'
pprint(data)


$ python pprint_pprint.py
PRINT:
[(0, {'a': 'A', 'c': 'C', 'b': 'B', 'e': 'E', 'd': 'D', 'g': 'G', 'f': 'F', 'h': 'H'}), (1, {'a': 'A', 'c': 'C', 'b': 'B', 'e': 'E', 'd': 'D', 'g': 'G', 'f': 'F', 'h': 'H'}), (2, {'a': 'A', 'c': 'C', 'b': 'B', 'e': 'E', 'd': 'D', 'g': 'G', 'f': 'F', 'h': 'H'})]

PPRINT:
[(0,
{'a': 'A',
'b': 'B',
'c': 'C',
'd': 'D',
'e': 'E',
'f': 'F',
'g': 'G',
'h': 'H'}),
(1,
{'a': 'A',
'b': 'B',
'c': 'C',
'd': 'D',
'e': 'E',
'f': 'F',
'g': 'G',
'h': 'H'}),
(2,
{'a': 'A',
'b': 'B',
'c': 'C',
'd': 'D',
'e': 'E',
'f': 'F',
'g': 'G',
'h': 'H'})]


Formatting:

If you need to format a data structure, but do not want to write it directly to a stream (for logging purposes, for example) you can use pformat() to build a string representation that can then be passed to another function.

import logging
from pprint import pformat
from pprint_data import data

logging.basicConfig(level=logging.DEBUG,
format='%(asctime)s %(levelname)-8s %(message)s',
)

logging.debug('Logging pformatted data')
logging.debug(pformat(data))


$ python pprint_pformat.py
2007-10-21 18:10:32,881 DEBUG Logging pformatted data
2007-10-21 18:10:32,884 DEBUG [(0,
{'a': 'A',
'b': 'B',
'c': 'C',
'd': 'D',
'e': 'E',
'f': 'F',
'g': 'G',
'h': 'H'}),
(1,
{'a': 'A',
'b': 'B',
'c': 'C',
'd': 'D',
'e': 'E',
'f': 'F',
'g': 'G',
'h': 'H'}),
(2,
{'a': 'A',
'b': 'B',
'c': 'C',
'd': 'D',
'e': 'E',
'f': 'F',
'g': 'G',
'h': 'H'})]


Arbitrary Classes:

The PrettyPrinter class used by pprint() can also work with your own classes, if they define a __repr__ method.

from pprint import pprint

class node(object):
def __init__(self, name, contents=[]):
self.name = name
self.contents = contents[:]
def __repr__(self):
return 'node(' + repr(self.name) + ', ' + repr(self.contents) + ')'

trees = [ node('node-1'),
node('node-2', [ node('node-2-1')]),
node('node-3', [ node('node-3-1')]),
]
pprint(trees)


 $ python pprint_arbitrary_object.py
[node('node-1', []),
node('node-2', [node('node-2-1', [])]),
node('node-3', [node('node-3-1', [])])]


Recursion:

Recursive data structures are represented with a reference to the original source of the data, with the form <Recursion on typename with id=number>. For example:

local_data = [ 'a', 'b', 1, 2 ]
local_data.append(local_data)

print 'id(local_data) =>', id(local_data)
pprint(local_data)


$ python pprint_recursion.py
id(local_data) => 486936
['a', 'b', 1, 2, <Recursion on list with id=486936>]


Limiting Nested Output:

For very deep data structures, you may not want the output to include all of the details. It might be impossible to format the data properly, the formatted text might be too large to manage, or you may need all of it. In that case, the depth argument can control how far down into the nested data structure the pretty printer goes.

from pprint import pprint

from pprint_data import data

pprint(data, depth=1)


 $ python pprint_depth.py 
[(0, {...}), (1, {...}), (2, {...})]


Controlling Output Width:

The default output width for the formatted text is 80 columns. To adjust that width, use the width argument to pprint().

from pprint import pprint

from pprint_data import data

for d in data:
for c in 'defgh':
del d[1][c]

for width in [ 80, 20, 5 ]:
print 'WIDTH =', width
pprint(data, width=width)
print


Notice that when the width is too low to accommodate the formatted data structure, the lines are not truncated or wrapped if that would introduce invalid syntax.

$ python pprint_width.py 
WIDTH = 80
[(0, {'a': 'A', 'b': 'B', 'c': 'C'}),
(1, {'a': 'A', 'b': 'B', 'c': 'C'}),
(2, {'a': 'A', 'b': 'B', 'c': 'C'})]

WIDTH = 20
[(0,
{'a': 'A',
'b': 'B',
'c': 'C'}),
(1,
{'a': 'A',
'b': 'B',
'c': 'C'}),
(2,
{'a': 'A',
'b': 'B',
'c': 'C'})]

WIDTH = 5
[(0,
{'a': 'A',
'b': 'B',
'c': 'C'}),
(1,
{'a': 'A',
'b': 'B',
'c': 'C'}),
(2,
{'a': 'A',
'b': 'B',
'c': 'C'})]


References:

Python Module of the Week Home
Download Sample Code


Technorati Tags:
,


Sunday, November 4, 2007

Programming Collective Intelligence

I'm reading Programming Collective Intelligence for the PyATL book club this month. I've only started, but am already finding it fascinating. If you've read it, come join the discussion over at our google group.

[Updated Nov 14 - My review is available here.]

PyMOTW: shutil

The shutil module includes high-level file operations such as copying, setting permissions, etc.

Module: shutil
Purpose: High-level file operations.
Python Version: 1.4

Description:

The shutil module provides several functions for copying and removing entire files.

Copying Files:

copyfile() copies the contents of the source to the destination. Raises IOError if you do not have permission to write to the destination file. Because the function opens the input file for reading, regardless of its type, special files cannot be copied as new special files with copyfile().

import os
from shutil import *

print 'BEFORE:', os.listdir(os.getcwd())
copyfile('shutil_copyfile.py', 'shutil_copyfile.py.copy')
print 'AFTER:', os.listdir(os.getcwd())


$ python shutil_copyfile.py 
BEFORE: ['__init__.py', 'shutil_copyfile.py']
AFTER: ['__init__.py', 'shutil_copyfile.py', 'shutil_copyfile.py.copy']


copyfile() is written using the lower-level function copyfileobj(). While the arguments to copyfile() are file names, the arguments to copyfileobj() are open file handles. The optional third argument is a buffer length to use for reading in chunks (by default, the entire file is read at one time).

import os
from StringIO import StringIO
import sys
from shutil import *

class VerboseStringIO(StringIO):
def read(self, n=-1):
next = StringIO.read(self, n)
print 'read(%d) =>' % n, next
return next

lorem_ipsum = '''Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
Vestibulum aliquam mollis dolor. Donec vulputate nunc ut diam.
Ut rutrum mi vel sem. Vestibulum ante ipsum.'''


print 'Default:'
input = VerboseStringIO(lorem_ipsum)
output = StringIO()
copyfileobj(input, output)

print

print 'All at once:'
input = VerboseStringIO(lorem_ipsum)
output = StringIO()
copyfileobj(input, output, -1)

print

print 'Blocks of 20:'
input = VerboseStringIO(lorem_ipsum)
output = StringIO()
copyfileobj(input, output, 20)


The default behavior is to read using large blocks:

$ python shutil_copyfileobj.py
Default:
read(16384) => Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
Vestibulum aliquam mollis dolor. Donec vulputate nunc ut diam.
Ut rutrum mi vel sem. Vestibulum ante ipsum.
read(16384) =>


Use -1 to read all of the input at one time:

All at once:
read(-1) => Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
Vestibulum aliquam mollis dolor. Donec vulputate nunc ut diam.
Ut rutrum mi vel sem. Vestibulum ante ipsum.
read(-1) =>


Or use another positive integer to set your own block size:

Blocks of 20:
read(20) => Lorem ipsum dolor si
read(20) => t amet, consectetuer
read(20) => adipiscing elit.
V
read(20) => estibulum aliquam mo
read(20) => llis dolor. Donec vu
read(20) => lputate nunc ut diam
read(20) => .
Ut rutrum mi vel
read(20) => sem. Vestibulum ante
read(20) => ipsum.
read(20) =>


The copy() function works like the Unix command line tool cp. If the named destination refers to a directory instead of a file, a new file is created in the directory using the base name of the source. The permissions of the file are copied along with the contents.

import os
from shutil import *

os.mkdir('example')
print 'BEFORE:', os.listdir('example')
copy('shutil_copy.py', 'example')
print 'AFTER:', os.listdir('example')


$ python shutil_copy.py
BEFORE: []
AFTER: ['shutil_copy.py']


copy2() works like copy(), but includes the access and modification times in the meta-data copied to the new file.

import os
from shutil import *

def show_file_info(filename):
stat_info = os.stat(filename)
print '\tMode :', stat_info.st_mode
print '\tCreated :', time.ctime(stat_info.st_ctime)
print '\tAccessed:', time.ctime(stat_info.st_atime)
print '\tModified:', time.ctime(stat_info.st_mtime)

os.mkdir('example')
print 'SOURCE:'
show_file_info('shutil_copy2.py')
copy2('shutil_copy2.py', 'example')
print 'DEST:'
show_file_info('example/shutil_copy2.py')


$ python shutil_copy2.py
SOURCE:
Mode : 33188
Created : Sun Oct 21 15:16:07 2007
Accessed: Sun Oct 21 15:16:11 2007
Modified: Sun Oct 21 15:16:07 2007
DEST:
Mode : 33188
Created : Sun Oct 21 15:16:11 2007
Accessed: Sun Oct 21 15:16:11 2007
Modified: Sun Oct 21 15:16:07 2007


Copying File Meta-data:

By default when a new file is created under Unix, it receives permissions based on the umask of the current user. To copy the permissions from one file to another, use copymode().

from commands import *
from shutil import *

print 'BEFORE:', getstatus('file_to_change.txt')
copymode('shutil_copymode.py', 'file_to_change.txt')
print 'AFTER :', getstatus('file_to_change.txt')


First, I need to create a file to be modified:

$ touch file_to_change.txt
$ chmod ugo+w file_to_change.txt


Then running the example script will change the permissions.

$ python shutil_copymode.py
BEFORE: -rw-rw-rw- 1 dhellman dhellman 0 Oct 21 14:43 file_to_change.txt
AFTER : -rw-r--r-- 1 dhellman dhellman 0 Oct 21 14:43 file_to_change.txt


To copy other meta-data about the file (permissions, last access time, and last modified time), use copystat().

import os
from shutil import *
import time

def show_file_info(filename):
stat_info = os.stat(filename)
print '\tMode :', stat_info.st_mode
print '\tCreated :', time.ctime(stat_info.st_ctime)
print '\tAccessed:', time.ctime(stat_info.st_atime)
print '\tModified:', time.ctime(stat_info.st_mtime)

print 'BEFORE:'
show_file_info('file_to_change.txt')
copystat('shutil_copystat.py', 'file_to_change.txt')
print 'AFTER :'
show_file_info('file_to_change.txt')


$ python shutil_copystat.py
BEFORE:
Mode : 33206
Created : Sun Oct 21 15:01:23 2007
Accessed: Sun Oct 21 14:43:26 2007
Modified: Sun Oct 21 14:43:26 2007
AFTER :
Mode : 33188
Created : Sun Oct 21 15:01:44 2007
Accessed: Sun Oct 21 15:01:43 2007
Modified: Sun Oct 21 15:01:39 2007


Working With Directory Trees:

The shutil module includes 3 functions for working with directory trees. To copy a directory from one place to another, use copytree(). It recurses through the source directory tree, copying files to the destination. The destination directory must not exist in advance. The symlinks argument controls whether symbolic links are copied as links or as files. The default is to copy the contents to new files. If the option is true, new symlinks are created within the destination tree.

Note: The documentation for copytree() says it should be considered a sample implementation, rather than a tool. You may want to copy the implementation and make it more robust, or add features like a progress meter.

from commands import *
from shutil import *

print 'BEFORE:'
print getoutput('ls -rlast /tmp/example')
copytree('example', '/tmp/example')
print 'AFTER:'
print getoutput('ls -rlast /tmp/example')


$ python shutil_copytree.py
BEFORE:
ls: /tmp/example: No such file or directory
AFTER:
total 8
8 -rw-r--r-- 1 dhellman wheel 1627 Oct 21 15:16 shutil_copy2.py
0 drwxr-xr-x 3 dhellman wheel 102 Oct 21 15:16 .
0 drwxrwxrwt 18 root wheel 612 Oct 21 15:26 ..


To remove a directory and its contents, use rmtree(). Errors are raised as exceptions by default. Errors can be ignored if the second argument is tree, and a special error handler function can be provided in the third argument.

from commands import *
from shutil import *

print 'BEFORE:'
print getoutput('ls -rlast /tmp/example')
rmtree('/tmp/example')
print 'AFTER:'
print getoutput('ls -rlast /tmp/example')


$ python shutil_rmtree.py
BEFORE:
total 8
8 -rw-r--r-- 1 dhellman wheel 1627 Oct 21 15:16 shutil_copy2.py
0 drwxr-xr-x 3 dhellman wheel 102 Oct 21 15:16 .
0 drwxrwxrwt 18 root wheel 612 Oct 21 15:26 ..
AFTER:
ls: /tmp/example: No such file or directory


To move a file or directory from one place to another, use move(). The semantics are similar to those of the Unix command mv. If the source and destination are within the same filesystem, the source is simply renamed. Otherwise the source is copied to the destination and then the source is removed.

import os
from shutil import *

print 'BEFORE: example : ', os.listdir('example')
move('example', 'example2')
print 'AFTER : example2: ', os.listdir('example2')


$ python shutil_move.py
BEFORE: example : ['shutil_copy.py']
AFTER : example2: ['shutil_copy.py']


References:

Python Module of the Week Home
Download Sample Code

Updated 12 Sept to fix rmtree() example.


Technorati Tags:
,


See who is linking to you

Google's Webmaster Tools site provides a reporting feature to let you see who is linking to you. Unfortunately, the report is backwards from the orientation I want to read it. It lists the remote links for each of your local pages. I want to see all of the local pages linked on a remote site grouped together. That helps me recognize trends and identify people who might be blogging about what I write here.

Luckily, in addition to the interactive report on the tools web site, you can download the data in a CSV file to be manipulated in any way you want. I put together a little script to produce an HTML file which shows what sites link to me, and the target links on my site. For example, I was a bit surprised to discover several links on an Italian Israeli site. Here's a segment of the output of LinkingToMe:



There are a lot of links on del.icio.us to the PyMOTW articles, and while that's cool it isn't very useful in this report. I included an option to filter out sites by their hostname, and included several bookmarking sites as defaults.

So far I'm not doing any other processing on the data (such as downloading the titles on those remote pages). Perhaps when I have a little more time I'll enhance the script.

[Corrected country of origin for whatsup.co.il.]

Friday, November 2, 2007

PyMOTW Feed temporarily broken, but fixed

For those of you subscribe directly to the PyMOTW feed, I apologize for the temporary interruption in service. Apparently I reached a maximum feed size and FeedBurner cut me off. I didn't realize there was an issue, because I had given up on the FeedBulletin notifications, since they just cried wolf every day first by saying that there was a delay in updating and then in the same email reporting that everything was working again. In any event, the feed at http://feeds.feedburner.com/PyMOTW is working correctly again, and should start updating soon.

Thursday, November 1, 2007

Python Magazine is here to stay

Word came in this morning, via Brian, that Python Magazine is "viable". That's great news! I've been having a good time reading the articles (and code) you have submitted, and working with Brian, Arbi, and everyone else at MTA to put it together.

So, if you've been holding off on submitting your proposal for an article, or subscribing, you can stop waiting. Head over to pythonmagazine.com and take care of both today.


Technorati Tags:


Atlanta Python Meetup November Meeting

The next PyAtl meeting is on Nov. 8th. The topic this month is GUI toolkits, and the agenda calls for a bunch of lightning talks. I'm disappointed that I won't be able to make it, since there are several toolkits on the list that I've never used before.

Agenda:
PyGTK 2.0: Jeremy Jones
WXPython: Mark Adams
PyGlet: Drew Smathers
Curses: Jeremy Jones
Divmod Nevow Athena: Cary Hull
Tkinter: Bernard Matthews
Python/Cocoa/Leopard: Noah Gift

We're still looking for someone to talk about PyQT, Jython, and IronPython. If you would like to present on any of those topics, please get in touch with Noah Gift.