Sunday, May 31, 2009

Data persistence tools in the Python Standard Library

This is the first in a new sub-series of Python Module of the Week posts with a feature-based introduction to modules in the standard library, organized by what your goal might be. Each article may include cross-references to several modules from different parts of the library, and show how they relate to one another. Feedback is welcome, as usual.


Data Persistence and Exchange


Python provides several modules for storing data. There are basically two aspects to persistence: converting the in-memory object back and forth into a format for saving it, and working with the storage of the converted data.




Serializing Objects


Python includes two modules capable of converting objects into a transmittable or storable format (serializing): pickle and json. It is most common to use pickle, since there is a fast C implementation and it is integrated with some of the other standard library modules that actually store the serialized data, such as shelve. Web-based applications may want to examine json, however, since it integrates better with some of the existing web service storage applications.





Storing Serialized Objects


Once the in-memory object is converted to a storable format, the next step is to decide how to store the data. A simple flat-file with serialized objects written one after the other works for data that does not need to be indexed in any way. But Python includes a collection of modules for storing key-value pairs in a simple database using one of the DBM format variants.


The simplest interface to take advantage of the DBM format is provided by shelve. Simply open the shelve file, and access it through a dictionary-like API. Objects saved to the shelve are automatically pickled and saved without any extra work on your part.


One drawback of shelve is that with the default interface you can’t guarantee which DBM format will be used. That won’t matter if your application doesn’t need to share the database files between hosts with different libraries, but if that is needed you can use one of the classes in the module to ensure a specific format is selected (Specific Shelf Types).



If you’re going to be passing a lot of data around via JSON anyway, using json and anydbm can provide another persistence mechanism. Since the DBM database keys and values must be strings, however, the objects won’t be automatically re-created when you access the value in the database.




Relational Databases


The excellent sqlite3 in-process relational database is available with most Python distributions. It stores its database in memory or in a local file, and all access is from within the same process, so there is no network lag. The compact nature of sqlite3 makes it especially well suited for embedding in desktop applications or development versions of web apps.



All access to the database is through the Python DBI 2.0 API, by default, as no object relational mapper (ORM) is included. The most popular general purpose ORM is SQLAlchemy, but others such as Django’s native ORM layer also support SQLite. SQLAlchemy is easy to install and set up, but if your objects aren’t very complicated and you are worried about overhead, you may want to use the DBI interface directly.




Data Exchange Through Standard Formats


Although not usually considered a true persistence format csv, or comma-separated-value, files can be an effective way to migrate data between applications. Most spreadsheet programs and databases support both export and import using CSV, so dumping data to a CSV file is frequently the simplest way to move data out of your application and into an analysis tool.

Monday, May 25, 2009

Japanese translation of PyMOTW

I am very pleased to announce that Tetsuya Morimoto is creating a Japanese translation of the Python Module of the Week series.

Tetsuya has used Python for 1.5 years. He has experience at a Linux Distributor using Python with yum, anaconda, and rpm-tools while building RPM packages. Now, he uses it to make useful tools for himself, and is interested in application frameworks such as Django, mercurial and wxPython. Tetsuya is a member of Python Japan User's Group and Python Code Reading.

Thanks, Tetsuya!

Sunday, May 24, 2009

PyMOTW: json


json – JavaScript Object Notation Serializer

Purpose:Encode Python objects as JSON strings, and decode JSON strings into Python objects.
Python Version:2.6

The json module provides an API similar to pickle for converting in-memory Python objects to a serialized representation known as “JavaScript Object Notation”. Unlike pickle, JSON has the benefit of having implementations in many languages (especially JavaScript), making it suitable for inter-application communication. JSON is probably most widely used for communicating between the web server and client in an AJAX application, but is not limited to that problem domain.

Encoding and Decoding Simple Data Types

The encoder understands Python’s native types by default (int, float, list, tuple, dict).

import json

data = [ { 'a':'A', 'b':(2, 4), 'c':3.0 } ]
print 'DATA:', repr(data)

data_string = json.dumps(data)
print 'JSON:', data_string

Values are encoded in a manner very similar to Python’s repr() output.

$ python json_simple_types.py
DATA: [{'a': 'A', 'c': 3.0, 'b': (2, 4)}]
JSON: [{"a": "A", "c": 3.0, "b": [2, 4]}]

Encoding, then re-decoding may not give exactly the same type of object.

import json

data = [ { 'a':'A', 'b':(2, 4), 'c':3.0 } ]
data_string = json.dumps(data)
print 'ENCODED:', data_string

decoded = json.loads(data_string)
print 'DECODED:', decoded

print 'ORIGINAL:', type(data[0]['b'])
print 'DECODED :', type(decoded[0]['b'])

For example, tuples are converted to JSON lists.

$ python json_simple_types_decode.py
ENCODED: [{"a": "A", "c": 3.0, "b": [2, 4]}]
DECODED: [{u'a': u'A', u'c': 3.0, u'b': [2, 4]}]
ORIGINAL: <type 'tuple'>
DECODED : <type 'list'>

Human-consumable vs. Compact Output

Another benefit of JSON over pickle is that the results are human-readable. The dumps() function accepts several arguments to make the output even nicer. For example, sort_keys tells the encoder to output the keys of a dictionary in sorted, instead of random, order.

import json

data = [ { 'a':'A', 'b':(2, 4), 'c':3.0 } ]
print 'DATA:', repr(data)

unsorted = json.dumps(data)
print 'JSON:', json.dumps(data)
print 'SORT:', json.dumps(data, sort_keys=True)

first = json.dumps(data, sort_keys=True)
second = json.dumps(data, sort_keys=True)

print 'UNSORTED MATCH:', unsorted == first
print 'SORTED MATCH :', first == second

Sorting makes it easier to scan the results by eye, and also makes it possible to compare JSON output in tests.

$ python json_sort_keys.py
DATA: [{'a': 'A', 'c': 3.0, 'b': (2, 4)}]
JSON: [{"a": "A", "c": 3.0, "b": [2, 4]}]
SORT: [{"a": "A", "b": [2, 4], "c": 3.0}]
UNSORTED MATCH: False
SORTED MATCH : True

For highly-nested data structures, you will want to specify a value for indent, so the output is formatted nicely as well.

import json

data = [ { 'a':'A', 'b':(2, 4), 'c':3.0 } ]
print 'DATA:', repr(data)

print 'NORMAL:', json.dumps(data, sort_keys=True)
print 'INDENT:', json.dumps(data, sort_keys=True, indent=2)

When indent is a non-negative integer, the output more closely resembles that of pprint, with leading spaces for each level of the data structure matching the indent level.

$ python json_indent.py
DATA: [{'a': 'A', 'c': 3.0, 'b': (2, 4)}]
NORMAL: [{"a": "A", "b": [2, 4], "c": 3.0}]
INDENT: [
{
"a": "A",
"b": [
2,
4
],
"c": 3.0
}
]

Verbose output like this increases the number of bytes needed to transmit the same amount of data, however, so it isn’t the sort of thing you necessarily want to use in a production environment. In fact, you may want to adjust the settings for separating data in the encoded output to make it even more compact than the default.

import json

data = [ { 'a':'A', 'b':(2, 4), 'c':3.0 } ]
print 'DATA:', repr(data)
print 'repr(data) :', len(repr(data))
print 'dumps(data) :', len(json.dumps(data))
print 'dumps(data, indent=2) :', len(json.dumps(data, indent=2))
print 'dumps(data, separators):', len(json.dumps(data, separators=(',',':')))

The separators argument to dumps() should be a tuple containing the strings to separate items in a list and keys from values in a dictionary. The default is (', ', ': '). By removing the whitespace, we can produce a more compact output.

$ python json_compact_encoding.py
DATA: [{'a': 'A', 'c': 3.0, 'b': (2, 4)}]
repr(data) : 35
dumps(data) : 35
dumps(data, indent=2) : 76
dumps(data, separators): 29

Encoding Dictionaries

The JSON format expects the keys to a dictionary to be strings. If you have other types as keys in your dictionary, trying to encode the object will produce a TypeError. One way to work around that limitation is to skip over non-string keys using the skipkeys argument:

import json

data = [ { 'a':'A', 'b':(2, 4), 'c':3.0, ('d',):'D tuple' } ]

print 'First attempt'
try:
print json.dumps(data)
except TypeError, err:
print 'ERROR:', err

print
print 'Second attempt'
print json.dumps(data, skipkeys=True)

Rather than raising an exception, the non-string key is simply ignored.

$ python json_skipkeys.py
First attempt
ERROR: key ('d',) is not a string

Second attempt
[{"a": "A", "c": 3.0, "b": [2, 4]}]

Working with Your Own Types

All of the examples so far have used Pythons built-in types because those are supported by json natively. It isn’t uncommon, of course, to have your own types that you want to be able to encode as well. There are two ways to do that.

First, we’ll need a class to encode:

class MyObj(object):
def __init__(self, s):
self.s = s
def __repr__(self):
return '<MyObj(%s)>' % self.s

The simple way of encoding a MyObj instance is to define a function to convert an unknown type to a known type. You don’t have to do the encoding yourself, just convert one object to another.

import json
import json_myobj

obj = json_myobj.MyObj('instance value goes here')

print 'First attempt'
try:
print json.dumps(obj)
except TypeError, err:
print 'ERROR:', err

def convert_to_builtin_type(obj):
print 'default(', repr(obj), ')'
# Convert objects to a dictionary of their representation
d = { '__class__':obj.__class__.__name__,
'__module__':obj.__module__,
}
d.update(obj.__dict__)
return d

print
print 'With default'
print json.dumps(obj, default=convert_to_builtin_type)

In convert_to_builtin_type(), instances of classes not recognized by json are converted to dictionaries with enough information to re-create the object if a program has access to the Python modules necessary.

$ python json_dump_default.py
First attempt
ERROR: <MyObj(instance value goes here)> is not JSON serializable

With default
default( <MyObj(instance value goes here)> )
{"s": "instance value goes here", "__module__": "json_myobj", "__class__": "MyObj"}

To decode the results and create a MyObj instance, we need to tie in to the decoder so we can import the class from the module and create the instance. For that, we use the object_hook argument to loads().

The object_hook is called for each dictionary decoded from the incoming data stream, giving us a chance to convert the dictionary to another type of object. The hook function should return the object it wants the calling application to receive instead of the dictionary.

import json

def dict_to_object(d):
if '__class__' in d:
class_name = d.pop('__class__')
module_name = d.pop('__module__')
module = __import__(module_name)
print 'MODULE:', module
class_ = getattr(module, class_name)
print 'CLASS:', class_
args = dict( (key.encode('ascii'), value) for key, value in d.items())
print 'INSTANCE ARGS:', args
inst = class_(**args)
else:
inst = d
return inst

encoded_object = '[{"s": "instance value goes here", "__module__": "json_myobj", "__class__": "MyObj"}]'

myobj_instance = json.loads(encoded_object, object_hook=dict_to_object)
print myobj_instance

Since json converts string values to unicode objects, we need to re-encode them as ASCII strings before using them as keyword arguments to the class constructor.

$ python json_load_object_hook.py
MODULE: <module 'json_myobj' from '/Users/dhellmann/Documents/PyMOTW/src/PyMOTW/json/json_myobj.pyc'>
CLASS: <class 'json_myobj.MyObj'>
INSTANCE ARGS: {'s': u'instance value goes here'}
[<MyObj(instance value goes here)>]

Similar hooks are available for the built-in types integers (parse_int), floating point numbers (parse_float), and constants (parse_constant).


Encoder and Decoder Classes

Besides the convenience functions we have already examined, the json module provides classes for encoding and decoding. By using the classes directly, you have access to extra APIs and can create subclasses to customize their behavior.

The JSONEncoder provides an iterable interface for producing “chunks” of encoded data, making it easier for you to write to files or network sockets without having to represent an entire data structure in memory.

import json

encoder = json.JSONEncoder()
data = [ { 'a':'A', 'b':(2, 4), 'c':3.0 } ]

for part in encoder.iterencode(data):
print 'PART:', part

As you can see, the output is generated in logical units, rather than being based on any size value.

$ python json_encoder_iterable.py
PART: [
PART: {
PART: "a"
PART: :
PART: "A"
PART: ,
PART: "c"
PART: :
PART: 3.0
PART: ,
PART: "b"
PART: :
PART: [
PART: 2
PART: ,
PART: 4
PART: ]
PART: }
PART: ]

The encode() method is basically equivalent to ''.join(encoder.iterencode()), with some extra error checking up front.

To encode arbitrary objects, we can override the default() method with an implementation similar to what we used above in convert_to_builtin_type().

import json
import json_myobj

class MyEncoder(json.JSONEncoder):

def default(self, obj):
print 'default(', repr(obj), ')'
# Convert objects to a dictionary of their representation
d = { '__class__':obj.__class__.__name__,
'__module__':obj.__module__,
}
d.update(obj.__dict__)
return d

obj = json_myobj.MyObj('internal data')
print obj
print MyEncoder().encode(obj)

The output is the same as the previous implementation.

$ python json_encoder_default.py
<MyObj(internal data)>
default( <MyObj(internal data)> )
{"s": "internal data", "__module__": "json_myobj", "__class__": "MyObj"}

Decoding text, then converting the dictionary into an object takes a little more work to set up than our previous implementation, but not much.

import json

class MyDecoder(json.JSONDecoder):

def __init__(self):
json.JSONDecoder.__init__(self, object_hook=self.dict_to_object)

def dict_to_object(self, d):
if '__class__' in d:
class_name = d.pop('__class__')
module_name = d.pop('__module__')
module = __import__(module_name)
print 'MODULE:', module
class_ = getattr(module, class_name)
print 'CLASS:', class_
args = dict( (key.encode('ascii'), value) for key, value in d.items())
print 'INSTANCE ARGS:', args
inst = class_(**args)
else:
inst = d
return inst

encoded_object = '[{"s": "instance value goes here", "__module__": "json_myobj", "__class__": "MyObj"}]'

myobj_instance = MyDecoder().decode(encoded_object)
print myobj_instance

And the output is the same as the earlier example.

$ python json_decoder_object_hook.py
MODULE: <module 'json_myobj' from '/Users/dhellmann/Documents/PyMOTW/src/PyMOTW/json/json_myobj.pyc'>
CLASS: <class 'json_myobj.MyObj'>
INSTANCE ARGS: {'s': u'instance value goes here'}
[<MyObj(instance value goes here)>]

Working with Streams and Files

In all of the examples so far, we have assumed that we could (and should) hold the encoded version of the entire data structure in memory at one time. With large data structures it may be preferable to write the encoding directly to a file-like object. The convenience functions load() and dump() accept references to a file-like object to use for reading or writing.

import json
import tempfile

data = [ { 'a':'A', 'b':(2, 4), 'c':3.0 } ]

f = tempfile.NamedTemporaryFile(mode='w+')
json.dump(data, f)
f.flush()

print open(f.name, 'r').read()

A socket would work in much the same way as the normal file handle used here.

$ python json_dump_file.py
[{"a": "A", "c": 3.0, "b": [2, 4]}]

Although it isn’t optimized to read only part of the data at a time, the load() function still offers the benefit of encapsulating the logic of generating objects from stream input.

import json
import tempfile

f = tempfile.NamedTemporaryFile(mode='w+')
f.write('[{"a": "A", "c": 3.0, "b": [2, 4]}]')
f.flush()
f.seek(0)

print json.load(f)
$ python json_load_file.py
[{u'a': u'A', u'c': 3.0, u'b': [2, 4]}]

Mixed Data Streams

The JSONDecoder includes the raw_decode() method for decoding a data structure followed by more data, such as JSON data with trailing text. The return value is the object created by decoding the input data, and an index into that data indicating where decoding left off.

import json

decoder = json.JSONDecoder()
def get_decoded_and_remainder(input_data):
obj, end = decoder.raw_decode(input_data)
remaining = input_data[end:]
return (obj, end, remaining)

encoded_object = '[{"a": "A", "c": 3.0, "b": [2, 4]}]'
extra_text = 'This text is not JSON.'

print 'JSON first:'
obj, end, remaining = get_decoded_and_remainder(' '.join([encoded_object, extra_text]))
print 'Object :', obj
print 'End of parsed input :', end
print 'Remaining text :', repr(remaining)

print
print 'JSON embedded:'
try:
obj, end, remaining = get_decoded_and_remainder(
' '.join([extra_text, encoded_object, extra_text])
)
except ValueError, err:
print 'ERROR:', err


Unfortunately, this only works if the object appears at the beginning of the input.

$ python json_mixed_data.py
JSON first:
Object : [{u'a': u'A', u'c': 3.0, u'b': [2, 4]}]
End of parsed input : 35
Remaining text : ' This text is not JSON.'

JSON embedded:
ERROR: No JSON object could be decoded

See also

json
The standard library documentation for this module.
http://json.org/
JavaScript Object Notation (JSON)
http://code.google.com/p/simplejson/
simplejson, from Bob Ippolito, et al, is the externally maintained development version of the json library included with Python 2.6 and Python 3.0. It maintains backwards compatibility with Python 2.4 and Python 2.5.

PyMOTW Home

The canonical version of this article

The Case for Working With Your Hands - NYTimes.com

The Case for Working With Your Hands - NYTimes.com



Some diagnostic situations contain a lot of variables. Any given symptom may have several possible causes, and further, these causes may interact with one another and therefore be difficult to isolate. In deciding how to proceed, there often comes a point where you have to step back and get a larger gestalt. Have a cigarette and walk around the lift. The gap between theory and practice stretches out in front of you, and this is where it gets interesting. What you need now is the kind of judgment that arises only from experience; hunches rather than rules. For me, at least, there is more real thinking going on in the bike shop than there was in the think tank.


Although I like this article, and understand the source of Mr. Crawford's frustration with corporate culture, I have to take some exception with the categorization of all "knowledge work" as less honest than trade work. The quote above, for example, could just as easily apply to software as motorcycles.

I've spent the last several weeks at work completely refactoring a major aspect of our application because we had finally reached the end of the long blind alley the existing design lead us to follow (it lasted us 7 years, so it was more like a blind turnpike). As I worked my way back out onto the main road and found the right direction to take, I broke and re-fixed several ancillary features, wandered off path myself a few times, and spent a lot of time pondering the variables involved and how they interact.

The resulting design is proving to be more malleable and simpler to test, and as a result of both of those aspects it is vastly easier to maintain. That's a big win for us, since even though I didn't add any customer-visible features to it in this release, by making the code easier to modify I was able to find and fix a major race condition. The redesign also lays the groundwork for future planned enhancements.

That's all by way of saying that working with your hands is honorable, but so can be working with your mind. Just because companies like the abstracting service described in the article do a half-assed job of their work, that doesn't automatically mean all companies do.