dircache module includes a function for caching directory listings.Module: dircache
Purpose: Cache directory listings, updating when the modification time of a directory changes.
Python Version: 1.4 and later
Listing Directory Contents:
The main function in the
dircache API is listdir(), a wrapper around os.listdir() that caches the results and returns the same list each time it is called with the a path unless the modification date of the named directory changes.import dircache
path = '.'
first = dircache.listdir(path)
second = dircache.listdir(path)
print 'Contents :', first
print 'Identical:', first is second
print 'Equal :', first == second
It is important to recognize that the exact same list is returned each time, so it should not be modified in place.
$ python dircache_listdir.py
Contents : ['.svn', '__init__.py', 'dircache_annotate.py', 'dircache_listdir.py',
'dircache_listdir_file_added.py', 'dircache_reset.py']
Identical: True
Equal : True
Of course, if the contents of the directory changes it is rescanned.
import dircache
import os
path = '/tmp'
file_to_create = os.path.join(path, 'pymotw_tmp.txt')
# Look at the directory contents
first = dircache.listdir(path)
# Create the new file
open(file_to_create, 'wt').close()
# Rescan the directory
second = dircache.listdir(path)
# Remove the file we created
os.unlink(file_to_create)
print 'Identical :', first is second
print 'Equal :', first == second
print 'Difference:', list(set(second) - set(first))
In this case the new file causes a new list to be constructed.
$ python dircache_listdir_file_added.py
Identical : False
Equal : False
Difference: ['pymotw_tmp.txt']
It is also possible to reset the entire cache, discarding its contents so that each path will be rechecked.
import dircache
path = '/tmp'
first = dircache.listdir(path)
dircache.reset()
second = dircache.listdir(path)
print 'Identical :', first is second
print 'Equal :', first == second
print 'Difference:', list(set(second) - set(first))
$ python dircache_reset.py
Identical : False
Equal : True
Difference: []
Annotated Listings:
The other interesting function provided by the
dircache module is annotate(). When called, annotate() modifies a list such as is returned by listdir(), adding a '/' to the end of the names that represent directories. (Sorry Windows users, although it uses os.path.join() to construct names to test, it always appends a '/', not os.sep.)import dircache
from pprint import pprint
path = '../../trunk'
contents = dircache.listdir(path)
annotated = contents[:]
dircache.annotate(path, annotated)
fmt = '%20s\t%20s'
print fmt % ('ORIGINAL', 'ANNOTATED')
print fmt % (('-' * 20,)*2)
for o, a in zip(contents, annotated):
print fmt % (o, a)
$ python dircache_annotate.py
ORIGINAL ANNOTATED
-------------------- --------------------
.DS_Store .DS_Store
.svn .svn/
ChangeLog ChangeLog
LICENSE.txt LICENSE.txt
MANIFEST MANIFEST
MANIFEST.in MANIFEST.in
MANIFEST.in.in MANIFEST.in.in
Makefile Makefile
PyMOTW PyMOTW/
README.txt README.txt
setup.py.in setup.py.in
static_content static_content/
template.html template.html
References:
Python Module of the Week Home
Download Sample Code
Technorati Tags:
python, PyMOTW
5 comments:
At work I was running a Python daemon that used dircache to look for new job files dropped into a queue directory. Every few months I'd get a job that just wouldn't be noticed. I suspected a race in the dircache module, but never figured enough information for a bug report. Lacking time to pursue this I changed the daemon to use os.listdir instead, and the problem went away.
So, I could be totally wrong on this, but it seems logical...
Assuming a local file system on an operating system like Linux that utilizes a directory entry cache, the actual disk load generated with dircache.listdir ought to be the same as the disk load generated with os.listdir.
The real benefit comes from not having to regenerate the list returned each time. It's especially evident with directories containing a large number of files.
Actually, you really don't want to use dircache...
Marius' experience where he was using dircache to watch a job directory is typical.
The reason is simple:
- Dircache uses the directory modification timestamp to determine if the contents of the directory has changed everytime you call dircache.listdir. It does this by issuing a "stat" system call on the directory (you can see this using strace).
- On linux, for the most commonly used filesystems (ext2/ext3), the modification timestamp only has a 1 second resolution.
The above conspires to make dircache pretty useless to monitor a directory which changes frequently (i.e. > once per second).
You HAVE to use os.listdir() for "real" applications for now. ext4 will have better resolution timestamps (down to the nanosecond), so will solve this issue. I have no idea what the situation on OS-X or Windows is, but I wouldn't trust it.
Thanks for the feedback, everyone!
It sounds like this module isn't that useful for watching a directory's contents, due to the granularity of the time value from stat and the fact that most modern operating systems are going to cache the filesystem contents anyway.
I wonder if we should just drop it from the standard library?
While there's really no difference in disk traffic, there is still a benefit in that the directory list isn't rebuilt every time. That is specially true on directories with a lot of entries.
Now, does it need its own module or should the algorithm be treated more as a possible optimization technique? Don't know.
In a directory with 10 entries:
jeff@martian:~/t$ python -m timeit -s 'import os' 'os.listdir(".")'
10000 loops, best of 3: 27 usec per loop
jeff@martian:~/t$ python -mtimeit -s 'import dircache' 'dircache.listdir(".")'
100000 loops, best of 3: 6.22 usec per loop
In a directory with 10k entries (this is where the big difference ought to show up):
jeff@martian:~/t$ python -m timeit -s 'import os' 'os.listdir(".")'
100 loops, best of 3: 11.1 msec per loop
jeff@martian:~/t$
eff@martian:~/t$ python -mtimeit -s 'import dircache' 'dircache.listdir(".")'
100000 loops, best of 3: 6.39 usec per loop
jeff@martian:~/t$
Post a Comment