Sunday, October 7, 2007

PyMOTW: difflib

The difflib module contains several classes for comparing sequences, especially of lines of text from files, and manipulating the results.

Updated: I can't quite make the formatting for the examples come out the way I want with this blog template (the content column is too narrow, and apparently fixed). If you are viewing this through the web page on Blogger, I encourage you to download the sample code and run it yourself to see what it looks like. If there are any CSS wizards out there who can help make the table actually visible, I would appreciate any comments you might leave.

Module: difflib
Purpose: Library of tools for computing and working with differences between sequences, especially of lines in text files.
Python Version: 2.1

Description:

The SequenceMatcher class compares any 2 sequences of values, as long as the values are hashable. It uses a recursive algorithm to identify the longest contiguous matching blocks from the sequences, eliminating "junk" values. The Differ class works on sequences of text lines and produces human-readable deltas, including differences within individual lines. The HtmlDiff class produces similar results formatted as an HTML table.

Test Data:

The examples below will all use this common test data in the difflib_data module:

text1 = """Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Integer
eu lacus accumsan arcu fermentum euismod. Donec pulvinar porttitor
tellus. Aliquam venenatis. Donec facilisis pharetra tortor. In nec
mauris eget magna consequat convallis. Nam sed sem vitae odio
pellentesque interdum. Sed consequat viverra nisl. Suspendisse arcu
metus, blandit quis, rhoncus ac, pharetra eget, velit. Mauris
urna. Morbi nonummy molestie orci. Praesent nisi elit, fringilla ac,
suscipit non, tristique vel, mauris. Curabitur vel lorem id nisl porta
adipiscing. Suspendisse eu lectus. In nunc. Duis vulputate tristique
enim. Donec quis lectus a justo imperdiet tempus."""
text1_lines = text1.splitlines()

text2 = """Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Integer
eu lacus accumsan arcu fermentum euismod. Donec pulvinar porttitor
tellus. Aliquam venenatis. Donec facilisis pharetra tortor. In nec
mauris eget magna consequat convallis. Nam sed sem vitae odio
pellentesque interdum. Sed consequat viverra nisl. Suspendisse arcu
metus, blandit quis, rhoncus ac, pharetra eget, velit. Mauris
urna. Morbi nonummy molestie orci. Praesent nisi elit, fringilla ac,
suscipit non, tristique vel, mauris. Curabitur vel lorem id nisl porta
adipiscing. Suspendisse eu lectus. In nunc. Duis vulputate tristique
enim. Donec quis lectus a justo imperdiet tempus."""
text2_lines = text2.splitlines()


Differ Example:

Reproducing output similar to the diff command line tool is simple with the Differ class:

import difflib
from difflib_data import *

d = difflib.Differ()
diff = d.compare(text1_lines, text2_lines)
print '\n'.join(list(diff))


The output includes the original input values from both lists, including common values, and markup data to indicate what changes were made. Lines may be prefixed with - to indicate that they were in the first sequence, but not the second. Lines prefixed with + were in the second sequence, but not the first. If a line has an incremental change between versions, an extra line prefixed with ? is used to try to indicate where the change occurred within the line. If a line has not changed, it is printed with an extra blank space on the left column to let it line up with the other lines which may have other markup.

The beginning of both text segments is the same.

 1:   Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Integer


The second line has been changed to include a comma in the modified text. Both versions of the line are printed, with the extra information on line 4 showing the column where the text was modified, including the fact that the , character was added.

 2: - eu lacus accumsan arcu fermentum euismod. Donec pulvinar porttitor
3: + eu lacus accumsan arcu fermentum euismod. Donec pulvinar, porttitor
4: ? +
5:


Lines 6-9 of the output shows where an extra space was removed.

 6: - tellus. Aliquam venenatis. Donec facilisis pharetra tortor.  In nec
7: ? -
8:
9: + tellus. Aliquam venenatis. Donec facilisis pharetra tortor. In nec


Next a more complex change was made, replacing several words in a phrase.

10: - mauris eget magna consequat convallis. Nam sed sem vitae odio
11: ? - --
12:
13: + mauris eget magna consequat convallis. Nam cras vitae mi vitae odio
14: ? +++ +++++ +
15:


The last sentence in the paragraph was changed significantly, so the difference is represented by simply removing the old version and adding the new (lines 20-23).

16:   pellentesque interdum. Sed consequat viverra nisl. Suspendisse arcu
17: metus, blandit quis, rhoncus ac, pharetra eget, velit. Mauris
18: urna. Morbi nonummy molestie orci. Praesent nisi elit, fringilla ac,
19: suscipit non, tristique vel, mauris. Curabitur vel lorem id nisl porta
20: - adipiscing. Suspendisse eu lectus. In nunc. Duis vulputate tristique
21: - enim. Donec quis lectus a justo imperdiet tempus.
22: + adipiscing. Duis vulputate tristique enim. Donec quis lectus a justo
23: + imperdiet tempus. Suspendisse eu lectus. In nunc.


The ndiff() function produces essentially the same output. The processing is specifically tailored to working with text data and eliminating "noise" in the input.

diff = difflib.ndiff(text1_lines, text2_lines)


Other Diff Formats:

Where the Differ class shows all of the inputs, a unified diff only includes modified lines and a bit of context. In version 2.3, a unified_diff() function was added to produce this sort of output:

import difflib
from difflib_data import *

diff = difflib.unified_diff(text1_lines, text2_lines, lineterm='')
print '\n'.join(list(diff))


The output should look familiar to users of svn or other version control tools:

$ python difflib_unified.py
---
+++
@@ -1,10 +1,10 @@
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Integer
-eu lacus accumsan arcu fermentum euismod. Donec pulvinar porttitor
-tellus. Aliquam venenatis. Donec facilisis pharetra tortor. In nec
-mauris eget magna consequat convallis. Nam sed sem vitae odio
+eu lacus accumsan arcu fermentum euismod. Donec pulvinar, porttitor
+tellus. Aliquam venenatis. Donec facilisis pharetra tortor. In nec
+mauris eget magna consequat convallis. Nam cras vitae mi vitae odio
pellentesque interdum. Sed consequat viverra nisl. Suspendisse arcu
metus, blandit quis, rhoncus ac, pharetra eget, velit. Mauris
urna. Morbi nonummy molestie orci. Praesent nisi elit, fringilla ac,
suscipit non, tristique vel, mauris. Curabitur vel lorem id nisl porta
-adipiscing. Suspendisse eu lectus. In nunc. Duis vulputate tristique
-enim. Donec quis lectus a justo imperdiet tempus.
+adipiscing. Duis vulputate tristique enim. Donec quis lectus a justo
+imperdiet tempus. Suspendisse eu lectus. In nunc.


Using context_diff() produces similar readable output:

$ python difflib_context.py
***
---
***************
*** 1,10 ****
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Integer
! eu lacus accumsan arcu fermentum euismod. Donec pulvinar porttitor
! tellus. Aliquam venenatis. Donec facilisis pharetra tortor. In nec
! mauris eget magna consequat convallis. Nam sed sem vitae odio
pellentesque interdum. Sed consequat viverra nisl. Suspendisse arcu
metus, blandit quis, rhoncus ac, pharetra eget, velit. Mauris
urna. Morbi nonummy molestie orci. Praesent nisi elit, fringilla ac,
suscipit non, tristique vel, mauris. Curabitur vel lorem id nisl porta
! adipiscing. Suspendisse eu lectus. In nunc. Duis vulputate tristique
! enim. Donec quis lectus a justo imperdiet tempus.
--- 1,10 ----
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Integer
! eu lacus accumsan arcu fermentum euismod. Donec pulvinar, porttitor
! tellus. Aliquam venenatis. Donec facilisis pharetra tortor. In nec
! mauris eget magna consequat convallis. Nam cras vitae mi vitae odio
pellentesque interdum. Sed consequat viverra nisl. Suspendisse arcu
metus, blandit quis, rhoncus ac, pharetra eget, velit. Mauris
urna. Morbi nonummy molestie orci. Praesent nisi elit, fringilla ac,
suscipit non, tristique vel, mauris. Curabitur vel lorem id nisl porta
! adipiscing. Duis vulputate tristique enim. Donec quis lectus a justo
! imperdiet tempus. Suspendisse eu lectus. In nunc.


HTML Output:

HtmlDiff (new in Python 2.4) produces HTML output with the same information as the Diff class. This example uses make_table(), but the make_file() method produces a fully-formed HTML file as output.

import difflib
from difflib_data import *

d = difflib.HtmlDiff()
print d.make_table(text1_lines, text2_lines)


The output produces a table like the one below. (I have modified the styles slightly to make the table work on this blog.)












f1Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Integerf1Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Integer
n2eu lacus accumsan arcu fermentum euismod. Donec pulvinar porttitorn2eu lacus accumsan arcu fermentum euismod. Donec pulvinar, porttitor
3tellus. Aliquam venenatis. Donec facilisis pharetra tortor.  In nec3tellus. Aliquam venenatis. Donec facilisis pharetra tortor. In nec
4mauris eget magna consequat convallis. Nam sed sem vitae odio4mauris eget magna consequat convallis. Nam cras vitae mi vitae odio
5pellentesque interdum. Sed consequat viverra nisl. Suspendisse arcu5pellentesque interdum. Sed consequat viverra nisl. Suspendisse arcu
6metus, blandit quis, rhoncus ac, pharetra eget, velit. Mauris6metus, blandit quis, rhoncus ac, pharetra eget, velit. Mauris
7urna. Morbi nonummy molestie orci. Praesent nisi elit, fringilla ac,7urna. Morbi nonummy molestie orci. Praesent nisi elit, fringilla ac,
8suscipit non, tristique vel, mauris. Curabitur vel lorem id nisl porta8suscipit non, tristique vel, mauris. Curabitur vel lorem id nisl porta
t9adipiscing. Suspendisse eu lectus. In nunc. Duis vulputate tristiquet9adipiscing. Duis vulputate tristique enim. Donec quis lectus a justo
10enim. Donec quis lectus a justo imperdiet tempus.10imperdiet tempus. Suspendisse eu lectus. In nunc. 


Junk Data:

All of the functions which produce diff sequences accept arguments to indicate which lines should be ignored, and which characters within a line should be ignored. This can be used to ignore markup or whitespace changes in two versions of file, for example. The default for Differ is to not ignore any lines or characters explicitly, but to rely on the SequenceMatcher's ability to detect noise. The default for ndiff is to ignore space and tab characters.

SequenceMatcher:

SequenceMatcher, which implements the comparison algorithm, can be used with sequences of any type of object as long as the object is hashable. For example, two lists of integers can be compared, and using get_opcodes() a set of instructions for converting the original list into the newer can be printed:

import difflib
from difflib_data import *

s1 = [ 1, 2, 3, 5, 6, 4 ]
s2 = [ 2, 3, 5, 4, 6, 1 ]

matcher = difflib.SequenceMatcher(None, s1, s2)
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
print ("%7s s1[%d:%d] (%s) s2[%d:%d] (%s)" %
(tag, i1, i2, s1[i1:i2], j1, j2, s2[j1:j2]))


$ python difflib_seq.py
delete s1[0:1] ([1]) s2[0:0] ([])
equal s1[1:4] ([2, 3, 5]) s2[0:3] ([2, 3, 5])
insert s1[4:4] ([]) s2[3:4] ([4])
equal s1[4:5] ([6]) s2[4:5] ([6])
replace s1[5:6] ([4]) s2[5:6] ([1])


You can use SequenceMatcher with your own classes, as well as built-in types.

References:

Python Module of the Week Home
Download Sample Code
Pattern Matching: The Gestalt Approach - Discussion of a similar algorithm by John W. Ratcliff and D. E. Metzener published in Dr. Dobb’s Journal in July, 1988.



Technorati Tags:
,


4 comments:

Michael Watkins said...

Add this CSS will at least allow horizontal scrolling:

table.diff {
display:block;
overflow:auto;
}

Doug said...

Thanks, Michael, that's much better!

metapundit.net said...

I was happy to see this week's PYMOTW show up in my rss reader - I've just been mucking about with difflib. One thing I ran into that I didn't understand was the linejunk/charjunk functionality. I assumed that a line for which my linejunk function returned True would be ignored in the diff. This doesn't seem to be the case - I can't quite figure out in fact what it's supposed to do.

The following code:

---------
import difflib

text1 = """aaaa
abcd
#present in both
qwer
"""
text1_lines = text1.splitlines()

text2 = """aaaa
abdc
#present in both
qwer
#this line should be ignored?
"""
text2_lines = text2.splitlines()

def comments(line):
if line[0] == '#': return True

diff = difflib.ndiff(text1_lines, text2_lines, linejunk=comments)
for line in diff: print line

---------

Results in the following output:
---------
aaaa
- abcd
? -

+ abdc
? +

#present in both
qwer
+ #this line should be ignored?

---------
It didn't filter out the comment lines that appear in both inputs and didn't ignore the comment line that only appeared in one input... I was pressed for time and just pre-filtered my lists of lines with a regex before passing them to the diff module, but still wonder what linejunk/charjunk are supposed to do...

Doug said...

@metapundit.net:

It took a little while to work out a response to your question. The junk functions do not filter content from being compared entirely, they just help with the algorithm that is looking for the longest matches. When a junk function is provided, the matching sequences do not include junk. The example from SequenceMatcher.find_longest_match() in the module looks something like:

>>> s = SequenceMatcher(None, " abcd", "abcd abcd")
>>> s.find_longest_match(0, 5, 0, 9)
(0, 4, 5)

>>> s = SequenceMatcher(lambda x: x==" ", " abcd", "abcd abcd")
>>> s.find_longest_match(0, 5, 0, 9)
(1, 0, 4)

In the first case, without junk filtering, the match is made with the full sequence from the first string argument ([0:5]), so it appears late in the second argument ([4:9]).

In the second case, with junk filtering, the space in the first string is ignored for the purposes of finding a match. The result is the first occurance of "abcd" matches for both strings ([1:5], and [0:4]).

I've added another example file (difflib_junk.py) to release 1.23 of the sample code.