Sunday, November 30, 2008

preparing Sphinx output for Blogger

Blogger doesn't let me set a different option on an individual post, and since not all of the posts are PyMOTW articles I've been trying to keep the "convert line breaks" flag on because it makes it easier for posts like these. The results have been a little ugly, but I think I have that straightened out, finally.

I prepare the PyMOTW articles using reST and convert them to HTML with Sphinx. I have a custom template that spits out only the body of the HTML (with no html or body tags). Code passes through pygments automatically as part of the Sphinx processing. The results include newlines after most of the tags, though. Blogger was converting those newlines to br tags, even when the tags themselves were otherwise invisible (like table tags).

I needed a cleanup script anyway because Sphinx (or docutils, I'm not 100% certain) inserts permalink anchors for each header. The stylesheet I use for the PyMOTW site causes them to be hidden unless the user mouses over the link, but I didn't want them at all in the blog posts. A previous attempt at a cleanup script with BeautifulSoup stripped the permalinks but also removed the whitespace from within pre tags. A recent update to BeautifulSoup fixed that problem, so I gave it another try today.

Unfortunately, I coudn't find any combination of arguments to tell BeautifulSoup not to insert newlines between tags. The prettyPrint option was either ignored, or I don't understand how it is intended to be used. So I use BeautifulSoup to remove the permalinks but fell back on regular expressions for the newline handling.

I want to remove all newline characters immediately after closing tags, except if the tags are part of code or other pre-formatted output. Lines that do not end with tags are probably part of pre blocks, and whitespace is obviously important there. I realized that since pygments consistently uses span tags, as long as I ignored newlines after span tags I should be safe.

This is the script I came up with to take the HTML output of Sphinx and prepare it for posting through Blogger:

#!/usr/bin/env python
# encoding: utf-8
#
# Copyright (c) 2008 Doug Hellmann All rights reserved.
#
"""Clean a sphinx-generated HTML blob to make a blog post.
"""

import re
import sys
from BeautifulSoup import BeautifulSoup
from cStringIO import StringIO

# The post body is passed to stdin.
body = sys.stdin.read()
soup = BeautifulSoup(body)

# Remove the permalinks to each header since the blog does not have
# the styles to hide them.
links = soup.findAll('a', attrs={'class':"headerlink"})
[l.extract() for l in links]

# Get BeautifulSoup's version of the string
s = soup.__str__(prettyPrint=False)

# Remove extra newlines. This depends on the fact that
# code blocks are passed through pygments, which wraps each part of the line
# in a span tag.
pattern = re.compile(r'([^s][^p][^a][^n]>)\n$', re.DOTALL|re.IGNORECASE)
s = ''.join(pattern.sub(r'\1', l) for l in StringIO(s))
print s


Today's PyMOTW post on readline is the first example of the results.

Updated 1 Dec to change import line based on reader comment.

9 comments:

Georg said...

I've now added a new config value to Sphinx tip that can be used to switch off these permalinks, called "html_add_permalinks".

Doug Hellmann said...

Sweet! Now if I could just figure out how to get rid of those extra newlines...

Drew Perttula said...

"from BeautifulSoup import *" should be "from BeautifulSoup import BeautifulSoup".

That's especially true if one's readers are learning python idioms by reading one's blog for example code :)

Doug Hellmann said...

Good point, Drew. I had been using * while testing a much more complicated version that attempted to subclass Tag and BeautifulSoup to deal with the newline issue, but I abandoned that approach. Thanks for pointing out the flub!

airfoyle said...

This is somewhat off-topic, but I have had trouble finding help on this....

When I paste my html into blogger --- after performing transformations such as the ones you describe --- all my newlines go away. I am copying from emacs on a Mac running OS 10.4. I've tried inserting explicit ^M^Js into the emacs buffer before copying it; perhaps the X11 copy function is deleting them.

What's odd is that inserting the newlines by hand back into the post using the blogger editor has no effect.

Any ideas or pointers to some magic source of information? Google has failed me here because putting "blogger" into the search just finds a lot of blogs.

Thanks!

Doug Hellmann said...

@airfoyle - Do you have the "Convert Line Breaks" option set to "yes" for your blog?

airfoyle said...

> Do you have the "Convert Line Breaks" option set to "yes" for your blog?

Nope.

I have found a solution, if not an explanation. If I delete all the newlines and put in explicit paragraph tags then I can get things under control.

I'm willing to run my html through the appropriate emacs-global-replace every time, but the whole thing is puzzling.

Thanks.

Doug Hellmann said...

@airfoyle - What happens if you turn on the convert linebreaks option?

airfoyle said...

> What happens if you turn on the convert linebreaks option?

Then it behaves more consistently; either way I have to squeeze out all the newlines to get my html markup to override whatever Blogger is trying to do. It's a nuisance, but a minor one, I guess.

Thanks.