convert xhtml back to html

Discussion in 'Python' started by Tim Arnold, Apr 24, 2008.

  1. Tim Arnold

    Tim Arnold Guest

    hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
    create CHM files. That application really hates xhtml, so I need to convert
    self-ending tags (e.g. <br />) to plain html (e.g. <br>).

    Seems simple enough, but I'm having some trouble with it. regexps trip up
    because I also have to take into account 'img', 'meta', 'link' tags, not
    just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do
    that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work. I'm not
    enough of a regexp pro to figure out that lookahead stuff.

    I'm not sure where to start now; I looked at BeautifulSoup and
    BeautifulStoneSoup, but I can't see how to modify the actual tag.

    --Tim Arnold
    Tim Arnold, Apr 24, 2008
    1. Advertisements

  2. Tim Arnold

    Gary Herron Guest

    Whether or not you can find an application that does what you want, I
    don't know, but at the very least I can say this much.

    You should not be reading and parsing the text yourself! XHTML is valid
    XML, and there a lots of ways to read and parse XML with Python.
    (ElementTree is what I use, but other choices exist.) Once you use an
    existing package to read your files into an internal tree structure
    representation, it should be a relatively easy job to traverse the tree
    to emit the tags and text you want.

    Gary Herron
    Gary Herron, Apr 24, 2008
    1. Advertisements

  3. Hi, I'm not sure if this is very helpful but the following works on
    the very simple example below.
    '<p>hello <img src="/img.png"> spam <br> bye </p>'
    Arnaud Delobelle, Apr 24, 2008
  4. You might try XIST (

    Code looks like this:

    from ll.xist import parsers
    from ll.xist.ns import html

    xhtml = '<p>hello <img src="/img.png"/> spam <br/> bye </p>'

    doc = parsers.parsestring(xhtml)
    print doc.bytes(xhtml=0)

    This outputs:

    <p>hello <img src="/img.png"> spam <br> bye </p>

    (and a warning that the alt attribute is missing in the img ;))

    Walter Dörwald, Apr 24, 2008
  5. Tim Arnold

    Tim Arnold Guest

    I agree and I'd really rather not parse it myself. However, ET will clean up
    the file which in my case includes some comments required as metadata, so
    that won't work. Oh, I could get ET to read it and write a new parser--I see
    what you mean. I think I need to subclass so I could get ET to honor those
    comments too.
    That's one way to go, I was just hoping for something easier.
    Tim Arnold, Apr 24, 2008
  6. Tim Arnold

    Tim Arnold Guest

    Thanks for that. It is helpful--I guess I had a brain malfunction. Your
    example will work for me I'm pretty sure, except in some cases where the IMG
    alt text contains a gt sign. I'm not sure that's even possible, so maybe
    this will do the job.
    Tim Arnold, Apr 24, 2008
  7. I'll second the recommendation to use xsl-t, set the output to html.

    The code for an XSL-T to do it would be basically:
    <xsl:stylesheet xmlns:xsl="" version="1.0">
    <xsl:eek:utput method="html" />
    <xsl:template match="/"><xsl:copy-of select="/"/></xsl:template>

    you would probably want to do other stuff than just copy it out but
    that's another case.

    Also, from my recollection the solution in CHM to make XHTML br
    elements behave correctly was <br /> as opposed to <br/>, at any rate
    I've done projects generating CHM and my output markup was well formed
    XML at all occasions.

    Bryan Rasmussen
    bryan rasmussen, Apr 24, 2008
  8. This should do the job in lxml 2.x:

    from lxml import etree

    tree = etree.parse("thefile.xhtml")
    tree.write("thefile.html", method="html")

    Stefan Behnel, Apr 24, 2008
  9. wow, that's pretty nice there.

    Just to know: what's the performance like on XML instances of 1 GB?

    Bryan Rasmussen
    bryan rasmussen, Apr 24, 2008
  10. bryan rasmussen top-posted:
    That's a pretty big file, although you didn't mention what kind of XML
    language you want to handle and what you want to do with it.

    lxml is pretty conservative in terms of memory:

    But the exact numbers depend on your data. lxml holds the XML tree in memory,
    which is a lot bigger than the serialised data. So, for example, if you have
    2GB of RAM and want to parse a serialised 1GB XML file full of little
    one-element integers into an in-memory tree, get prepared for lunch. With a
    lot of long text string content instead, it might still fit.

    However, lxml also has a couple of step-by-step and stream parsing APIs:

    They might do what you want.

    Stefan Behnel, Apr 25, 2008
  11. If you are operating with huge XML files (say, larger than available
    RAM) repeatedly, an XML database may also be a good option.

    My current favorite in this realm is Sedna (free, Apache 2.0 license).
    Among other features, it has facilities for indexing within documents
    and collections (faster queries) and transactional sub-document updates
    (safely modify parts of a document without rewriting the entire
    document). I have been working on a python interface to it recently
    (zif.sedna, in pypi).

    Regarding RAM consumption, a Sedna database uses approximately 100 MB of
    RAM by default, and that does not change much, no matter how much (or
    how little) data is actually stored.

    For a quick idea of Sedna's capabilities, the Sedna folks have put up an
    on-line demo serving and xquerying an extract from Wikipedia (in the
    range of 20 GB of data) using a Sedna server, at . Along with the on-line demo, they provide
    instructions for deploying the technology locally.

    - Jim Washington
    Jim Washington, Apr 25, 2008
  12. Tim Arnold

    Tim Arnold Guest

    Thanks Bryan, Walter, John, Marc, and Stefan. I finally went with the xslt
    transform which works very well and is simple. regexps would work, but they
    just scare me somehow. Brian, my tags were formatted as <br /> but the help
    compiler would issue warnings on each one resulting in log files with
    thousands of warnings. It did finish the compile though, but it made
    understanding the logs too painful.

    Stefan, I *really* look forward to being able to use lxml when I move to RH
    linux next month. I've been using hp10.20 and never could get the requisite
    libraries to compile. Once I make that move, maybe I won't have as many
    markup related questions here!

    thanks again to all for the great suggestions.
    --Tim Arnold
    Tim Arnold, Apr 25, 2008
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.