convert xhtml back to html

Discussion in 'Python' started by Tim Arnold, Apr 24, 2008.

  1. Tim Arnold

    Tim Arnold Guest

    hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
    create CHM files. That application really hates xhtml, so I need to convert
    self-ending tags (e.g. <br />) to plain html (e.g. <br>).

    Seems simple enough, but I'm having some trouble with it. regexps trip up
    because I also have to take into account 'img', 'meta', 'link' tags, not
    just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do
    that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work. I'm not
    enough of a regexp pro to figure out that lookahead stuff.

    I'm not sure where to start now; I looked at BeautifulSoup and
    BeautifulStoneSoup, but I can't see how to modify the actual tag.

    thanks,
    --Tim Arnold
     
    Tim Arnold, Apr 24, 2008
    #1
    1. Advertising

  2. Tim Arnold

    Gary Herron Guest

    Tim Arnold wrote:
    > hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
    > create CHM files. That application really hates xhtml, so I need to convert
    > self-ending tags (e.g. <br />) to plain html (e.g. <br>).
    >
    > Seems simple enough, but I'm having some trouble with it. regexps trip up
    > because I also have to take into account 'img', 'meta', 'link' tags, not
    > just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do
    > that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work. I'm not
    > enough of a regexp pro to figure out that lookahead stuff.
    >
    > I'm not sure where to start now; I looked at BeautifulSoup and
    > BeautifulStoneSoup, but I can't see how to modify the actual tag.
    >
    > thanks,
    > --Tim Arnold
    >
    >
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >

    Whether or not you can find an application that does what you want, I
    don't know, but at the very least I can say this much.

    You should not be reading and parsing the text yourself! XHTML is valid
    XML, and there a lots of ways to read and parse XML with Python.
    (ElementTree is what I use, but other choices exist.) Once you use an
    existing package to read your files into an internal tree structure
    representation, it should be a relatively easy job to traverse the tree
    to emit the tags and text you want.


    Gary Herron
     
    Gary Herron, Apr 24, 2008
    #2
    1. Advertising

  3. "Tim Arnold" <> writes:

    > hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
    > create CHM files. That application really hates xhtml, so I need to convert
    > self-ending tags (e.g. <br />) to plain html (e.g. <br>).
    >
    > Seems simple enough, but I'm having some trouble with it. regexps trip up
    > because I also have to take into account 'img', 'meta', 'link' tags, not
    > just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do
    > that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work. I'm not
    > enough of a regexp pro to figure out that lookahead stuff.


    Hi, I'm not sure if this is very helpful but the following works on
    the very simple example below.

    >>> import re
    >>> xhtml = '<p>hello <img src="/img.png"/> spam <br/> bye </p>'
    >>> xtag = re.compile(r'<([^>]*?)/>')
    >>> xtag.sub(r'<\1>', xhtml)

    '<p>hello <img src="/img.png"> spam <br> bye </p>'


    --
    Arnaud
     
    Arnaud Delobelle, Apr 24, 2008
    #3
  4. Arnaud Delobelle wrote:
    > "Tim Arnold" <> writes:
    >
    >> hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
    >> create CHM files. That application really hates xhtml, so I need to convert
    >> self-ending tags (e.g. <br />) to plain html (e.g. <br>).
    >>
    >> Seems simple enough, but I'm having some trouble with it. regexps trip up
    >> because I also have to take into account 'img', 'meta', 'link' tags, not
    >> just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do
    >> that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work. I'm not
    >> enough of a regexp pro to figure out that lookahead stuff.

    >
    > Hi, I'm not sure if this is very helpful but the following works on
    > the very simple example below.
    >
    >>>> import re
    >>>> xhtml = '<p>hello <img src="/img.png"/> spam <br/> bye </p>'
    >>>> xtag = re.compile(r'<([^>]*?)/>')
    >>>> xtag.sub(r'<\1>', xhtml)

    > '<p>hello <img src="/img.png"> spam <br> bye </p>'


    You might try XIST (http://www.livinglogic.de/Python/xist):

    Code looks like this:

    from ll.xist import parsers
    from ll.xist.ns import html

    xhtml = '<p>hello <img src="/img.png"/> spam <br/> bye </p>'

    doc = parsers.parsestring(xhtml)
    print doc.bytes(xhtml=0)

    This outputs:

    <p>hello <img src="/img.png"> spam <br> bye </p>

    (and a warning that the alt attribute is missing in the img ;))

    Servus,
    Walter
     
    Walter Dörwald, Apr 24, 2008
    #4
  5. Tim Arnold

    Tim Arnold Guest

    "Gary Herron" <> wrote in message
    news:...
    > Tim Arnold wrote:
    >> hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop
    >> to create CHM files. That application really hates xhtml, so I need to
    >> convert self-ending tags (e.g. <br />) to plain html (e.g. <br>).
    >>
    >> Seems simple enough, but I'm having some trouble with it. regexps trip up
    >> because I also have to take into account 'img', 'meta', 'link' tags, not
    >> just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to
    >> do that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work.
    >> I'm not enough of a regexp pro to figure out that lookahead stuff.
    >>
    >> I'm not sure where to start now; I looked at BeautifulSoup and
    >> BeautifulStoneSoup, but I can't see how to modify the actual tag.
    >>
    >> thanks,
    >> --Tim Arnold
    >>
    >>
    >> --
    >> http://mail.python.org/mailman/listinfo/python-list
    >>

    > Whether or not you can find an application that does what you want, I
    > don't know, but at the very least I can say this much.
    >
    > You should not be reading and parsing the text yourself! XHTML is valid
    > XML, and there a lots of ways to read and parse XML with Python.
    > (ElementTree is what I use, but other choices exist.) Once you use an
    > existing package to read your files into an internal tree structure
    > representation, it should be a relatively easy job to traverse the tree to
    > emit the tags and text you want.
    >
    >
    > Gary Herron
    >

    I agree and I'd really rather not parse it myself. However, ET will clean up
    the file which in my case includes some comments required as metadata, so
    that won't work. Oh, I could get ET to read it and write a new parser--I see
    what you mean. I think I need to subclass so I could get ET to honor those
    comments too.
    That's one way to go, I was just hoping for something easier.
    thanks,
    --Tim
     
    Tim Arnold, Apr 24, 2008
    #5
  6. Tim Arnold

    Tim Arnold Guest

    "Arnaud Delobelle" <> wrote in message
    news:...
    > "Tim Arnold" <> writes:
    >
    >> hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop
    >> to
    >> create CHM files. That application really hates xhtml, so I need to
    >> convert
    >> self-ending tags (e.g. <br />) to plain html (e.g. <br>).
    >>
    >> Seems simple enough, but I'm having some trouble with it. regexps trip up
    >> because I also have to take into account 'img', 'meta', 'link' tags, not
    >> just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to
    >> do
    >> that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work. I'm
    >> not
    >> enough of a regexp pro to figure out that lookahead stuff.

    >
    > Hi, I'm not sure if this is very helpful but the following works on
    > the very simple example below.
    >
    >>>> import re
    >>>> xhtml = '<p>hello <img src="/img.png"/> spam <br/> bye </p>'
    >>>> xtag = re.compile(r'<([^>]*?)/>')
    >>>> xtag.sub(r'<\1>', xhtml)

    > '<p>hello <img src="/img.png"> spam <br> bye </p>'
    >
    >
    > --
    > Arnaud


    Thanks for that. It is helpful--I guess I had a brain malfunction. Your
    example will work for me I'm pretty sure, except in some cases where the IMG
    alt text contains a gt sign. I'm not sure that's even possible, so maybe
    this will do the job.
    thanks,
    --Tim
     
    Tim Arnold, Apr 24, 2008
    #6
  7. I'll second the recommendation to use xsl-t, set the output to html.


    The code for an XSL-T to do it would be basically:
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:eek:utput method="html" />
    <xsl:template match="/"><xsl:copy-of select="/"/></xsl:template>
    </xsl:stylesheet>

    you would probably want to do other stuff than just copy it out but
    that's another case.

    Also, from my recollection the solution in CHM to make XHTML br
    elements behave correctly was <br /> as opposed to <br/>, at any rate
    I've done projects generating CHM and my output markup was well formed
    XML at all occasions.

    Cheers,
    Bryan Rasmussen

    On Thu, Apr 24, 2008 at 5:34 PM, Tim Arnold <> wrote:
    > hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
    > create CHM files. That application really hates xhtml, so I need to convert
    > self-ending tags (e.g. <br />) to plain html (e.g. <br>).
    >
    > Seems simple enough, but I'm having some trouble with it. regexps trip up
    > because I also have to take into account 'img', 'meta', 'link' tags, not
    > just the simple 'br' and 'hr' tags. Well, maybe there's a simple way to do
    > that with regexps, but my simpleminded <img[^(/>)]+/> doesn't work. I'm not
    > enough of a regexp pro to figure out that lookahead stuff.
    >
    > I'm not sure where to start now; I looked at BeautifulSoup and
    > BeautifulStoneSoup, but I can't see how to modify the actual tag.
    >
    > thanks,
    > --Tim Arnold
    >
    >
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >
     
    bryan rasmussen, Apr 24, 2008
    #7
  8. Tim Arnold wrote:
    > hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
    > create CHM files. That application really hates xhtml, so I need to convert
    > self-ending tags (e.g. <br />) to plain html (e.g. <br>).


    This should do the job in lxml 2.x:

    from lxml import etree

    tree = etree.parse("thefile.xhtml")
    tree.write("thefile.html", method="html")

    http://codespeak.net/lxml

    Stefan
     
    Stefan Behnel, Apr 24, 2008
    #8
  9. wow, that's pretty nice there.

    Just to know: what's the performance like on XML instances of 1 GB?

    Cheers,
    Bryan Rasmussen


    On Thu, Apr 24, 2008 at 9:55 PM, Stefan Behnel <> wrote:
    > Tim Arnold wrote:
    > > hi, I've got lots of xhtml pages that need to be fed to MS HTML Workshop to
    > > create CHM files. That application really hates xhtml, so I need to convert
    > > self-ending tags (e.g. <br />) to plain html (e.g. <br>).

    >
    > This should do the job in lxml 2.x:
    >
    > from lxml import etree
    >
    > tree = etree.parse("thefile.xhtml")
    > tree.write("thefile.html", method="html")
    >
    > http://codespeak.net/lxml
    >
    > Stefan
    >
    >
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >
     
    bryan rasmussen, Apr 24, 2008
    #9
  10. bryan rasmussen top-posted:
    > On Thu, Apr 24, 2008 at 9:55 PM, Stefan Behnel <> wrote:
    >> from lxml import etree
    >>
    >> tree = etree.parse("thefile.xhtml")
    >> tree.write("thefile.html", method="html")
    >>
    >> http://codespeak.net/lxml

    >
    > wow, that's pretty nice there.
    >
    > Just to know: what's the performance like on XML instances of 1 GB?


    That's a pretty big file, although you didn't mention what kind of XML
    language you want to handle and what you want to do with it.

    lxml is pretty conservative in terms of memory:

    http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

    But the exact numbers depend on your data. lxml holds the XML tree in memory,
    which is a lot bigger than the serialised data. So, for example, if you have
    2GB of RAM and want to parse a serialised 1GB XML file full of little
    one-element integers into an in-memory tree, get prepared for lunch. With a
    lot of long text string content instead, it might still fit.

    However, lxml also has a couple of step-by-step and stream parsing APIs:

    http://codespeak.net/lxml/parsing.html#the-target-parser-interface
    http://codespeak.net/lxml/parsing.html#the-feed-parser-interface
    http://codespeak.net/lxml/parsing.html#iterparse-and-iterwalk

    They might do what you want.

    Stefan
     
    Stefan Behnel, Apr 25, 2008
    #10
  11. Stefan Behnel wrote:
    > bryan rasmussen top-posted:
    >
    >> On Thu, Apr 24, 2008 at 9:55 PM, Stefan Behnel <> wrote:
    >>
    >>> from lxml import etree
    >>>
    >>> tree = etree.parse("thefile.xhtml")
    >>> tree.write("thefile.html", method="html")
    >>>
    >>> http://codespeak.net/lxml
    >>>

    >> wow, that's pretty nice there.
    >>
    >> Just to know: what's the performance like on XML instances of 1 GB?
    >>

    >
    > That's a pretty big file, although you didn't mention what kind of XML
    > language you want to handle and what you want to do with it.
    >
    > lxml is pretty conservative in terms of memory:
    >
    > http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
    >
    > But the exact numbers depend on your data. lxml holds the XML tree in memory,
    > which is a lot bigger than the serialised data. So, for example, if you have
    > 2GB of RAM and want to parse a serialised 1GB XML file full of little
    > one-element integers into an in-memory tree, get prepared for lunch. With a
    > lot of long text string content instead, it might still fit.
    >
    > However, lxml also has a couple of step-by-step and stream parsing APIs:
    >
    > http://codespeak.net/lxml/parsing.html#the-target-parser-interface
    > http://codespeak.net/lxml/parsing.html#the-feed-parser-interface
    > http://codespeak.net/lxml/parsing.html#iterparse-and-iterwalk
    >

    If you are operating with huge XML files (say, larger than available
    RAM) repeatedly, an XML database may also be a good option.

    My current favorite in this realm is Sedna (free, Apache 2.0 license).
    Among other features, it has facilities for indexing within documents
    and collections (faster queries) and transactional sub-document updates
    (safely modify parts of a document without rewriting the entire
    document). I have been working on a python interface to it recently
    (zif.sedna, in pypi).

    Regarding RAM consumption, a Sedna database uses approximately 100 MB of
    RAM by default, and that does not change much, no matter how much (or
    how little) data is actually stored.

    For a quick idea of Sedna's capabilities, the Sedna folks have put up an
    on-line demo serving and xquerying an extract from Wikipedia (in the
    range of 20 GB of data) using a Sedna server, at
    http://wikidb.dyndns.org/ . Along with the on-line demo, they provide
    instructions for deploying the technology locally.

    - Jim Washington
     
    Jim Washington, Apr 25, 2008
    #11
  12. Tim Arnold

    Tim Arnold Guest

    "bryan rasmussen" <> wrote in message
    news:...
    > I'll second the recommendation to use xsl-t, set the output to html.
    >
    >
    > The code for an XSL-T to do it would be basically:
    > <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    > version="1.0">
    > <xsl:eek:utput method="html" />
    > <xsl:template match="/"><xsl:copy-of select="/"/></xsl:template>
    > </xsl:stylesheet>
    >
    > you would probably want to do other stuff than just copy it out but
    > that's another case.
    >
    > Also, from my recollection the solution in CHM to make XHTML br
    > elements behave correctly was <br /> as opposed to <br/>, at any rate
    > I've done projects generating CHM and my output markup was well formed
    > XML at all occasions.
    >
    > Cheers,
    > Bryan Rasmussen


    Thanks Bryan, Walter, John, Marc, and Stefan. I finally went with the xslt
    transform which works very well and is simple. regexps would work, but they
    just scare me somehow. Brian, my tags were formatted as <br /> but the help
    compiler would issue warnings on each one resulting in log files with
    thousands of warnings. It did finish the compile though, but it made
    understanding the logs too painful.

    Stefan, I *really* look forward to being able to use lxml when I move to RH
    linux next month. I've been using hp10.20 and never could get the requisite
    libraries to compile. Once I make that move, maybe I won't have as many
    markup related questions here!

    thanks again to all for the great suggestions.
    --Tim Arnold
     
    Tim Arnold, Apr 25, 2008
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    5
    Views:
    1,970
    Wesley Hall
    Nov 4, 2006
  2. chronos3d
    Replies:
    9
    Views:
    836
    Andy Dingley
    Dec 5, 2006
  3. Usha2009
    Replies:
    0
    Views:
    1,178
    Usha2009
    Dec 20, 2009
  4. xhtml champs
    Replies:
    0
    Views:
    568
    xhtml champs
    Aug 1, 2011
  5. xhtml champs
    Replies:
    0
    Views:
    1,084
    xhtml champs
    Aug 2, 2011
Loading...

Share This Page