transforming xhtml to html (resolving namespace dependencies)

Discussion in 'XML' started by Andy, Jan 30, 2011.

  1. Andy

    Andy Guest

    Hi,

    I am using Apache xalan to transform xhtml files to html files.

    My xslt stylesheet is:

    <?xml version="1.0" encoding="UTF-8"?>
    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/
    Transform">
    <xsl:eek:utput method="html" encoding="UTF-8"/>
    <xsl:template match="/"><xsl:copy-of select="/"/></xsl:template>
    </xsl:stylesheet>

    Seems to work. For example, I had an xhtml file which had entities
    defined in DOCTYPE and those were resolved successfully.

    However, I'm more concerned with another document:

    Its an xhtml file and begins with:

    <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:eek:="urn:schemas-
    microsoft-com:eek:ffice:eek:ffice" xmlns:w="urn:schemas-microsoft-
    com:eek:ffice:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/
    omml" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://
    www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type"
    content="text/html; charset=utf-8"/>

    My concern is that xalan resolve all dependencies in such an xhtml
    file on the schemas referenced in the html tag.

    Will it???

    The xalan output to html began with:

    <html xmlns="http://www.w3.org/1999/xhtml" xmlns:m="http://
    schemas.microsoft.com/office/2004/12/omml" xmlns:eek:="urn:schemas-
    microsoft-com:eek:ffice:eek:ffice" xmlns:v="urn:schemas-microsoft-com:vml"
    xmlns:w="urn:schemas-microsoft-com:eek:ffice:word" xmlns:xlink="http://
    www.w3.org/1999/xlink">

    So I'm obviously concerned that the dependencies are still there!

    If its ok, can I strip all those xmlns attributes in the <html> tag?

    Or maybe I need a much better xslt stylesheet.

    Thanks,
    Andy
    Andy, Jan 30, 2011
    #1
    1. Advertising

  2. Sun, 30 Jan 2011 15:30:44 -0800 (PST), /Andy/:

    > Its an xhtml file and begins with:
    >
    > <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:eek:="urn:schemas-
    > microsoft-com:eek:ffice:eek:ffice" xmlns:w="urn:schemas-microsoft-
    > com:eek:ffice:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/
    > omml" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://
    > www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type"
    > content="text/html; charset=utf-8"/>


    It is not XHTML, it's an MS Office output for HTML which happens to
    be some sort of XML.

    > My concern is that xalan resolve all dependencies in such an xhtml
    > file on the schemas referenced in the html tag.
    >
    > Will it???


    I don't see any direct schema references, but I don't think you need
    any in this case.

    > The xalan output to html began with:
    >
    > <html xmlns="http://www.w3.org/1999/xhtml" xmlns:m="http://
    > schemas.microsoft.com/office/2004/12/omml" xmlns:eek:="urn:schemas-
    > microsoft-com:eek:ffice:eek:ffice" xmlns:v="urn:schemas-microsoft-com:vml"
    > xmlns:w="urn:schemas-microsoft-com:eek:ffice:word" xmlns:xlink="http://
    > www.w3.org/1999/xlink">
    >
    > So I'm obviously concerned that the dependencies are still there!
    >
    > If its ok, can I strip all those xmlns attributes in the <html> tag?


    Yes, if you want to output pure HTML you need to strip those
    namespace declaration attributes off. See the
    'exclude-result-prefixes' attribute [1][2].

    > Or maybe I need a much better xslt stylesheet.


    I guess you would need to include templates for converting elements
    like <o:p> into HTML ones - <p>. The crap which MS Office output
    for HTML is enormous. I can't give you all the rules you need for
    converting such a file to a clean HTML. You may also look at HTML
    Tidy [3].

    [1] http://www.w3.org/TR/xslt#stylesheet-element
    [2] http://www.w3.org/TR/xslt#literal-result-element
    [3] http://tidy.sourceforge.net/

    --
    Stanimir
    Stanimir Stamenkov, Jan 31, 2011
    #2
    1. Advertising

  3. Stanimir Stamenkov wrote:
    > Sun, 30 Jan 2011 15:30:44 -0800 (PST), /Andy/:


    >> The xalan output to html began with:
    >>
    >> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:m="http://
    >> schemas.microsoft.com/office/2004/12/omml" xmlns:eek:="urn:schemas-
    >> microsoft-com:eek:ffice:eek:ffice" xmlns:v="urn:schemas-microsoft-com:vml"
    >> xmlns:w="urn:schemas-microsoft-com:eek:ffice:word" xmlns:xlink="http://
    >> www.w3.org/1999/xlink">
    >>
    >> So I'm obviously concerned that the dependencies are still there!
    >>
    >> If its ok, can I strip all those xmlns attributes in the <html> tag?

    >
    > Yes, if you want to output pure HTML you need to strip those namespace
    > declaration attributes off. See the 'exclude-result-prefixes' attribute
    > [1][2].


    exclude-result-prefixes does not help if namespaces are copied from an
    input node, as was done in the posted stylesheet by using
    <xsl:copy-of select="/"/>

    You would need to write a stylesheet doing
    <xsl:template match="*">
    <xsl:element name="{name()}" namespace="{namespace-uri()}">
    <xsl:apply-templates select="@* | node()"/>
    </xsl:element>
    </xsl:template>
    in XSLT 1.0, to make sure elements are copied but their namespace nodes
    are not automatically copied. But even this way, as long as elements in
    a certain namespace are copied through, the result document when
    serialized is going to declare those namespaces.
    So in that input document you could only get rid of e.g.
    xmlns:w="urn:schemas-microsoft-com:eek:ffice:word" as long as there are no
    element in that namespace copied.

    If you want to strip all namespace then use
    <xsl:template match="*">
    <xsl:element name="{local-name()}">
    <xsl:apply-templates select="@* | node()"/>
    </xsl:element>
    </xsl:template>
    or perhaps add templates for elements in namespaces like
    urn:schemas-microsoft-com:eek:ffice:word to don't copy them at all, if you
    don't need or want such elements.


    --

    Martin Honnen
    http://msmvps.com/blogs/martin_honnen/
    Martin Honnen, Jan 31, 2011
    #3
  4. Andy wrote:

    > However, I'm more concerned with another document:
    >
    > Its an xhtml file and begins with:
    >
    > <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:eek:="urn:schemas-
    > microsoft-com:eek:ffice:eek:ffice" xmlns:w="urn:schemas-microsoft-
    > com:eek:ffice:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/
    > omml" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://
    > www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type"
    > content="text/html; charset=utf-8"/>
    >
    > My concern is that xalan resolve all dependencies in such an xhtml
    > file on the schemas referenced in the html tag.
    >
    > Will it???


    There are namespace declarations in that document. An XML parser does
    not resolve the URLs in namespace declarations.
    Schemas are not referenced, that would be done with
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="urn:schemas-microsoft-com:vml
    http://example.com/someschema.xsd"

    --

    Martin Honnen
    http://msmvps.com/blogs/martin_honnen/
    Martin Honnen, Jan 31, 2011
    #4
  5. Andy

    Andy Guest

    On Jan 30, 9:53 pm, Stanimir Stamenkov <> wrote:
    > Sun, 30 Jan 2011 15:30:44 -0800 (PST), /Andy/:
    >
    > > Its an xhtml file and begins with:

    >
    > > <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:eek:="urn:schemas-
    > > microsoft-com:eek:ffice:eek:ffice" xmlns:w="urn:schemas-microsoft-
    > > com:eek:ffice:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/
    > > omml" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://
    > >www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type"
    > > content="text/html; charset=utf-8"/>

    >
    > It is not XHTML, it's an MS Office output for HTML which happens to
    > be some sort of XML.
    >


    Let me tell you about my bigger problem. I have 1000s of epubs, which
    are wrappers for xhmtl content type chapters. The requirement is that
    each of the chapters be xhtml, which is enforced by the epub format.
    My requirement is that I convert them to html. One problem I've seen
    is a DOCTYPE prelude to the chapter that defines entity subsitutions
    local to that chapter. xalan + xslt with the simple stylesheet I
    posted in the first question resolves those entities to html entities
    successfully.

    But the MS Office generated epub document made me worried that in a
    variety of ways, the creator of an epub chapter (xhtml subdocument)
    could embed references to schemas other than just the xhtml schema,
    and still expect firefox to resolve all those dependencies in its
    parser.

    I.e. The schemas referenced in the xhtml chapter might defined
    entities, there. Is there an xslt stylesheet that would tell xalan to
    resolve all the entities in externally referenced schemas other than
    xhtml schema itself?

    The second question is like with this MS Office generated epub. There
    are schemas referenced that probably define the structure of o: and v:
    and m: tags. What would Firefox's parser do with such a tag if I told
    Firefox that the page content type was "application/xhtml+xml"? And
    is there a simple stylesheet (that doesn't special case every external
    schema tag definition) that will resolve each xhtml page to html (via
    xalan xslt interpreter) the same way firefox does?

    Andy
    Andy, Jan 31, 2011
    #5
  6. Andy

    Andy Guest

    On Jan 31, 4:06 am, Martin Honnen <> wrote:
    > Andy wrote:
    > > However, I'm more concerned with another document:

    >
    > > Its an xhtml file and begins with:

    >
    > > <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:eek:="urn:schemas-
    > > microsoft-com:eek:ffice:eek:ffice" xmlns:w="urn:schemas-microsoft-
    > > com:eek:ffice:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/
    > > omml" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://
    > >www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type"
    > > content="text/html; charset=utf-8"/>

    >
    > > My concern is that xalan resolve all dependencies in such an xhtml
    > > file on the schemas referenced in the html tag.

    >
    > > Will it???

    >
    > There are namespace declarations in that document. An XML parser does
    > not resolve the URLs in namespace declarations.
    > Schemas are not referenced, that would be done with
    >    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    >    xsi:schemaLocation="urn:schemas-microsoft-com:vmlhttp://example.com/someschema.xsd"
    >
    > --


    So you are saying that for this particular document, I can safely
    strip the declarations from the html tag?

    What I noticed is that when I display the document in firefox, even if
    it does not have much size, it takes 10 to 15 seconds to load the
    page, which made me think that firefox was going out to the microsoft
    site and parsing those external schemas. For what purpose if the
    "schemas are not referenced"?


    >
    >         Martin Honnen
    >        http://msmvps.com/blogs/martin_honnen/
    Andy, Jan 31, 2011
    #6
  7. Andy wrote:

    >>> <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:eek:="urn:schemas-
    >>> microsoft-com:eek:ffice:eek:ffice" xmlns:w="urn:schemas-microsoft-
    >>> com:eek:ffice:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/
    >>> omml" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://
    >>> www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type"
    >>> content="text/html; charset=utf-8"/>

    >>
    >>> My concern is that xalan resolve all dependencies in such an xhtml
    >>> file on the schemas referenced in the html tag.

    >>
    >>> Will it???

    >>
    >> There are namespace declarations in that document. An XML parser does
    >> not resolve the URLs in namespace declarations.
    >> Schemas are not referenced, that would be done with
    >> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    >> xsi:schemaLocation="urn:schemas-microsoft-com:vmlhttp://example.com/someschema.xsd"
    >>
    >> --

    >
    > So you are saying that for this particular document, I can safely
    > strip the declarations from the html tag?


    I can't assess that, removing namespace declarations is only safe as
    long as there are no elements or attributes in those namespaces. And you
    have not shown any contents of that document, just the root element with
    the namespace declarations.

    > What I noticed is that when I display the document in firefox, even if
    > it does not have much size, it takes 10 to 15 seconds to load the
    > page, which made me think that firefox was going out to the microsoft
    > site and parsing those external schemas. For what purpose if the
    > "schemas are not referenced"?


    Well the document does not reference any schemas, it simply uses XML
    namespace declarations. An XML parser simply recognizes elements and
    attributes based on their namespace i.e. if you send application/xml or
    text/xml or application/xhtml+xml (or other MIME types that trigger XML
    parsing) to a browser then it only renders a HTML link if it finds an 'a
    href' element in the XHTML namespace http://www.w3.org/1999/xhtml. An
    'a' element in no namespace does not have any meaning as a link.
    And the XML parser in Firefox is not even schema aware, it is Expat I
    think, so even if you referenced a schema with xsi:schemaLocation, it
    wouldn't matter to Firefox.

    I don't know why it took that long to load and render your document but
    it is certainly not because of namespace declarations.


    --

    Martin Honnen
    http://msmvps.com/blogs/martin_honnen/
    Martin Honnen, Jan 31, 2011
    #7
  8. Andy

    Peter Flynn Guest

    On 31/01/11 14:33, Andy wrote:
    [...]
    > Let me tell you about my bigger problem. I have 1000s of epubs, which
    > are wrappers for xhmtl content type chapters. The requirement is that
    > each of the chapters be xhtml, which is enforced by the epub format.
    > My requirement is that I convert them to html. One problem I've seen
    > is a DOCTYPE prelude to the chapter that defines entity subsitutions
    > local to that chapter. xalan + xslt with the simple stylesheet I
    > posted in the first question resolves those entities to html entities
    > successfully.
    >
    > But the MS Office generated epub document made me worried that in a
    > variety of ways, the creator of an epub chapter (xhtml subdocument)
    > could embed references to schemas other than just the xhtml schema,
    > and still expect firefox to resolve all those dependencies in its
    > parser.


    AFAIK FF does not pay any attention to resolving namespace URIs (it's
    not required by XML in any case: they merely have to be present; actual
    schema locations can specified separately with the xxx:schemaLocation
    attribute). FF doesn't even resolve DTD references, FFS :)

    In any case, if you are stripping off all this gunk and making plain ol'
    HTML, there won't be any namespaces for a browser to resolve...

    > I.e. The schemas referenced in the xhtml chapter might defined
    > entities, there.


    Schemas can't declare entities. Only DTDs can do that.

    > The second question is like with this MS Office generated epub. There
    > are schemas referenced that probably define the structure of o: and v:
    > and m: tags. What would Firefox's parser do with such a tag if I told
    > Firefox that the page content type was "application/xhtml+xml"?


    Probably ignore it, but why not try it and see?
    I thought you were generating HTML from these epubs, not XHTML.

    > is there a simple stylesheet (that doesn't special case every external
    > schema tag definition) that will resolve each xhtml page to html (via
    > xalan xslt interpreter) the same way firefox does?


    Tidy.

    ///Peter
    --
    XML FAQ: http://xml.silmaril.ie/
    Peter Flynn, Feb 3, 2011
    #8
  9. Andy

    Peter Flynn Guest

    On 30/01/11 23:30, Andy wrote:
    [...]
    > Its an xhtml file and begins with:
    >
    > <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:eek:="urn:schemas-
    > microsoft-com:eek:ffice:eek:ffice" xmlns:w="urn:schemas-microsoft-
    > com:eek:ffice:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/
    > omml" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://
    > www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type"
    > content="text/html; charset=utf-8"/>


    That's Word's "Save As...HTML". It's a horrendous kludge, and even omits
    the title element, the one text-bearing element in HTML that is actually
    compulsory :)

    > My concern is that xalan resolve all dependencies in such an xhtml
    > file on the schemas referenced in the html tag.
    >
    > Will it???


    No, and I'm not clear why you'd want to do that. There are no
    dependencies there unless you are a copy of word.exe :)

    Stanimir's suggestion of HTML Tidy is worth following. This needs the
    bogus o:p elements replacing (I suggest span); you can then clean out
    the rest of the rubbish with the -c and -n options:

    $ sed -e "s+o:p>+span>+g" foo.htm | tidy -c -n -asxml - >foo.xhtml

    ///Peter
    --
    XML FAQ: http://xml.silmaril.ie/
    Peter Flynn, Feb 3, 2011
    #9
  10. Andy

    Peter Flynn Guest

    On 31/01/11 14:39, Andy wrote:
    [...]
    > So you are saying that for this particular document, I can safely
    > strip the declarations from the html tag?


    Unless there are any private element types embedded in there which have
    the same names (modulo the namespace) as XHTML element types but
    different semantics.

    You could write a little XSLT script to pass over the document and check
    for that. <o:p> is a good example, as it is a p element type in an o
    namespace, yet it gets embedded in s span inside a HTML p element type.
    As I suggested earlier, you can trivially convert those to a span, with
    a specific class if you wanted to.

    > What I noticed is that when I display the document in firefox, even if
    > it does not have much size, it takes 10 to 15 seconds to load the
    > page, which made me think that firefox was going out to the microsoft
    > site and parsing those external schemas.


    No, FF is probably parsing the XML and then converting it to its
    internal HTML-based rendering model, so it's doing twice as much work as
    it does when loading plain HTML.

    > For what purpose if the "schemas are not referenced"?


    I think you may be confusing schemas with namespaces.

    Schemas are for guiding the formation of a document, and for providing a
    validating parser with a "reference map" of possible element type
    locations and node structures. Their principal use in rendering is --
    like DTDs -- to provide information about default attribute values; and
    these are minimal in HTML anyway.

    Namespaces are a way of identifying and disambiguating element and
    attribute types which have the same name but come from different
    backgrounds or have different semantics. This lets you embed (for
    example) MathML in DocBook without <arg> in MathML being confused with
    <arg> in DocBook; you also see this in XSLT, if you want to output
    MathML: <xsl:eek:therwise> cannot be confused with <m:eek:therwise>.

    Unfortunately, some document type designers think you're not
    well-dressed unless you obfuscate everything with vast namespaces. They
    have their place, and can be very useful, but they are often abused as a
    substitute for rigorous document type analysis.

    ///Peter
    --
    XML FAQ: http://xml.silmaril.ie/
    Peter Flynn, Feb 3, 2011
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Dimitre Novatchev
    Replies:
    1
    Views:
    577
    Dimitre Novatchev
    Jul 26, 2003
  2. Puzzled

    Transforming xhtml with xslt

    Puzzled, May 28, 2007, in forum: XML
    Replies:
    6
    Views:
    504
    Joe Kesselman
    May 30, 2007
  3. Rakesh Kumar

    freeze.py - resolving dependencies

    Rakesh Kumar, Feb 26, 2008, in forum: Python
    Replies:
    0
    Views:
    442
    Rakesh Kumar
    Feb 26, 2008
  4. xhtml champs
    Replies:
    0
    Views:
    448
    xhtml champs
    Aug 1, 2011
  5. xhtml champs
    Replies:
    0
    Views:
    1,014
    xhtml champs
    Aug 2, 2011
Loading...

Share This Page