How to differentiate between <XX></XX> and <XX/> with SAX

Discussion in 'XML' started by dpj5754@yahoo.fr, Jul 26, 2004.

  1. Guest

    Is there a simple and determinist way to make the difference
    between the 2 sequences:

    <XX></XX>

    and

    <XX/>

    The EndElement callback does not provide this information.

    Thanks,
    Pascal.
     
    , Jul 26, 2004
    #1
    1. Advertising

  2. Rolf Magnus Guest

    wrote:

    > Is there a simple and determinist way to make the difference
    > between the 2 sequences:
    >
    > <XX></XX>
    >
    > and
    >
    > <XX/>


    No. Their meaning is exactly the same. Why do you think you need that?

    > The EndElement callback does not provide this information.
     
    Rolf Magnus, Jul 26, 2004
    #2
    1. Advertising

  3. Rolf Magnus wrote:

    > wrote:
    >
    >
    >>Is there a simple and determinist way to make the difference
    >>between the 2 sequences:
    >>
    >><XX></XX>
    >>
    >>and
    >>
    >><XX/>

    >
    >
    > No. Their meaning is exactly the same. Why do you think you need that?


    Doesn't the first sample have an empty text() node as first child, and
    the second doesn't ?

    Franck,e-

    >
    >
    >>The EndElement callback does not provide this information.

    >
    >
    >
     
    Franck Guillaud, Jul 26, 2004
    #3
  4. In article <4104fbf8$0$15283$>,
    Franck Guillaud <> wrote:

    >>><XX></XX>
    >>>
    >>><XX/>


    > Doesn't the first sample have an empty text() node as first child, and
    >the second doesn't ?


    No.

    (XML itself doesn't define any such thing as a "text node". The
    Infoset has character information items, and there aren't any of them
    in either case. The XPath data model doesn't have a text node in either
    case, and SAX parsers do not call the characters method.)

    -- Richard
     
    Richard Tobin, Jul 26, 2004
    #4
  5. Rolf Magnus wrote:

    >
    > No. Their meaning is exactly the same. Why do you think you need that?
    >


    Ok, everywhere, I read that they are the same.
    But this is only true for XML, not for HTML, and even it if was
    true for HTML, it is still not true due to the way browsers interpret it.

    What I need is to parse manually written HTML.
    In HTML, <BR/> is interpreted differently than <BR></BR>.

    So, I have to basic reasons to do this:

    - I need it, the parser must make the difference, because
    it must ouput tag that it does not process like they were entered
    in order for the ouput to be correctly interpreted.

    - Even if it was not needed due to a technical reason, if the
    developper who wrote the HTML page decided that it is <XX/>, i
    prefer to output <XX/> rather than the other form. So that the
    developper can easily read the output of my program, and do not have
    to wonder about some "strange" conversion.

    Summary;

    We do no live in a perfect world, with perfect standard perfectly
    implemented by perfect developper. So we need a "stable" way to
    do the difference. I like standards very much (I have a networking
    background, you know ISO, IETF, IEEE, ATM FORUM, FR FORUIM, EIA, etc etc
    ....), but I live in a non standard world. I must adapt to survive :)

    Thanks for your help.
    Pascal.
     
    Pascal Dufour, Jul 26, 2004
    #5
  6. Pascal Dufour wrote:
    > Rolf Magnus wrote:
    >
    > >
    > > No. Their meaning is exactly the same. Why do you think you need that?
    > >

    >
    > Ok, everywhere, I read that they are the same.
    > But this is only true for XML, not for HTML, and even it if was
    > true for HTML, it is still not true due to the way browsers interpret it.
    >
    > What I need is to parse manually written HTML.
    > In HTML, <BR/> is interpreted differently than <BR></BR>.


    you can't parse html with an xml parser ; however, you can parse html
    with an sgml parser ; additionally, you can use a tool that converts
    html in xml (with best effort), like Cyber Neko HTML Parser
    http://www.apache.org/~andyc/neko/doc/html/

    >
    > So, I have to basic reasons to do this:
    >
    > - I need it, the parser must make the difference, because
    > it must ouput tag that it does not process like they were entered
    > in order for the ouput to be correctly interpreted.


    there's something quite confusing : you're talking about parsing like
    outputing ; these 2 processes are totally opposite : parsing gives
    access to a data model, and serializing (i prefer this term) renders
    this data model to an xml characters form (file, char flow...)

    you can't act on the xml data model because it is governed by a set of
    stable specifications, but you can act on the serialization ; for this
    purpose, formatter tools often provide a set of options that allow to
    tune the output ; you can also write your own formatter

    >
    > - Even if it was not needed due to a technical reason, if the
    > developper who wrote the HTML page decided that it is <XX/>, i
    > prefer to output <XX/> rather than the other form. So that the
    > developper can easily read the output of my program, and do not have
    > to wonder about some "strange" conversion.
    >
    > Summary;
    >
    > We do no live in a perfect world, with perfect standard perfectly
    > implemented by perfect developper. So we need a "stable" way to
    > do the difference. I like standards very much (I have a networking
    > background, you know ISO, IETF, IEEE, ATM FORUM, FR FORUIM, EIA, etc etc
    > ...), but I live in a non standard world. I must adapt to survive :)
    >
    > Thanks for your help.
    > Pascal.
    >



    --
    Cordialement,

    ///
    (. .)
    -----ooO--(_)--Ooo-----
    | Philippe Poulard |
    -----------------------
     
    Philippe Poulard, Jul 27, 2004
    #6
  7. Ok, everywhere, I read that they are the same.
    But this is only true for XML, not for HTML, and even it if was
    true for HTML, it is still not true due to the way browsers interpret it.


    well for HTML (but this is after all an XML newsgroup) the situation is
    completely different.
    <BR/> and <BR></BR>
    are _both_ syntax errors ( /> is always a syntax error in HTML, and BR
    has no end tag as it is declared EMPTY in the HTML DTD, so </BR> is also
    an error)

    Of course a browser may or may not have some lax silent error recovery
    from either of these situtations but in any case the behaviour will be
    browser specific.


    > - Even if it was not needed due to a technical reason, if the
    > developper who wrote the HTML page decided that it is <XX/>, i
    > prefer to output <XX/> rather than the other form.


    So long as you are clearly writing HTML rather than XML there's nothing
    wrong with you doing that. XSLT for example, if writing html can not
    distinguish the inputs of <BR/> and <BR></BR> as the input is XML and
    these are the same, but in either case an "identity" transform will
    produce the HTML syntax
    <BR>
    if the html output method is being used (which it is by default if the
    top level output element is <html>.

    David
     
    David Carlisle, Jul 27, 2004
    #7
  8. Stefan Ram Guest

    David Carlisle <> writes:
    > XSLT for example, if writing html can not
    >distinguish the inputs of <BR/> and <BR></BR> as the input is XML and
    >these are the same,


    Actually, in XML, the notion "element" is not an abstract one,
    but a concrete non-terminal symbol of the syntax.

    Therefore, as elements, the element "<br/>" and the element
    "<br></br>" are two /different/ elements, just as "<br/>" also
    is a different element than "<br />".

    You might say, that they have the same element type, the same
    contents and the same number, names and value of attributes
    (here: none). Or, possibly, that they have the same
    "infoset", but the infoset specification is not part of the
    XML specification.
     
    Stefan Ram, Jul 27, 2004
    #8
  9. In article <>,
    David Carlisle <> wrote:

    % are _both_ syntax errors ( /> is always a syntax error in HTML, and BR

    Actually, it's not, although its meaning is not the same as in XML. <br />
    means the same as <br>>.
    --

    Patrick TJ McPhee
    East York Canada
     
    Patrick TJ McPhee, Jul 28, 2004
    #9

  10. > Actually, it's not, although its meaning is not the same as in XML. <br />
    > means the same as <br>>.


    Ooops sorry I was thinking that was turned off in HTML's SGML decl, but
    apparently not. Still (most:) of my point holds, in fact that means
    that the situation is worse than I indicated: if you rely on <br/>
    working in the browser after sending the file with an html mime type you
    are not just relying on lax error recovery, you are relying on
    non-conformant HTML parsing.


    David
     
    David Carlisle, Jul 28, 2004
    #10
  11. Philippe Poulard wrote:

    >> So, I have to basic reasons to do this:
    >>
    >> - I need it, the parser must make the difference, because
    >> it must ouput tag that it does not process like they were entered
    >> in order for the ouput to be correctly interpreted.

    >
    >
    > there's something quite confusing : you're talking about parsing like
    > outputing ; these 2 processes are totally opposite : parsing gives
    > access to a data model, and serializing (i prefer this term) renders
    > this data model to an xml characters form (file, char flow...)
    >


    No, what I meant is:

    A - Developpers write a file formatted in a certain way
    B - I parse the file and create an in-memory representation (a tree)
    C - I process the tree and for certain tag I replace them with some data
    D - I put the tree back in a textual form.

    If during B I loose the information on the format used in phase A, I
    can not reproduce it in D.

    So if

    - In A i have <XX/>
    - in B I do not know weither it is <XX/> or <XX></XX>
    - C ... don't play a role in this discussion
    - In D I must do an arbitrary choice for the output (serialisation).

    This is true even if we forget HTML.
     
    Pascal Dufour, Jul 28, 2004
    #11
  12. "Stefan Ram" <-berlin.de> wrote in message
    news:-berlin.de...
    > Actually, in XML, the notion "element" is not an abstract one,
    > but a concrete non-terminal symbol of the syntax.


    Mr. Ram, I think your conclusions regarding the term "element" are
    inconsistent with usage
    throughout the XML spec.

    > Therefore, as elements, the element "<br/>" and the element
    > "<br></br>" are two /different/ elements, [snip]


    What do you mean by different? Do you mean the two forms denote elements
    that have different structures? This is the notion of "different" that is
    at the heart of the preceding discussion. The definition of "element"
    clearly shows that <br/>" and "<br></br>" are two forms that each denote a
    single (empty) element. This equivalence is stated explicitly:

    "The representation of an empty element is either a start-tag immediately
    followed by an end-tag, or an empty-element tag."

    > just as "<br/>" also
    > is a different element than "<br />".


    Per the spec, whitespace between the element name and trailing slash is not
    significant:

    [3] S ::= (#x20 | #x9 | #xD | #xA)+
    [5] Name ::= (Letter | '_' | ':') (NameChar)*
    [44] EmptyElemTag ::= '<' Name (S Attribute)* S? '/>'

    /kmc

    Reference: "Extensible Markup Language (XML) 1.0 (Third Edition)"
    http://www.w3.org/TR/REC-xml
     
    Keith M. Corbett, Aug 10, 2004
    #12
  13. Stefan Ram Guest

    "Keith M. Corbett" <> writes:
    >> Therefore, as elements, the element "<br/>" and the element
    >> "<br></br>" are two /different/ elements, [snip]

    >What do you mean by different?


    Extensionally different, that is:

    An element is a certain sequence of Unicode characters. This
    is specified by the XML-syntax (BNF).

    Two elements differ if they differs as such sequences.

    > Do you mean the two forms denote elements
    >that have different structures?


    In XML, you do not /denote/ elements. You /write/ elements.

    An element is the actual sequence of characters, e.g., "<X/>"
    /is/ an element, it does not /denote/ an element. (An element
    might /denote/ something, like a book or a notion - depending
    on the XML application.)

    Let me use a comparison to make the notions clear: In C, the
    literal "02" and the literal "002" are two /different/ literals,
    even though they /denote/ the same value. The XML elements
    are like those literals, not like the values.


    > This is the notion of "different" that is
    >at the heart of the preceding discussion. The definition of "element"
    >clearly shows that <br/>" and "<br></br>" are two forms that each denote a
    >single (empty) element.


    These are indeed two forms. However, they do not /denote/
    (empty) elements, the /are/ empty elements. (Just as "2" in C
    does not /denote/ a literal, but /is/ a literal [and /denotes/
    a value].)

    > This equivalence is stated explicitly:
    >"The representation of an empty element is either a start-tag immediately
    >followed by an end-tag, or an empty-element tag."


    Both /are/ elements according to the definition:

    [39] element ::= EmptyElemTag | STag content ETag

    The part you quoted seems to intend to state that, e.g.,
    both "<X></X>" and "<X/>" are empty elements. It does not
    say, that they are the same element.

    >> just as "<br/>" also
    >> is a different element than "<br />".

    >Per the spec, whitespace between the element name and trailing slash is not
    >significant:
    > [3] S ::= (#x20 | #x9 | #xD | #xA)+
    > [5] Name ::= (Letter | '_' | ':') (NameChar)*
    > [44] EmptyElemTag ::= '<' Name (S Attribute)* S? '/>'


    What you quote here, defines "S" (white space) and then
    uses this definition to define "EmptyElemTag". It does not
    state that this is "insignificant". (An XML application
    might choose to consider it to be insignificant, what
    surely nearly all XML applications do.)
     
    Stefan Ram, Aug 11, 2004
    #13
  14. In article <-berlin.de>,
    Stefan Ram <-berlin.de> wrote:

    > Let me use a comparison to make the notions clear: In C, the
    > literal "02" and the literal "002" are two /different/ literals,
    > even though they /denote/ the same value. The XML elements
    > are like those literals, not like the values.


    The XML spec itself does not have this distinction. It describes the
    syntax, and in a few places says that things are insignificant, though
    it does not attempt to be exhaustive about this. For example, it
    says that the order of attributes is insignificant, but does not
    mention whether the order of elements is significant.

    The <x/> and <x></x> forms are syntactically different, and it may be
    useful for some applications to preserve this difference for human
    convenience, but applications are intended to treat them as
    semantically equivalent, and this is explicit for applications layered
    on the Infoset, which does not distinguish between the two syntactic
    forms.

    That seems to be enough: I don't see any point discussing whether they
    are "the same element", or whether <x/> is an element or denotes one.
    I'm sure you could come up with a consistent story either way, and
    neither would tell us anything we don't already know.

    -- Richard
     
    Richard Tobin, Aug 11, 2004
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    5
    Views:
    509
  2. Jo
    Replies:
    7
    Views:
    512
  3. Bob Weiner
    Replies:
    1
    Views:
    139
    Elton W
    May 6, 2005
  4. Replies:
    1
    Views:
    163
    Peter Michaux
    Apr 27, 2007
  5. rhitam
    Replies:
    1
    Views:
    183
    David Mark
    May 13, 2009
Loading...

Share This Page