Re: Space between ending and starting tag not ignorable in a XMLdocument? ...

Discussion in 'XML' started by Joe Kesselman, Jul 12, 2011.

  1. On 7/11/2011 6:15 PM, lbrt chx _ gemale kom wrote:
    > What do you call the the carriage return and the four running spaces
    > after the ending "</sitename>" and before the starting"<base>" if you
    > know this is not an XHTML document? "inline text" anyway?

    Text that happens to be whitespace. If the parent element was declared
    in the DTD as only allowing element content, you could call it
    whitespace-in-element-content; see answers to the immediately preceding
    question about "ignorable" whitespace.

    > Then there is something I don't get: What would be the XPath
    > declaration to address the text between and ending and another
    > starting tag?

    Don't think in terms of tags. Think in terms of tree nodes. This is a
    text node whose immediately preceding sibling is the element that has
    just been ended. There are several possible ways of writing that,
    depending on how much you know about the possible structure of the
    document and how readable (versus efficient) you want the path to be.
    For example,
    is a pretty direct translation of "the first node following the sitename
    element, but only if it is a text node".

    > ~ How is it that, knowing this is XML, this sequence of
    > characters could be relevant?

    If the parent element support mixed content, the whitespace may be a
    meaningful part of its content. Only the application knows that for
    sure; the parser can't assume, and can't even reliably hint.

    > ~ As far as I know XMLReaders (I mostly code Java) can read into the
    > actual document and get the encoding and readjust their own encoding,
    > so I thought that they could also notice if they are processing XML
    > or XHTML and do their thing internally

    XHTML *is* XML -- it's just a particular XML-based language. There's
    nothing inherently different about XHTML versus other XML, at the XML
    processing level. The things which distinguish it, outside of using a
    specific namespace, are all semantics -- and semantics are implemented
    by the application, not the parser.

    > Inline00Test ~ There should be some way to let xerces know I only
    > want to characters within the starting and corresponding closing tag

    Write your SAX handler, or other layers of your application, to track
    the context and discard anything it doesn't consider semantically
    important. Sorry, but the definition of XML and the XML APIs really does
    make this the application's responsibility.

    Or, if you prefer, preprocess the document through an application which
    will filter the document according to a set of rules you've defined. For
    example, you could use XSLT to do so, writing a stylesheet which
    recognizes and does not copy those text nodes. Massive overkill for this
    simple a filtering task, but it would save you having to write your own
    filtering at the Java level... and most XSLT processors can produce SAX
    output which you could feed directly into your application's logic.

    Joe Kesselman,

    {} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
    /\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
    Joe Kesselman, Jul 12, 2011
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Shoval Tomer
    Shoval Tomer
    Jul 9, 2003
  2. Bob
  3. Mayeul
    Jul 11, 2011
  4. Joe Kesselman
    Joe Kesselman
    Jul 12, 2011
  5. Mayeul
    Jul 12, 2011

Share This Page