whitespace in element content

Wolfgang Jeltsch · Oct 31, 2004

Hello,

it is often convenient to insert whitespace into an XML document in order to
format it nicely. For example, take this snippet of a notional DocBook XML
document:

<para>
This is a longer paragraph.
With <wordasword>longer</wordasword> I mean that it contains more than
one sentence.
</para>

I want the whitespace in this snippet of code to be handled as follows:

(1) The whitespace between "<para>" and "This" as well as the whitespace
between "sentence." and "</para>" shall be discarded.

(2) Each other sequence of adjacent whitespace characters shall be
transformed into a single space character.

But how do XML processors and applications deal with this issue?

In section 2.10 of "Extensible Markup Language (XML) 1.0 (Third Edition)",
one can read:

In editing XML documents, it is often convenient to use "white
space" (spaces, tabs, and blank lines) to set apart the markup for
greater readability. Such white space is typically not intended for
inclusion in the delivered version of the document.

But who decides which whitespace shall be considered as whitespace that is
just used to set apart the markup? And is whitespace just used to indent
lines of text also not intended for inclusion in the delivered version?
What is this "delivered version" of the document?

I'd be thankful for any clarification.

Best whishes,
Wolfgang

Richard Tobin · Oct 31, 2004

Wolfgang Jeltsch said:
But who decides which whitespace shall be considered as whitespace that is
just used to set apart the markup? And is whitespace just used to indent
lines of text also not intended for inclusion in the delivered version?
What is this "delivered version" of the document?

As far as the XML spec is concerned, deciding which whitespace is
significant or not is a job for the application, which really means
"everything except the parser". A conformant parser must give all the
whitespace to the application, which can then decide what to do with
it.

Of course, there may be other standard programs or libraries layered
on top of the XML parser which you might not consider to be the
application. XSLT for example allows you to specify that some
whitespace is to be stripped from its input. From the point of view
of the parser, XSLT is the application, but you may regard it as just
a library that you're using.

I want the whitespace in this snippet of code to be handled as follows:

(1) The whitespace between "<para>" and "This" as well as the whitespace
between "sentence." and "</para>" shall be discarded.

(2) Each other sequence of adjacent whitespace characters shall be
transformed into a single space character.

This is a fairly common form of whitespace normalization and often
goes under the name of "tokenization". For example, XML itself treats
tokenized attributes like this. Among other things, you could use an
XML Schema processor to do this normalization.

-- Richard

cr88192 · Nov 1, 2004

Wolfgang Jeltsch said:
Hello,

it is often convenient to insert whitespace into an XML document in order
to
format it nicely. For example, take this snippet of a notional DocBook
XML
document:

<para>
This is a longer paragraph.
With <wordasword>longer</wordasword> I mean that it contains more
than
one sentence.
</para>

I want the whitespace in this snippet of code to be handled as follows:

(1) The whitespace between "<para>" and "This" as well as the
whitespace
between "sentence." and "</para>" shall be discarded.

(2) Each other sequence of adjacent whitespace characters shall be
transformed into a single space character.

But how do XML processors and applications deal with this issue?

In section 2.10 of "Extensible Markup Language (XML) 1.0 (Third Edition)",
one can read:

In editing XML documents, it is often convenient to use "white
space" (spaces, tabs, and blank lines) to set apart the markup for
greater readability. Such white space is typically not intended for
inclusion in the delivered version of the document.

But who decides which whitespace shall be considered as whitespace that is
just used to set apart the markup? And is whitespace just used to indent
lines of text also not intended for inclusion in the delivered version?
What is this "delivered version" of the document?

I'd be thankful for any clarification.

my parser uses what could be called the "newline whitespace assertion",
namely:
any initial whitespace is ignored;
any whitespace following a newline is eaten and replaced with a single space
(unless it is the end of the text).

<foo>Hello World
Again</foo>

is parsed as:
<foo>Hello World Again</foo>

Controlling whitespace in XSL output - tutorial anywhere?	6	Oct 22, 2007
Handling Whitespace in Java DOM	1	Dec 12, 2007
Stripping whitespace around certain elements	1	Jun 25, 2003
Whitespace problems, xml-parsing	5	Apr 15, 2008
Problems with whitespace in output document	1	Mar 14, 2005
Whitespace before opening paren in function call?	2	Mar 7, 2009
Spaces in element?	4	Dec 19, 2007
Delimiter for attributes in an element	6	Apr 28, 2012

whitespace in element content

Wolfgang Jeltsch

Richard Tobin

cr88192

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads