whitespace in element content

W

Wolfgang Jeltsch

Hello,

it is often convenient to insert whitespace into an XML document in order to
format it nicely. For example, take this snippet of a notional DocBook XML
document:

<para>
This is a longer paragraph.
With <wordasword>longer</wordasword> I mean that it contains more than
one sentence.
</para>

I want the whitespace in this snippet of code to be handled as follows:

(1) The whitespace between "<para>" and "This" as well as the whitespace
between "sentence." and "</para>" shall be discarded.

(2) Each other sequence of adjacent whitespace characters shall be
transformed into a single space character.

But how do XML processors and applications deal with this issue?

In section 2.10 of "Extensible Markup Language (XML) 1.0 (Third Edition)",
one can read:

In editing XML documents, it is often convenient to use "white
space" (spaces, tabs, and blank lines) to set apart the markup for
greater readability. Such white space is typically not intended for
inclusion in the delivered version of the document.

But who decides which whitespace shall be considered as whitespace that is
just used to set apart the markup? And is whitespace just used to indent
lines of text also not intended for inclusion in the delivered version?
What is this "delivered version" of the document?

I'd be thankful for any clarification.

Best whishes,
Wolfgang
 
R

Richard Tobin

Wolfgang Jeltsch said:
But who decides which whitespace shall be considered as whitespace that is
just used to set apart the markup? And is whitespace just used to indent
lines of text also not intended for inclusion in the delivered version?
What is this "delivered version" of the document?

As far as the XML spec is concerned, deciding which whitespace is
significant or not is a job for the application, which really means
"everything except the parser". A conformant parser must give all the
whitespace to the application, which can then decide what to do with
it.

Of course, there may be other standard programs or libraries layered
on top of the XML parser which you might not consider to be the
application. XSLT for example allows you to specify that some
whitespace is to be stripped from its input. From the point of view
of the parser, XSLT is the application, but you may regard it as just
a library that you're using.
I want the whitespace in this snippet of code to be handled as follows:

(1) The whitespace between "<para>" and "This" as well as the whitespace
between "sentence." and "</para>" shall be discarded.

(2) Each other sequence of adjacent whitespace characters shall be
transformed into a single space character.

This is a fairly common form of whitespace normalization and often
goes under the name of "tokenization". For example, XML itself treats
tokenized attributes like this. Among other things, you could use an
XML Schema processor to do this normalization.

-- Richard
 
C

cr88192

Wolfgang Jeltsch said:
Hello,

it is often convenient to insert whitespace into an XML document in order
to
format it nicely. For example, take this snippet of a notional DocBook
XML
document:

<para>
This is a longer paragraph.
With <wordasword>longer</wordasword> I mean that it contains more
than
one sentence.
</para>

I want the whitespace in this snippet of code to be handled as follows:

(1) The whitespace between "<para>" and "This" as well as the
whitespace
between "sentence." and "</para>" shall be discarded.

(2) Each other sequence of adjacent whitespace characters shall be
transformed into a single space character.

But how do XML processors and applications deal with this issue?

In section 2.10 of "Extensible Markup Language (XML) 1.0 (Third Edition)",
one can read:

In editing XML documents, it is often convenient to use "white
space" (spaces, tabs, and blank lines) to set apart the markup for
greater readability. Such white space is typically not intended for
inclusion in the delivered version of the document.

But who decides which whitespace shall be considered as whitespace that is
just used to set apart the markup? And is whitespace just used to indent
lines of text also not intended for inclusion in the delivered version?
What is this "delivered version" of the document?

I'd be thankful for any clarification.
my parser uses what could be called the "newline whitespace assertion",
namely:
any initial whitespace is ignored;
any whitespace following a newline is eaten and replaced with a single space
(unless it is the end of the text).


<foo>Hello World
Again</foo>

is parsed as:
<foo>Hello World Again</foo>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top