Whitespace problems, xml-parsing

WP · Apr 15, 2008

Hello, I have the following xml-file:
?xml version="1.0" encoding="UTF-8"?>
<staff xmlns="myns"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="myns staff.xsd">
<employee hasQuit="false">
<id>4711</id>
<name>Linda</name>
<address>
<street>Some street 1337</street>
<city>Boston</city>
</address>
</employee>
<employee hasQuit="false">
<id>4712</id>
<name>Michael</name>
<address>
<street>Another street 122</street>
<city>Stockholm</city>
</address>
</employee>
</staff>
which is valid according to its schema.
I'm very rusty at java and this is the first time I've been working
with xml in any programming language and my problem is that when I
parse it I get a lot of text nodes containing just whitespace even
though I thought I set it to ignore such whitespace. The output is:
Are we ignoring element content whitespace? true
text data start

text data end
text data start

text data end
text data start
4711
text data end
text data start

text data end
text data start
Linda
text data end
text data start

text data end
text data start

text data end
text data start
Some street 1337
text data end
text data start

text data end
text data start
Boston
text data end
text data start

text data end
text data start

text data end
text data start

text data end
text data start

text data end
text data start
4712
text data end
text data start

text data end
text data start
Michael
text data end
text data start

text data end
text data start

text data end
text data start
Another street 122
text data end
text data start

text data end
text data start
Stockholm
text data end
text data start

text data end
text data start

text data end
text data start

text data end

And my code:
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.validation.Schema;
import javax.xml.validation.SchemaFactory;
import javax.xml.XMLConstants;

import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.w3c.dom.Text;

public class DOM_Demo {
public static void main(String[] args) {
try {
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();

factory.setIgnoringElementContentWhitespace(true);
factory.setNamespaceAware(true);
factory.setSchema(loadSchema("staff.xsd"));

System.out.println("Are we ignoring element content
whitespace? " + factory.isIgnoringElementContentWhitespace());

DocumentBuilder document_builder =
factory.newDocumentBuilder();
Document doc1 = document_builder.parse("staff.xml");

traverse(doc1.getFirstChild());
}
catch (Throwable t) {
System.out.println("Exception caught: " +
t.getLocalizedMessage());
}
}

private static Schema loadSchema(String schemaFile) throws
Throwable {
SchemaFactory schema_factory =

SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);

return schema_factory.newSchema(new java.io.File(schemaFile));
}

private static void traverse(Node current_node) {
if (current_node.getNodeType() == Node.TEXT_NODE) {
Text text = (Text)current_node;
if (!text.isElementContentWhitespace()) {
System.out.println("text data start");
System.out.println(text.getData());
System.out.println("text data end");
}
else {
System.out.println("element content whitespace");
}

}
else if (current_node.getNodeType() == Node.ELEMENT_NODE) {
NodeList children = current_node.getChildNodes();

for (int i = 0; i < children.getLength(); ++i) {
traverse(children.item(i));
}
}
}
}

Sorry for the long post but I wanted to include all details. I want to
get rid of all element data that doesn't reside in elements that are
supposed to have it (id, name, street, city). Hope you understand what
I mean.

Thanks for reading and thanks for any replies!

- WP

RedGrittyBrick · Apr 15, 2008

WP said:
Hello, I have the following xml-file:
?xml version="1.0" encoding="UTF-8"?>
<staff xmlns="myns"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="myns staff.xsd">
<employee hasQuit="false">
<id>4711</id>
<name>Linda</name>
<address>
<street>Some street 1337</street>
<city>Boston</city>
</address>
</employee> ....
</staff>
which is valid according to its schema.
I'm very rusty at java and this is the first time I've been working
with xml in any programming language and my problem is that when I
parse it I get a lot of text nodes containing just whitespace even
though I thought I set it to ignore such whitespace. ....
import javax.xml.parsers.DocumentBuilderFactory; ....
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();

factory.setIgnoringElementContentWhitespace(true);

Maybe it is because of this ...
"Note that only whitespace which is directly contained within
element content that has an element only content model (see
XML Rec 3.2.1) will be eliminated."
From API reference documentation.

WP · Apr 15, 2008

Maybe it is because of this ...
"Note that only whitespace which is directly contained within
element content that has an element only content model (see
XML Rec 3.2.1) will be eliminated."
From API reference documentation.

Thanks for your reply, however, according to my standalone parser
(oxygenxml) the type's content type is already element only. Maybe
because I'm not explicitly turning on validation? I thought, since I
have a schema, I shouldn't need to do that. But if I do turn it on it
complains that it cannot find the schema and also wants me to supply
my own error handler (sorry, at another computer now and don't have
the exact error messages in front of me). Any ideas?

WP · Apr 15, 2008

Thanks for your reply, however, according to my standalone parser
(oxygenxml) the type's content type is already element only. Maybe
because I'm not explicitly turning on validation? I thought, since I
have a schema, I shouldn't need to do that. But if I do turn it on it
complains that it cannot find the schema and also wants me to supply
my own error handler (sorry, at another computer now and don't have
the exact error messages in front of me). Any ideas?

I have now turned on validation and fixed so the schema is found
properly (had missed to do a call to setFeature() on the
DocumentBuilderFactory object. It didn't solve anything, however,
output is still as in my OP. Then I stumbled upon this:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6545684
so it seems that it worked like I want it to in jdk5 but was regressed
in jdk6 and then "fixed" even though jdk5 did wrong. But I'm running
the latest JDK so maybe that fix was reverted. Sigh. How do people
handle this? I will be reading schemas and files where the content is
unknown beforehand, how am I to know what whitespace is just eye-candy
(indentation) and should be discarded and what is actual data and
should be kept?

RedGrittyBrick · Apr 16, 2008

WP said:
I have now turned on validation and fixed so the schema is found
properly (had missed to do a call to setFeature() on the
DocumentBuilderFactory object. It didn't solve anything, however,
output is still as in my OP. Then I stumbled upon this:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6545684
so it seems that it worked like I want it to in jdk5 but was regressed
in jdk6 and then "fixed" even though jdk5 did wrong. But I'm running
the latest JDK so maybe that fix was reverted. Sigh.

How do people handle this?

I wrote this when first learning Java + XML some while ago, It looks a
bit lame now but I think it does what you want. It discards whitespace
used for indentation but retains all whitespace (including leading and
trailing whitespace) within data elements.

-------------------------------- 8< ----------------------------------
public class ParseXMLbyDOM {

public static void main(String[] args) {

String filename = "XML/animals.xml";

String uri = "file:" + new File(filename).getAbsolutePath();
Document doc = null;
try {
DocumentBuilderFactory factory = DocumentBuilderFactory
.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
doc = builder.parse(uri);
} catch (ParserConfigurationException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
doRecursive(doc, "");
}

private static void doRecursive(Node node, String name) {
if (node == null)
return;
NodeList nodes = node.getChildNodes();
for (int i = 0; i < nodes.getLength(); i++) {
Node n = nodes.item(i);
if (n == null)
continue;
doNode(n, name);
}
}

private static void doNode(Node node, String name) {
String nodeName = "unknown";
switch (node.getNodeType()) {
case Node.ELEMENT_NODE:
if (name.length() == 0) {
nodeName = node.getNodeName();
} else {
nodeName = name + "." + node.getNodeName();
}
doRecursive(node, nodeName);
break;
case Node.TEXT_NODE:
String text = node.getNodeValue();
if (text.length() == 0 || text.matches("\n *")
|| text.equals("\\r")) {
break;
}
String type = "";
NamedNodeMap attrs = node.getAttributes();
if (attrs != null) {
Node attr = attrs.getNamedItem("type");
if (attr != null) {
type = attr.getNodeValue();
}
}
System.out.println(name + "(" + type + ") = '"
+ text + "'.");
nodeName = "unknown";
break;
default:
System.out.println("Other node "
+ node.getNodeType() + " : "
+ node.getClass());
break;
}
}
}
-------------------------------- 8< ----------------------------------
<inventory>
<animal type="mammal">
<name>Fred</name>
<species>Hippo</species>
<weight units="Kg">1552</weight>
</animal>
<animal type="reptile">
<name>
Gert
AKA Gertrude
the galloping reptile
</name>
<species>Croc</species>
</animal>
</inventory>
-------------------------------- 8< ----------------------------------
inventory.animal.name() = 'Fred'.
inventory.animal.species() = 'Hippo'.
inventory.animal.weight() = '1552'.
inventory.animal.name() = '
Gert
AKA Gertrude
the galloping reptile
'.
inventory.animal.species() = 'Croc'.
-------------------------------- 8< ----------------------------------

I will be reading schemas and files where the content is
unknown beforehand, how am I to know what whitespace is just eye-candy
(indentation) and should be discarded and what is actual data and
should be kept?

I don't think XML explicitly differentiates between "eye candy"
whitespace and "actual data" whitespace.

If so, you'll have to invent your own heuristics for this.

WP · Apr 16, 2008

RedGrittyBrick said:
I wrote this when first learning Java + XML some while ago, It looks a
bit lame now but I think it does what you want. It discards whitespace
used for indentation but retains all whitespace (including leading and
trailing whitespace) within data elements.

Thanks, I will give it a try later and post back. Thanks for sharing this!
[snip]

- WP

XML Parsing	3	Apr 1, 2008
optimize XML parsing	2	Jun 12, 2007
ElementTree.XML(string XML) and ElementTree.fromstring(string XML)not working	12	Jun 25, 2009
[ANN] xml-mapping 0.8	0	Jul 7, 2005
text to xml conversion	2	Jun 21, 2007
Perl XML parsing	5	Apr 18, 2005
Nesting XML Elements in Java	10	Apr 18, 2006
XML document must have a top level element	11	Aug 23, 2008

Whitespace problems, xml-parsing

WP

RedGrittyBrick

WP

WP

RedGrittyBrick

WP

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads