Whitespace problems, xml-parsing

Discussion in 'Java' started by WP, Apr 15, 2008.

  1. WP

    WP Guest

    Hello, I have the following xml-file:
    ?xml version="1.0" encoding="UTF-8"?>
    <staff xmlns="myns"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="myns staff.xsd">
    <employee hasQuit="false">
    <id>4711</id>
    <name>Linda</name>
    <address>
    <street>Some street 1337</street>
    <city>Boston</city>
    </address>
    </employee>
    <employee hasQuit="false">
    <id>4712</id>
    <name>Michael</name>
    <address>
    <street>Another street 122</street>
    <city>Stockholm</city>
    </address>
    </employee>
    </staff>
    which is valid according to its schema.
    I'm very rusty at java and this is the first time I've been working
    with xml in any programming language and my problem is that when I
    parse it I get a lot of text nodes containing just whitespace even
    though I thought I set it to ignore such whitespace. The output is:
    Are we ignoring element content whitespace? true
    text data start


    text data end
    text data start


    text data end
    text data start
    4711
    text data end
    text data start


    text data end
    text data start
    Linda
    text data end
    text data start


    text data end
    text data start


    text data end
    text data start
    Some street 1337
    text data end
    text data start


    text data end
    text data start
    Boston
    text data end
    text data start


    text data end
    text data start


    text data end
    text data start


    text data end
    text data start


    text data end
    text data start
    4712
    text data end
    text data start


    text data end
    text data start
    Michael
    text data end
    text data start


    text data end
    text data start


    text data end
    text data start
    Another street 122
    text data end
    text data start


    text data end
    text data start
    Stockholm
    text data end
    text data start


    text data end
    text data start


    text data end
    text data start


    text data end

    And my code:
    import javax.xml.parsers.DocumentBuilder;
    import javax.xml.parsers.DocumentBuilderFactory;
    import javax.xml.validation.Schema;
    import javax.xml.validation.SchemaFactory;
    import javax.xml.XMLConstants;

    import org.w3c.dom.Document;
    import org.w3c.dom.Node;
    import org.w3c.dom.NodeList;
    import org.w3c.dom.Text;

    public class DOM_Demo {
    public static void main(String[] args) {
    try {
    DocumentBuilderFactory factory =
    DocumentBuilderFactory.newInstance();

    factory.setIgnoringElementContentWhitespace(true);
    factory.setNamespaceAware(true);
    factory.setSchema(loadSchema("staff.xsd"));

    System.out.println("Are we ignoring element content
    whitespace? " + factory.isIgnoringElementContentWhitespace());

    DocumentBuilder document_builder =
    factory.newDocumentBuilder();
    Document doc1 = document_builder.parse("staff.xml");

    traverse(doc1.getFirstChild());
    }
    catch (Throwable t) {
    System.out.println("Exception caught: " +
    t.getLocalizedMessage());
    }
    }

    private static Schema loadSchema(String schemaFile) throws
    Throwable {
    SchemaFactory schema_factory =

    SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);

    return schema_factory.newSchema(new java.io.File(schemaFile));
    }

    private static void traverse(Node current_node) {
    if (current_node.getNodeType() == Node.TEXT_NODE) {
    Text text = (Text)current_node;
    if (!text.isElementContentWhitespace()) {
    System.out.println("text data start");
    System.out.println(text.getData());
    System.out.println("text data end");
    }
    else {
    System.out.println("element content whitespace");
    }

    }
    else if (current_node.getNodeType() == Node.ELEMENT_NODE) {
    NodeList children = current_node.getChildNodes();

    for (int i = 0; i < children.getLength(); ++i) {
    traverse(children.item(i));
    }
    }
    }
    }

    Sorry for the long post but I wanted to include all details. I want to
    get rid of all element data that doesn't reside in elements that are
    supposed to have it (id, name, street, city). Hope you understand what
    I mean.

    Thanks for reading and thanks for any replies!

    - WP
     
    WP, Apr 15, 2008
    #1
    1. Advertisements

  2. Maybe it is because of this ...
    "Note that only whitespace which is directly contained within
    element content that has an element only content model (see
    XML Rec 3.2.1) will be eliminated."
    From API reference documentation.
     
    RedGrittyBrick, Apr 15, 2008
    #2
    1. Advertisements

  3. WP

    WP Guest

    Thanks for your reply, however, according to my standalone parser
    (oxygenxml) the type's content type is already element only. Maybe
    because I'm not explicitly turning on validation? I thought, since I
    have a schema, I shouldn't need to do that. But if I do turn it on it
    complains that it cannot find the schema and also wants me to supply
    my own error handler (sorry, at another computer now and don't have
    the exact error messages in front of me). Any ideas?
     
    WP, Apr 15, 2008
    #3
  4. WP

    WP Guest

    I have now turned on validation and fixed so the schema is found
    properly (had missed to do a call to setFeature() on the
    DocumentBuilderFactory object. It didn't solve anything, however,
    output is still as in my OP. Then I stumbled upon this:
    http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6545684
    so it seems that it worked like I want it to in jdk5 but was regressed
    in jdk6 and then "fixed" even though jdk5 did wrong. But I'm running
    the latest JDK so maybe that fix was reverted. Sigh. How do people
    handle this? I will be reading schemas and files where the content is
    unknown beforehand, how am I to know what whitespace is just eye-candy
    (indentation) and should be discarded and what is actual data and
    should be kept?
     
    WP, Apr 15, 2008
    #4

  5. I wrote this when first learning Java + XML some while ago, It looks a
    bit lame now but I think it does what you want. It discards whitespace
    used for indentation but retains all whitespace (including leading and
    trailing whitespace) within data elements.


    -------------------------------- 8< ----------------------------------
    public class ParseXMLbyDOM {

    public static void main(String[] args) {

    String filename = "XML/animals.xml";

    String uri = "file:" + new File(filename).getAbsolutePath();
    Document doc = null;
    try {
    DocumentBuilderFactory factory = DocumentBuilderFactory
    .newInstance();
    DocumentBuilder builder = factory.newDocumentBuilder();
    doc = builder.parse(uri);
    } catch (ParserConfigurationException e) {
    e.printStackTrace();
    } catch (SAXException e) {
    e.printStackTrace();
    } catch (IOException e) {
    e.printStackTrace();
    }
    doRecursive(doc, "");
    }

    private static void doRecursive(Node node, String name) {
    if (node == null)
    return;
    NodeList nodes = node.getChildNodes();
    for (int i = 0; i < nodes.getLength(); i++) {
    Node n = nodes.item(i);
    if (n == null)
    continue;
    doNode(n, name);
    }
    }

    private static void doNode(Node node, String name) {
    String nodeName = "unknown";
    switch (node.getNodeType()) {
    case Node.ELEMENT_NODE:
    if (name.length() == 0) {
    nodeName = node.getNodeName();
    } else {
    nodeName = name + "." + node.getNodeName();
    }
    doRecursive(node, nodeName);
    break;
    case Node.TEXT_NODE:
    String text = node.getNodeValue();
    if (text.length() == 0 || text.matches("\n *")
    || text.equals("\\r")) {
    break;
    }
    String type = "";
    NamedNodeMap attrs = node.getAttributes();
    if (attrs != null) {
    Node attr = attrs.getNamedItem("type");
    if (attr != null) {
    type = attr.getNodeValue();
    }
    }
    System.out.println(name + "(" + type + ") = '"
    + text + "'.");
    nodeName = "unknown";
    break;
    default:
    System.out.println("Other node "
    + node.getNodeType() + " : "
    + node.getClass());
    break;
    }
    }
    }
    -------------------------------- 8< ----------------------------------
    <inventory>
    <animal type="mammal">
    <name>Fred</name>
    <species>Hippo</species>
    <weight units="Kg">1552</weight>
    </animal>
    <animal type="reptile">
    <name>
    Gert
    AKA Gertrude
    the galloping reptile
    </name>
    <species>Croc</species>
    </animal>
    </inventory>
    -------------------------------- 8< ----------------------------------
    inventory.animal.name() = 'Fred'.
    inventory.animal.species() = 'Hippo'.
    inventory.animal.weight() = '1552'.
    inventory.animal.name() = '
    Gert
    AKA Gertrude
    the galloping reptile
    '.
    inventory.animal.species() = 'Croc'.
    -------------------------------- 8< ----------------------------------

    I don't think XML explicitly differentiates between "eye candy"
    whitespace and "actual data" whitespace.

    If so, you'll have to invent your own heuristics for this.
     
    RedGrittyBrick, Apr 16, 2008
    #5
  6. WP

    WP Guest

    Thanks, I will give it a try later and post back. Thanks for sharing this!
    [snip]

    - WP
     
    WP, Apr 16, 2008
    #6
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.