Java DOM processing XML Keeping Carriage Returns

Discussion in 'XML' started by Dr. Laurence Leff, Dec 27, 2004.

  1. I am writing a Java program to read in XML file, modify some elements
    slightly, and then write it out. That XML file is prepared
    in Docbook.

    It works fine, except that it is disturbing the carriage returns in places
    where they have meaning.

    Attached are a sample input file, the sample output file, and a simplified
    version of my Java program. My real file, examines certain element's attributes
    and adds certain elements to the DOM data structure.

    How, best to write such a program, without disturbing the carriage returns
    between "<programlisting>" and "</programlisting>"

    Here is the input file. I need to preserve the formatting of the material
    between "<programlisting>" and "</programlisting>"

    <test>
    <programlisting>
    a
    b
    c d e f
    </programlisting>
    </test>



    Here is the output file -- observe how the "a b c d e f" are now on
    one line.
    <?xml version="1.0" encoding="UTF-8"?>
    <test>
    <programlisting> a b c d e f </programlisting>
    </test>

    Here is the Java program:

    import java.text.*;
    import java.io.*;

    import javax.xml.parsers.*;
    import org.apache.xml.serialize.*;
    import org.w3c.dom.*;
    import org.xml.sax.*;

    public class Test {
    static PrintWriter debug = null;
    static Document document = null;
    static String OutputFileName;

    static Element CreateElement (String ElementName, String Contents){
    Element ToReturn;
    ToReturn = document.createElement(ElementName);
    Text T = document.createTextNode(Contents);
    ToReturn.appendChild(T);
    return ToReturn;
    }
    static Element CreateElement (String ElementName, String Contents, String AttributeName, String AttributeValue){
    Element ToReturn = CreateElement(ElementName,Contents);
    ToReturn.setAttribute(AttributeName,AttributeValue);
    return ToReturn;
    }


    public static void main (String[] args) throws FileNotFoundException {
    try {
    debug = new PrintWriter (new FileWriter("debug.out"));
    }
    catch (Exception d) {System.out.println("cannot open debug file");}
    Text T;
    int j;
    DocumentBuilder parser = null;
    // Here we read in the data from the XML file
    DocumentBuilderFactory Factory = DocumentBuilderFactory.newInstance();
    String xmlFile = args[0];
    File file = new File (xmlFile);
    try {
    parser = Factory.newDocumentBuilder();
    }
    catch (ParserConfigurationException pce) {
    System.out.println ("Parser Configuration Exception " + pce.getMessage());
    System.exit(0);
    }
    try {
    document = parser.parse(file);
    }
    catch (SAXException se) {
    System.out.println ("SAX Exception on parsing document " + se.getMessage());
    System.exit(0);
    }
    catch (IOException ioe) {
    System.out.println ("IO Exception on parsing document " + ioe.getMessage());
    System.exit(0);
    }
    FileWriter out = null;
    XMLSerializer X = null;
    Element root = null;
    OutputFileName=args[0];
    try {
    out = new FileWriter(OutputFileName+".OUT"+".xml");
    out.flush();
    OutputFormat o = new OutputFormat(document);
    o.setIndent(5);
    o.setIndenting(true);
    X = new XMLSerializer(o);
    X.setOutputCharStream(out);
    }
    catch (IOException e0) {
    System.out.println ("problem in setting up to save XML file" + e0.getMessage());
    e0.printStackTrace();
    }

    // use the XML functions to dump the materials
    try {
    X.serialize(document);
    out.flush();
    } catch (IOException e2) {System.out.println("error writing file " + e2.getMessage());e2.printStackTrace();}

    debug.close();

    }
    }

    Dr. Laurence Leff Western Illinois University, Macomb IL 61455 ||(309) 298-1315
    Stipes 447 Assoc. Prof. of Computer Sci. Pager: 309-367-0787 FAX: 309-298-2302
    Secretary: eContracts Technical Committee OASIS Legal XML Member Section
     
    Dr. Laurence Leff, Dec 27, 2004
    #1
    1. Advertising

  2. /Dr. Laurence Leff/:

    > It works fine, except that it is disturbing the carriage returns in places
    > where they have meaning.


    <http://www.w3.org/TR/REC-xml/#sec-line-ends>:

    > ... XML processor MUST behave as if it normalized all line breaks in
    > external parsed entities (including the document entity) on input,
    > before parsing, by translating both the two-character sequence #xD
    > #xA and any #xD that is not followed by #xA to a single #xA
    > character.


    So you have to include the carriage returns in the source XML data
    using character references -

    --
    Stanimir
     
    Stanimir Stamenkov, Dec 28, 2004
    #2
    1. Advertising

  3. /Stanimir Stamenkov/:

    > So you have to include the carriage returns in the source XML data using
    > character references -


    As far as I see:

    Document doc;
    ... // initialize a new empty 'doc'
    Element elem = doc.createElement("test");
    elem.appendChild(doc.createTextNode(
    "A line of text.\r\n\r\nAnother line."));
    doc.appendChild(elem);
    // serialize the 'doc'

    Serialization using the standard JAXP Transformations API (using a
    "copy transformer") correctly outputs
    in place of the CR
    characters so they would be the read next time again.

    --
    Stanimir
     
    Stanimir Stamenkov, Dec 28, 2004
    #3
  4. Dr. Stamenkov:

    Thank you for your quick responses to my question on using Java
    to process XML files that contain formatted text such as programs.

    I tried your suggestion of using the entity reference for carriage
    return,
    , (I wrote a perl script to identify my
    programlisting sections and make the replacement.)

    It did not help.

    Here is the sample input (after including the entity references)

    <section><para>
    abc
    def
    <programlisting>
    ghi
    jkl
    mno
    </programlisting>
    ghi
    jkl
    </para></section>

    Here is the output of the Java program. Observe that the carriage return
    and formatting in the "programlisting" section are being changed.

    <?xml version="1.0" encoding="UTF-8"?>
    <section>
    <para> abc def <programlisting> ghi jkl mno </programlisting>
    ghi jkl </para>
    </section>

    This output was from the same Java program as before.

    I also tried the startPreserving() option on the OutputSerializer
    and removing the invocations of the methods setIndent and setIndenting.
    I also tried removing the carraige returns between the line, replacing
    them with the return,
    These made no change.

    I then tried the other suggestion:

    import java.text.*;
    import java.io.*;

    import javax.xml.parsers.*;
    import org.apache.xml.serialize.*;
    import org.w3c.dom.*;
    import org.xml.sax.*;

    public class T{
    static PrintWriter debug = null;
    static String OutputFileName;



    public static void main (String[] args) throws ParserConfigurationException,FileNotFoundException {
    try {
    debug = new PrintWriter (new FileWriter("debug.out"));
    }
    catch (Exception d) {System.out.println("cannot open debug file");}
    Text T;
    int j;
    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    Document doc;
    DocumentBuilder db = dbf.newDocumentBuilder();
    doc = db.newDocument();
    Element elem = doc.createElement("test");
    elem.appendChild(doc.createTextNode(
    "A line of text \r\n\r\nAnother line"));

    doc.appendChild(elem);

    FileWriter out = null;
    XMLSerializer X = null;
    Element root = null;
    try {
    out = new FileWriter(OutputFileName+".OUT"+".xml");
    out.flush();
    OutputFormat o = new OutputFormat(doc);
    //o.setIndent(5);
    //o.setIndenting(true);
    X = new XMLSerializer(o);
    X.setOutputCharStream(out);
    }
    catch (IOException e0) {
    System.out.println ("problem in setting up to save XML file" + e0.getMessage());
    e0.printStackTrace();
    }

    // use the XML functions to dump the materials
    try {
    X.serialize(doc);
    out.flush();
    } catch (IOException e2) {System.out.println("error writing file " + e2.getMessage());e2.printStackTrace();}

    debug.close();

    }
    }

    I got this output:

    <?xml version="1.0" encoding="UTF-8"?>
    <test>A line of text Another line</test>

    Perhaps, there is a version or configuration problem with my parser software.
    I am using Xerces-2_4_0.

    Thank you for any further assistance that you or anyone else
    reading this newsgroup can provide.

    Dr. Laurence Leff Western Illinois University, Macomb IL 61455 ||(309) 298-1315
    Stipes 447 Assoc. Prof. of Computer Sci. Pager: 309-367-0787 FAX: 309-298-2302
    Secretary: eContracts Technical Committee OASIS Legal XML Member Section
     
    Dr. Laurence Leff, Dec 28, 2004
    #4
  5. /Dr. Laurence Leff/:

    > I tried your suggestion of using the entity reference for carriage
    > return,
    ...
    >
    > It did not help.
    >
    > Here is the sample input (after including the entity references)
    >
    > <section><para>
    > abc
    > def
    > <programlisting>
    >
    ghi
    >
    jkl
    >
    mno
    >
    </programlisting>
    > ghi
    > jkl
    > </para></section>
    >

    [...]
    >
    > Perhaps, there is a version or configuration problem with my parser software.
    > I am using Xerces-2_4_0.


    As I've mentioned in my previous reply I've used the JAXP
    Transformations API (part of the standard Java 1.4 framework) to
    serialize the data. I have Xerces2 version 2.6.2 but I haven't used
    its 'serialize' package. It could be these 'Serializer' classes need
    additional configuration or just they don't behave well in the
    version you have. The Xerces version I have provides implementation
    of the DOM Level 3 Load and Save API (which is now part of the
    standard Java 5 framework) but I haven't tried that, too.

    I've prepared an example for you to try:

    http://www.geocities.com/stanio/test/XMLInputOutputTest.java
    http://www.geocities.com/stanio/test/input.xml

    It reads the "input.xml" file (which is copy of the sample input
    you've given above), dumps its contents to the console where CR and
    LF characters are indicated/replaced with "[CR]" and "[LF]" strings
    (all on one line). Then it saves the read DOM data to "output.xml".

    In addtion a "test.xml" file is created with DOM data constructed
    using the 'Document' factory methods (as in my previous example).

    --
    Stanimir
     
    Stanimir Stamenkov, Dec 29, 2004
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Iceberg

    Carriage Returns and sockets

    Iceberg, Sep 6, 2003, in forum: Perl
    Replies:
    1
    Views:
    1,166
    Iceberg
    Sep 6, 2003
  2. Schroeder
    Replies:
    1
    Views:
    565
    Jim Gibson
    Jan 26, 2005
  3. Replies:
    1
    Views:
    2,765
    Martin Dechev
    Oct 7, 2005
  4. Bazza Formez
    Replies:
    2
    Views:
    943
    Bazza Formez
    Aug 6, 2007
  5. Steve Anderson
    Replies:
    3
    Views:
    265
    Steve Anderson
    Jun 21, 2004
Loading...

Share This Page