Xml parser and character encoding

Discussion in 'Java' started by Ghislain Benrais, Jun 26, 2006.

  1. Hello,
    I am new to java and I run a short program processing xml files.
    Everything ran very well until I received xml files with the character
    itself instead of its numerical reference (for instance 'é' instead of
    'é'). I thought java would handle it but unexpectedly, it handles it
    under DOS but doesn't handle it under Linux !
    Do you have any explanations ?
    Input file :
    =======
    <?xml version="1.0" encoding="ISO-8859-1" ?>
    <values>
    <value>détail</value>
    <value>détail</value>
    </values>
    Java program :
    ==========
    package javaapplication2;
    import java.io.*;
    import org.xml.sax.*;
    import org.xml.sax.helpers.*;
    import java.util.*;
    public class Main extends DefaultHandler {
    private String CData;
    // Encodage
    static String encoding;
    private Writer out;
    public Main(String[] args) {
    super();
    encoding = "ISO-8859-15";
    try {
    XMLReader xr = XMLReaderFactory.createXMLReader();
    xr.setContentHandler( this );
    out = new OutputStreamWriter(new
    FileOutputStream("out.txt"),encoding);
    InputSource input = null;
    input = new InputSource(new FileReader("file.xml"));
    xr.parse(input);
    out.close();
    }catch ( Exception e ) {
    e.printStackTrace();
    }
    }
    public static void main(String[] args) {
    // TODO code application logic here
    Main main = new Main(args);
    }
    //--------------------------------------------------------------------------------------
    // Méthodes du parser
    //--------------------------------------------------------------------------------------
    public void startElement( String namespaceURI, String localName, String
    qName, Attributes attr ) throws SAXException {
    CData = new String("");
    }
    public void characters(char[] chars, int iStart, int iLen) {
    CData = CData + new String(chars, iStart, iLen);
    }
    public void endElement( String namespaceURI,String localName,String
    qName ) throws SAXException {
    if (localName.equals( "value" )) {
    try{
    out.write(CData+"\n");
    }catch ( Exception e ) {
    e.printStackTrace();
    }
    return;
    }
    }
    }
    Result if run from DOS
    ================
    détail
    détail
    Result if run from Linux
    =================
    d?tail
    détail


    Thanks in advance,
    Ghislain
    Ghislain Benrais, Jun 26, 2006
    #1
    1. Advertising

  2. Ghislain Benrais

    Oliver Wong Guest

    "Ghislain Benrais" <> wrote in message
    news:e7osgo$sge$...
    > Hello,
    > I am new to java and I run a short program processing xml files.
    > Everything ran very well until I received xml files with the character
    > itself instead of its numerical reference (for instance 'é' instead of
    > 'é'). I thought java would handle it but unexpectedly, it handles it
    > under DOS but doesn't handle it under Linux !
    > Do you have any explanations ?
    > Input file :
    > =======
    > <?xml version="1.0" encoding="ISO-8859-1" ?>

    [most of the code snipped]
    > input = new InputSource(new FileReader("file.xml"));


    From http://java.sun.com/j2se/1.5.0/docs/api/java/io/FileReader.html:

    <quote>
    The constructors of this class assume that the default character encoding
    and the default byte-buffer size are appropriate. To specify these values
    yourself, construct an InputStreamReader on a FileInputStream.
    </quote>

    In other words, you're not specifying the encoding in the reader, and so
    it picks some arbitrary one, and that encoding doesn't match the encoding
    used in your XML file.

    Did you try using the constructor of InputSource which takes a byte
    stream instead of a character stream?
    http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/InputSource.html#InputSource(java.io.InputStream)

    - Oliver
    Oliver Wong, Jun 26, 2006
    #2
    1. Advertising

  3. I tried :
    input = new InputSource(new FileInputStream("file.xml"));
    and it works !
    Thank you Oliver !
    Ghislain Benrais, Jun 26, 2006
    #3
  4. Ghislain Benrais

    Chris Uppal Guest

    Ghislain Benrais wrote:

    > I tried :
    > input = new InputSource(new FileInputStream("file.xml"));
    > and it works !


    But now you are overriding the encoding specified in the input file with the
    one used by the FileInputStream -- and that will be whatever your Java system
    default is.

    As far as I can see, your earlier code would have used the charset specified in
    the XML file, and -- as far as I can tell that /ought/ to work correctly. I
    have no idea why it doesn't.

    -- chris
    Chris Uppal, Jun 26, 2006
    #4
  5. Ghislain Benrais

    Oliver Wong Guest

    "Chris Uppal" <-THIS.org> wrote in message
    news:44a009db$0$659$...
    > Ghislain Benrais wrote:
    >
    >> I tried :
    >> input = new InputSource(new FileInputStream("file.xml"));
    >> and it works !

    >
    > But now you are overriding the encoding specified in the input file with
    > the
    > one used by the FileInputStream -- and that will be whatever your Java
    > system
    > default is.
    >
    > As far as I can see, your earlier code would have used the charset
    > specified in
    > the XML file, and -- as far as I can tell that /ought/ to work correctly.
    > I
    > have no idea why it doesn't.


    The original code specifies the *OUTPUT* encoding, but not the input
    one.

    - Oliver
    Oliver Wong, Jun 26, 2006
    #5
  6. Ghislain Benrais

    Oliver Wong Guest

    "Oliver Wong" <> wrote in message
    news:uXTng.15389$B91.14232@edtnps82...
    >
    > "Chris Uppal" <-THIS.org> wrote in message
    > news:44a009db$0$659$...
    >> Ghislain Benrais wrote:
    >>
    >>> I tried :
    >>> input = new InputSource(new FileInputStream("file.xml"));
    >>> and it works !

    >>
    >> But now you are overriding the encoding specified in the input file with
    >> the
    >> one used by the FileInputStream -- and that will be whatever your Java
    >> system
    >> default is.
    >>
    >> As far as I can see, your earlier code would have used the charset
    >> specified in
    >> the XML file, and -- as far as I can tell that /ought/ to work correctly.
    >> I
    >> have no idea why it doesn't.

    >
    > The original code specifies the *OUTPUT* encoding, but not the input
    > one.


    Oops, sorry, I misread your post, Chris.

    Here's what I suspect is happening in the original code: A FileReader is
    created with no specified encoding. A FileReader doesn't know anything about
    XML, so it's not like the file reader is going to look for an XML
    declaration node, and check it's encoding attribute. Instead, the FileReader
    just uses the system default encoding and reads a stream of bytes from the
    disk, an transforms them into a stream of characters, and passes these
    characters to the XMLReader. By the time the XMLReader receives these
    characters, they've already been decoded under some specific encoding, so
    it's "too late" for the XMLReader to try to use the encoding information
    specified in the XML file.

    That's why I suggested the OP use the constructor which takes in a
    stream of bytes instead. The XMLReader will probably decode the first few
    bytes using ASCII or UTF-8, until it finds an encoding specified in the
    file, in which case it does whatever magic it needs to do to switch encoding
    mid-stream.

    And it turns out that's what the OP actually did. FileInputStream
    processes files as a stream of bytes, and not as a stream of characters, so
    no encoding/decoding is done by FileInputStream.

    - Oliver
    Oliver Wong, Jun 26, 2006
    #6
  7. Ghislain Benrais

    Chris Uppal Guest

    Oliver Wong wrote:

    > > As far as I can see, your earlier code would have used the charset
    > > specified in
    > > the XML file, and -- as far as I can tell that /ought/ to work
    > > correctly. I
    > > have no idea why it doesn't.

    >
    > The original code specifies the *OUTPUT* encoding, but not the input
    > one.


    Yes, precisely. And if the input encoding is not specified from code, then (as
    I understand it) the SAX implementation is /supposed/ to take it from the XML
    (where, in the OP's examply it was declared as "IS-8859-1"). Using a
    FileInputStream means that the input is decoded by that stream before the XML
    parser sees it -- which may not be what is desired. More specifically, the
    code I commented on uses the Java system default decoder (whatever that happens
    to be) -- which is almost certainly not what is desired.

    -- chris
    Chris Uppal, Jun 26, 2006
    #7
  8. Ghislain Benrais

    Chris Uppal Guest

    I wrote:

    > Yes, precisely. And if the input encoding is not specified from code,
    > then (as I understand it) the SAX implementation is /supposed/ to take it
    > from the XML (where, in the OP's examply it was declared as "IS-8859-1").
    > Using a FileInputStream means that the input is decoded by that stream
    > before the XML parser sees it -- which may not be what is desired. More
    > specifically, the code I commented on uses the Java system default
    > decoder (whatever that happens to be) -- which is almost certainly not
    > what is desired.


    Oops, sorry, I misread your post Oliver.

    ;-) (But the "sorry" is real)

    I misread both your post and the OP, in fact. I was under the impression that
    he was originally using an FileInputStream, and you were "correcting" that to a
    FileReader. My mistake.

    -- chris
    Chris Uppal, Jun 26, 2006
    #8
  9. Ghislain Benrais

    Dale King Guest

    Oliver Wong wrote:
    >
    > That's why I suggested the OP use the constructor which takes in a
    > stream of bytes instead. The XMLReader will probably decode the first
    > few bytes using ASCII or UTF-8, until it finds an encoding specified in
    > the file, in which case it does whatever magic it needs to do to switch
    > encoding mid-stream.


    It is UTF-8 by the way. XML can get encoding information from:

    - an external transport protocol (e.g. HTTP or MIME) which is really the
    only reason to use a Reader as input to XMLReader.
    - from an encoding declaration as in <?xml encoding='UTF-8'?>
    - or from a byte order mark

    If none of the above are present it is a fatal error for the XML to be
    in anything but UTF-8.
    --
    Dale King
    Dale King, Jun 28, 2006
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. raavi
    Replies:
    2
    Views:
    897
    raavi
    Mar 2, 2006
  2. arne
    Replies:
    0
    Views:
    342
  3. Erik Wasser
    Replies:
    5
    Views:
    428
    Peter J. Holzer
    Mar 5, 2006
  4. Sean
    Replies:
    3
    Views:
    252
    robic0
    Oct 3, 2006
  5. Sean
    Replies:
    0
    Views:
    358
Loading...

Share This Page