Detect XML document encodings with SAX

Discussion in 'Java' started by Sebastian, Nov 21, 2012.

  1. Sebastian

    Sebastian Guest

    Hello there,

    I discovered this post:
    http://www.ibm.com/developerworks/library/x-tipsaxxni/

    and implemented both approaches (SAX and Xerces XNI).

    Unfortunately, for the attached XML file, both methods
    output an encoding of UTF-8, while looking at the file
    makes it clear that it is not UTF-8 encoded (all characters,
    including the umlaut and the Euro-sign, take one byte, and the
    declared encoding also is not UTF-8).

    Does anyone have an idea why that is so? And how I could
    go about making some XML parser determine the correct encoding?

    -- Sebastian
    Sebastian, Nov 21, 2012
    #1
    1. Advertising

  2. Sebastian

    Lew Guest

    Sebastian wrote:
    > I discovered this post:
    > http://www.ibm.com/developerworks/library/x-tipsaxxni/
    >
    > and implemented both approaches (SAX and Xerces XNI).
    >
    > Unfortunately, for the attached XML file, both methods


    Don't do attachments on Usenet.

    > output an encoding of UTF-8, while looking at the file


    as they should. XML should be encoded in UTF-8 nearly always.

    But SAX is a parser, so it doesn't output, it inputs. What are you telling us?

    > makes it clear that it is not UTF-8 encoded (all characters,
    > including the umlaut and the Euro-sign, take one byte, and the
    > declared encoding also is not UTF-8).


    http://sscce.org/

    > Does anyone have an idea why that is so? And how I could


    You used the default encoding in your Writer.

    > go about making some XML parser determine the correct encoding?


    Your problem is writing the file, no? That has nothing to do with parsing.

    If your problem is with reading the file, then the encoding in the XML declaration
    should suffice to guide the parser. But then why do you talk about methods that
    "output an encoding"?

    However, according to
    http://xmlwriter.net/xml_guide/xml_declaration.shtml#Encoding
    supported encodings only include UTF-8, UTF-16, ISO-10646-UCS-2,
    ISO-10646-UCS-4, ISO-8859-1 to ISO-8859-9, ISO-2022-JP, Shift_JIS, and EUC-JP,
    as you would have learned had you researched your question.

    So it looks like you must not accept XML documents with such a non-standard
    encoding.

    Show us the code, or at least an SSCCE of it.

    --
    Lew
    Lew, Nov 21, 2012
    #2
    1. Advertising

  3. Sebastian

    Sebastian Guest

    Am 21.11.2012 20:31, schrieb Lew:
    > Sebastian wrote:
    >> I discovered this post:
    >> http://www.ibm.com/developerworks/library/x-tipsaxxni/
    >>
    >> and implemented both approaches (SAX and Xerces XNI).

    [snip]

    >
    > Your problem is writing the file, no? That has nothing to do with parsing.

    No, it is with parsing the file. Parsing with the purpose of detecting
    the encoding.

    > If your problem is with reading the file, then the encoding in the XML declaration
    > should suffice to guide the parser.

    My question is exactly why in this case this does not suffice.

    >But then why do you talk about methods that
    > "output an encoding"?

    I meant the System.out.println() statements in the code.

    [snip]

    > Show us the code, or at least an SSCCE of it.
    >

    I was referring to the code in the IBM developerworks article that I
    linked to. Perhaps I should simply have copied out that code into my
    original post. So here goes:

    import org.xml.sax.*;
    import org.xml.sax.ext.*;
    import org.xml.sax.helpers.*;

    import java.io.IOException;

    public class SAXEncodingDetector extends DefaultHandler {

    /**
    * print the encodings of all URLs given on the command line.
    */
    public static void main(String[] args) throws SAXException,
    IOException {
    XMLReader parser = XMLReaderFactory.createXMLReader();
    SAXEncodingDetector handler = new SAXEncodingDetector();
    parser.setContentHandler(handler);
    for (int i = 0; i < args.length; i++) {
    try {
    parser.parse(args);
    }
    catch (SAXException ex) {
    System.out.println(handler.encoding);
    }
    }
    }

    private String encoding;
    private Locator2 locator;

    @Override
    public void setDocumentLocator(Locator locator) {
    if (locator instanceof Locator2) {
    this.locator = (Locator2) locator;
    }
    else {
    this.encoding = "unknown";
    }
    }

    @Override
    public void startDocument() throws SAXException {
    if (locator != null) {
    this.encoding = locator.getEncoding();
    }
    throw new SAXException("Early termination");
    }

    }
    Sebastian, Nov 21, 2012
    #3
  4. Sebastian

    Lew Guest

    Sebastian wrote:
    > schrieb Lew:
    >> Sebastian wrote:
    >>> I discovered this post:
    >>> http://www.ibm.com/developerworks/library/x-tipsaxxni/
    >>>
    >>> and implemented both approaches (SAX and Xerces XNI).

    >
    > [snip]
    >
    >> Your problem is writing the file, no? That has nothing to do with parsing.

    >
    > No, it is with parsing the file. Parsing with the purpose of detecting
    > the encoding.


    Not clear from your phrasing.

    >> If your problem is with reading the file, then the encoding in the XML declaration
    >> should suffice to guide the parser.

    >
    > My question is exactly why in this case this does not suffice.


    Did my answer to that question not suffice?

    I notice you didn't address my answer in your response; in fact you snipped it.

    --
    Lew
    Lew, Nov 22, 2012
    #4
  5. Sebastian

    Sebastian Guest

    Am 22.11.2012 01:37, schrieb Lew:
    > Sebastian wrote:
    >> schrieb Lew:
    >>> Sebastian wrote:
    >>>> I discovered this post:
    >>>> http://www.ibm.com/developerworks/library/x-tipsaxxni/
    >>>>
    >>>> and implemented both approaches (SAX and Xerces XNI).

    >>
    >> [snip]
    >>
    >>> Your problem is writing the file, no? That has nothing to do with parsing.

    >>
    >> No, it is with parsing the file. Parsing with the purpose of detecting
    >> the encoding.

    >
    > Not clear from your phrasing.
    >
    >>> If your problem is with reading the file, then the encoding in the XML declaration
    >>> should suffice to guide the parser.

    >>
    >> My question is exactly why in this case this does not suffice.

    >
    > Did my answer to that question not suffice?
    >
    > I notice you didn't address my answer in your response; in fact you snipped it.


    The answer cannot be that windows-1250 is non-standard. In fact, the
    declared encoding of the XML file does not seem to matter. The code will
    always output "UTF-8".

    I am using Java 7 on Windows XP.

    -- Sebastian
    Sebastian, Nov 22, 2012
    #5
  6. Sebastian

    markspace Guest

    On 11/21/2012 10:41 PM, Sebastian wrote:

    >
    > The answer cannot be that windows-1250 is non-standard. In fact, the
    > declared encoding of the XML file does not seem to matter. The code will
    > always output "UTF-8".
    >


    Maybe this quote from the article will help you out:

    "This approach works 90 percent of the time, maybe a little more. But
    SAX parsers aren't required to support the Locator interface, much less
    Locator2, and a few don't. A second option, if you know you're using
    Xerces, is to work with XNI"


    Since the output of the program is "unknown", I'd guess that this
    particular SAX parser doesn't support Locator2, like it says.
    markspace, Nov 22, 2012
    #6
  7. On 22/11/12 07:18, markspace wrote:
    > On 11/21/2012 10:41 PM, Sebastian wrote:
    >>
    >> The answer cannot be that windows-1250 is non-standard. In fact, the
    >> declared encoding of the XML file does not seem to matter. The code will
    >> always output "UTF-8".
    >>

    >
    > Maybe this quote from the article will help you out:
    >
    > "This approach works 90 percent of the time, maybe a little more. But
    > SAX parsers aren't required to support the Locator interface, much
    > less Locator2, and a few don't. A second option, if you know you're
    > using Xerces, is to work with XNI"
    >
    >
    > Since the output of the program is "unknown", I'd guess that this
    > particular SAX parser doesn't support Locator2, like it says.


    Like the OP, I'm getting "UTF-8", and tracing in the code shows that it
    is getting a Locator2.


    --
    ss at comp dot lancs dot ac dot uk
    Steven Simpson, Nov 22, 2012
    #7
  8. Sebastian

    Roedy Green Guest

    On Wed, 21 Nov 2012 15:32:19 +0100, Sebastian
    <> wrote, quoted or indirectly quoted
    someone who said :

    >Does anyone have an idea why that is so? And how I could
    >go about making some XML parser determine the correct encoding?


    See http://mindprod.com/products2.html#ENCODINGRECOGNISER

    This is a manual assist tool to help you guess the encoding.

    Encodings are not embedded in any way in files. You just have to know.

    ARGHHH!

    See http://mindprod.com/jgloss/encoding.html
    for how to use native2ascii to interconvert encodings.

    The XML world likes UTF-8. Using anything else is just asking for
    trouble.
    --
    Roedy Green Canadian Mind Products http://mindprod.com
    Students who hire or con others to do their homework are as foolish
    as couch potatoes who hire others to go to the gym for them.
    Roedy Green, Nov 22, 2012
    #8
  9. Sebastian

    markspace Guest

    On 11/21/2012 11:53 PM, Steven Simpson wrote:

    > Like the OP, I'm getting "UTF-8", and tracing in the code shows that it
    > is getting a Locator2.



    Oh, well mine doesn't. I guess we have two different implementations.
    Sorry can't guess what is up with yours.
    markspace, Nov 22, 2012
    #9
  10. On 2012-11-22 11:24, Roedy Green <> wrote:
    > On Wed, 21 Nov 2012 15:32:19 +0100, Sebastian
    ><> wrote, quoted or indirectly quoted
    > someone who said :
    >>Does anyone have an idea why that is so? And how I could
    >>go about making some XML parser determine the correct encoding?

    >
    > See http://mindprod.com/products2.html#ENCODINGRECOGNISER
    >
    > This is a manual assist tool to help you guess the encoding.


    No need to guess.

    > Encodings are not embedded in any way in files. You just have to know.


    Not true for XML. The file Sebastian posted starts with

    <?xml version="1.0" encoding="windows-1250"?>

    hp


    --
    _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
    |_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
    | | | | die Satzbestandteile des Satzes nicht mehr
    __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
    Peter J. Holzer, Nov 23, 2012
    #10
  11. Sebastian

    Arne Vajhøj Guest

    Sebastian wrote:
    > I discovered this post:
    > http://www.ibm.com/developerworks/library/x-tipsaxxni/
    >
    > and implemented both approaches (SAX and Xerces XNI).
    >
    > Unfortunately, for the attached XML file, both methods
    > output an encoding of UTF-8, while looking at the file


    I tried.

    And I can not get it to work either.

    SAX detects UTF-8 no matter what it really is.

    StAX seems never to detect and W3C DOM seems to
    always detect correct.

    I can not offer an explanation. Obviously the parsers
    need to internally detect correct. Otherwise they
    could not parse correct.

    Code below.

    Arne

    ====

    import java.io.FileInputStream;
    import java.io.FileNotFoundException;
    import java.io.FileReader;
    import java.io.FileWriter;
    import java.io.IOException;
    import java.io.PrintWriter;

    import javax.xml.parsers.DocumentBuilder;
    import javax.xml.parsers.DocumentBuilderFactory;
    import javax.xml.parsers.ParserConfigurationException;
    import javax.xml.stream.XMLInputFactory;
    import javax.xml.stream.XMLStreamException;
    import javax.xml.stream.XMLStreamReader;

    import org.w3c.dom.Document;
    import org.xml.sax.InputSource;
    import org.xml.sax.Locator;
    import org.xml.sax.SAXException;
    import org.xml.sax.XMLReader;
    import org.xml.sax.ext.Locator2;
    import org.xml.sax.helpers.XMLReaderFactory;
    import org.xml.sax.helpers.DefaultHandler;

    public class XmlEncodingDectect {
    private static final String FNM1 = "/work/foobar1.xml";
    private static final String FNM2 = "/work/foobar2.xml";
    private static final String FNM3 = "/work/foobar3.xml";
    private static void gen1() throws IOException {
    PrintWriter pw = new PrintWriter(new FileWriter(FNM1));
    pw.println("<?xml version='1.0' encoding='UTF-8'?>");
    pw.println("<root/>");
    pw.close();
    }
    private static void gen2() throws IOException {
    PrintWriter pw = new PrintWriter(new FileWriter(FNM2));
    pw.println("<?xml version='1.0' encoding='ISO-8859-1'?>");
    pw.println("<root/>");
    pw.close();
    }
    private static void gen3() throws IOException {
    PrintWriter pw = new PrintWriter(new FileWriter(FNM3));
    pw.println("<?xml version='1.0'?>");
    pw.println("<root/>");
    pw.close();
    }
    private static String encoding;
    private static String detectSAX(String fnm) throws SAXException,
    IOException {
    XMLReader parser = XMLReaderFactory.createXMLReader();
    parser.setContentHandler(new DefaultHandler() {
    private Locator2 locator;
    @Override
    public void setDocumentLocator(Locator locator) {
    if (locator instanceof Locator2) {
    this.locator = (Locator2) locator;
    } else {
    encoding = "Unknown";
    }
    }
    @Override
    public void startDocument() throws SAXException {
    if (locator != null) {
    encoding = locator.getEncoding();
    }
    }
    });
    parser.parse(new InputSource(new FileInputStream(fnm)));
    return encoding;
    }
    private static String detectW3CDOM(String fnm) throws
    ParserConfigurationException, FileNotFoundException, SAXException,
    IOException {
    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    DocumentBuilder db = dbf.newDocumentBuilder();
    Document doc = db.parse(new InputSource(new FileInputStream(fnm)));
    String encoding = doc.getXmlEncoding();
    return encoding != null ? encoding : "Unknown";
    }
    private static String detectStAX(String fnm) throws
    FileNotFoundException, XMLStreamException {
    XMLInputFactory xif = XMLInputFactory.newInstance();
    XMLStreamReader xsr = xif.createXMLStreamReader(new
    FileInputStream(fnm));
    String encoding = null;
    while(xsr.hasNext()) {
    xsr.next();
    switch(xsr.getEventType()) {
    case XMLStreamReader.START_DOCUMENT:
    encoding = xsr.getEncoding();
    break;
    default:
    break;
    }
    }
    return encoding != null ? encoding : "Unknown";
    }
    public static void main(String[] args) throws IOException,
    SAXException, ParserConfigurationException, XMLStreamException {
    gen1();
    System.out.println(detectSAX(FNM1));
    System.out.println(detectW3CDOM(FNM1));
    System.out.println(detectStAX(FNM1));
    gen2();
    System.out.println(detectSAX(FNM2));
    System.out.println(detectW3CDOM(FNM2));
    System.out.println(detectStAX(FNM2));
    gen3();
    System.out.println(detectSAX(FNM3));
    System.out.println(detectW3CDOM(FNM3));
    System.out.println(detectStAX(FNM3));
    }
    }
    Arne Vajhøj, Nov 24, 2012
    #11
  12. Sebastian

    Arne Vajhøj Guest

    On 11/21/2012 2:31 PM, Lew wrote:
    > Sebastian wrote:
    >> I discovered this post:
    >> http://www.ibm.com/developerworks/library/x-tipsaxxni/
    >>
    >> and implemented both approaches (SAX and Xerces XNI).
    >>
    >> Unfortunately, for the attached XML file, both methods

    >
    > Don't do attachments on Usenet.
    >
    >> output an encoding of UTF-8, while looking at the file

    >
    > as they should.


    No.

    If the XML prolog specifies another encoding than UTF-8,
    then it should not return UTF-8.

    > XML should be encoded in UTF-8 nearly always.


    XML allows for other encodings.

    And Java XML parsers support it.

    So it should always work.

    > But SAX is a parser, so it doesn't output, it inputs. What are you telling us?


    Output usually mean System.out.println - that works fine with a parser.

    > If your problem is with reading the file, then the encoding in the XML declaration
    > should suffice to guide the parser. But then why do you talk about methods that
    > "output an encoding"?


    Because he wants to know what it is.

    > However, according to
    > http://xmlwriter.net/xml_guide/xml_declaration.shtml#Encoding
    > supported encodings only include UTF-8, UTF-16, ISO-10646-UCS-2,
    > ISO-10646-UCS-4, ISO-8859-1 to ISO-8859-9, ISO-2022-JP, Shift_JIS,

    and EUC-JP,
    > as you would have learned had you researched your question.
    >
    > So it looks like you must not accept XML documents with such a

    non-standard
    > encoding.


    Those that has researched would know that the XML spec do not
    limit the encodings at all. The XML processor must support UTF-8
    and UTF-16, but are free to support others.

    Arne



    Arne
    Arne Vajhøj, Nov 24, 2012
    #12
  13. Sebastian

    Arne Vajhøj Guest

    On 11/22/2012 2:18 AM, markspace wrote:
    > On 11/21/2012 10:41 PM, Sebastian wrote:
    >> The answer cannot be that windows-1250 is non-standard. In fact, the
    >> declared encoding of the XML file does not seem to matter. The code will
    >> always output "UTF-8".
    >>

    >
    > Maybe this quote from the article will help you out:
    >
    > "This approach works 90 percent of the time, maybe a little more. But
    > SAX parsers aren't required to support the Locator interface, much less
    > Locator2, and a few don't. A second option, if you know you're using
    > Xerces, is to work with XNI"
    >
    > Since the output of the program is "unknown", I'd guess that this
    > particular SAX parser doesn't support Locator2, like it says.


    Except that it does not return Unknown - it returns UTF-8.

    Arne
    Arne Vajhøj, Nov 24, 2012
    #13
  14. Sebastian

    Arne Vajhøj Guest

    On 11/23/2012 6:13 PM, Peter J. Holzer wrote:
    > On 2012-11-22 11:24, Roedy Green <> wrote:
    >> On Wed, 21 Nov 2012 15:32:19 +0100, Sebastian
    >> <> wrote, quoted or indirectly quoted
    >> someone who said :
    >>> Does anyone have an idea why that is so? And how I could
    >>> go about making some XML parser determine the correct encoding?

    >>
    >> See http://mindprod.com/products2.html#ENCODINGRECOGNISER
    >>
    >> This is a manual assist tool to help you guess the encoding.

    >
    > No need to guess.
    >
    >> Encodings are not embedded in any way in files. You just have to know.

    >
    > Not true for XML. The file Sebastian posted starts with
    >
    > <?xml version="1.0" encoding="windows-1250"?>


    New around here?

    Don't expect Roedy's posts to relate that much to what he is
    replying to.

    Arne
    Arne Vajhøj, Nov 24, 2012
    #14
  15. Sebastian

    Lew Guest

    Arne Vajhøj wrote:
    > Lew wrote:
    >> Sebastian wrote:

    [snip]
    >>> output an encoding of UTF-8, while looking at the file

    >> as they should.

    >
    > No.
    >
    > If the XML prolog specifies another encoding than UTF-8,
    > then it should not return UTF-8.


    True, but I'm saying they should specify UTF-8 in the prolog.

    >> XML should be encoded in UTF-8 nearly always.


    See?

    > XML allows for other encodings.


    So? You should use UTF-8 nearly always, i.e., unless there's a compelling
    reason not to.

    > And Java XML parsers support it.


    For those rare times when you deviate from the usual UTF-8.

    > So it should always work.


    >> But SAX is a parser, so it doesn't output, it inputs. What are you telling us?

    >
    > Output usually mean System.out.println - that works fine with a parser.


    His phrasing wasn't clear to me. That's why I asked for clarification.

    I could have guessed, too.

    >> If your problem is with reading the file, then the encoding in the XML declaration


    See? You're preaching to the choir.

    >> should suffice to guide the parser. But then why do you talk about methods that


    >> "output an encoding"?

    >
    > Because he wants to know what it is.
    >
    >> However, according to
    >> http://xmlwriter.net/xml_guide/xml_declaration.shtml#Encoding
    >> supported encodings only include UTF-8, UTF-16, ISO-10646-UCS-2,
    >> ISO-10646-UCS-4, ISO-8859-1 to ISO-8859-9, ISO-2022-JP, Shift_JIS,
    >> and EUC-JP,
    >> So it looks like you must not accept XML documents with such a
    >> non-standard encoding.

    >
    > Those that has researched would know that the XML spec do not
    > limit the encodings at all. The XML processor must support UTF-8
    > and UTF-16, but are free to support others.


    Perhaps the OP's parser doesn't exercise that freedom, judging by the
    symptoms.

    'sall I'm sayin'.

    Obviously I don't know the answer, but he's asking for suggestions
    to investigate, AIUI. He's having encoding problems. His XML is apparently
    encoded in Windows-1252, a notoriously funky encoding especially for
    the variety of characters with which one might wish to deal. So why not
    investigate obtaining material that isn't in such a notoriously funky
    encoding, like, oh, say, the old reliable standard UTF-8?

    Perhaps that isn't feasible, for reasons as yet unstated, but that's
    the nature of brainstorming.

    --
    Lew
    Lew, Nov 24, 2012
    #15
  16. Sebastian

    Sebastian Guest

    Sebastian wrote:
    > I discovered this post:
    > http://www.ibm.com/developerworks/library/x-tipsaxxni/
    >
    > and implemented both approaches (SAX and Xerces XNI).
    >
    > Unfortunately, for the attached XML file, both methods
    > output an encoding of UTF-8, while looking at the file


    Am 24.11.2012 11:14, schrieb Lew:
    [snip]
    >
    > Obviously I don't know the answer, but he's asking for suggestions
    > to investigate, AIUI. He's having encoding problems. His XML is apparently
    > encoded in Windows-1252, a notoriously funky encoding especially for
    > the variety of characters with which one might wish to deal. So why not
    > investigate obtaining material that isn't in such a notoriously funky
    > encoding, like, oh, say, the old reliable standard UTF-8?
    >
    > Perhaps that isn't feasible, for reasons as yet unstated, but that's
    > the nature of brainstorming.


    Here's the background to my question:
    I am dealing with other people's code that processes XML files.
    Unfortunately, that code, which I have no control over, seems to use
    some home-grown parsing algorithm, which DOES NOT always detect
    encodings correctly, but expects to be told them.

    The XML files come from several sources in different encodings, and I
    cannot dictate anything there either.

    So I thought, well, why don't I add a little preprocessor to discover
    the encoding to give to that terrible file processor I'm stuck with.
    Shouldn't be that hard, because, as Arne said:

    > Am 24.11.2012 03:11, schrieb Arne Vajhøj:
    > Obviously the parsers
    > need to internally detect correct. Otherwise they
    > could not parse correct.


    The only approach that seems to work (at least for Arne), namely
    W3C DOM, is out of the question for me, because the files are
    potentially huge and I cannot keep a complete document model in memory.
    I need something along the lines of SAX. I'll have to look around some more.

    -- Sebastian

    PS: The author of that article from which I took the code isn't just
    anyone. Elliotte Rusty Harold hosts the XML web site
    http://www.cafeconleche.org/ and is affiliated with the University of
    North Carolina. Perhaps I could try to get in touch with him.
    Sebastian, Nov 24, 2012
    #16
  17. Sebastian

    Arne Vajhøj Guest

    On 11/24/2012 4:18 PM, Sebastian wrote:
    > Am 24.11.2012 11:14, schrieb Lew:
    > [snip]
    >>
    >> Obviously I don't know the answer, but he's asking for suggestions
    >> to investigate, AIUI. He's having encoding problems. His XML is
    >> apparently
    >> encoded in Windows-1252, a notoriously funky encoding especially for
    >> the variety of characters with which one might wish to deal. So why not
    >> investigate obtaining material that isn't in such a notoriously funky
    >> encoding, like, oh, say, the old reliable standard UTF-8?
    >>
    >> Perhaps that isn't feasible, for reasons as yet unstated, but that's
    >> the nature of brainstorming.

    >
    > Here's the background to my question:
    > I am dealing with other people's code that processes XML files.
    > Unfortunately, that code, which I have no control over, seems to use
    > some home-grown parsing algorithm, which DOES NOT always detect
    > encodings correctly, but expects to be told them.
    >
    > The XML files come from several sources in different encodings, and I
    > cannot dictate anything there either.


    I would consider it tempting to rewrite that app to use a standard
    XML parser.

    It would solve this problem and possibly also some future problems.

    > So I thought, well, why don't I add a little preprocessor to discover
    > the encoding to give to that terrible file processor I'm stuck with.
    > Shouldn't be that hard, because, as Arne said:
    >
    > > Am 24.11.2012 03:11, schrieb Arne Vajhøj:
    > > Obviously the parsers
    > > need to internally detect correct. Otherwise they
    > > could not parse correct.

    >
    > The only approach that seems to work (at least for Arne), namely
    > W3C DOM, is out of the question for me, because the files are
    > potentially huge and I cannot keep a complete document model in memory.
    > I need something along the lines of SAX. I'll have to look around some
    > more.


    What about just reading the first few lines until you have the
    XML declaration.

    Parsing the encoding out of that should be simple.

    private static final Pattern encpat =
    Pattern.compile("encoding\\s*=\\s*['\"]([^'\"]+)['\"]");
    private static String detectSimple(String fnm) throws IOException {
    BufferedReader br = new BufferedReader(new FileReader(fnm));
    String firstpart = "";
    while(!firstpart.contains(">")) firstpart += br.readLine();
    br.close();
    Matcher m = encpat.matcher(firstpart);
    if(m.find()) {
    return m.group(1);
    } else {
    return "Unknown";
    }
    }

    I do not like the solution, but given the restrictions in the
    context, then maybe it is what you need.

    > PS: The author of that article from which I took the code isn't just
    > anyone. Elliotte Rusty Harold hosts the XML web site
    > http://www.cafeconleche.org/ and is affiliated with the University of
    > North Carolina. Perhaps I could try to get in touch with him.


    Teaching at a university is no guarantee of good practical
    programming skills.

    Arne
    Arne Vajhøj, Nov 24, 2012
    #17
  18. Sebastian

    Arne Vajhøj Guest

    On 11/24/2012 5:14 AM, Lew wrote:
    > Arne Vajhøj wrote:
    >> Lew wrote:
    >>> But SAX is a parser, so it doesn't output, it inputs. What are you telling us?

    >>
    >> Output usually mean System.out.println - that works fine with a parser.

    >
    > His phrasing wasn't clear to me. That's why I asked for clarification.


    Then maybe we need "How to ask for clarifications the smart way".

    >>> However, according to
    >>> http://xmlwriter.net/xml_guide/xml_declaration.shtml#Encoding
    >>> supported encodings only include UTF-8, UTF-16, ISO-10646-UCS-2,
    >>> ISO-10646-UCS-4, ISO-8859-1 to ISO-8859-9, ISO-2022-JP, Shift_JIS,
    >>> and EUC-JP,
    >>> So it looks like you must not accept XML documents with such a
    >>> non-standard encoding.

    >>
    >> Those that has researched would know that the XML spec do not
    >> limit the encodings at all. The XML processor must support UTF-8
    >> and UTF-16, but are free to support others.

    >
    > Perhaps the OP's parser doesn't exercise that freedom, judging by the
    > symptoms.


    There are nothing in OP's symptoms that indicate lack of support
    for encodings.

    OP's symptoms is that it parse fine with encoding XYZ but when asked
    by caller it claims wrongfully to be using UTF-8.

    Arne
    Arne Vajhøj, Nov 24, 2012
    #18
  19. Sebastian

    Arne Vajhøj Guest

    On 11/24/2012 5:14 AM, Lew wrote:
    > Obviously I don't know the answer, but he's asking for suggestions
    > to investigate, AIUI. He's having encoding problems. His XML is apparently
    > encoded in Windows-1252, a notoriously funky encoding especially for
    > the variety of characters with which one might wish to deal.


    CP-1252 is just another encoding. It is not more or less funky than
    any other encoding.

    In fact it is identical with ISO-8859-1 for all characters except
    128-159, which are control characters/unmapped in ISO-8859-1 but has
    various extra values in CP-1252.

    > So why not
    > investigate obtaining material that isn't in such a notoriously funky
    > encoding, like, oh, say, the old reliable standard UTF-8?


    If one can chose the data files and the software, then life is easy.

    Arne
    Arne Vajhøj, Nov 24, 2012
    #19
  20. Sebastian

    markspace Guest

    On 11/24/2012 1:18 PM, Sebastian wrote:
    > I am dealing with other people's code that processes XML files.
    > Unfortunately, that code, which I have no control over, seems to use
    > some home-grown parsing algorithm, which DOES NOT always detect
    > encodings correctly, but expects to be told them.



    That's not a big deal. Several of the Java components work this way.
    Open the file with an assumed encoding, and test the encoding. If you
    are wrong, throw an exception, which causes the stream to be re-opened
    with the correct encoding (now that the correct encoding has been detected).

    Be careful you're not subverting an established, working process here.

    I personally am still looking for an SSCCE, as your last one didn't
    reproduce the error for me.
    markspace, Nov 25, 2012
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robert Olofsson
    Replies:
    3
    Views:
    2,983
    Sunitha Kumar
    Jul 3, 2003
  2. Per Magnus L?vold

    SAX parser ignores part of XML document

    Per Magnus L?vold, Nov 19, 2004, in forum: Java
    Replies:
    1
    Views:
    468
    John C. Bollinger
    Nov 19, 2004
  3. Jari Kujansuu
    Replies:
    2
    Views:
    1,004
    Jari Kujansuu
    Sep 30, 2003
  4. Tony Prichard
    Replies:
    0
    Views:
    725
    Tony Prichard
    Dec 12, 2003
  5. Erik Wasser
    Replies:
    5
    Views:
    452
    Peter J. Holzer
    Mar 5, 2006
Loading...

Share This Page