SAX character array

Discussion in 'Java' started by Duane Evenson, May 26, 2006.

  1. I am trying to parse a zipped XML file (open document spreadsheet). It is
    composed of one long line of code.

    The SAX parser takes character arrays of only 2048 characters. When a
    character argument spans this break, the result is a second parser call to
    characters(). The character data ends up being split into two components.

    What can I do to fix this?

    Here's the pertinent portion of my code:
    ZipFile zf;
    DefaultHandler handler = new ParseHandler();
    SAXParserFactory factory = SAXParserFactory.newInstance();
    factory.setNamespaceAware(true);
    try {
    zf = new ZipFile(DATA_FILE_NAME);
    SAXParser saxParser = factory.newSAXParser();
    saxParser.parse(zf.getInputStream(zf.getEntry(CONTENT_FILE_NAME)),
    handler);
    } catch ...

    TIA
    Duane Evenson, May 26, 2006
    #1
    1. Advertising

  2. Duane Evenson

    Adam Maass Guest

    "Duane Evenson" <> wrote:
    >I am trying to parse a zipped XML file (open document spreadsheet). It is
    > composed of one long line of code.
    >
    > The SAX parser takes character arrays of only 2048 characters. When a
    > character argument spans this break, the result is a second parser call to
    > characters(). The character data ends up being split into two components.
    >
    > What can I do to fix this?
    >
    > Here's the pertinent portion of my code:
    > ZipFile zf;
    > DefaultHandler handler = new ParseHandler();
    > SAXParserFactory factory = SAXParserFactory.newInstance();
    > factory.setNamespaceAware(true);
    > try {
    > zf = new ZipFile(DATA_FILE_NAME);
    > SAXParser saxParser = factory.newSAXParser();
    > saxParser.parse(zf.getInputStream(zf.getEntry(CONTENT_FILE_NAME)),
    > handler);
    > } catch ...
    >
    > TIA
    >


    In your implementation of characters(), you need to use a StringBuffer:

    StringBuffer buf = new StringBuffer();
    public void characters(char[] ch, int start, int length)
    {
    buf.append(ch, start, length);
    }


    Depending on the structure of the XML you're parsing, you may need to keep a
    stack of StringBuffers or pull other tricks so that characters() picks up
    the right StringBuffer to append to.
    Adam Maass, May 26, 2006
    #2
    1. Advertising

  3. On Fri, 26 May 2006 09:18:47 -0700, Adam Maass wrote:

    >
    > "Duane Evenson" <> wrote:
    >>I am trying to parse a zipped XML file (open document spreadsheet). It is
    >> composed of one long line of code.
    >>
    >> The SAX parser takes character arrays of only 2048 characters. When a
    >> character argument spans this break, the result is a second parser call to
    >> characters(). The character data ends up being split into two components.
    >>
    >> What can I do to fix this?
    >>
    >> Here's the pertinent portion of my code:
    >> ZipFile zf;
    >> DefaultHandler handler = new ParseHandler();
    >> SAXParserFactory factory = SAXParserFactory.newInstance();
    >> factory.setNamespaceAware(true);
    >> try {
    >> zf = new ZipFile(DATA_FILE_NAME);
    >> SAXParser saxParser = factory.newSAXParser();
    >> saxParser.parse(zf.getInputStream(zf.getEntry(CONTENT_FILE_NAME)),
    >> handler);
    >> } catch ...
    >>
    >> TIA
    >>

    >
    > In your implementation of characters(), you need to use a StringBuffer:
    >
    > StringBuffer buf = new StringBuffer();
    > public void characters(char[] ch, int start, int length)
    > {
    > buf.append(ch, start, length);
    > }
    >
    >
    > Depending on the structure of the XML you're parsing, you may need to keep a
    > stack of StringBuffers or pull other tricks so that characters() picks up
    > the right StringBuffer to append to.


    This isn't the problem, or at least the solution. This would result in one
    string buffer composed of all the spreadsheet cells concatenated together.
    I want to process each cell separately.
    Here is a code fragment from my program and the output:

    public void characters(char buf[], int offset, int len)
    throws SAXException {
    String str = new String(buf, offset, len);
    System.out.println("buf.length: " + buf.length + " offset: " + offset
    + " len: " + len + " str: "+ str);
    }

    # each call to characters should occur for each spreadsheet cell
    buf.length: 2048 offset: 525 len: 10 str: 24/12/1999
    buf.length: 2048 offset: 635 len: 10 str: Overwaitea
    buf.length: 2048 offset: 726 len: 9 str: Groceries
    buf.length: 2048 offset: 835 len: 4 str: 4.99
    buf.length: 2048 offset: 920 len: 3 str: CAD
    buf.length: 2048 offset: 1004 len: 8 str: BoM - MC
    buf.length: 2048 offset: 1093 len: 1 str: x
    buf.length: 2048 offset: 1175 len: 9 str: Groceries
    buf.length: 2048 offset: 1265 len: 1 str: x
    buf.length: 2048 offset: 1401 len: 1 str: x
    buf.length: 2048 offset: 1570 len: 10 str: 30/12/1999
    buf.length: 2048 offset: 1680 len: 7 str: Gas Bar
    buf.length: 2048 offset: 1768 len: 3 str: Gas
    buf.length: 2048 offset: 1872 len: 5 str: 10.51
    buf.length: 2048 offset: 1958 len: 3 str: CAD
    buf.length: 2048 offset: 2042 len: 6 str: BoM -
    # Note how the string is split across calls to characters
    # and how it happens at the end of the character array.
    buf.length: 2048 offset: 0 len: 2 str: MC
    buf.length: 2048 offset: 83 len: 1 str: x
    buf.length: 2048 offset: 165 len: 3 str: Gas
    ....

    I need to find some way to overcome this segmentation of the input data.
    Duane Evenson, May 27, 2006
    #3
  4. On Sat, 27 May 2006 11:50:09 GMT, Duane Evenson <>
    wrote:

    > On Fri, 26 May 2006 09:18:47 -0700, Adam Maass wrote:
    >
    >>
    >> "Duane Evenson" <> wrote:
    >>> I am trying to parse a zipped XML file (open document spreadsheet). It
    >>> is
    >>> composed of one long line of code.
    >>>
    >>> The SAX parser takes character arrays of only 2048 characters. When a
    >>> character argument spans this break, the result is a second parser
    >>> call to
    >>> characters(). The character data ends up being split into two
    >>> components.
    >>>
    >>> What can I do to fix this?
    >>>
    >>> Here's the pertinent portion of my code:
    >>> ZipFile zf;
    >>> DefaultHandler handler = new ParseHandler();
    >>> SAXParserFactory factory = SAXParserFactory.newInstance();
    >>> factory.setNamespaceAware(true);
    >>> try {
    >>> zf = new ZipFile(DATA_FILE_NAME);
    >>> SAXParser saxParser = factory.newSAXParser();
    >>> saxParser.parse(zf.getInputStream(zf.getEntry(CONTENT_FILE_NAME)),
    >>> handler);
    >>> } catch ...
    >>>
    >>> TIA
    >>>

    >>
    >> In your implementation of characters(), you need to use a StringBuffer:
    >>
    >> StringBuffer buf = new StringBuffer();
    >> public void characters(char[] ch, int start, int length)
    >> {
    >> buf.append(ch, start, length);
    >> }
    >>
    >>
    >> Depending on the structure of the XML you're parsing, you may need to
    >> keep a
    >> stack of StringBuffers or pull other tricks so that characters() picks
    >> up
    >> the right StringBuffer to append to.

    >
    > This isn't the problem, or at least the solution. This would result in
    > one
    > string buffer composed of all the spreadsheet cells concatenated
    > together.
    > I want to process each cell separately.
    > Here is a code fragment from my program and the output:
    >
    > public void characters(char buf[], int offset, int len)
    > throws SAXException {
    > String str = new String(buf, offset, len);
    > System.out.println("buf.length: " + buf.length + " offset: " + offset
    > + " len: " + len + " str: "+ str);
    > }


    The problem is that characters may be called more than once while
    parsing a single element. You should create a StringBuffer on getting
    startElement for the one you want to capture, concatenate all
    characters calls to it and convert to String ONLY when you get
    the endElement call.

    The reason is that SAX parser will call characters when it reaches
    the end of a bufferload so any element split over more than one
    bufferload will get multiple calls.


    >
    > # each call to characters should occur for each spreadsheet cell
    > buf.length: 2048 offset: 525 len: 10 str: 24/12/1999
    > buf.length: 2048 offset: 635 len: 10 str: Overwaitea
    > buf.length: 2048 offset: 726 len: 9 str: Groceries
    > buf.length: 2048 offset: 835 len: 4 str: 4.99
    > buf.length: 2048 offset: 920 len: 3 str: CAD
    > buf.length: 2048 offset: 1004 len: 8 str: BoM - MC
    > buf.length: 2048 offset: 1093 len: 1 str: x
    > buf.length: 2048 offset: 1175 len: 9 str: Groceries
    > buf.length: 2048 offset: 1265 len: 1 str: x
    > buf.length: 2048 offset: 1401 len: 1 str: x
    > buf.length: 2048 offset: 1570 len: 10 str: 30/12/1999
    > buf.length: 2048 offset: 1680 len: 7 str: Gas Bar
    > buf.length: 2048 offset: 1768 len: 3 str: Gas
    > buf.length: 2048 offset: 1872 len: 5 str: 10.51
    > buf.length: 2048 offset: 1958 len: 3 str: CAD
    > buf.length: 2048 offset: 2042 len: 6 str: BoM -
    > # Note how the string is split across calls to characters
    > # and how it happens at the end of the character array.
    > buf.length: 2048 offset: 0 len: 2 str: MC
    > buf.length: 2048 offset: 83 len: 1 str: x
    > buf.length: 2048 offset: 165 len: 3 str: Gas
    > ...
    >
    > I need to find some way to overcome this segmentation of the input data.
    >




    --
    Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
    William Brogden, May 27, 2006
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Velvet
    Replies:
    9
    Views:
    14,779
    Joerg Jooss
    Jan 19, 2006
  2. William Brogden
    Replies:
    1
    Views:
    8,339
    Manoj S. P.
    Jun 30, 2003
  3. Michael Lee
    Replies:
    0
    Views:
    1,536
    Michael Lee
    Jun 27, 2003
  4. raavi
    Replies:
    2
    Views:
    901
    raavi
    Mar 2, 2006
  5. Jan van Mansum
    Replies:
    0
    Views:
    487
    Jan van Mansum
    Oct 5, 2004
Loading...

Share This Page