SAX character array

D

Duane Evenson

I am trying to parse a zipped XML file (open document spreadsheet). It is
composed of one long line of code.

The SAX parser takes character arrays of only 2048 characters. When a
character argument spans this break, the result is a second parser call to
characters(). The character data ends up being split into two components.

What can I do to fix this?

Here's the pertinent portion of my code:
ZipFile zf;
DefaultHandler handler = new ParseHandler();
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setNamespaceAware(true);
try {
zf = new ZipFile(DATA_FILE_NAME);
SAXParser saxParser = factory.newSAXParser();
saxParser.parse(zf.getInputStream(zf.getEntry(CONTENT_FILE_NAME)),
handler);
} catch ...

TIA
 
A

Adam Maass

Duane Evenson said:
I am trying to parse a zipped XML file (open document spreadsheet). It is
composed of one long line of code.

The SAX parser takes character arrays of only 2048 characters. When a
character argument spans this break, the result is a second parser call to
characters(). The character data ends up being split into two components.

What can I do to fix this?

Here's the pertinent portion of my code:
ZipFile zf;
DefaultHandler handler = new ParseHandler();
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setNamespaceAware(true);
try {
zf = new ZipFile(DATA_FILE_NAME);
SAXParser saxParser = factory.newSAXParser();
saxParser.parse(zf.getInputStream(zf.getEntry(CONTENT_FILE_NAME)),
handler);
} catch ...

TIA

In your implementation of characters(), you need to use a StringBuffer:

StringBuffer buf = new StringBuffer();
public void characters(char[] ch, int start, int length)
{
buf.append(ch, start, length);
}


Depending on the structure of the XML you're parsing, you may need to keep a
stack of StringBuffers or pull other tricks so that characters() picks up
the right StringBuffer to append to.
 
D

Duane Evenson

Duane Evenson said:
I am trying to parse a zipped XML file (open document spreadsheet). It is
composed of one long line of code.

The SAX parser takes character arrays of only 2048 characters. When a
character argument spans this break, the result is a second parser call to
characters(). The character data ends up being split into two components.

What can I do to fix this?

Here's the pertinent portion of my code:
ZipFile zf;
DefaultHandler handler = new ParseHandler();
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setNamespaceAware(true);
try {
zf = new ZipFile(DATA_FILE_NAME);
SAXParser saxParser = factory.newSAXParser();
saxParser.parse(zf.getInputStream(zf.getEntry(CONTENT_FILE_NAME)),
handler);
} catch ...

TIA

In your implementation of characters(), you need to use a StringBuffer:

StringBuffer buf = new StringBuffer();
public void characters(char[] ch, int start, int length)
{
buf.append(ch, start, length);
}


Depending on the structure of the XML you're parsing, you may need to keep a
stack of StringBuffers or pull other tricks so that characters() picks up
the right StringBuffer to append to.

This isn't the problem, or at least the solution. This would result in one
string buffer composed of all the spreadsheet cells concatenated together.
I want to process each cell separately.
Here is a code fragment from my program and the output:

public void characters(char buf[], int offset, int len)
throws SAXException {
String str = new String(buf, offset, len);
System.out.println("buf.length: " + buf.length + " offset: " + offset
+ " len: " + len + " str: "+ str);
}

# each call to characters should occur for each spreadsheet cell
buf.length: 2048 offset: 525 len: 10 str: 24/12/1999
buf.length: 2048 offset: 635 len: 10 str: Overwaitea
buf.length: 2048 offset: 726 len: 9 str: Groceries
buf.length: 2048 offset: 835 len: 4 str: 4.99
buf.length: 2048 offset: 920 len: 3 str: CAD
buf.length: 2048 offset: 1004 len: 8 str: BoM - MC
buf.length: 2048 offset: 1093 len: 1 str: x
buf.length: 2048 offset: 1175 len: 9 str: Groceries
buf.length: 2048 offset: 1265 len: 1 str: x
buf.length: 2048 offset: 1401 len: 1 str: x
buf.length: 2048 offset: 1570 len: 10 str: 30/12/1999
buf.length: 2048 offset: 1680 len: 7 str: Gas Bar
buf.length: 2048 offset: 1768 len: 3 str: Gas
buf.length: 2048 offset: 1872 len: 5 str: 10.51
buf.length: 2048 offset: 1958 len: 3 str: CAD
buf.length: 2048 offset: 2042 len: 6 str: BoM -
# Note how the string is split across calls to characters
# and how it happens at the end of the character array.
buf.length: 2048 offset: 0 len: 2 str: MC
buf.length: 2048 offset: 83 len: 1 str: x
buf.length: 2048 offset: 165 len: 3 str: Gas
....

I need to find some way to overcome this segmentation of the input data.
 
W

William Brogden

Duane Evenson said:
I am trying to parse a zipped XML file (open document spreadsheet). It
is
composed of one long line of code.

The SAX parser takes character arrays of only 2048 characters. When a
character argument spans this break, the result is a second parser
call to
characters(). The character data ends up being split into two
components.

What can I do to fix this?

Here's the pertinent portion of my code:
ZipFile zf;
DefaultHandler handler = new ParseHandler();
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setNamespaceAware(true);
try {
zf = new ZipFile(DATA_FILE_NAME);
SAXParser saxParser = factory.newSAXParser();
saxParser.parse(zf.getInputStream(zf.getEntry(CONTENT_FILE_NAME)),
handler);
} catch ...

TIA

In your implementation of characters(), you need to use a StringBuffer:

StringBuffer buf = new StringBuffer();
public void characters(char[] ch, int start, int length)
{
buf.append(ch, start, length);
}


Depending on the structure of the XML you're parsing, you may need to
keep a
stack of StringBuffers or pull other tricks so that characters() picks
up
the right StringBuffer to append to.

This isn't the problem, or at least the solution. This would result in
one
string buffer composed of all the spreadsheet cells concatenated
together.
I want to process each cell separately.
Here is a code fragment from my program and the output:

public void characters(char buf[], int offset, int len)
throws SAXException {
String str = new String(buf, offset, len);
System.out.println("buf.length: " + buf.length + " offset: " + offset
+ " len: " + len + " str: "+ str);
}

The problem is that characters may be called more than once while
parsing a single element. You should create a StringBuffer on getting
startElement for the one you want to capture, concatenate all
characters calls to it and convert to String ONLY when you get
the endElement call.

The reason is that SAX parser will call characters when it reaches
the end of a bufferload so any element split over more than one
bufferload will get multiple calls.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top