SAX parseing goes 'all funny' on value [en]

F

Fred

Hi,

I am parsing a small xml document and the parseing goes 'all funny'
when parsing this element: <useragent>Mozilla/4.61 [en] (WinNT;
I)</useragent>

I've created a subclass of org.xml.sax.helpers.DefaultHandler, and an
instance of this subclass is set on my
org.apache.xerces.parsers.SAXParser:

SAXParser parser = new SAXParser();
parser.setContentHandler(pdh);
parser.setErrorHandler(pdh);

I've found that the

public void characters(char[] ch, int offset, int length) throws
SAXException

method is called once per element parsed. my debug output confirms
this. e.g. when parsing <useragent>MobileExplorer/3.00 (Mozilla/1.22;
compatible; MMEF300; Microsoft; Windows; GenericLarge)</useragent> it
reads:

D: reading characters...(useragent) length=89, offset=721,
found='MobileExplorer/3.00 (Mozilla/1.22; compatible; MMEF300;
Microsoft; Windows; GenericLarge)'
D: ending element (useragent) current element value is :
[MobileExplorer/3.00 (Mozilla/1.22; compatible; MMEF300; Microsoft;
Windows; GenericLarge)]


But... when parsing <useragent>Mozilla/4.61 [en] (WinNT;
I)</useragent>
the debug output reads

D: reading characters...(useragent) length=16, offset=1097,
found='Mozilla/4.61 [en'
D: reading characters...(useragent) length=1, offset=0, found=']'
D: reading characters...(useragent) length=11, offset=1114, found='
(WinNT; I)'
D: ending (useragent) current element value is : [ (WinNT; I)]

It calls the characters method trice?!
Does the [en] bit in the element value have anything to do with this?
Would like to understand what and why.

(As a 'temp fix' I thought to have the DefaultHandlers characters(...)
method concatenate characters read, till the endElement(...) is
invoked; but that seems to break everything.)

Thanks for your input.
Fred.
 
J

Julian Reschke

Fred said:
(As a 'temp fix' I thought to have the DefaultHandlers characters(...)
method concatenate characters read, till the endElement(...) is
invoked; but that seems to break everything.)

I think that's how SAX is supposed to work. There's no guarantee that
you're only getting a single event here.
 
E

Eric Bohlman

I think that's how SAX is supposed to work. There's no guarantee that
you're only getting a single event here.

It *is* how SAX is supposed to work. Keep in mind that character data in
XML can be arbitrarily long; if a parser had to deliver character data in a
single chunk, it could find itself constantly allocating and reallocating
buffers. Not imposing such a requirement greatly simplifies buffer
management in a parser; it can use a fixed-size internal buffer and just
call the character handler when everything up to the end of the buffer is
character data, rather than having to shift everything around. That can
greatly speed up parsing.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,733
Messages
2,569,440
Members
44,829
Latest member
PIXThurman

Latest Threads

Top