problem with reading stream

D

Durango@google

Hello, I am trying to read a stream from a URL into a Inputsource and
pass that to a DOM parser.
The URL links to a RSS file and I am able to successfully read this.

The problem I am having is that, the data that the stream reads causes
the parser to throw errors.
This is because the RSS stream starts with a newline character. If
the newline character is removed than the parser works fine.
I was able to read the RSS stream and write it to a file omitting the
newline character.
After that I read the file stream into the Inputsource and passed it
to the Dom parser.

This worked fine, however I do not want to save it to a file, rather I
would like to know if there is
a way to manipulate the stream before passing it to the Inputsource
before it gets passed to the parser?

Anyone have any suggestions for the issue I am facing?
 
O

Owen Jacobson

Hello, I am trying to read a stream from a URL into a Inputsource and
pass that to a DOM parser.
The URL links to a RSS file and I am able to successfully read this.

The problem I am having is that, the data that the stream reads causes
the parser to throw errors.
This is because the RSS stream starts with a newline character.  If
the newline character is removed than the parser works fine.
I was able to read the RSS stream and write it to a file omitting the
newline character.
After that I read the file stream into the Inputsource and passed it
to the Dom parser.

This worked fine, however I do not want to save it to a file, rather I
would like to know if there is
a way to manipulate the stream before passing it to the Inputsource
before it gets passed to the parser?

Anyone have any suggestions for the issue I am facing?

First - it is completely legal for an XML file to begin with
whitespace, comments, processing instructions, or a tag - so (in the
absence of an example) I have to suppose that your RSS parser is a bit
buggy.

However, until the bug you file is fixed, you can probably work around
it a few ways: you can use a BufferedInputStream around the base input
stream in combination with the mark method to do this:

1. Mark in preparation for reading one byte.
2. Read a byte.
3. If the byte matches one of the whitespace tokens, go to 1.
4. reset() to "un-read" the last byte.

After all of this, the buffered stream will be at a point where the
next reads will begin at the first non-whitespace byte. You might
want to consider using a Reader rather than an InputStream for this,
so that you get character-oriented rather than byte-oriented
behaviour, though.

What does feedvalidator say about the RSS file at the URL? It's
possible the RSS resource itself is defective, which should be
reported (with a feedvalidator link for emphasis) to whoever's
publishing the feed.

-o
 
G

Gordon Beaton

This worked fine, however I do not want to save it to a file, rather
I would like to know if there is a way to manipulate the stream
before passing it to the Inputsource before it gets passed to the
parser?

Just read the newline from the stream yourself, and it won't be there
for the parser to read.

/gordon

--
 
M

Mike Schilling

Owen Jacobson wrote:

First - it is completely legal for an XML file to begin with
whitespace, comments, processing instructions, or a tag - so (in the
absence of an example) I have to suppose that your RSS parser is a
bit
buggy.

No, actually it isn't. See the XML standard at
http://www.xml.com/axml/testaxml.htm : it allows whitespace after the
end of the document, but not before the beginning. And the reason for
this should be clear: since an XML document could use many different
encodings, there would be a catch-22 in trying to discard whitespace
that precedes the XML declaration that indicates what encoding is
being used. See Appendix F of the standard.
 
O

Owen Jacobson

No, actually it isn't.  See the XML standard athttp://www.xml.com/axml/testaxml.htm: it allows whitespace after the
end of the document, but not before the beginning.  And the reason for
this should be clear: since an XML document could use many different
encodings, there would be a catch-22 in trying to discard whitespace
that precedes the XML declaration that indicates what encoding is
being used.  See Appendix F of the standard.

...Huh. You're right. I've never encountered it because I don't
usually put whitespace before a document - there's no point, it's just
wasted bytes. In that case, the OP's appropriate course of action
would be to tell whoever's publishing the RSS to fix the feed.

-o
 
T

Tom Anderson

..Huh. You're right. I've never encountered it because I don't usually
put whitespace before a document - there's no point, it's just wasted
bytes.

I got bitten by this recently. I wrote a JSP that started like this:

<%@page contentType="application/xhtml+xml"%>
<?xml version="1.0" encoding="UTF-8"?>
<!-- then doctype, and actual content -->

I thought i was being a really good boy and setting my content-type right.
But all that happened is that the client fell over with a parsing error!
Because, of course, after the page directive JSP tag, and before the XML
declaration, there's a newline. So, after JSP processing, my page actually
started "\n<?xml", which is against the rules.

You can fix it easily by deleting the newline, and having the XML decl
follow directly on the heels of the page directive.
In that case, the OP's appropriate course of action would be to tell
whoever's publishing the RSS to fix the feed.

Good luck with that.

tom
 
D

Daniel Pitts

Tom said:
I got bitten by this recently. I wrote a JSP that started like this:

<%@page contentType="application/xhtml+xml"%>
<?xml version="1.0" encoding="UTF-8"?>
<!-- then doctype, and actual content -->

I thought i was being a really good boy and setting my content-type
right. But all that happened is that the client fell over with a parsing
error! Because, of course, after the page directive JSP tag, and before
the XML declaration, there's a newline. So, after JSP processing, my
page actually started "\n<?xml", which is against the rules.

You can fix it easily by deleting the newline, and having the XML decl
follow directly on the heels of the page directive.


Good luck with that.

tom
The technique I use for this situation is:
<%@ page contentType="application/xhtml+xml"
%><?xml version="1.0" encoding="UTF-8" ?>

It has always been a little frustrating that there isn't better
whitespace control in JSP.
 
M

Mike Schilling

Gordon said:
Just read the newline from the stream yourself, and it won't be there
for the parser to read.

One simple (and reusable) way to do this is to create a FilterInputStream
that discards leading whitespace and let the XML parser read from that.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,040
Latest member
papereejit

Latest Threads

Top