HTML parsing by REXML

P

Paul Argentoff

Hello world.

Sorry for returning to sorta well-discussed (but not in a sense I need) topic.
I can't parse xml files by rexml since some tags in html are open (such as
<link>, etc). Document.new errors with a message about such a tag.

Is there any workaround? I really like and accustomed to REXML and don't want
to use another lib.
 
D

Dario Linsky

Hi,

Sorry for returning to sorta well-discussed (but not in a sense I need)
topic.
I can't parse xml files by rexml since some tags in html are open (such as
<link>, etc). Document.new errors with a message about such a tag.
Do I understand your problem right, that REXML gives you an Exception
because you did not close a tag? If so, a possible solution would be to
use XHTML instead of normal HTML.
 
M

Mark Hubbart

Hi,


Do I understand your problem right, that REXML gives you an Exception
because you did not close a tag? If so, a possible solution would be to
use XHTML instead of normal HTML.

to expand on this:
html is not xml. xhtml *is* xml.
in html, you can have unclosed tags that look like this:
<link rel="stylesheet" href="default.css">
but in xhtml, the tag has to close itself:
<link rel="stylesheet" href="default.css"/>
xhtml does this to be compatible with xml, which it is based on.

If you have a bunch of regular html files that you need to parse, I
would suggest running them through "HTML Tidy", which can convert them
to well-formed xhtml for you. You should then be able to parse them
with REXML. See http://tidy.sourgeforge.net/

--Mark
 
Y

Yan-Fa Li

Yeah I had the same problem recently. I think since html allows lax
closing of elements rexml will just barf. In the end I used regular
expressions to slurp catch the lines I was interested in and regex to
capture the fields I wanted. Works really well. There's also a html
parser class based on the python one, but it was so badly documented and
it seems to be poorly supported that I chose not to use it.
 
B

Ben Giddings

Yan-Fa Li said:
Yeah I had the same problem recently. I think since html allows lax
closing of elements rexml will just barf. In the end I used regular
expressions to slurp catch the lines I was interested in and regex to
capture the fields I wanted. Works really well. There's also a html
parser class based on the python one, but it was so badly documented and
it seems to be poorly supported that I chose not to use it.

You might have more luck with mine:

http://rubyforge.org/projects/htmltokenizer/

It is more forgiving, and pretty easy to use.

Ben
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,570
Members
45,045
Latest member
DRCM

Latest Threads

Top