HTML parsing by REXML

Paul Argentoff · Apr 1, 2004

Hello world.

Sorry for returning to sorta well-discussed (but not in a sense I need) topic.
I can't parse xml files by rexml since some tags in html are open (such as
<link>, etc). Document.new errors with a message about such a tag.

Is there any workaround? I really like and accustomed to REXML and don't want
to use another lib.

Dario Linsky · Apr 1, 2004

Hi,

Sorry for returning to sorta well-discussed (but not in a sense I need)
topic.
I can't parse xml files by rexml since some tags in html are open (such as
<link>, etc). Document.new errors with a message about such a tag.

Do I understand your problem right, that REXML gives you an Exception
because you did not close a tag? If so, a possible solution would be to
use XHTML instead of normal HTML.

Mark Hubbart · Apr 1, 2004

Hi,

Do I understand your problem right, that REXML gives you an Exception
because you did not close a tag? If so, a possible solution would be to
use XHTML instead of normal HTML.

to expand on this:
html is not xml. xhtml *is* xml.
in html, you can have unclosed tags that look like this:
<link rel="stylesheet" href="default.css">
but in xhtml, the tag has to close itself:
<link rel="stylesheet" href="default.css"/>
xhtml does this to be compatible with xml, which it is based on.

If you have a bunch of regular html files that you need to parse, I
would suggest running them through "HTML Tidy", which can convert them
to well-formed xhtml for you. You should then be able to parse them
with REXML. See http://tidy.sourgeforge.net/

--Mark

Yan-Fa Li · Apr 1, 2004

Yeah I had the same problem recently. I think since html allows lax
closing of elements rexml will just barf. In the end I used regular
expressions to slurp catch the lines I was interested in and regex to
capture the fields I wanted. Works really well. There's also a html
parser class based on the python one, but it was so badly documented and
it seems to be poorly supported that I chose not to use it.

Ben Giddings · Apr 1, 2004

Yan-Fa Li said:
Yeah I had the same problem recently. I think since html allows lax
closing of elements rexml will just barf. In the end I used regular
expressions to slurp catch the lines I was interested in and regex to
capture the fields I wanted. Works really well. There's also a html
parser class based on the python one, but it was so badly documented and
it seems to be poorly supported that I chose not to use it.

You might have more luck with mine:

http://rubyforge.org/projects/htmltokenizer/

It is more forgiving, and pretty easy to use.

Ben

Errors on REXML reading an HTML.	1	Dec 24, 2010
REXML: parsing a string with unescaped ampersand entities	7	Dec 7, 2007
Ruby, Unicode, and HTML Entities Problem	4	Sep 26, 2010
REXML and Date interaction	5	Nov 26, 2006
[REXML] is my installation not working?	11	Jan 24, 2005
REXML code no longer works	3	May 11, 2004
REXML screen scraping questions	4	Sep 14, 2005
parsing xml (xmpp) with ruby	3	Sep 27, 2008

HTML parsing by REXML

Paul Argentoff

Dario Linsky

Mark Hubbart

Yan-Fa Li

Ben Giddings

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads