HTML Parsing?

M

Martin Hart

Hi all,

I need to access an http server and interpret som data from the page i get
back (basically for some minimal tests of a website). I know that I can use
the Net::HTTP class to connect and retrieve the page, but then I am left with
a string full of stuff.

What do people use to parse this into something useful? Is REXML an option
(although the html is not likely to be valid xml)? I have looked at the
html-parser on RAA but do not seem to be able to individually access the
components of the returned page (for example I need to see what the contents
of a text control are - or what the caption of the <h2> tag is.

I suppose using regexps is an option as well, but just wondering if I am
missing some cool library that already does all this stuff?

Thanks for any advice

Martin

--
Martin Hart
Arnclan Limited
53 Union Street
Dunstable, Beds
LU6 1EX
http://www.arnclanit.com
 
E

Emmanuel Touzery

What do people use to parse this into something useful? Is REXML an option
(although the html is not likely to be valid xml)? I have looked at the
html-parser on RAA but do not seem to be able to individually access the
components of the returned page (for example I need to see what the
contents of a text control are - or what the caption of the <h2> tag is.

see the thread at
http://www.ruby-talk.org/cgi-bin/vframe.rb/ruby/ruby-talk/91265?91157-91621+split-mode-vertical

emmanuel
 
M

Martin Hart

For the OP: you can use the above library to convert HTML into a
REXML::Document, then pull it apart as you please.

Gavin

thanks for all the advice - I can't believe that I missed the similar thread
started by Gavin only 4 days ago :-(

Sorry for the noise.

Cheers,
Martin

--
Martin Hart
Arnclan Limited
53 Union Street
Dunstable, Beds
LU6 1EX
http://www.arnclanit.com
 
M

Martin Hart

OK feel free to call me an idiot here, but what versions of html-parser and
htmltools are you running?

I downloaded both the html-parser and the patched-html-parser from RAA which
installed themselves into site_ruby/ (not where i'd expect them -
site_ruby/1.8/...). I did this because htmltools appears to depend on one of
them - although not mentioned in the README (version 1.06)

Then I downloaded htmltools from rubyforge which first fails to install
because the sgml-parser.rb file is not in "html/sgml-parser" which is where
it is supposed(?) to be.

Anyway, after moving files to where I presume they should be installed to, the
htmltools library fails to install because the tests do not run (all 15 unit
tests fail with "NameError: uninitialized constant
HTML::TestStackingParser").

My environment is ruby 1.8.1 linux.

My next step is to just install the files by hand and then try again - but I
would be interested to hear if anybody else has experienced similar
installation problems - or if I am just missing something obvious?

Cheers,
Martin
 
G

Gavin Sinclair

D

daz

cc: Johannes Brodwall (email)
------------------------------
[...]
Then I downloaded htmltools from rubyforge which first fails to install
because the sgml-parser.rb file is not in "html/sgml-parser" which is where
it is supposed(?) to be.

Anyway, after moving files to where I presume they should be installed to, the
htmltools library fails to install because the tests do not run (all 15 unit
tests fail with "NameError: uninitialized constant
HTML::TestStackingParser").

My next step is to just install the files by hand and then try again - but I
would be interested to hear if anybody else has experienced similar
installation problems - or if I am just missing something obvious?

In ruby-htmltools/test/tc_stacking-parser.rb, replace
line 35:
@parser = HTML::TestStackingParser.new(true, self)
with:
@parser = TestStackingParser.new(true, self)

----
Occurrences of "set_up" have been changed to "setup" for Test::Unit.
For consistency, all "tear_down" should be changed to "teardown".
----

It seems that Johannes' idea is to include sgml-parser with the
updated htmltools library. (It's in his CVS tarball)
IMHO, this would make a good home for the whole of html-parser (patched)
(only 31Kb including docs). As long as the original author and packager
are credited in the README, I don't know that anyone would object on grounds
other than duplication of dormant library code. Development could be
continued here.

There should be no need (?) to distribute the RDoc output now that it's
built into Ruby.


daz
 
M

Martin Hart

It seems that Johannes' idea is to include sgml-parser with the
updated htmltools library.
[snip]

Thanks - I got there in the end anyway by manually installing all the files I
had downloaded and tweaking them as necessary.

Just to append a note to the mini thread that started on packaging as a result
of this... while a packaging system with all the works would be great, It
seems to me what is really needed soonest is a definitive place where we can
take downloads from. I got the versions of code that I am using from RAA...

Where I came unstuck is that there appear to be two different(?) versions of
ruby-htmltools. One by Ned Konz that is linked to from RAA, and one by
Johannes Brodwall that is on rubyforge. I don't know the history of these -
it may well be that they are the same product that has changed ownership etc,
but it does cause confusion (at least in my case :) when two people download
the same thing from two different places. There is no common frame of
reference, we think that we are talking about the same code but we may not
be.


Cheers,
Martin

--
Martin Hart
Arnclan Limited
53 Union Street
Dunstable, Beds
LU6 1EX
http://www.arnclanit.com
 
G

Gavin Sinclair

It seems that Johannes' idea is to include sgml-parser with the
updated htmltools library.
[snip]
Thanks - I got there in the end anyway by manually installing all the files I
had downloaded and tweaking them as necessary.
Just to append a note to the mini thread that started on packaging as a result
of this... while a packaging system with all the works would be great, It
seems to me what is really needed soonest is a definitive place where we can
take downloads from. I got the versions of code that I am using from RAA...
Where I came unstuck is that there appear to be two different(?) versions of
ruby-htmltools. One by Ned Konz that is linked to from RAA, and one by
Johannes Brodwall that is on rubyforge. I don't know the history of these -
it may well be that they are the same product that has changed ownership etc,
but it does cause confusion (at least in my case :) when two people download
the same thing from two different places. There is no common frame of
reference, we think that we are talking about the same code but we may not
be.

True, but this is an isolated case. I've never seen so much
fragmentation with a Ruby library as I've seen with htmltools :)

Since there is an htmltools project on RubyForge, that should become
the definitive one, once it's ensured that it's fully up to date.
I'll be doing more HTML parsing fairly soon, so I'll try to do my bit
in this area.

Cheers,
Gavin
 
D

daz

Martin Hart said:
Where I came unstuck is that there appear to be two different(?) versions of
ruby-htmltools. One by Ned Konz that is linked to from RAA, and one by
Johannes Brodwall that is on rubyforge. I don't know the history of these -
it may well be that they are the same product that has changed ownership etc,
but it does cause confusion (at least in my case :) when two people download
the same thing from two different places. There is no common frame of
reference, we think that we are talking about the same code but we may not
be.

Until I read this thread, I was unaware of 1.06 on RubyForge which *is*
an "updated for Ruby 1.8" version of 1.04 from RAA. ((garbage sentence))

This problem isn't too common atm, but you're right - this example is in a bit
of a mess. The issue is understood by those who matter.

Sorry we had to share the same inconvenience :)


Cheers,


daz
 
J

Johannes Brodwall

daz said:
Until I read this thread, I was unaware of 1.06 on RubyForge which *is*
an "updated for Ruby 1.8" version of 1.04 from RAA. ((garbage sentence))

This problem isn't too common atm, but you're right - this example is in a bit
of a mess. The issue is understood by those who matter.

Sorry we had to share the same inconvenience :)

Thank you all for the feedback, and especially to daz for alerting me
directly (I haven't paid attention to ruby-talk lately).

I have updated the tarball to include sgml-parser. Sorry about the
slip-up.

I will not have time to work much on the project for long. If anyone
wants to lend a hand, please speak up.


~Johannes
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,780
Messages
2,569,610
Members
45,254
Latest member
Top Crypto TwitterChannel

Latest Threads

Top