Help with net/http


A

Atomic Bomb

I am trying to screen scrape a webpage and pull out the name, address,
city, state, zip and phone on a site that lists apartments for rent.

Here is my code:
------------------------
temparray = Array.new

url = URI.parse("http://www.apartment-directory.info")
res = Net::HTTP.start(url.host, url.port) {|http|
http.get('/connecticut/0')
}
# puts res.body

res.body.each_line {|line|
line.gsub!(/\"/, '')
temparray.push(line) if line =~ /<td\svalign=top/
}
temparray.each do |j|
# j.gsub!(/<a\shref=\/map.*<\/a>/,'')
j.gsub!(/\shref=\/map\//,'')
j.gsub!(/\d+\sclass=map>Map\&nbsp\;It!/,'')
j.gsub!(/<\/td>/,'')
j.gsub!(/<td\svalign=top>/, '')
j.gsub!(/<td\svalign=top\snowrap>/, '')
j.gsub!(/<tr\sbgcolor=white>/, '<br>')
j.gsub!(/MapIt!/, ', ')
j.gsub!(/\(/, ', (')
j.gsub!(/<\/tr>/,'')

puts j
}
end
----------------------
I am able to grab the HTML from the page, I then gsub! out a " sign
then push each line that starts with <td valign=top onto an array. I
then iterate through the array and try to remove what I don't want with
more gsub! commands. The output from this still has HTML tags on it and
looks good if I output it to a html page (you can see the output here:
http://www.holy-name.org/ct.html) but I really need to remove the HTML
tags and get just the important facts into a CSV file. Since there are 4
elements in the array for each record, the only way I could get it to
work on a web page was to add a <br> between records.

Is there a better way to pull out the pertinent info and avoid all the
HTML tags?

thanks

atomic
 
Ad

Advertisements

A

Alex Stahl

[Note: parts of this message were removed to make it a legal post.]

Nokogiri provides a great interface for accessing the data trapped
inside markup.

Try something like:

page = Nokogiri::HTML res.body
data = []
page.xpath("//xpath/to/table").each do |node|
data << node.xpath("./rel/xpath/to/data/text()")
end




________________________________________________________________________

Alex Stahl | Sr. Quality Engineer | hi5 Networks, Inc. | (e-mail address removed)
|
 
A

A. Mcbomb

Thanks Alex.

I tried following the instructions to install the Nokogiri gem but it
gave me a few errors. I tried linking the libraries during the install:

[server01][/]$ gem install nokogiri -- --with-xml2-lib=/usr/local/lib
--with-xml2-include=/usr/local/include/libxml2
--with-xslt-lib=/usr/local/lib
--with-xslt-include=/usr/local/include/libxslt
Building native extensions. This could take a while...
Successfully installed nokogiri-1.4.4
1 gem installed
Installing ri documentation for nokogiri-1.4.4...

No definition for get_options

No definition for set_options

No definition for parse_memory

No definition for parse_file

No definition for parse_with
Installing RDoc documentation for nokogiri-1.4.4...

No definition for get_options

No definition for set_options

No definition for parse_memory

No definition for parse_file

No definition for parse_with
---

As a test, I created a test file with the following code:

require 'open-uri'
doc = Nokogiri::HTML(open("http://www.anysite.com/"))

But when I run it, I get the following so I don't think the gem in
installed correctly:

[server01][/usr/bin]$ ./test.rb
/test.rb:7: uninitialized constant Nokogiri (NameError)

Would you be able to suggest anything to help me get Nokogiri installed
and working?

thanks

atomic
 
J

Jesús Gabriel y Galán

Thanks Alex.

I tried following the instructions to install the Nokogiri gem but it
gave me a few errors. I tried linking the libraries during the install:

[server01][/]$ gem install nokogiri -- --with-xml2-lib=3D/usr/local/lib
--with-xml2-include=3D/usr/local/include/libxml2
--with-xslt-lib=3D/usr/local/lib
--with-xslt-include=3D/usr/local/include/libxslt
Building native extensions. =A0This could take a while...
Successfully installed nokogiri-1.4.4
1 gem installed
Installing ri documentation for nokogiri-1.4.4...

No definition for get_options

No definition for set_options

No definition for parse_memory

No definition for parse_file

No definition for parse_with
Installing RDoc documentation for nokogiri-1.4.4...

No definition for get_options

No definition for set_options

No definition for parse_memory

No definition for parse_file

No definition for parse_with
---

As a test, I created a test file with the following code:

=A0 require 'open-uri'
=A0 doc =3D Nokogiri::HTML(open("http://www.anysite.com/"))

But when I run it, I get the following so I don't think the gem in
installed correctly:

[server01][/usr/bin]$ ./test.rb
./test.rb:7: uninitialized constant Nokogiri (NameError)

Would you be able to suggest anything to help me get Nokogiri installed
and working?

You have to require 'nokogiri'

Jesus.
 
A

A. Mcbomb

I didn't realized that, Jesus but it didn't help in my installation.
When I run the test script, here's what I get:

[server01][/usr/bin]$ ./test.rb
/test.rb:6:in `require': no such file to load -- nokogiri (LoadError)
from ./test.rb:6


thanks

atomic
 
J

Jesús Gabriel y Galán

I didn't realized that, Jesus but it didn't help in my installation.
When I run the test script, here's what I get:

[server01][/usr/bin]$ ./test.rb
./test.rb:6:in `require': no such file to load -- nokogiri (LoadError)
=A0 =A0 =A0 =A0from ./test.rb:6

Did you require rubygems, before requiring nokogiri? The typical ways are:

export RUBYOPT=3Drubygems

or calling ruby -rubygems ./test.rb

or adding require 'rubygems' to your script (there has been
discussions here about why this is not recommended, specially for
library code)

In general, to use a gem you have to require rubygems before requiring the =
gem.

Jesus.
 
Ad

Advertisements

A

A. Mcbomb

That definately helped, Jesus....thanks.

Here is what I get now when I run the test script:

[server01][/usr/bin]$ ./test.rb
HI. You're using libxml2 version 2.6.16 which is over 4 years old and
has
plenty of bugs. We suggest that for maximum HTML/XML parsing pleasure,
you
upgrade your version of libxml2 and re-install nokogiri. If you like
using
libxml2 version 2.6.16, but don't like this warning, please define the
constant
I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before requring
nokogiri.


This sounds like I should upgrade....would you recommend just simply
replacing the one file (libxml2) and then reinstall the gem or are there
more files that I should replace such as libxslt as well?

thank you so much for helping me to get this going Jesus!

atomic
 
J

Jesús Gabriel y Galán

That definately helped, Jesus....thanks.

Here is what I get now when I run the test script:

[server01][/usr/bin]$ ./test.rb
HI. =A0You're using libxml2 version 2.6.16 which is over 4 years old and
has
plenty of bugs. =A0We suggest that for maximum HTML/XML parsing pleasure,
you
upgrade your version of libxml2 and re-install nokogiri. =A0If you like
using
libxml2 version 2.6.16, but don't like this warning, please define the
constant
I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before requring
nokogiri.


This sounds like I should upgrade....would you recommend just simply
replacing the one file (libxml2) and then reinstall the gem or are there
more files that I should replace such as libxslt as well?

thank you so much for helping me to get this going Jesus!

What OS (and version are you on?). I have a pretty old version of
Ubuntu (8.10) and have libxml2.so.2.6.32.
To correctly upgrade a library, please use your OS facilities (apt,
yum or whatever).

Jesus.
 
A

A. Mcbomb

Here's what my server is running:

Linux version 2.6.9-42.0.3.EL.wh1smp ([email protected]) (gcc version 3.4.6
20060404 (Red Hat 3.4.6-11)) #1 SMP Fri Aug 14 15:48:17 MDT 2009


The problem I run into is that since this is a shared hosting server,
they don't allow me to add RPMs to the server.

Do you know of a way to update the library with a binary file for
instance?
What libraray do I need so I can look around?

thanks again Jesus

atomic
 
S

Scott Hill

[Note: parts of this message were removed to make it a legal post.]

Installing gems local to your user account might help get around some
issues. Depending on what your host allows/supports, you might look into
using RVM to manage your Ruby installations and gems.

--Scott
 
J

Jesús Gabriel y Galán

Installing gems local to your user account might help get around some
issues. Depending on what your host allows/supports, you might look into
using RVM to manage your Ruby installations and gems.

The problem is not the gem, is the libxml2 dependency. I don't know
how to install a library locally for RedHat, maybe the OP can
investigate that. And then you have to tell nokogiri to use that local
version. I haven't looked into it, maybe it's easy, I don't know.
Maybe some other person on the list can help the OP further.

Jesus.
 
Ad

Advertisements

A

A. Mcbomb

I got one of my servers updated and I'm now running Nokogiri without
errors which is great news.

Here is my new code:
-------------------
url = URI.parse("http://www.apartment-directory.info")
res = Net::HTTP.start(url.host, url.port) {|http|
http.get('/connecticut/0')
}

page = Nokogiri::HTML res.body
page.xpath("//tr//td/a").each do |node|
puts node.text
end
-----------------
This returns some of the data that I need but not all of it.
I do not understand this line:

page.xpath("//tr/td")

I know it is supposed to be the path to the data I need but I'm not sure
how I can get to all the data I need from the URL, it seems like some of
the data is between tags that I can't figure out.

This is one record from the webpage in HTML:
-----
<tr bgcolor=white><td valign=top><a
href="/map/22-glenbrook-road-condo-associate/stamford-connecticut-06902-(203)327-4028/14741"
title="Condominium Office Rental and Leasing, Condominiums and
Townhouses, Condominium and Townhouse Rental and Leasing ">22 Glenbrook
Road Condo Associates</a></td><td valign=top>
<a
href="/map/22-glenbrook-road-condo-associate/stamford-connecticut-06902-(203)327-4028/14741"
class=map>Map&nbsp;It!</a>&nbsp;&nbsp;</td><td valign=top>22 Glenbrook
Road</td>
<td valign=top>Stamford,&nbsp;&nbsp;CT&nbsp;&nbsp;06902</td>
<td valign=top nowrap>(203) 327-4028</td></tr>
-----
I need to be able to get the following information for one record out:

22 Glenbrook Road Condo Associates,22 Glenbrook
Road,Stamford,CT,O6902,(203) 327-4028

I thought that if I configured Nokogiri with:
page.xpath("//tr/td")

..that is would get me inside these table brackets but it's not working.

Can you possibly point out where I'm going wrong?

thanks for the help,

atomic
 
A

A. Mcbomb

Hang on! It is working now. As I was writing my last post, I realized I =

had been using:

page.xpath("//tr//td/a") and changed it to page.xpath("//tr/td")

and tried that after my last post.

I get the following output which is good except for the A type =

characters, what is the best way to get rid of those and combine the =

record on the same line seperated only by commas?


Map=C3=82 It!=C3=82 =C3=82

90 Gerrish Avenue

East Haven,=C3=82 =C3=82 CT=C3=82 =C3=82 06512

(203) 466-2605

Avalon Bay Communities



Map=C3=82 It!=C3=82 =C3=82

66 Glenbrook Road No. 200

Stamford,=C3=82 =C3=82 CT=C3=82 =C3=82 06902

(203) 357-0986

Avalon Grove Luxury Apartments


thanks again,

atomic

-- =

Posted via http://www.ruby-forum.com/.=
 
B

brabuhr

You might also consider the mechanize library:
http://mechanize.rubyforge.org/mechanize/GUIDE_rdoc.html

e.g.

require 'rubygems'
require 'mechanize'

Mechanize.new.get("http://www.apartment-directory.info/alabama/0") do |page=
|
page.search('//tr').each do |tr|
tds =3D tr.search('./td')
puts tds[0].text.chomp rescue nil
puts tds[2].text.chomp rescue nil
puts tds[3].text.chomp rescue nil
puts
end
end

This sample script as-is is too greedy; it loops over every row of
every table table instead of just the interesting one.

$ ruby i.rb
[some garbage from other tables]
...
Aquadome Apartment
1619 8th Street Southwest
Decatur,=A0=A0AL=A0=A035601

Arbor Park Apartments
175 Sloan Avenue East
Talladega,=A0=A0AL=A0=A035160

Arbor Place Apartments
515 Fox Run Parkway No. 9A
Opelika,=A0=A0AL=A0=A036801

Arbor Pointe Apartments
100 Dairy Road
Mobile,=A0=A0AL=A0=A036612

Arboretum Apartments
1800 Arboretum Circle
Birmingham,=A0=A0AL=A0=A035216

Arbors On Taylor
485 Taylor Road
Montgomery,=A0=A0AL=A0=A036117

Arrow Head Apartments
129 South Union Avenue
Ozark,=A0=A0AL=A0=A036360
...
 
H

Hassan Schroeder

page =3D Nokogiri::HTML res.body
page.xpath("//tr//td/a").each do |node|
=A0puts node.text
end
-----------------
This returns some of the data that I need but not all of it.
I do not understand this line:

=A0 page.xpath("//tr/td")

That's not what you're using.
I know it is supposed to be the path to the data I need but I'm not sure
how I can get to all the data I need from the URL, it seems like some of
the data is between tags that I can't figure out.

If you *did* use `//tr/td` you *would* get all the information in the tabl=
e,
only some of which is within anchor (a) tags.

--=20
Hassan Schroeder ------------------------ (e-mail address removed)
twitter: @hassan
 
A

A. Mcbomb

I never heard of mechanize but I see from the doco that it requires
Nokogiri to run. I have a working copy of Nokogiri and just did a
successful 'gem install mechanize' but when I ran your basic script, I
get:

[[email protected] bin]# ./mechanize.rb
/mechanize.rb:7: uninitialized constant Mechanize (NameError)
from
/usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in
`gem_original_require'
from
/usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in
`require'
from ./mechanize.rb:5



Am I missing something?

atomic
 
Ad

Advertisements

B

brabuhr

I never heard of mechanize but I see from the doco that it requires
Nokogiri to run. I have a working copy of Nokogiri and just did a
successful 'gem install mechanize' but when I ran your basic script, I
get:

[[email protected] bin]# ./mechanize.rb
./mechanize.rb:7: uninitialized constant Mechanize (NameError)
=A0 =A0 =A0 =A0from
/usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in
`gem_original_require'
=A0 =A0 =A0 =A0from
/usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in
`require'
=A0 =A0 =A0 =A0from ./mechanize.rb:5

Am I missing something?

Hmm... I'm not sure there... oh wait... Could it be confused since
your file is (also) named 'mechanize'?:

$ cp i.rb mechanize.rb
$ ruby mechanize.rb
/mechanize.rb:5: uninitialized constant Mechanize (NameError)
from /Library/Ruby/Site/1.8/rubygems/custom_require.rb:31:in
`gem_original_require'
from /Library/Ruby/Site/1.8/rubygems/custom_require.rb:31:in `require'
from mechanize.rb:3

Yeah, I think that's it. Try renaming your script.
 
A

A. Mcbomb

I renamed the script and it worked!
Pretty nice....thanks.

My last question though is, what is the easiest way to get rid of all
the garbage that I don't want from the other tables?

thanks alot

atomic
 
Ad

Advertisements

B

brabuhr

My last question though is, what is the easiest way to get rid of all
the garbage that I don't want from the other tables?

Try to narrow down the xpath used to pull stuff out. E.g.

//table[2]/tbody/tr

to only get the rows from the second table.
 
Ad

Advertisements


Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top