A
Atomic Bomb
I am trying to screen scrape a webpage and pull out the name, address,
city, state, zip and phone on a site that lists apartments for rent.
Here is my code:
------------------------
temparray = Array.new
url = URI.parse("http://www.apartment-directory.info")
res = Net::HTTP.start(url.host, url.port) {|http|
http.get('/connecticut/0')
}
# puts res.body
res.body.each_line {|line|
line.gsub!(/\"/, '')
temparray.push(line) if line =~ /<td\svalign=top/
}
temparray.each do |j|
# j.gsub!(/<a\shref=\/map.*<\/a>/,'')
j.gsub!(/\shref=\/map\//,'')
j.gsub!(/\d+\sclass=map>Map\ \;It!/,'')
j.gsub!(/<\/td>/,'')
j.gsub!(/<td\svalign=top>/, '')
j.gsub!(/<td\svalign=top\snowrap>/, '')
j.gsub!(/<tr\sbgcolor=white>/, '<br>')
j.gsub!(/MapIt!/, ', ')
j.gsub!(/\(/, ', (')
j.gsub!(/<\/tr>/,'')
puts j
}
end
----------------------
I am able to grab the HTML from the page, I then gsub! out a " sign
then push each line that starts with <td valign=top onto an array. I
then iterate through the array and try to remove what I don't want with
more gsub! commands. The output from this still has HTML tags on it and
looks good if I output it to a html page (you can see the output here:
http://www.holy-name.org/ct.html) but I really need to remove the HTML
tags and get just the important facts into a CSV file. Since there are 4
elements in the array for each record, the only way I could get it to
work on a web page was to add a <br> between records.
Is there a better way to pull out the pertinent info and avoid all the
HTML tags?
thanks
atomic
city, state, zip and phone on a site that lists apartments for rent.
Here is my code:
------------------------
temparray = Array.new
url = URI.parse("http://www.apartment-directory.info")
res = Net::HTTP.start(url.host, url.port) {|http|
http.get('/connecticut/0')
}
# puts res.body
res.body.each_line {|line|
line.gsub!(/\"/, '')
temparray.push(line) if line =~ /<td\svalign=top/
}
temparray.each do |j|
# j.gsub!(/<a\shref=\/map.*<\/a>/,'')
j.gsub!(/\shref=\/map\//,'')
j.gsub!(/\d+\sclass=map>Map\ \;It!/,'')
j.gsub!(/<\/td>/,'')
j.gsub!(/<td\svalign=top>/, '')
j.gsub!(/<td\svalign=top\snowrap>/, '')
j.gsub!(/<tr\sbgcolor=white>/, '<br>')
j.gsub!(/MapIt!/, ', ')
j.gsub!(/\(/, ', (')
j.gsub!(/<\/tr>/,'')
puts j
}
end
----------------------
I am able to grab the HTML from the page, I then gsub! out a " sign
then push each line that starts with <td valign=top onto an array. I
then iterate through the array and try to remove what I don't want with
more gsub! commands. The output from this still has HTML tags on it and
looks good if I output it to a html page (you can see the output here:
http://www.holy-name.org/ct.html) but I really need to remove the HTML
tags and get just the important facts into a CSV file. Since there are 4
elements in the array for each record, the only way I could get it to
work on a web page was to add a <br> between records.
Is there a better way to pull out the pertinent info and avoid all the
HTML tags?
thanks
atomic