how do I extract the ID name of a Div and its content?

H

Hans Wurst

Hi folks,
sorry to bother you with such a mundane question, but I tried for 3hours
and don`t know how :-(
I need to extract the strings, that means everything between "" or ''

that is the Input:

<body>
<div id='pagewrapper'>
<div id='header'>
<p>Ruby Forum is cool</p>
</div>
<div id='navbar'>
<ul>
<li><%= link_to 'Cities', cities_path %></li>
<li><%= link_to 'Restaurants', restaurants_path %></li>
<li><%= link_to 'Categories', categories_path %></li>
<li><%= link_to 'Products', products_path %></li>
</ul>

wished output:

pagewrapper
header
navbar
Cities
Restaurants
Categories
Products


Please help
Thanks in advance
pannda
 
R

Robert Dober

Disclaimer: If there are other ' somewhere in the document (comments,
CDATA sections, Text elements)
this will miserably break and you need HPricot or other HTML parsers.
If however your data is simple enough and you can fulfill the
prerequisite it becomes very easy...

robert@siena:~/log/ruby/ML 12:56:41
505/6 > cat strings.rb && ruby strings.rb
#!/usr/bin/ruby
# vim: sw=2 ts=2 ft=ruby expandtab tw=0 nu syn=on:
# file: strings.rb


text = DATA.read

p text.scan /'(.*?)'/
__END__
<body>
<div id='pagewrapper'>
<div id='header'>
<p>Ruby Forum is cool</p>
</div>
<div id='navbar'>
<ul>
<li><%= link_to 'Cities', cities_path %></li>
<li><%= link_to 'Restaurants', restaurants_path %></li>
<li><%= link_to 'Categories', categories_path %></li>
<li><%= link_to 'Products', products_path %></li>
</ul>
[["pagewrapper"], ["header"], ["navbar"], ["Cities"], ["Restaurants"],
["Categories"], ["Products"]]
 
B

Brian Candler

Hans said:
but I tried for 3hours and don`t know how :-(

So, what approach or approaches did you try? What was the solution that
got closest to what you were trying, and what did it output?

There are a whole host of ways you might be approaching this. For
example, you might be using the Hpricot (HTML parsing) library, or REXML
(XML parsing), or simple regular expression matching.

If all you want is the bits between 'single quotes' then a regular
expression match is probably easiest. Try using String#scan, and give it
a regular expression which matches a single quote followed by any number
of non-single-quote characters followed by a single quote.

Or you could use use String#split("'") and keep only the odd-numbered
elements of the returned array.

I'm sure if you post your actual code and what it does, someone will
help you tweak it to work.
 
H

Hans Wurst

Thaaaaaaaank You Robert!
Thumbs up!, this is exactly what I wanted, clean & simple without
hpricot, rexml or overly complicated regexs

Have a nice weekend
greetz Pannda
 
H

Hans Wurst

Hi Brian
I'm sure if you post your actual code and what it does, someone will
help you tweak it to work.

actual code was this

require 'rubygems'
require 'yaml'
require 'hpricot'




html_datei = File.open(ARGV[0]).readlines.collect do |line|


option 1# line.gsub(/<\/?[^>]*>/, "").to_yaml
option 2 # line.gsub(/</,
"").gsub(/>/,"").strip.to_yaml.gsub(/---/,"").lstrip
option 3 # line.gsub(/^<\/?[^>]*>/,"").lstrip
end

# all 3 options worked, but I couldn't figure out how to get those "" or
'' in between
# so I tried hpricot

doc = open(ARGV[0]) { |f| Hpricot(f).search("div") }

# but how to go from here? I couldn't figure out the documentation of
Hpricot, because .to_inner_html doesn't work




yaml_datei = File.new(ARGV[1], 'w+')

yaml_datei << html_datei
yaml_datei << doc
yaml_datei.close

so, that was my actual code
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top