Regexp Ruby selection

T

touffik

Hi folks,
I'm trying to code a ruby script that select the content of a HTML
table in a HTML page.
I used rubular to test my regexp syntax which is
/ <td class="TabIntCenContenuto"[^>]*>(.*)&nbsp; /
with rubular the result of my expression is :
Result 1
1. 12345678
Result 2
1. SAN FRANCESCO DA PAOLA
Result 3
1. Via San Francesco Da Paola, 10
Result 4
1. 10123
Result 5
1. TORINO
etc....
But with my script :

File.open('D:/testt/1.txt', 'r') do |filein|

while line = filein.gets
p line if line =~ /<td class="TabIntCenContenuto"[^>]*>/ .. line
=~ /\/A&nbsp;/
end
fileout.puts p
end
end

I got this result
"</td><td class=\"TabIntCenContenuto\">12345678&nbsp;\n"
"</td><td class=\"TabIntCenContenuto\">SAN FRANCESCO DA PAOLA&nbsp;</
td>\n"
"<td class=\"TabIntCenContenuto\">Via San Francesco Da Paola,
10&nbsp;</td>\n"
"<td class=\"TabIntCenContenuto\">10123&nbsp;</td>\n"
"<td class=\"TabIntCenContenuto\" align=\"left\">TORINO&nbsp;</td>\n"

I thought the .. between 2 "line =~" was like (...) in rubular which
let catch the content ??
Moreover I would like to transform this html code in XML. But I can"t
find an idea how to transform these HTML line in XML.
<root>
<number>12345678</number>
But there is no attribut 'name' or wathever in the <td> so making and
match/replace would be difficult ?
...

So, if someone can help me I would be very grateful.
Nice day ;)
 
S

Srijayanth Sridhar

Any specific reason you can't use hpricot or other HTML parsers?

Jayanth
 
S

Shadowfirebird

I'll second that. Hpricot is really quite remarkable. It'll almost
certainly save you days and days of pain. Unless you are doing this
for fun / learning, of course.

Any specific reason you can't use hpricot or other HTML parsers?

Jayanth

Hi folks,
I'm trying to code a ruby script that select the content of a HTML
table in a HTML page.
I used rubular to test my regexp syntax which is
/ <td class="TabIntCenContenuto"[^>]*>(.*)&nbsp; /
with rubular the result of my expression is :
Result 1
1. 12345678
Result 2
1. SAN FRANCESCO DA PAOLA
Result 3
1. Via San Francesco Da Paola, 10
Result 4
1. 10123
Result 5
1. TORINO
etc....
But with my script :

File.open('D:/testt/1.txt', 'r') do |filein|

while line = filein.gets
p line if line =~ /<td class="TabIntCenContenuto"[^>]*>/ .. line
=~ /\/A&nbsp;/
end
fileout.puts p
end
end

I got this result
"</td><td class=\"TabIntCenContenuto\">12345678&nbsp;\n"
"</td><td class=\"TabIntCenContenuto\">SAN FRANCESCO DA PAOLA&nbsp;</
td>\n"
"<td class=\"TabIntCenContenuto\">Via San Francesco Da Paola,
10&nbsp;</td>\n"
"<td class=\"TabIntCenContenuto\">10123&nbsp;</td>\n"
"<td class=\"TabIntCenContenuto\" align=\"left\">TORINO&nbsp;</td>\n"

I thought the .. between 2 "line =~" was like (...) in rubular which
let catch the content ??
Moreover I would like to transform this html code in XML. But I can"t
find an idea how to transform these HTML line in XML.
<root>
<number>12345678</number>
But there is no attribut 'name' or wathever in the <td> so making and
match/replace would be difficult ?
..

So, if someone can help me I would be very grateful.
Nice day ;)



--
Me, I imagine places that I have never seen / The colored lights in
fountains, blue and green / And I imagine places that I will never go
/ Behind these clouds that hang here dark and low
But it's there when I'm holding you / There when I'm sleeping too /
There when there's nothing left of me / Hanging out behind the
burned-out factories / Out of reach but leading me / Into the
beautiful sea
 
S

Sebastian Hungerecker

I thought the .. between 2 "line =~" was like (...) in rubular which
let catch the content ??

Generally in ruby .. denotes a range. Like starting_value .. end_value.
In this case though it denotes a flip flop, which is evil and should never
ever be used because it makes my head hurt. Here's what it does though:
some_loop {
do_something if foo .. bar
}
This will do nothing until foo is true. When foo is true it will do_something.
It will then keep doing_something in every iteration of the loop until bar
becomes true. After bar became true it will stop doing_something until foo is
true again.
So as a summary: It doesn't do what you thought it did. As a matter of fact it
doesn't do anything sane. So just keep as far away from it as possible.


HTH,
Sebastian
 
T

touffik

Any specific reason you can't use hpricot or other HTML parsers?

I didn't know this tool for ruby. I used once a parser named tidy but
that's all. I'll try now and let you know.

Generally in ruby .. denotes a range. Like starting_value .. end_value.
In this case though it denotes a flip flop, which is evil and should never
ever be used because it makes my head hurt. Here's what it does though:
some_loop {
  do_something if foo .. bar}

This will do nothing until foo is true. When foo is true it will do_something.
It will then keep doing_something in every iteration of the loop until bar
becomes true. After bar became true it will stop doing_something until foo is
true again.
So as a summary: It doesn't do what you thought it did. As a matter of fact it
doesn't do anything sane. So just keep as far away from it as possible.

So i was wrong .. Thanks you for your explaination of this wrong use
of the loop.

Thanks you.
 
P

Peña, Botp

From: (e-mail address removed) [mailto:[email protected]]=20
# I'm trying to code a ruby script that select the content of a HTML
# table in a HTML page.
# I used rubular to test my regexp syntax which is
# / <td class=3D"TabIntCenContenuto"[^>]*>(.*)&nbsp; /

the re is fine, you can use that

# with rubular the result of my expression is :
# Result 1
# 1. 12345678
# Result 2
# 1. SAN FRANCESCO DA PAOLA
# Result 3
# 1. Via San Francesco Da Paola, 10
# Result 4
# 1. 10123
# Result 5
# 1. TORINO
# etc....
# But with my script :
#=20
# File.open('D:/testt/1.txt', 'r') do |filein|
# while line =3D filein.gets
# p line if line =3D~ /<td class=3D"TabIntCenContenuto"[^>]*>/ .. =
line
# =3D~ /\/A&nbsp;/
# end
# fileout.puts p
# end
# end
# I got this result
# "</td><td class=3D\"TabIntCenContenuto\">12345678&nbsp;\n"
# "</td><td class=3D\"TabIntCenContenuto\">SAN FRANCESCO DA =
PAOLA&nbsp;</
# td>\n"
# "<td class=3D\"TabIntCenContenuto\">Via San Francesco Da Paola,
# 10&nbsp;</td>\n"
# "<td class=3D\"TabIntCenContenuto\">10123&nbsp;</td>\n"
# "<td class=3D\"TabIntCenContenuto\" =
align=3D\"left\">TORINO&nbsp;</td>\n"

you already got it, but you did not capture

sample code & run,

botp@botp-desktop:~$ cat test.rb
File.open('test.txt') do |f|
while line =3D f.gets
if line=3D~/<td class=3D"TabIntCenContenuto"[^>]*>(.*)&nbsp;/
p $1
end
end
end

botp@botp-desktop:~$ ruby test.rb
"12345678"
"SAN FRANCESCO DA PAOLA"
"Via San Francesco Da Paola,10"
"10123"
"TORINO"
=20
# I thought the .. between 2 "line =3D~" was like (...) in rubular which
# let catch the content ??

you are making it harder. keep it simple.

# Moreover I would like to transform this html code in XML. But I can"t
# find an idea how to transform these HTML line in XML.
# <root>
# <number>12345678</number>
# But there is no attribut 'name' or wathever in the <td> so making and
# match/replace would be difficult ?

if the html is nicely formatted, you can loop through the table.=20
if you want to be sure, try outputting all the data you can capture =
first. Then output that again with xml tags inserted.

do not worry. xml, like html, is just text w tags. Manipulating text is =
a good learning exercise for ruby.

kind regards -botp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top