Finding a sentence (more than one word & punctuation (, . ;)) ina string?

K

Kev Jackson

given this string

" <td valign=\"top\">message</td> <td valign=\"top\">the message
to echo.</td> <td valign=\"top\" align=\"center\">Yes, unless data is
included in a character section within this element.</td> </tr> "

how can I get this result

["message", "the message to echo.", "Yes, unless data is included in a
character section within this element."]

?

I've tried scan + regexp, but the best I've got so far is

[["message"]]

with this

r.scan(/\">(\w+\s*)<\/td>/)

Thanks
Kev
 
E

Erik Veenstra

given this string
" <td valign=\"top\"> message</td> <td valign=\"top\"> the
message to echo.</td> <td valign=\"top\" align=\"center\">
Yes, unless data is included in a character section within
this element.</td> </tr> "

how can I get this result

["message", "the message to echo.", "Yes, unless data is
included in a character section within this element."]

?

s.split(/\s*<[^<>]*>\s*/).reject{|x| x.empty?}

gegroet,
Erik V. - http://www.erikveen.dds.nl/
 
R

Robert Klemme

Kev said:
given this string

" <td valign=\"top\">message</td> <td valign=\"top\">the message
to echo.</td> <td valign=\"top\" align=\"center\">Yes, unless data
is included in a character section within this element.</td> </tr> "

how can I get this result

["message", "the message to echo.", "Yes, unless data is included in a
character section within this element."]

?

I've tried scan + regexp, but the best I've got so far is

[["message"]]

with this

r.scan(/\">(\w+\s*)<\/td>/)

Thanks
Kev

If you really want sentences, this will work:
s.scan /\w+(?:[\s,]+\w+)*[.;?!]/
=> ["the message\nto echo.", "Yes, unless data is\nincluded in a character
section within this element."]
s.scan /\w+(?:,?\s+\w+)*[.;?!]/
=> ["the message\nto echo.", "Yes, unless data is\nincluded in a character
section within this element."]

Kind regards

robert
 
M

Mark Woodward

Hi all,

Erik Veenstra wrote:
....
s.split(/\s*<[^<>]*>\s*/).reject{|x| x.empty?}

gegroet,
Erik V. - http://www.erikveen.dds.nl/

As a newbie I thought I'd have a go at this.
What I was trying to do was take Eriks code above, get the text between
tags into an array and then print it out as:
[message, the message to echo, Yes, unless data is included...]

I can do it by the look of things but if there are any suggestions how
to improve this I'd appreciate it. Ie is the {} the most efficient way
to fill the array? Is there a better way to print it out?


# --------------------------------
foo = " <td valign=\"top\">message</td> <td valign=\"top\">the
message to echo.</td> <td valign=\"top\" align=\"center\">Yes, unless
data is included in a character section within this element.</td> </tr> "

# I want to fill an array so I can display in the format
# [message, the message to echo, Yes, unless...]
a = Array.new

# I think I understand this.
# /\s*<[^<>]*>\s*/ = find all tags
# \s* find 0 or more spaces
# <[^<>]*> find anything between and including <>
# \s* as above
# and reject them (.reject)
# whats left (text between tags) use as x in the block |x|

# x seemed to include empty strings so only add x to the array if not ""
foo.split(/\s*<[^<>]*>\s*/).reject{|x| a.insert(-1,x) if x != ""}

# Trying to find the best way to print this???
# nothing like what I want
# puts "--- print a ---"
print a

# extra space after last item
# puts "\n\n--- print \"[\" a.each{|x| print x + \", \" print \"]\" ---"
print "[ "
a.each{|x| print x + ", "}
print "]"

# close but must know array size
# puts "\n\n print \"[\" + a[0] + \", \" + a[1] + \", \" + a[2] + \"]\""
print "[" + a[0] + ", " + a[1] + ", " + a[2] + "]\n"

# probably the most 'right' output wise
puts "\n\n--- for i in 0...a.length-1 ---"
print "[ "
for i in 0...a.length-1
print a + ", "
end
print a[a.length-1]
print "]"
# --------------------------------

thanks,

Mark
 
M

Mark Woodward

Mark Woodward wrote:
....
# x seemed to include empty strings so only add x to the array if not ""
foo.split(/\s*<[^<>]*>\s*/).reject{|x| a.insert(-1,x) if x != ""}

Hmm, here's the first improvement? Seems I can use a << x to append to
an array:

# x seemed to include ""??? so only add x to the array if not ""
foo.split(/\s*<[^<>]*>\s*/).reject{|x| a << x if x != ""}
 
X

Xavier Noria

given this string

" <td valign=\"top\">message</td> <td valign=\"top\">the
message to echo.</td> <td valign=\"top\" align=\"center\">Yes,
unless data is included in a character section within this
element.</td> </tr> "

how can I get this result

["message", "the message to echo.", "Yes, unless data is included
in a character section within this element."]

There have been several simple approaches proposed in this thread
that may work for what you want. Just in case, if you needed
something more robust you could have a glance at existing Perl
modules that solve this problem like Lingua::EN::Sentence.

-- fxn
 
R

Ross Bamford

Mark Woodward wrote:
...
# x seemed to include empty strings so only add x to the array if not
""
foo.split(/\s*<[^<>]*>\s*/).reject{|x| a.insert(-1,x) if x != ""}

Hmm, here's the first improvement? Seems I can use a << x to append to
an array:

# x seemed to include ""??? so only add x to the array if not ""
foo.split(/\s*<[^<>]*>\s*/).reject{|x| a << x if x != ""}

I'm not sure what you're trying to do here, but I think split returns an
array already, operated on by reject in this case, which returns the new
array. So with the Erik's code:

a = s.split(/\s*<[^<>]*>\s*/).reject{|x| x.empty?}
p a
# => ["message", "the message to echo.", ... etc ... ]

I guess an alternative similar to your approach above might be:

b = foo.split(/\s*<[^<>]*>\s*/).inject([]) { |ary,x| if x.empty? then ary
else ary << x end }
p b
# => ["message", "the message to echo.", ... etc ... ]

Note the 'p' method, which prints out using 'inspect'. Alternatively, you
could have done:

puts b.inspect
print "{b.inspect}\n"

and so on. Another nitpick about your example, is that in most Ruby I've
seen people tend to prefer using unless rather than !negating the
condition to if. So where you have:

if x != ""

I'd tend to use:

unless x == ""

or (more likely):

unless x.empty?

Cheers,
 
M

Mark Woodward

Hi Ross,

I'm not sure what you're trying to do here,

makes 2 of us ;-)

but I think split returns
an array already, operated on by reject in this case, which returns the
new array. So with the Erik's code:

a = s.split(/\s*<[^<>]*>\s*/).reject{|x| x.empty?}
p a
# => ["message", "the message to echo.", ... etc ... ]

Exactly what I was trying to do. I thought it had to be an array but
couldn't figure out how to print it like ["","",""] like the OP wanted.
p a - now thats embarrassing! 2 letters and it works. Compare that to my
gibberish :-(. We all have to start somewhere I guess!
I guess an alternative similar to your approach above might be:

b = foo.split(/\s*<[^<>]*>\s*/).inject([]) { |ary,x| if x.empty?
then ary else ary << x end }
p b
# => ["message", "the message to echo.", ... etc ... ]

Note the 'p' method, which prints out using 'inspect'. Alternatively,
you could have done:

puts b.inspect
print "{b.inspect}\n"

steady on! ;-)
and so on. Another nitpick about your example, is that in most Ruby
I've seen people tend to prefer using unless rather than !negating the
condition to if. So where you have:

if x != ""

I'd tend to use:

unless x == ""

or (more likely):

unless x.empty?

Nitpick away! I appreciate it. Its been a good little exercise re p,
puts, print and chaining methods etc. I've been reading the pickaxe
book, but readings not good enough. I need to write some code. If I can
make a fool of myself here but learn something at the same time then
thats great!

thanks,
 
R

Ross Bamford

Exactly what I was trying to do. I thought it had to be an array but
couldn't figure out how to print it like ["","",""] like the OP wanted.
p a - now thats embarrassing! 2 letters and it works. Compare that to my
gibberish :-(. We all have to start somewhere I guess!

Absolutely. My early Ruby was probably some of the least Rubyish Ruby
around :) Check out the 'show_array' nonsense here at
http://roscopeco.co.uk/code/noob/basic-syn2.rb - ouch. (I later refactored
it a bit to http://roscopeco.co.uk/code/noob/arrays.html).
Nitpick away! I appreciate it. Its been a good little exercise re p,
puts, print and chaining methods etc. I've been reading the pickaxe
book, but readings not good enough. I need to write some code. If I can
make a fool of myself here but learn something at the same time then
thats great!

Heh, I definitely know what you mean there - I have to do stuff to learn
too. That said, though, I just got my paper pickaxe (finally, this
morning!) and it's much better having something solid to refer to without
having to switch to the browser and all that, so I can at least check I'm
making sense :)

Cheers,
 
G

Gene Tani

Kev said:
given this string

" <td valign=\"top\">message</td> <td valign=\"top\">the message
to echo.</td> <td valign=\"top\" align=\"center\">Yes, unless data is
included in a character section within this element.</td> </tr> "

how can I get this result

["message", "the message to echo.", "Yes, unless data is included in a
character section within this element."]

?

I've tried scan + regexp, but the best I've got so far is

[["message"]]

with this

r.scan(/\">(\w+\s*)<\/td>/)

Thanks
Kev

if this is an HTML table extraction thing, rubyful soup is the easiest
way to do it
http://www.crummy.com/software/RubyfulSoup/documentation.html

there's also the htmltokenizer.getText() method, (which i just now
discovered by googling) which allows you to extract from before 1 tag
at a time
http://htmltokenizer.rubyforge.org/doc/
http://htmltokenizer.rubyforge.org/doc/
 
M

Mark Woodward

Ross said:
Heh, I definitely know what you mean there - I have to do stuff to
learn too. That said, though, I just got my paper pickaxe (finally,
this morning!) and it's much better having something solid to refer to
without having to switch to the browser and all that, so I can at least
check I'm making sense :)

Yeah, I've been using the PDF version of Pickaxe(vers 2) but will order
the felled trees version I think. Also 'The Ruby Way' version 2 when it
is published. What ever it takes ;-)

thanks again,
 
K

Kev Jackson

Gene said:
Kev Jackson wrote:

given this string

" <td valign=\"top\">message</td> <td valign=\"top\">the message
to echo.</td> <td valign=\"top\" align=\"center\">Yes, unless data is
included in a character section within this element.</td> </tr> "

how can I get this result

["message", "the message to echo.", "Yes, unless data is included in a
character section within this element."]

?

I've tried scan + regexp, but the best I've got so far is

[["message"]]

with this

r.scan(/\">(\w+\s*)<\/td>/)

Thanks
Kev

if this is an HTML table extraction thing, rubyful soup is the easiest
way to do it
http://www.crummy.com/software/RubyfulSoup/documentation.html

there's also the htmltokenizer.getText() method, (which i just now
discovered by googling) which allows you to extract from before 1 tag
at a time
http://htmltokenizer.rubyforge.org/doc/
http://htmltokenizer.rubyforge.org/doc/
That is indeed what the problem domain is (did the <td> give it away!).

Basically I have a whole lot of html files and I need to re-write them
as xml (sort of docbook-ish, but not quite). I'm using builder
(fantastic bit of kit by the way), but my original files sometimes
contain things like

"<td valign=\"top\">append</td>
<td valign=\"top\">Append to an existing file (or
<a
href=\"http://java.sun.com/j2se/1.4.2/docs/api/java/io/FileWriter.html#FileWriter(java.lang.String,
boolean)\" target=\"_blank\">
open a new file / overwrite an existing file</a>)?
</td>
<td valign=\"top\" align=\"center\">No - default is false.</td>"

And anything I try basically means that I end up with either nothing
extracted or the whole table extracted! My thoughts were to try a
simple conversion and then fix things manually afterwards (ie get 95% of
the conversion done through a script and then apply some elbow grease to
finish off the parts that take too much time to work out)

I'm now off to read about this tokenizer ^^^ and see if it does what I
want - obviously I'd love to have an automated solution (there are 1000+
html docs I need to convert).

I must admit to beginning to loathe HTMLs lack of structural information
- if this was a docbook file I'd have very few problems converting it (I
could choose many options), but html is so limited in its ability to
express what meaning some section has [sigh]

Thanks to all for the suggested regexps - I never intended it to become
a mini Ruby Quiz :)
Kev
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top