Finding a sentence (more than one word & punctuation (, . ;)) ina string?

Discussion in 'Ruby' started by Kev Jackson, Jan 11, 2006.

  1. Kev Jackson

    Kev Jackson Guest

    given this string

    " <td valign=\"top\">message</td> <td valign=\"top\">the message
    to echo.</td> <td valign=\"top\" align=\"center\">Yes, unless data is
    included in a character section within this element.</td> </tr> "

    how can I get this result

    ["message", "the message to echo.", "Yes, unless data is included in a
    character section within this element."]

    ?

    I've tried scan + regexp, but the best I've got so far is

    [["message"]]

    with this

    r.scan(/\">(\w+\s*)<\/td>/)

    Thanks
    Kev
    Kev Jackson, Jan 11, 2006
    #1
    1. Advertising

  2. Re: Finding a sentence (more than one word & punctuation (, . ;)) in a string?

    > given this string
    >
    > " <td valign=\"top\"> message</td> <td valign=\"top\"> the
    > message to echo.</td> <td valign=\"top\" align=\"center\">
    > Yes, unless data is included in a character section within
    > this element.</td> </tr> "
    >
    > how can I get this result
    >
    > ["message", "the message to echo.", "Yes, unless data is
    > included in a character section within this element."]
    >
    > ?


    s.split(/\s*<[^<>]*>\s*/).reject{|x| x.empty?}

    gegroet,
    Erik V. - http://www.erikveen.dds.nl/
    Erik Veenstra, Jan 11, 2006
    #2
    1. Advertising

  3. Re: Finding a sentence (more than one word & punctuation (, . ;)) in a string?

    Kev Jackson wrote:
    > given this string
    >
    > " <td valign=\"top\">message</td> <td valign=\"top\">the message
    > to echo.</td> <td valign=\"top\" align=\"center\">Yes, unless data
    > is included in a character section within this element.</td> </tr> "
    >
    > how can I get this result
    >
    > ["message", "the message to echo.", "Yes, unless data is included in a
    > character section within this element."]
    >
    > ?
    >
    > I've tried scan + regexp, but the best I've got so far is
    >
    > [["message"]]
    >
    > with this
    >
    > r.scan(/\">(\w+\s*)<\/td>/)
    >
    > Thanks
    > Kev


    If you really want sentences, this will work:

    >> s.scan /\w+(?:[\s,]+\w+)*[.;?!]/

    => ["the message\nto echo.", "Yes, unless data is\nincluded in a character
    section within this element."]
    >> s.scan /\w+(?:,?\s+\w+)*[.;?!]/

    => ["the message\nto echo.", "Yes, unless data is\nincluded in a character
    section within this element."]

    Kind regards

    robert
    Robert Klemme, Jan 11, 2006
    #3
  4. Re: Finding a sentence (more than one word & punctuation (, . ;))in a string?

    Hi all,

    Erik Veenstra wrote:
    ....
    > s.split(/\s*<[^<>]*>\s*/).reject{|x| x.empty?}
    >
    > gegroet,
    > Erik V. - http://www.erikveen.dds.nl/
    >


    As a newbie I thought I'd have a go at this.
    What I was trying to do was take Eriks code above, get the text between
    tags into an array and then print it out as:
    [message, the message to echo, Yes, unless data is included...]

    I can do it by the look of things but if there are any suggestions how
    to improve this I'd appreciate it. Ie is the {} the most efficient way
    to fill the array? Is there a better way to print it out?


    # --------------------------------
    foo = " <td valign=\"top\">message</td> <td valign=\"top\">the
    message to echo.</td> <td valign=\"top\" align=\"center\">Yes, unless
    data is included in a character section within this element.</td> </tr> "

    # I want to fill an array so I can display in the format
    # [message, the message to echo, Yes, unless...]
    a = Array.new

    # I think I understand this.
    # /\s*<[^<>]*>\s*/ = find all tags
    # \s* find 0 or more spaces
    # <[^<>]*> find anything between and including <>
    # \s* as above
    # and reject them (.reject)
    # whats left (text between tags) use as x in the block |x|

    # x seemed to include empty strings so only add x to the array if not ""
    foo.split(/\s*<[^<>]*>\s*/).reject{|x| a.insert(-1,x) if x != ""}

    # Trying to find the best way to print this???
    # nothing like what I want
    # puts "--- print a ---"
    print a

    # extra space after last item
    # puts "\n\n--- print \"[\" a.each{|x| print x + \", \" print \"]\" ---"
    print "[ "
    a.each{|x| print x + ", "}
    print "]"

    # close but must know array size
    # puts "\n\n print \"[\" + a[0] + \", \" + a[1] + \", \" + a[2] + \"]\""
    print "[" + a[0] + ", " + a[1] + ", " + a[2] + "]\n"

    # probably the most 'right' output wise
    puts "\n\n--- for i in 0...a.length-1 ---"
    print "[ "
    for i in 0...a.length-1
    print a + ", "
    end
    print a[a.length-1]
    print "]"
    # --------------------------------

    thanks,

    Mark
    Mark Woodward, Jan 11, 2006
    #4
  5. Re: Finding a sentence (more than one word & punctuation (, . ;))ina string?

    Mark Woodward wrote:
    ....
    >
    > # x seemed to include empty strings so only add x to the array if not ""
    > foo.split(/\s*<[^<>]*>\s*/).reject{|x| a.insert(-1,x) if x != ""}


    Hmm, here's the first improvement? Seems I can use a << x to append to
    an array:

    # x seemed to include ""??? so only add x to the array if not ""
    foo.split(/\s*<[^<>]*>\s*/).reject{|x| a << x if x != ""}


    --
    Mark
    Mark Woodward, Jan 11, 2006
    #5
  6. Kev Jackson

    Xavier Noria Guest

    Re: Finding a sentence (more than one word & punctuation (, . ;)) in a string?

    On Jan 11, 2006, at 8:08, Kev Jackson wrote:

    > given this string
    >
    > " <td valign=\"top\">message</td> <td valign=\"top\">the
    > message to echo.</td> <td valign=\"top\" align=\"center\">Yes,
    > unless data is included in a character section within this
    > element.</td> </tr> "
    >
    > how can I get this result
    >
    > ["message", "the message to echo.", "Yes, unless data is included
    > in a character section within this element."]


    There have been several simple approaches proposed in this thread
    that may work for what you want. Just in case, if you needed
    something more robust you could have a glance at existing Perl
    modules that solve this problem like Lingua::EN::Sentence.

    -- fxn
    Xavier Noria, Jan 11, 2006
    #6
  7. Kev Jackson

    Ross Bamford Guest

    Re: Finding a sentence (more than one word & punctuation (, . ;))in a string?

    On Wed, 11 Jan 2006 10:52:08 -0000, Mark Woodward <>
    wrote:

    > Mark Woodward wrote:
    > ...
    >> # x seemed to include empty strings so only add x to the array if not
    >> ""
    >> foo.split(/\s*<[^<>]*>\s*/).reject{|x| a.insert(-1,x) if x != ""}

    >
    > Hmm, here's the first improvement? Seems I can use a << x to append to
    > an array:
    >
    > # x seemed to include ""??? so only add x to the array if not ""
    > foo.split(/\s*<[^<>]*>\s*/).reject{|x| a << x if x != ""}
    >


    I'm not sure what you're trying to do here, but I think split returns an
    array already, operated on by reject in this case, which returns the new
    array. So with the Erik's code:

    a = s.split(/\s*<[^<>]*>\s*/).reject{|x| x.empty?}
    p a
    # => ["message", "the message to echo.", ... etc ... ]

    I guess an alternative similar to your approach above might be:

    b = foo.split(/\s*<[^<>]*>\s*/).inject([]) { |ary,x| if x.empty? then ary
    else ary << x end }
    p b
    # => ["message", "the message to echo.", ... etc ... ]

    Note the 'p' method, which prints out using 'inspect'. Alternatively, you
    could have done:

    puts b.inspect
    print "{b.inspect}\n"

    and so on. Another nitpick about your example, is that in most Ruby I've
    seen people tend to prefer using unless rather than !negating the
    condition to if. So where you have:

    if x != ""

    I'd tend to use:

    unless x == ""

    or (more likely):

    unless x.empty?

    Cheers,

    --
    Ross Bamford -
    Ross Bamford, Jan 11, 2006
    #7
  8. Re: Finding a sentence (more than one word & punctuation (, . ;))ina string?

    Hi Ross,

    Ross Bamford wrote:
    > On Wed, 11 Jan 2006 10:52:08 -0000, Mark Woodward <>
    > wrote:


    > I'm not sure what you're trying to do here,


    makes 2 of us ;-)

    but I think split returns
    > an array already, operated on by reject in this case, which returns the
    > new array. So with the Erik's code:
    >
    > a = s.split(/\s*<[^<>]*>\s*/).reject{|x| x.empty?}
    > p a
    > # => ["message", "the message to echo.", ... etc ... ]


    Exactly what I was trying to do. I thought it had to be an array but
    couldn't figure out how to print it like ["","",""] like the OP wanted.
    p a - now thats embarrassing! 2 letters and it works. Compare that to my
    gibberish :-(. We all have to start somewhere I guess!

    >
    > I guess an alternative similar to your approach above might be:
    >
    > b = foo.split(/\s*<[^<>]*>\s*/).inject([]) { |ary,x| if x.empty?
    > then ary else ary << x end }
    > p b
    > # => ["message", "the message to echo.", ... etc ... ]
    >
    > Note the 'p' method, which prints out using 'inspect'. Alternatively,
    > you could have done:
    >
    > puts b.inspect
    > print "{b.inspect}\n"


    steady on! ;-)

    >
    > and so on. Another nitpick about your example, is that in most Ruby
    > I've seen people tend to prefer using unless rather than !negating the
    > condition to if. So where you have:
    >
    > if x != ""
    >
    > I'd tend to use:
    >
    > unless x == ""
    >
    > or (more likely):
    >
    > unless x.empty?


    Nitpick away! I appreciate it. Its been a good little exercise re p,
    puts, print and chaining methods etc. I've been reading the pickaxe
    book, but readings not good enough. I need to write some code. If I can
    make a fool of myself here but learn something at the same time then
    thats great!

    >
    > Cheers,
    >


    thanks,

    --
    Mark
    Mark Woodward, Jan 11, 2006
    #8
  9. Kev Jackson

    Ross Bamford Guest

    Re: Finding a sentence (more than one word & punctuation (, . ;))in a string?

    On Wed, 11 Jan 2006 11:40:58 -0000, Mark Woodward <>
    wrote:

    > Exactly what I was trying to do. I thought it had to be an array but
    > couldn't figure out how to print it like ["","",""] like the OP wanted.
    > p a - now thats embarrassing! 2 letters and it works. Compare that to my
    > gibberish :-(. We all have to start somewhere I guess!
    >


    Absolutely. My early Ruby was probably some of the least Rubyish Ruby
    around :) Check out the 'show_array' nonsense here at
    http://roscopeco.co.uk/code/noob/basic-syn2.rb - ouch. (I later refactored
    it a bit to http://roscopeco.co.uk/code/noob/arrays.html).

    > Nitpick away! I appreciate it. Its been a good little exercise re p,
    > puts, print and chaining methods etc. I've been reading the pickaxe
    > book, but readings not good enough. I need to write some code. If I can
    > make a fool of myself here but learn something at the same time then
    > thats great!
    >


    Heh, I definitely know what you mean there - I have to do stuff to learn
    too. That said, though, I just got my paper pickaxe (finally, this
    morning!) and it's much better having something solid to refer to without
    having to switch to the browser and all that, so I can at least check I'm
    making sense :)

    Cheers,

    --
    Ross Bamford -
    Ross Bamford, Jan 11, 2006
    #9
  10. Kev Jackson

    Gene Tani Guest

    Re: Finding a sentence (more than one word & punctuation (, . ;)) in a string?

    Kev Jackson wrote:
    > given this string
    >
    > " <td valign=\"top\">message</td> <td valign=\"top\">the message
    > to echo.</td> <td valign=\"top\" align=\"center\">Yes, unless data is
    > included in a character section within this element.</td> </tr> "
    >
    > how can I get this result
    >
    > ["message", "the message to echo.", "Yes, unless data is included in a
    > character section within this element."]
    >
    > ?
    >
    > I've tried scan + regexp, but the best I've got so far is
    >
    > [["message"]]
    >
    > with this
    >
    > r.scan(/\">(\w+\s*)<\/td>/)
    >
    > Thanks
    > Kev


    if this is an HTML table extraction thing, rubyful soup is the easiest
    way to do it
    http://www.crummy.com/software/RubyfulSoup/documentation.html

    there's also the htmltokenizer.getText() method, (which i just now
    discovered by googling) which allows you to extract from before 1 tag
    at a time
    http://htmltokenizer.rubyforge.org/doc/
    http://htmltokenizer.rubyforge.org/doc/
    Gene Tani, Jan 11, 2006
    #10
  11. Re: Finding a sentence (more than one word & punctuation (, . ;))ina string?

    Ross Bamford wrote:

    > Heh, I definitely know what you mean there - I have to do stuff to
    > learn too. That said, though, I just got my paper pickaxe (finally,
    > this morning!) and it's much better having something solid to refer to
    > without having to switch to the browser and all that, so I can at least
    > check I'm making sense :)


    Yeah, I've been using the PDF version of Pickaxe(vers 2) but will order
    the felled trees version I think. Also 'The Ruby Way' version 2 when it
    is published. What ever it takes ;-)

    thanks again,

    --
    Mark
    Mark Woodward, Jan 11, 2006
    #11
  12. Kev Jackson

    Kev Jackson Guest

    Re: Finding a sentence (more than one word & punctuation (, . ;))in a string?

    Gene Tani wrote:

    >Kev Jackson wrote:
    >
    >
    >>given this string
    >>
    >>" <td valign=\"top\">message</td> <td valign=\"top\">the message
    >>to echo.</td> <td valign=\"top\" align=\"center\">Yes, unless data is
    >>included in a character section within this element.</td> </tr> "
    >>
    >>how can I get this result
    >>
    >>["message", "the message to echo.", "Yes, unless data is included in a
    >>character section within this element."]
    >>
    >>?
    >>
    >>I've tried scan + regexp, but the best I've got so far is
    >>
    >>[["message"]]
    >>
    >>with this
    >>
    >>r.scan(/\">(\w+\s*)<\/td>/)
    >>
    >>Thanks
    >>Kev
    >>
    >>

    >
    >if this is an HTML table extraction thing, rubyful soup is the easiest
    >way to do it
    >http://www.crummy.com/software/RubyfulSoup/documentation.html
    >
    >there's also the htmltokenizer.getText() method, (which i just now
    >discovered by googling) which allows you to extract from before 1 tag
    >at a time
    >http://htmltokenizer.rubyforge.org/doc/
    >http://htmltokenizer.rubyforge.org/doc/
    >
    >
    >
    >

    That is indeed what the problem domain is (did the <td> give it away!).

    Basically I have a whole lot of html files and I need to re-write them
    as xml (sort of docbook-ish, but not quite). I'm using builder
    (fantastic bit of kit by the way), but my original files sometimes
    contain things like

    "<td valign=\"top\">append</td>
    <td valign=\"top\">Append to an existing file (or
    <a
    href=\"http://java.sun.com/j2se/1.4.2/docs/api/java/io/FileWriter.html#FileWriter(java.lang.String,
    boolean)\" target=\"_blank\">
    open a new file / overwrite an existing file</a>)?
    </td>
    <td valign=\"top\" align=\"center\">No - default is false.</td>"

    And anything I try basically means that I end up with either nothing
    extracted or the whole table extracted! My thoughts were to try a
    simple conversion and then fix things manually afterwards (ie get 95% of
    the conversion done through a script and then apply some elbow grease to
    finish off the parts that take too much time to work out)

    I'm now off to read about this tokenizer ^^^ and see if it does what I
    want - obviously I'd love to have an automated solution (there are 1000+
    html docs I need to convert).

    I must admit to beginning to loathe HTMLs lack of structural information
    - if this was a docbook file I'd have very few problems converting it (I
    could choose many options), but html is so limited in its ability to
    express what meaning some section has [sigh]

    Thanks to all for the suggested regexps - I never intended it to become
    a mini Ruby Quiz :)
    Kev
    Kev Jackson, Jan 12, 2006
    #12
  13. Re: Finding a sentence (more than one word & punctuation (, . ;)) in a string?

    A quick scan says that you've got legit xml there, why not use REXML?
    It's included in the ruby standard libs. Or any of the above html/xml
    parsing libraries with xpath to pluck your values out.

    REXML Docs:
    http://ruby-doc.org/stdlib/

    REXML Homepage:
    http://www.germane-software.com/software/rexml

    .adam
    Adam Sanderson, Jan 12, 2006
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Torsten Bronger

    End-of-sentence punctuation

    Torsten Bronger, Dec 26, 2004, in forum: XML
    Replies:
    1
    Views:
    749
    Jukka K. Korpela
    Dec 26, 2004
  2. Merciadri Luca
    Replies:
    4
    Views:
    809
  3. Replies:
    2
    Views:
    94
  4. Steven D'Aprano
    Replies:
    0
    Views:
    90
    Steven D'Aprano
    Dec 23, 2013
  5. Replies:
    3
    Views:
    82
    Gary Herron
    Dec 23, 2013
Loading...

Share This Page