My regexp stupidity needs assistance before loose all my hair!

Discussion in 'Ruby' started by trans. (T. Onoma), Jan 17, 2005.

  1. Let me painfully honest: I hate parsing, especially w/ regexp, and I don't
    care if it's because I stupid and suck at it. It shouldn't have to be this
    hair pulling! Anyway... Can some one please give the regular expression to
    match the first square bracket's contents. In this case it would be "Hello".

    s = <<-EOS
    [Hello]
    This is[b.] a test.
    [Hello.]
    EOS

    Much obliged,
    T.
     
    trans. (T. Onoma), Jan 17, 2005
    #1
    1. Advertisements

  2. trans.  (T. Onoma)

    Zach Dennis Guest



    The trick here is to make sure you are non-greedy.

    s =~ /\[([^\]]*)\]/

    Zach
     
    Zach Dennis, Jan 17, 2005
    #2
    1. Advertisements

  3. I think that this is what you need: /\[[\w]+\]/

    This little application might help you (not sure if it is 100% Ruby
    compatible, but may be a start) called TestRexp, which you can get
    here: http://regexpstudio.com/RegExpStudio.html

    hth,
    Douglas
     
    Douglas Livingstone, Jan 17, 2005
    #3
  4. trans.  (T. Onoma)

    Zach Dennis Guest



    Almost forgot, $1 is the match you are looking for.
     
    Zach Dennis, Jan 17, 2005
    #4
  5. trans.  (T. Onoma)

    Zach Dennis Guest

    Ah, this is nicer and shorter then mine... I think I will use this one
    to. =)

    Zach
     
    Zach Dennis, Jan 17, 2005
    #5
  6. trans.  (T. Onoma)

    Glenn Parker Guest



    s =~ /\[([^\]]*)\]/
    puts $1
     
    Glenn Parker, Jan 17, 2005
    #6
  7. | > Let me painfully honest: I hate parsing, especially w/ regexp, and I
    | > don't care if it's because I stupid and suck at it. It shouldn't have to
    | > be this hair pulling! Anyway... Can some one please give the regular
    | > expression to match the first square bracket's contents. In this case it
    | > would be "Hello".
    | >
    | > s = <<-EOS
    | > [Hello]
    | > This is[b.] a test.
    | > [Hello.]
    | > EOS
    |
    | The trick here is to make sure you are non-greedy.
    |
    | s =~ /\[([^\]]*)\]/

    Thanks. I _see_ now why mine wasn't working, though I don't _understand_ why
    it wasn't working. I was using the / /x extension, because I generally like
    to space the parts my regexps out to read easier, but for some reason that
    causes the above to match instead. Oh well, I just won't do that.

    Thanks All for your responses!
    T.
     
    trans. (T. Onoma), Jan 17, 2005
    #7
  8. 26 pm, Zach Dennis wrote:
    | | trans. (T. Onoma) wrote:
    | | > Let me painfully honest: I hate parsing, especially w/ regexp, and I
    | | > don't care if it's because I stupid and suck at it. It shouldn't have
    | | > to be this hair pulling! Anyway... Can some one please give the regular
    | | > expression to match the first square bracket's contents. In this case
    | | > it would be "Hello".
    | | >
    | | > s = <<-EOS
    | | > [Hello]
    | | > This is[b.] a test.
    | | > [Hello.]
    | | > EOS
    | |
    | | The trick here is to make sure you are non-greedy.
    | |
    | | s =~ /\[([^\]]*)\]/
    |
    | Thanks. I _see_ now why mine wasn't working, though I don't _understand_
    | why it wasn't working. I was using the / /x extension, because I generally
    | like to space the parts my regexps out to read easier, but for some reason
    | that causes the above to match instead. Oh well, I just won't do that.

    Oops scratch that. That's not the reason either (sigh). But I got it working
    now anyway. Thanks.

    T.
     
    trans. (T. Onoma), Jan 17, 2005
    #8
  9. And I was thinking "ooh, Zach's looks like a better way to do it" :)

    Douglas
     
    Douglas Livingstone, Jan 17, 2005
    #9
  10. trans.  (T. Onoma)

    Assaph Mehr Guest

    Thanks. I _see_ now why mine wasn't working, though I don't


    It has todo with the pattern matching being greedy, not the /x flag.
    your pattern will match a '[' then as many characters as possible -
    including ']' - until a final closing ']'.
    There are two solutions:
    1. As shown, match any non ']'.
    2. Make the match non greedy: %r{ \[(.+?)\] }x

    HTH,
    Assaph
    ps. If you want all occurences in the string, use string#scan instead
    of String#match.
     
    Assaph Mehr, Jan 17, 2005
    #10
  11. trans.  (T. Onoma)

    Mark Hubbart Guest



    Or:

    s =~ /\[.*?\]/

    which uses the ? non-greedy modifier to ensure that only the very next
    "]" is matched. For example:


    str = <<EOT
    [this] [is a test]
    here are[some]brackets
    [brackets ]
    [] no words
    no brackets
    EOT
    ==>"[this] [is a test]\nhere are[some]brackets\n[brackets ]\n[] no
    words\nno brackets\n"

    str.each{|line| p line.scan(/\[.*?\]/)}
    ["[this]", "[is a test]"]
    ["[some]"]
    ["[brackets ]"]
    ["[]"]
    []

    cheers,
    Mark
     
    Mark Hubbart, Jan 17, 2005
    #11
  12. On Monday 17 January 2005 04:51 pm, Assaph Mehr wrote:
    | > Thanks. I _see_ now why mine wasn't working, though I don't
    | > _understand_ why it wasn't working. I was using the / /x extension,
    | > because I generally like to space the parts my regexps out to
    | > read easier, but for some reason that causes the above to
    | > match instead. Oh well, I just won't do that.
    |
    | It has todo with the pattern matching being greedy, not the /x flag.
    | your pattern will match a '[' then as many characters as possible -
    | including ']' - until a final closing ']'.
    | There are two solutions:
    | 1. As shown, match any non ']'.
    | 2. Make the match non greedy: %r{ \[(.+?)\] }x
    |
    | HTH,
    | Assaph
    | ps. If you want all occurences in the string, use string#scan instead
    | of String#match.

    Thanks Assaph,

    I had an escape character match in the regexp:

    / [^`] \[(.+?)\] /x

    That was messing it up (Don't really know why) but I just "zeroed" it:

    / (?=[^`]) \[(.+?)\] /x

    And that did the trick.

    Just one of those things were you just over look what you think you know to
    the point of seizure ;)

    T.
     
    trans. (T. Onoma), Jan 17, 2005
    #12
  13. trans.  (T. Onoma)

    Assaph Mehr Guest

    I had an escape character match in the regexp:
    Thats because [^`] will match 'a single character that is not `'.
    When you did the zero-width lookahead, you made into 'possibly a
    character, so long as it's not ` '.

    Hope this makes sense :)
     
    Assaph Mehr, Jan 17, 2005
    #13
  14. trans.  (T. Onoma)

    Mark Hubbart Guest

    I may be reading this wrong, but I think that with the zero-width
    lookahead, it is now ensuring that the first character of the match is
    not a backtick. Which, since it's always going to be a square bracket,
    makes the lookahead superfluous.

    If you need escaping, try:
    /(?: # escape sequence match
    ^ | [^`] # alternate: match either "start of line" or a non-backtick.
    )
    ( # non-greedy [foo] match
    \[.*?\]
    )/

    ... then use $1. This one won't match any paired square brackets
    immediately preceded by a backtick.

    cheers,
    Mark
     
    Mark Hubbart, Jan 17, 2005
    #14
  15. trans.  (T. Onoma)

    John Carter Guest

    Given your other posts in this forum I cannot believe that you are stupid.

    So here are some meta-hints on how to "suck less" at Regexes...

    Always use the %r{}x form of regexs.

    This neatly avoids the leaning toothpick syndrome when\/matching\/paths

    The x modifier allows you to use white space and even comments within the
    regex to make it readable. (Larry Wall of perl fame regrets he didn't make
    it the default...)

    My .emacs has a key-binding that will produce "=~ %r{ }x" and leave the
    cursor in the middle.
    (global-set-key [(control %)]
    `(lambda ()
    (interactive)
    (insert "=~ %r{ }x")
    (backward-char 4)
    ))

    Pull the development of the regex outside the development of your app.
    Unit tests are good for that, or even if you just make a wee small script
    or do it on the command line or in irb.

    If you are doing it on the command line beware of nasty interactions
    between the string and quoting conventions of the shell and ruby.

    (Speaking Unix now...)
    eg. ruby -e "blah" is A Very Bad Idea. The shell will peek inside the
    "blah" and do things that you really definitely don't want
    happening in a regex. Solution, use single quotes, bash never looks in
    side them. Downside, it means you must _never_ use single quotes in the
    ruby fragment blah, but you can use double quotes.

    ruby -e 'blah'

    Grow the regex slowly. Start with the smallest thing, make it match.

    If you immediately write down a large regex, odds on it will match
    nothing.

    Sheer murderous frustration lies that way.

    Start small, or strip away stuff on the right hand side of the regex until
    you match anything something. Then slowly start adding it back.

    File.read(fileName) is cute. It allows you to pull the whole file in at
    once as one string and then you can match across lines.

    Be aware that since standards are such good things, everyone has their are
    own one. ie. POSIX (grep) regexes are different to Emacs regexes which are
    different to Ruby regexes. grep even provides too different regex
    languages! Ruby and perl regexes are very similar.
    It isn't. Really. Do what I suggest and you will slowly find regexes are
    really a very fun and powerful way of doing things.



    John Carter Phone : (64)(3) 358 6639
    Tait Electronics Fax : (64)(3) 359 4632
    PO Box 1645 Christchurch Email :
    New Zealand

    "The notes I handle no better than many pianists. But the pauses
    between the notes -
    ah, that is where the art resides!' - Artur Schnabel
     
    John Carter, Jan 17, 2005
    #15
  16. On Monday 17 January 2005 06:33 pm, John Carter wrote:
    | Grow the regex slowly. Start with the smallest thing, make it match.
    |
    | If you immediately write down a large regex, odds on it will match
    | nothing.

    Ah this is my major problem. I tend to write whole chunks of code at once and
    then go back and tweak to perfection. Not always the best way to go. And
    regexp is a perfect example of when not to do this.

    Thanks. That lesson will surely help a great deal.

    T.
     
    trans. (T. Onoma), Jan 18, 2005
    #16
  17. Hi,

    Am Dienstag, 18. Jan 2005, 06:26:35 +0900 schrieb Douglas Livingstone:
    What are the square brackets for? As far as I see /\[\w+\]/
    does, too.

    Bertram
     
    Bertram Scharpf, Jan 18, 2005
    #17
  18. trans.  (T. Onoma)

    Zach Dennis Guest

    In a regular expression squares brackets represent a character class. A
    charcter class looks for one character matching any of the characters
    that make up that character class. Say you are looking for the words
    "fix" or "fox" in sentence.

    You could write:

    /f(i|o)x/

    or you could write:

    /f[io]x/

    You can also negate a character class, and match anything that is NOT in
    the character class. You do this by starting your character class with a
    carrot ^

    Say you wanted to find anything f-x, but not "fox"

    /f[^o]x/

    this will find "fix", "fex", "fux", "fgx", etc.. but not "fox".

    In the regular expression: /\[[\w]+\]/

    \[ = you are looking for a literal left square bracket
    [\w]+ = you are looking for a character class with any word character
    one or more times
    \] = you are looking for a closing right square bracket

    This will find the "fix" in the sentence "This is a [fix]", but this
    regular expression will fail if you do "This is a [ fix ]", because the
    spaces before the "f" and after the "x" are not considered word
    characters. A better regular expression is (sorry Doug, I"m taking it
    back, I like mine better now):

    /\[([^\]]*)\]/

    which will match anything inside of square brackets. This will match:

    "This is a [fix]" $1 will equal "fix"
    "This is a [ fix ]" $1 will equal " fix "
    "This is a [ *sentence inside of a fix* ]" $1 will equal " *sentence
    inside of a fix* "

    I hope this was helpful.

    Zach
     
    Zach Dennis, Jan 18, 2005
    #18
  19. Shortcuts like \w define character classes, so the brackets are not
    needed, as the other poster hinted at. ;)

    \w+ and [\w]+ are identical

    You can put them in classed if you want, mainly to add to them:

    [\w']+ match word and ' characters

    Hope that helps.

    James Edward Gray II
     
    James Edward Gray II, Jan 18, 2005
    #19
  20. trans.  (T. Onoma)

    Zach Dennis Guest

    Almost forgot to hit up your question...

    /\[[\w]+\]/

    and

    /\[\w+\]/

    are basically the same since \w covers a whole character class of word
    characters.

    Zach
     
    Zach Dennis, Jan 18, 2005
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.