regular expression match and exclude

Discussion in 'Ruby' started by Azalar ---, Sep 7, 2008.

  1. Azalar ---

    Azalar --- Guest

    I am parsing a web page full of image links that also contain links to
    the thumbnails for those images.

    Here is my test data..

    galleries/image1.jpg
    galleries/image1thumb.jpg
    galleries/image2.jpg
    galleries/image2thumb.jpg
    galleries/image3.jpg
    galleries/image3thumb.jpg

    If i use this expression..
    /galleries.*(?=thumb).*jpg/

    The results are just those lines containing the word thumb.

    What I want to do is inverse this though and return only return that
    DONT contain the word thumb.
    --
    Posted via http://www.ruby-forum.com/.
     
    Azalar ---, Sep 7, 2008
    #1
    1. Advertising

  2. On Sun, Sep 7, 2008 at 3:09 PM, Azalar --- <> wrote:
    > I am parsing a web page full of image links that also contain links to
    > the thumbnails for those images.
    >
    > Here is my test data..
    >
    > galleries/image1.jpg
    > galleries/image1thumb.jpg
    > galleries/image2.jpg
    > galleries/image2thumb.jpg
    > galleries/image3.jpg
    > galleries/image3thumb.jpg
    >
    > If i use this expression..
    > /galleries.*(?=thumb).*jpg/
    >
    > The results are just those lines containing the word thumb.
    >
    > What I want to do is inverse this though and return only return that
    > DONT contain the word thumb.


    If you have all the links in an array, I would do the opposite: match
    the ones that contain
    thumb, rejecting those from the array

    irb(main):001:0> links = %w{galleries/image1.jpg
    galleries/image1thumb.jpg galleries/image2.jpg
    galleries/image2thumb.jpg galleries/image3.jpg
    galleries/image3thumb.jpg}
    => ["galleries/image1.jpg", "galleries/image1thumb.jpg",
    "galleries/image2.jpg", "galleries/image2thumb.jpg",
    "galleries/image3.jpg", "galleries/image3thumb.jpg"]

    irb(main):007:0> links.reject {|l| l =~ /thumb/}
    => ["galleries/image1.jpg", "galleries/image2.jpg", "galleries/image3.jpg"]

    You can as specific as you need for the regexp:

    irb(main):008:0> links.reject {|l| l =~ /galleries\/.*thumb.*jpg/}
    => ["galleries/image1.jpg", "galleries/image2.jpg", "galleries/image3.jpg"]


    Hope this helps,

    Jesus.
     
    Jesús Gabriel y Galán, Sep 7, 2008
    #2
    1. Advertising

  3. Azalar ---

    Azalar --- Guest

    I should mention that image isn't a fixed word in this case I just used
    that as an example.
    It represents whatever the name of the image is

    So the test data could be..

    galleries/blah.jpg
    galleries/blahthumb.jpg
    galleries/landscape.jpg
    galleries/landscapethumb.jpg
    galleries/foo.jpg
    galleries/foothumb.jpg


    Robert Klemme wrote:
    > 2008/9/7 Azalar --- <>:
    >>
    >> The array is created when i use the scan method so if i used reject it
    >> would become a two pass process which is what i am doing anyway so I was
    >> wondering if regular expressions had built in support for this.

    >
    > Will be hard. It's easier if you do
    >
    > %r{galleries/image\d+\.jpg\z}
    >
    > Kind regards
    >
    > robert


    --
    Posted via http://www.ruby-forum.com/.
     
    Azalar ---, Sep 7, 2008
    #3
  4. Azalar ---

    Axel Etzold Guest

    -------- Original-Nachricht --------
    > Datum: Mon, 8 Sep 2008 02:46:12 +0900
    > Von: Azalar --- <>
    > An:
    > Betreff: Re: regular expression match and exclude


    >
    > I should mention that image isn't a fixed word in this case I just used
    > that as an example.
    > It represents whatever the name of the image is
    >
    > So the test data could be..
    >
    > galleries/blah.jpg
    > galleries/blahthumb.jpg
    > galleries/landscape.jpg
    > galleries/landscapethumb.jpg
    > galleries/foo.jpg
    > galleries/foothumb.jpg
    >
    >
    > Robert Klemme wrote:
    > > 2008/9/7 Azalar --- <>:
    > >>
    > >> The array is created when i use the scan method so if i used reject it
    > >> would become a two pass process which is what i am doing anyway so I

    > was
    > >> wondering if regular expressions had built in support for this.

    > >
    > > Will be hard. It's easier if you do
    > >
    > > %r{galleries/image\d+\.jpg\z}
    > >
    > > Kind regards
    > >
    > > robert

    >
    > --
    > Posted via http://www.ruby-forum.com/.


    Dear Azalar,

    to achieve searching for some text pattern that is not followed by something else, you need
    regular expressions with negative lookahead

    http://www.regular-expressions.info/lookaround.html

    As far as I know, this is not supported by Ruby's 1.8.x regexps, but there is a special library, Oniguruma,
    with Ruby bindings available for this.

    http://oniguruma.rubyforge.org/

    Best regards,

    Axel
    --
    Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen!
    Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer
     
    Axel Etzold, Sep 7, 2008
    #4
  5. 2008/9/7 Axel Etzold <>:

    > to achieve searching for some text pattern that is not followed by something else, you need
    > regular expressions with negative lookahead
    >
    > http://www.regular-expressions.info/lookaround.html
    >
    > As far as I know, this is not supported by Ruby's 1.8.x regexps,


    Negative lookahead *is* supported by that (and earlier) version.

    The problem with negative lookahead is that - with the test data given
    - that the negative lookahead is difficult to get right:

    irb(main):016:0> %w{foo.jpg foothumb.jpg}.each do |s|
    irb(main):017:1* p [s, /\A\w+(?!thumb)\.jpg\z/ =~ s]
    irb(main):018:1> end
    ["foo.jpg", 0]
    ["foothumb.jpg", 0]
    => ["foo.jpg", "foothumb.jpg"]

    I am not saying that it won't work but off the top of my head I do not
    have a solution that works. In any case, it's easier to exclude
    matches outside the regular expression engine. One way would be to
    make thumb a capturing group and check whether the group is present,
    like

    irb(main):022:0> %w{foo.jpg foothumb.jpg}.each do |s|
    irb(main):023:1* p [s, /\A\w+?(thumb)?\.jpg\z/ =~ s, $1]
    irb(main):024:1> end
    ["foo.jpg", 0, nil]
    ["foothumb.jpg", 0, "thumb"]
    => ["foo.jpg", "foothumb.jpg"]

    and use that criterion for exclusion.

    Now you can meditate on why the negative lookahead did not work. :)

    Kind regards

    robert

    --
    use.inject do |as, often| as.you_can - without end
     
    Robert Klemme, Sep 7, 2008
    #5
  6. RnJvbTogQXphbGFyIC0tLSBbbWFpbHRvOnB0ZWFsZUBnbWFpbC5jb21dIA0KIyBUaGUgYXJyYXkg
    aXMgY3JlYXRlZCB3aGVuIGkgdXNlIHRoZSBzY2FuIG1ldGhvZCBzbyBpZiBpIHVzZWQgDQojIHJl
    amVjdCBpdCANCiMgd291bGQgYmVjb21lIGEgdHdvIHBhc3MgcHJvY2VzcyB3aGljaCBpcyB3aGF0
    IGkgYW0gZG9pbmcgDQojIGFueXdheSBzbyBJIHdhcyANCiMgd29uZGVyaW5nIGlmIHJlZ3VsYXIg
    ZXhwcmVzc2lvbnMgaGFkIGJ1aWx0IGluIHN1cHBvcnQgZm9yIHRoaXMuDQoNCndlbGwgaWYgaXQn
    cyB0aHVtYiBmb3IgdGh1bWJzIHNha2UsIHdlIGNhbiBiZSBzdHViYm9ybiBhYm91dCBpdCA6KQ0K
    DQppcmIobWFpbik6MDM2OjA+IHJlMg0KPT4gL2dhbGxlcmllcy4qPyhbXnRdW15oXVtedV1bXm1d
    W15iXSkuanBnLw0KaXJiKG1haW4pOjAzODowPiBnLnNlbGVjdHt8eHwgeD1+cmUyIH0NCj0+IFsi
    Z2FsbGVyaWVzL2ltYWdlMS5qcGciLCAiZ2FsbGVyaWVzL2ltYWdlMi5qcGciLCAiZ2FsbGVyaWVz
    L2ltYWdlMy5qcGciXQ0KDQpraW5kIHJlZ2FyZHMgLWJvdHANCg==
     
    Peña, Botp, Sep 8, 2008
    #6
  7. 2008/9/8 Caleb Clausen <>:
    > On 9/7/08, Pe=F1a, Botp <> wrote:
    >> irb(main):036:0> re2
    >> =3D> /galleries.*?([^t][^h][^u][^m][^b]).jpg/
    >> irb(main):038:0> g.select{|x| x=3D~re2 }
    >> =3D> ["galleries/image1.jpg", "galleries/image2.jpg", "galleries/image3.=

    jpg"]
    >
    > Unfortunately this way will fail to match (for instance) "bomb.jpg".
    > If you want to go this route, you need to do something like this:
    > (untested!)
    >
    > /^galleries.*([^t]humb|[^h]umb|[^u]mb|[^m]b|[^b])\.jpg$/ #bleah
    >
    > Patrick He wrote:
    >> /^galleries\/((?!thumb).)+\.jpg$/

    >
    > Ah, brilliant! But it doesn't quite work. It fails to match
    > "thumb_foo.jpg", which probably should match here. A simple
    > modification should fix it, tho:
    >
    > /^galleries\/((?!thumb\.jpg$).)+\.jpg$/
    >
    > There. That's a pretty nice pure regexp.


    Now just make the group non capturing and you're a tad more efficient. :)

    %r{\Agalleries/(?:(?!thumb\.jpg\z).)+\.jpg\z}

    Cheers

    robert


    --=20
    use.inject do |as, often| as.you_can - without end
     
    Robert Klemme, Sep 8, 2008
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Brian Vallelunga
    Replies:
    2
    Views:
    8,359
    Brian Vallelunga
    Jul 31, 2003
  2. RJN
    Replies:
    2
    Views:
    20,082
    Frank
    Feb 25, 2005
  3. Replies:
    8
    Views:
    13,929
    John C. Bollinger
    Nov 10, 2005
  4. Replies:
    6
    Views:
    809
    Bengt Richter
    Nov 8, 2005
  5. Shannon Jacobs

    Regular expression to exclude lines?

    Shannon Jacobs, Nov 24, 2003, in forum: Javascript
    Replies:
    26
    Views:
    550
    Mark Szlazak
    Dec 5, 2003
Loading...

Share This Page