how greedy is nongreedy in regexp ?

Discussion in 'Perl Misc' started by peter pilsl, Nov 8, 2004.

  1. peter pilsl

    peter pilsl Guest

    the following substitution does not what I want. So I ask myself where
    the knot in my brain is this time.

    I want to clean up urls and make a/b/c/../e/f => a/b/e/f

    $a="a/b/c/../e/f";
    $a=~s,/(.*?)/\.\./,/,;
    print $1,"\n",$a,"\n"

    which gives:

    b/c
    a/e/f


    Thats still to greedy to me ;) my desired match would be "c"
    Is there a way to achieve this? Maybe some look-behind-hack? I'm not
    familiar with this.

    My alternative approach would be to simply split the whole stuff and
    iterate throught it. This approach - however - seems to be more
    costintensive than I'd like it to be.


    thnx,
    peter


    --
    http://www2.goldfisch.at/know_list
     
    peter pilsl, Nov 8, 2004
    #1
    1. Advertising

  2. peter pilsl

    Anno Siegel Guest

    peter pilsl <> wrote in comp.lang.perl.misc:
    >
    >
    > the following substitution does not what I want. So I ask myself where
    > the knot in my brain is this time.
    >
    > I want to clean up urls and make a/b/c/../e/f => a/b/e/f


    The standard module File::Spec has canonpath() to do this. Since
    canonpath() does a logical cleanup without looking at a physical
    file system, it should work for URLs.

    Anno
     
    Anno Siegel, Nov 8, 2004
    #2
    1. Advertising

  3. Anno Siegel <-berlin.de> wrote:
    > peter pilsl <> wrote in comp.lang.perl.misc:
    >>
    >>
    >> the following substitution does not what I want. So I ask myself where
    >> the knot in my brain is this time.
    >>
    >> I want to clean up urls and make a/b/c/../e/f => a/b/e/f


    > The standard module File::Spec has canonpath() to do this. Since
    > canonpath() does a logical cleanup without looking at a physical
    > file system, it should work for URLs.


    Note that this is not a valid cleanup in the presence of symbolic links.
    So when run on Unix (or Mac or OS2), File::Spec->canonpath() will not
    convert the string.

    $ perl -MFile::Spec -le 'print File::Spec->canonpath("a/b/c/../d/e");'
    a/b/c/../d/e

    Because Win32 filesystems don't support symlinks, you can explicitly run
    the win32 module to do this...

    $ perl -MFile::Spec::Win32 -le 'print File::Spec::Win32->canonpath("a/b/c/../d/e");'
    a\b\d\e

    Whether you can cope with the backslashes is another question....

    --
    Darren Dunham
    Senior Technical Consultant TAOS http://www.taos.com/
    Got some Dr Pepper? San Francisco, CA bay area
    < This line left intentionally blank to confuse you. >
     
    Darren Dunham, Nov 8, 2004
    #3
  4. Anno Siegel wrote:
    > peter pilsl <> wrote in comp.lang.perl.misc:
    >
    >>
    >>the following substitution does not what I want. So I ask myself where
    >>the knot in my brain is this time.
    >>
    >>I want to clean up urls and make => a/b/e/f

    >
    >
    > The standard module File::Spec has canonpath() to do this.


    But since this is a question about canonicalising URLs it would seem
    more appropriate to use the URI module. However the canonical() method
    of URI doesn't do this. Is this right or should it be considered a bug
    in URI?
     
    Brian McCauley, Nov 10, 2004
    #4
  5. peter pilsl wrote:

    > Subject: Re: how greedy is nongreedy in regexp ?


    [ snip ]

    Others have answered the question in the mssage body, but to answer the
    question in the subject line...

    Non-greedyness is local. Non-greedyness does _not_ trump finding the
    leftmost match. When ther are two (non-)gready subexpressions the first
    ones (non-)greadyness tekes priority.
     
    Brian McCauley, Nov 10, 2004
    #5
  6. peter pilsl

    Stuart Moore Guest

    Brian McCauley wrote:
    >
    > But since this is a question about canonicalising URLs it would seem
    > more appropriate to use the URI module. However the canonical() method
    > of URI doesn't do this. Is this right or should it be considered a bug
    > in URI?


    Is www.foo.com/a/b/../c/d always the same as www.foo.com/a/c/d ?

    I'd guess on most webservers it is, but surely you can't guarantee it in
    general. Suppose the dots were instead input to a script?

    Stuart
     
    Stuart Moore, Nov 11, 2004
    #6
  7. URI canonicalization (was: how greedy is nongreedy in regexp ?)

    Stuart Moore wrote:

    > Brian McCauley wrote:
    >
    >>
    >> But since this is a question about canonicalising URLs it would seem
    >> more appropriate to use the URI module. However the canonical()
    >> method of URI doesn't do this. Is this right or should it be
    >> considered a bug in URI?

    >
    >
    > Is www.foo.com/a/b/../c/d always the same as www.foo.com/a/c/d ?
    >
    > I'd guess on most webservers it is, but surely you can't guarantee it in
    > general.


    I think you can garantee it on standards-conforming web servers. At
    least that's the implication of of the W3C standards for URIs.

    > Suppose the dots were instead input to a script?


    Then at least one of them must be encoded.

    So I think it's fairly clear from the standards that

    http://foo/a/b/c/../../d/./e/f

    Should canonicalize to

    http://foo/a/d/e/f

    It also follows that both of...

    http://foo/a/b/../c
    http://foo/a/b/../c

    ...should canonicalise to the same thing and it shouldn't be..

    http://foo/a/b/../c

    ...because that's a non-canonical form of..

    http://foo/a/c

    It is however unclear to me what the correct canonical form would be. I
    think it should probably be

    http://foo/a/b/../c
     
    Brian McCauley, Nov 11, 2004
    #7
  8. peter pilsl

    Joe Schaefer Guest

    Brian McCauley <> writes:

    > But since this is a question about canonicalising URLs it would seem more
    > appropriate to use the URI module. However the canonical() method of URI
    > doesn't do this. Is this right or should it be considered a bug in URI?


    I think it mostly depends on whether or not the URI is relative
    or absolute. According to rfc 2396 Section 4:

    URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ]

    The syntax for relative URI is a shortened form of that for absolute
    URI, where some prefix of the URI is missing and certain path
    components ("." and "..") have a special meaning when, and only when,
    interpreting a relative path. The relative URI syntax is defined in
    Section 5.

    Section 5.2 step 6 talks about the semantics of "../", but that's only
    in the narrow context of resolving a relative uri into an absolute one.
    AFAICT 2396 discusses "canonicalization" of an absolute URI only as it
    relates to %XX escapes.

    --
    Joe Schaefer
     
    Joe Schaefer, Nov 11, 2004
    #8
  9. peter pilsl

    Ben Morrow Guest

    Quoth Brian McCauley <>:
    >
    >
    > Anno Siegel wrote:
    > > peter pilsl <> wrote in comp.lang.perl.misc:
    > >
    > >>
    > >>the following substitution does not what I want. So I ask myself where
    > >>the knot in my brain is this time.
    > >>
    > >>I want to clean up urls and make => a/b/e/f

    > >
    > >
    > > The standard module File::Spec has canonpath() to do this.

    >
    > But since this is a question about canonicalising URLs it would seem
    > more appropriate to use the URI module. However the canonical() method
    > of URI doesn't do this. Is this right or should it be considered a bug
    > in URI?


    This is right. The only time .. and . are significant in a URI is when
    they are on the left-hand end of a relative URI which is being made
    absolute, when they are specified to mean what you'd expect.

    I would consider this a bug in the URI spec, myself; as it means that,
    say,

    http://foo/a/./b

    is a valid URI distinct from

    http://foo/a/b

    ; but neither can be made relative to

    http://foo/a

    without losing the distinction. Ach, well. :)

    Ben

    --
    I've seen things you people wouldn't believe: attack ships on fire off
    the shoulder of Orion; I watched C-beams glitter in the dark near the
    Tannhauser Gate. All these moments will be lost, in time, like tears in rain.
    Time to die.
     
    Ben Morrow, Nov 11, 2004
    #9
  10. Ben Morrow wrote:
    > Quoth Brian McCauley <>:
    >
    >>
    >>Anno Siegel wrote:
    >>
    >>>peter pilsl <> wrote in comp.lang.perl.misc:
    >>>
    >>>
    >>>>the following substitution does not what I want. So I ask myself where
    >>>>the knot in my brain is this time.
    >>>>
    >>>>I want to clean up urls and make => a/b/e/f
    >>>
    >>>
    >>>The standard module File::Spec has canonpath() to do this.

    >>
    >>But since this is a question about canonicalising URLs it would seem
    >>more appropriate to use the URI module. However the canonical() method
    >>of URI doesn't do this. Is this right or should it be considered a bug
    >>in URI?

    >
    >
    > This is right. The only time .. and . are significant in a URI is when
    > they are on the left-hand end of a relative URI which is being made
    > absolute, when they are specified to mean what you'd expect.
    >
    > I would consider this a bug in the URI spec, myself; as it means that,
    > say,
    >
    > http://foo/a/./b
    >
    > is a valid URI distinct from
    >
    > http://foo/a/b


    That's what the RFC says but I think I found another (later?) document
    on W3C that says that /./ or /../ are invalid in absolute URLs and must
    be written using %2E.
     
    Brian McCauley, Nov 13, 2004
    #10
  11. Brian McCauley wrote:

    > That's what the RFC says but I think I found another (later?) document
    > on W3C that says that /./ or /../ are invalid in absolute URLs and must
    > be written using %2E.


    Found it. It's still a draft but it's also widespread current practice.

    http://www.gbiv.com/protocols/uri/rev-2002/draft-fielding-uri-rfc2396bis-07.html#path

    and

    http://www.gbiv.com/protocols/uri/r...-uri-rfc2396bis-07.html#relative-dot-segments

    "[...] removing the special "." and ".." complete path
    segments from referenced path. This is done [...] whether
    or not the path was relative, [...]"

    Discussion leading to this change can be found here:

    http://www.gbiv.com/protocols/uri/rev-2002/issues.html#033-dot-segments
     
    Brian McCauley, Nov 13, 2004
    #11
  12. Joe Schaefer wrote:

    > Brian McCauley <> writes:
    >
    >
    >>But since this is a question about canonicalising URLs it would seem more
    >>appropriate to use the URI module. However the canonical() method of URI
    >>doesn't do this. Is this right or should it be considered a bug in URI?

    >
    >
    > I think it mostly depends on whether or not the URI is relative
    > or absolute. According to rfc 2396


    Yes, I was getting ahead of myself. You are, of course, right about
    RFC2396, I was looking at the draft of it's successor.

    I think it would be helpful if the URI module had an option to implement
    the new semantics since they are already widespread in other software.

    Maybe I'll submit a patch.
     
    Brian McCauley, Nov 13, 2004
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Sam Pointon

    regexp non-greedy matching bug?

    Sam Pointon, Dec 4, 2005, in forum: Python
    Replies:
    8
    Views:
    376
    Fredrik Lundh
    Dec 5, 2005
  2. Tim Peters

    Re: regexp non-greedy matching bug?

    Tim Peters, Dec 4, 2005, in forum: Python
    Replies:
    0
    Views:
    406
    Tim Peters
    Dec 4, 2005
  3. John Hazen

    Re: regexp non-greedy matching bug?

    John Hazen, Dec 4, 2005, in forum: Python
    Replies:
    0
    Views:
    413
    John Hazen
    Dec 4, 2005
  4. Dan Kelly

    Greedy and non greedy quantifiers

    Dan Kelly, Jan 17, 2008, in forum: Ruby
    Replies:
    4
    Views:
    168
    Robert Klemme
    Jan 19, 2008
  5. Matt Garrish

    greedy v. non-greedy matching

    Matt Garrish, Feb 16, 2004, in forum: Perl Misc
    Replies:
    4
    Views:
    174
    Matt Garrish
    Feb 16, 2004
Loading...

Share This Page