how greedy is nongreedy in regexp ?

P

peter pilsl

the following substitution does not what I want. So I ask myself where
the knot in my brain is this time.

I want to clean up urls and make a/b/c/../e/f => a/b/e/f

$a="a/b/c/../e/f";
$a=~s,/(.*?)/\.\./,/,;
print $1,"\n",$a,"\n"

which gives:

b/c
a/e/f


Thats still to greedy to me ;) my desired match would be "c"
Is there a way to achieve this? Maybe some look-behind-hack? I'm not
familiar with this.

My alternative approach would be to simply split the whole stuff and
iterate throught it. This approach - however - seems to be more
costintensive than I'd like it to be.


thnx,
peter
 
A

Anno Siegel

peter pilsl said:
the following substitution does not what I want. So I ask myself where
the knot in my brain is this time.

I want to clean up urls and make a/b/c/../e/f => a/b/e/f

The standard module File::Spec has canonpath() to do this. Since
canonpath() does a logical cleanup without looking at a physical
file system, it should work for URLs.

Anno
 
D

Darren Dunham

The standard module File::Spec has canonpath() to do this. Since
canonpath() does a logical cleanup without looking at a physical
file system, it should work for URLs.

Note that this is not a valid cleanup in the presence of symbolic links.
So when run on Unix (or Mac or OS2), File::Spec->canonpath() will not
convert the string.

$ perl -MFile::Spec -le 'print File::Spec->canonpath("a/b/c/../d/e");'
a/b/c/../d/e

Because Win32 filesystems don't support symlinks, you can explicitly run
the win32 module to do this...

$ perl -MFile::Spec::Win32 -le 'print File::Spec::Win32->canonpath("a/b/c/../d/e");'
a\b\d\e

Whether you can cope with the backslashes is another question....
 
B

Brian McCauley

Anno said:
The standard module File::Spec has canonpath() to do this.

But since this is a question about canonicalising URLs it would seem
more appropriate to use the URI module. However the canonical() method
of URI doesn't do this. Is this right or should it be considered a bug
in URI?
 
B

Brian McCauley

peter said:
Subject: Re: how greedy is nongreedy in regexp ?

[ snip ]

Others have answered the question in the mssage body, but to answer the
question in the subject line...

Non-greedyness is local. Non-greedyness does _not_ trump finding the
leftmost match. When ther are two (non-)gready subexpressions the first
ones (non-)greadyness tekes priority.
 
S

Stuart Moore

Brian said:
But since this is a question about canonicalising URLs it would seem
more appropriate to use the URI module. However the canonical() method
of URI doesn't do this. Is this right or should it be considered a bug
in URI?

Is www.foo.com/a/b/../c/d always the same as www.foo.com/a/c/d ?

I'd guess on most webservers it is, but surely you can't guarantee it in
general. Suppose the dots were instead input to a script?

Stuart
 
B

Brian McCauley

Stuart said:
Is www.foo.com/a/b/../c/d always the same as www.foo.com/a/c/d ?

I'd guess on most webservers it is, but surely you can't guarantee it in
general.

I think you can garantee it on standards-conforming web servers. At
least that's the implication of of the W3C standards for URIs.
Suppose the dots were instead input to a script?

Then at least one of them must be encoded.

So I think it's fairly clear from the standards that

http://foo/a/b/c/../../d/./e/f

Should canonicalize to

http://foo/a/d/e/f

It also follows that both of...

http://foo/a/b/../c
http://foo/a/b/../c

...should canonicalise to the same thing and it shouldn't be..

http://foo/a/b/../c

...because that's a non-canonical form of..

http://foo/a/c

It is however unclear to me what the correct canonical form would be. I
think it should probably be

http://foo/a/b/../c
 
J

Joe Schaefer

Brian McCauley said:
But since this is a question about canonicalising URLs it would seem more
appropriate to use the URI module. However the canonical() method of URI
doesn't do this. Is this right or should it be considered a bug in URI?

I think it mostly depends on whether or not the URI is relative
or absolute. According to rfc 2396 Section 4:

URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ]

The syntax for relative URI is a shortened form of that for absolute
URI, where some prefix of the URI is missing and certain path
components ("." and "..") have a special meaning when, and only when,
interpreting a relative path. The relative URI syntax is defined in
Section 5.

Section 5.2 step 6 talks about the semantics of "../", but that's only
in the narrow context of resolving a relative uri into an absolute one.
AFAICT 2396 discusses "canonicalization" of an absolute URI only as it
relates to %XX escapes.
 
B

Ben Morrow

Quoth Brian McCauley said:
But since this is a question about canonicalising URLs it would seem
more appropriate to use the URI module. However the canonical() method
of URI doesn't do this. Is this right or should it be considered a bug
in URI?

This is right. The only time .. and . are significant in a URI is when
they are on the left-hand end of a relative URI which is being made
absolute, when they are specified to mean what you'd expect.

I would consider this a bug in the URI spec, myself; as it means that,
say,

http://foo/a/./b

is a valid URI distinct from

http://foo/a/b

; but neither can be made relative to

http://foo/a

without losing the distinction. Ach, well. :)

Ben
 
B

Brian McCauley

Ben said:
This is right. The only time .. and . are significant in a URI is when
they are on the left-hand end of a relative URI which is being made
absolute, when they are specified to mean what you'd expect.

I would consider this a bug in the URI spec, myself; as it means that,
say,

http://foo/a/./b

is a valid URI distinct from

http://foo/a/b

That's what the RFC says but I think I found another (later?) document
on W3C that says that /./ or /../ are invalid in absolute URLs and must
be written using %2E.
 
B

Brian McCauley

Brian said:
That's what the RFC says but I think I found another (later?) document
on W3C that says that /./ or /../ are invalid in absolute URLs and must
be written using %2E.

Found it. It's still a draft but it's also widespread current practice.

http://www.gbiv.com/protocols/uri/rev-2002/draft-fielding-uri-rfc2396bis-07.html#path

and

http://www.gbiv.com/protocols/uri/r...-uri-rfc2396bis-07.html#relative-dot-segments

"[...] removing the special "." and ".." complete path
segments from referenced path. This is done [...] whether
or not the path was relative, [...]"

Discussion leading to this change can be found here:

http://www.gbiv.com/protocols/uri/rev-2002/issues.html#033-dot-segments
 
B

Brian McCauley

Joe said:
I think it mostly depends on whether or not the URI is relative
or absolute. According to rfc 2396

Yes, I was getting ahead of myself. You are, of course, right about
RFC2396, I was looking at the draft of it's successor.

I think it would be helpful if the URI module had an option to implement
the new semantics since they are already widespread in other software.

Maybe I'll submit a patch.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top