Need help with regular expression to parse URLs

N

Neil

Hello:

I am having trouble figuring out how to write a regular expression to
parse our parts of a url.

For example, I am trying to parse the url
http://jammconsulting.com/jamm/page/test/*/*/*/*.html
into several substrings. The URL should begin with
http://jammconsulting.com/jamm/*/*/
and then have a group of parameters in the form */*
and then end with .html

So, for example, this url:
http://jammconsulting.com/jamm/page/products/Brand/Abc.html

Should give me Brand and Abc as parameters.

I wrote this regular expression:
^http://jammconsulting.com/jamm/[^/]+/[^/]+/([^/]+/[^/]+)*\\.html?

It seems to be working fine for most urls, but it barfed on this one:
http://jammconsulting.com/jamm/page/products/Stuff/Bags-&-Luggage/Bags-&-Totes/Backpacks.html

The matcher gives me 1 group with this value: s/Backpacks

I dont understand how that could have happened. I was expecting to
get
two groups:
Stuff/Bags-%26-Luggage
Bags-%26-Totes/Backpacks

Any ideas what went wrong?

Also, is there a way to tell the pattern to further parse the group
into
Stuff and Bags-%26-Luggage separately or should I do that with another
Pattern I apply to the group after I extract it from the main url?

Thanks,
Neil
 
K

Knute Johnson

Neil said:
Hello:

I am having trouble figuring out how to write a regular expression to
parse our parts of a url.

For example, I am trying to parse the url
http://jammconsulting.com/jamm/page/test/*/*/*/*.html
into several substrings. The URL should begin with
http://jammconsulting.com/jamm/*/*/
and then have a group of parameters in the form */*
and then end with .html

So, for example, this url:
http://jammconsulting.com/jamm/page/products/Brand/Abc.html

Should give me Brand and Abc as parameters.

I wrote this regular expression:
^http://jammconsulting.com/jamm/[^/]+/[^/]+/([^/]+/[^/]+)*\\.html?

It seems to be working fine for most urls, but it barfed on this one:
http://jammconsulting.com/jamm/page/products/Stuff/Bags-&-Luggage/Bags-&-Totes/Backpacks.html

The matcher gives me 1 group with this value: s/Backpacks

I dont understand how that could have happened. I was expecting to
get
two groups:
Stuff/Bags-%26-Luggage
Bags-%26-Totes/Backpacks

Any ideas what went wrong?

Also, is there a way to tell the pattern to further parse the group
into
Stuff and Bags-%26-Luggage separately or should I do that with another
Pattern I apply to the group after I extract it from the main url?

Thanks,
Neil

--
Neil Aggarwal, (281)846-8957, www.JAMMConsulting.com
Will your e-commerce site go offline if you have
a DB server failure, fiber cut, flood, fire, or other disaster?
If so, ask about our geographically redundant database system.

There is no way (that I know of) to get two groups without specifying
two sets of parentheses in the regex.
 
K

Knute Johnson

Neil said:
There is no way (that I know of) to get two groups without specifying
two sets of parentheses in the regex.

If I change my regex to be:
^http://jammconsulting.com/jamm/[^/]+/[^/]+/(([^/]+)/([^/]+))*\
\.html?

I get this result:

Group 1: s/Backpacks
Group 2: s
Group 3: Backpacks

Which is splitting up the subexpression but the outer group is wrong
in the first place.

Any ideas?

--
Neil Aggarwal, (281)846-8957, www.JAMMConsulting.com
Will your e-commerce site go offline if you have
a DB server failure, fiber cut, flood, fire, or other disaster?
If so, ask about our geographically redundant database system.

import java.util.regex.*;

public class test {
public static void main(String[] args) {
String str =
// "http://jamconsulting.com/jamm/page/products/Brand/Abc.html";

"http://jammconsulting.com/jamm/page/products/Stuff/Bags-&-Luggage/Bags-&-Totes/Backpacks.html";
Pattern p = Pattern.compile("http://.*/(.*/.*)/(.*/.*)\\.html");
Matcher m = p.matcher(str);
System.out.println(m.matches());
System.out.println(m.group(1));
System.out.println(m.group(2));
}
}

C:\Documents and Settings\Knute Johnson>java test
true
Stuff/Bags-%26-Luggage
Bags-%26-Totes/Backpacks
 
M

markspace

Neil said:
So, for example, this url:
http://jammconsulting.com/jamm/page/products/Brand/Abc.html

Should give me Brand and Abc as parameters.


I get "Brand/Abc" as one single capture group, not two separate things.

Don't get confused, there are two groups in the Matcher. The first is
the WHOLE STRING. It's what the whole regex matches. That's not a
capturing group, it's just the first "group" in the matcher group list.
The second group (argument or index #1) is the first capturing group,
if any. Don't confuse matcher.group(0) and matcher.group(1), they're
two different things, really.

In other works, if you're not testing this carefully and you think
matcher.group(0) is the first capturing group, that's your mistake.

I wrote this regular expression:
^http://jammconsulting.com/jamm/[^/]+/[^/]+/([^/]+/[^/]+)*\\.html?

It seems to be working fine for most urls, but it barfed on this one:
http://jammconsulting.com/jamm/page/products/Stuff/Bags-&-Luggage/Bags-&-Totes/Backpacks.html

The matcher gives me 1 group with this value: s/Backpacks


I get that result too. However, there's no way that regex is "working"
on "most urls" unless the target data is different than what you are
showing us, or you've got some post processing of the capturing group
that masks the problem.

I dont understand how that could have happened. I was expecting to
get
two groups:
Stuff/Bags-%26-Luggage
Bags-%26-Totes/Backpacks

Any ideas what went wrong?

I agree with Eric: you'll need two capture groups:

^http://jammconsulting.com/jamm/[^/]+/[^/]+/([^/]+/[^/]+)/([^/]+/[^/]+)\.html?


I don't understand what the * was in the end of your regex: "*\.html" ?
I'm not an expert so maybe I missed something though. Note the above is
regex, not a Java string. You'll need to double up the \ at the end for
Java string escape sequences.

Also, is there a way to tell the pattern to further parse the group
into
Stuff and Bags-%26-Luggage separately or should I do that with another
Pattern I apply to the group after I extract it from the main url?


Depends how you want to "further parse" the target string. Example?
 
M

markspace

Knute said:
Pattern p = Pattern.compile("http://.*/(.*/.*)/(.*/.*)\\.html");


Just curious: how efficient do you think this is? I think that the
first .* will match the whole string, then the regex will start backing
off slowly one character at a time until it gets a match on the rest of
the pattern. This may happen multiple times as each other .* is also
backed off one character at a time to try to produce a match.

A smart matcher could spot constant ".html" at the end and maybe
optimize the resulting compiled regex, but I don't know enough about
regex to predict whether this is really likely or even possible.

The OP's regex string was a bit superior in that regard, he just needed
the extra capturing group.
 
K

Knute Johnson

markspace said:
Just curious: how efficient do you think this is? I think that the
first .* will match the whole string, then the regex will start backing
off slowly one character at a time until it gets a match on the rest of
the pattern. This may happen multiple times as each other .* is also
backed off one character at a time to try to produce a match.

It's probably horrible. I didn't really play with it other than to make
the two groups work.
A smart matcher could spot constant ".html" at the end and maybe
optimize the resulting compiled regex, but I don't know enough about
regex to predict whether this is really likely or even possible.

The OP's regex string was a bit superior in that regard, he just needed
the extra capturing group.

No doubt.
 
W

Wojtek

Neil wrote :
I am having trouble figuring out how to write a regular expression to
parse our parts of a url.

Not to dis regex, but...

I read this thread and think that I could have written a custom parser
in less time, and probably with better performance.
 
R

Roedy Green

I am having trouble figuring out how to write a regular expression to
parse our parts of a url.

The URL/URI classes are designed to take URLs apart and put them back
together. You probably don't even have to roll your own regex.

Even if it does not do everything, you can get it strip out the piece
you need, that you can process with a simple regex.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"You can have quality software, or you can have pointer arithmetic; but you cannot have both at the same time."
~ Bertrand Meyer (born: 1950 age: 59) 1989, creator of design by contract and the Eiffel language.
 
M

markspace

Wojtek said:
Neil wrote :

Not to dis regex, but...

I read this thread and think that I could have written a custom parser
in less time, and probably with better performance.


Seriously? It took me about two minutes of fiddling with the regex
before I felt I had the answer, and some of that included just messing
around to make absolutely sure I was doing what I thought I was doing.

If you can write a custom parser in two minutes, I'd like to see it.

Also, the regex will be more flexible when requirements do inevitably
change.
 
R

Roedy Green

W

Wojtek

markspace wrote :
Seriously? It took me about two minutes of fiddling with the regex before I
felt I had the answer, and some of that included just messing around to make
absolutely sure I was doing what I thought I was doing.

If you can write a custom parser in two minutes, I'd like to see it.

Well maybe three minutes... or so :)

For this one, the start of the parse would be the length of the base
URI "http://jammconsulting.com/jamm/page/products/", then read through
the remainder gathering characters into a StringBuffer. When the exit
point is reached for that "block" (back-slash), place the
StringBuffer.toString() into a ListArray and go again. When ".html" is
reached, exit the loop.

Print out the ListArray. Done.

So :
http://jammconsulting.com/jamm/page/products/Stuff/Bags-&-Luggage/Bags-&-Totes/Backpacks.html

would produce:
Stuff
Bags-%26-Luggage
Bags-%26-Totes
Backpacks

Also, the regex will be more flexible when requirements do inevitably change.

I write a lot of parsers and find them easier than regex, but then I do
not pretend to be a regex master, so creating a regex is almost like a
black art to me. I read through the docs, use a dynamic tester, and
cross my fingers. Both hands...
 
R

Roedy Green



try:


"http://jammconsulting.com/jamm/[^/]+/[^/]+/([^/]+)/([^/]+)/([^.]+)\\.html"


or much easier:

String [] chunks = Pattern.compile( "/" ).split( s );
--
Roedy Green Canadian Mind Products
http://mindprod.com

"You can have quality software, or you can have pointer arithmetic; but you cannot have both at the same time."
~ Bertrand Meyer (born: 1950 age: 59) 1989, creator of design by contract and the Eiffel language.
 
T

Tom Anderson

Complicated regexes are such a bitch to debug. We need a tool that
shows you just how far it got.

There's a good regexp plugin for Eclipse (and there are doubtless others
than this):

http://brosinski.com/regex/

It doesn't quite do what you say, but it does live updating of a match
display as you edit the pattern, which goes a long way towards letting you
play with regexps interactively.

tom
 
T

Tom Anderson

Neil said:
I wrote this regular expression:
^http://jammconsulting.com/jamm/[^/]+/[^/]+/([^/]+/[^/]+)*\\.html?

It seems to be working fine for most urls, but it barfed on this one:
http://jammconsulting.com/jamm/page/products/Stuff/Bags-&-Luggage/Bags-&-Totes/Backpacks.html

The matcher gives me 1 group with this value: s/Backpacks

I dont understand how that could have happened. I was expecting to
get
two groups:
Stuff/Bags-%26-Luggage
Bags-%26-Totes/Backpacks

Any ideas what went wrong?

You have two problems.

Firstly, the repeated group as written has no way to admit slashes
*between* pairs of path elements. Expand the repetition by hand (three
times, here):

[^/]+/[^/]+[^/]+/[^/]+[^/]+/[^/]+

You get the slash between elements in a pair, but not between pairs. This
explains your results. You need something that expands to:

[^/]+/[^/]+/[^/]+/[^/]+/[^/]+/[^/]+

Like:

^http://jammconsulting.com/jamm/[^/]+/[^/]+(/[^/]+/[^/]+)*\\.html?

You can get the individual elements with smaller capturing groups (here
making the pair-level group non-capturing):

^http://jammconsulting.com/jamm/[^/]+/[^/]+(?:/([^/]+)/([^/]+))*\\.html?

Secondly, you get one matching group per occurrence of a capturing group
in the *pattern*, not per occurrence of the subpattern in the match. That
is, if the above pair group matches five times, you'll still only get a
single pair of captured groups (the last ones). That, i think, means
there's no way to use a regular expression to do what you want to do here.

At least, not directly. What you can do is make a regexp which matches a
single occurrence of a pair of elements, and then use the Matcher's find()
method to loop over all occurrences in the string. Like so:

import java.net.URI;
import java.net.URISyntaxException;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Split {
public static void main(String... args) throws URISyntaxException {
Pattern whole = Pattern.compile("^/jamm/[^/]+/[^/]+(.*?)\\.html?$");
Pattern pair = Pattern.compile("([^/]+)/([^/]+)");
for (String s: args) {
URI uri = new URI(s);
String path = uri.getPath();
Matcher wholeMatch = whole.matcher(path);
if (wholeMatch.matches()) {
Matcher pairMatch = pair.matcher(wholeMatch.group(1));
while (pairMatch.find()) {
String first = pairMatch.group(1);
String second = pairMatch.group(2);
System.out.println(Integer.toString(pairMatch.start()) + "\t" + first + "\t" + second);
}
}
}
}
}

Note that rather than matching against the raw URL string, i'm going via
java.net.URI; this saves me having to match the other bits of the URL
explicitly, and also takes care of resolving % escapes.
I don't understand what the * was in the end of your regex: "*\.html" ?

It's a quantifier on the preceding group - the one which captures the
paired path components like 'Stuff/Bags-%26-Luggage'. It means that there
can be any number of such pairs.

tom
 
T

Tom Anderson

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top