Need help with regular expression to parse URLs

Discussion in 'Java' started by Neil, Aug 10, 2009.

  1. Neil

    Neil Guest

    Hello:

    I am having trouble figuring out how to write a regular expression to
    parse our parts of a url.

    For example, I am trying to parse the url
    http://jammconsulting.com/jamm/page/test/*/*/*/*.html
    into several substrings. The URL should begin with
    http://jammconsulting.com/jamm/*/*/
    and then have a group of parameters in the form */*
    and then end with .html

    So, for example, this url:
    http://jammconsulting.com/jamm/page/products/Brand/Abc.html

    Should give me Brand and Abc as parameters.

    I wrote this regular expression:
    ^http://jammconsulting.com/jamm/[^/]+/[^/]+/([^/]+/[^/]+)*\\.html?

    It seems to be working fine for most urls, but it barfed on this one:
    http://jammconsulting.com/jamm/page/products/Stuff/Bags-&-Luggage/Bags-&-Totes/Backpacks.html

    The matcher gives me 1 group with this value: s/Backpacks

    I dont understand how that could have happened. I was expecting to
    get
    two groups:
    Stuff/Bags-%26-Luggage
    Bags-%26-Totes/Backpacks

    Any ideas what went wrong?

    Also, is there a way to tell the pattern to further parse the group
    into
    Stuff and Bags-%26-Luggage separately or should I do that with another
    Pattern I apply to the group after I extract it from the main url?

    Thanks,
    Neil

    --
    Neil Aggarwal, (281)846-8957, www.JAMMConsulting.com
    Will your e-commerce site go offline if you have
    a DB server failure, fiber cut, flood, fire, or other disaster?
    If so, ask about our geographically redundant database system.
    Neil, Aug 10, 2009
    #1
    1. Advertising

  2. Neil wrote:
    > Hello:
    >
    > I am having trouble figuring out how to write a regular expression to
    > parse our parts of a url.
    >
    > For example, I am trying to parse the url
    > http://jammconsulting.com/jamm/page/test/*/*/*/*.html
    > into several substrings. The URL should begin with
    > http://jammconsulting.com/jamm/*/*/
    > and then have a group of parameters in the form */*
    > and then end with .html
    >
    > So, for example, this url:
    > http://jammconsulting.com/jamm/page/products/Brand/Abc.html
    >
    > Should give me Brand and Abc as parameters.
    >
    > I wrote this regular expression:
    > ^http://jammconsulting.com/jamm/[^/]+/[^/]+/([^/]+/[^/]+)*\\.html?
    >
    > It seems to be working fine for most urls, but it barfed on this one:
    > http://jammconsulting.com/jamm/page/products/Stuff/Bags-&-Luggage/Bags-&-Totes/Backpacks.html
    >
    > The matcher gives me 1 group with this value: s/Backpacks
    >
    > I dont understand how that could have happened. I was expecting to
    > get
    > two groups:
    > Stuff/Bags-%26-Luggage
    > Bags-%26-Totes/Backpacks
    >
    > Any ideas what went wrong?
    >
    > Also, is there a way to tell the pattern to further parse the group
    > into
    > Stuff and Bags-%26-Luggage separately or should I do that with another
    > Pattern I apply to the group after I extract it from the main url?
    >
    > Thanks,
    > Neil
    >
    > --
    > Neil Aggarwal, (281)846-8957, www.JAMMConsulting.com
    > Will your e-commerce site go offline if you have
    > a DB server failure, fiber cut, flood, fire, or other disaster?
    > If so, ask about our geographically redundant database system.


    There is no way (that I know of) to get two groups without specifying
    two sets of parentheses in the regex.

    --

    Knute Johnson
    email s/nospam/knute2009/

    --
    Posted via NewsDemon.com - Premium Uncensored Newsgroup Service
    ------->>>>>>http://www.NewsDemon.com<<<<<<------
    Unlimited Access, Anonymous Accounts, Uncensored Broadband Access
    Knute Johnson, Aug 10, 2009
    #2
    1. Advertising

  3. Neil

    Neil Guest

    > There is no way (that I know of) to get two groups without specifying
    > two sets of parentheses in the regex.


    If I change my regex to be:
    ^http://jammconsulting.com/jamm/[^/]+/[^/]+/(([^/]+)/([^/]+))*\
    \.html?

    I get this result:

    Group 1: s/Backpacks
    Group 2: s
    Group 3: Backpacks

    Which is splitting up the subexpression but the outer group is wrong
    in the first place.

    Any ideas?

    --
    Neil Aggarwal, (281)846-8957, www.JAMMConsulting.com
    Will your e-commerce site go offline if you have
    a DB server failure, fiber cut, flood, fire, or other disaster?
    If so, ask about our geographically redundant database system.
    Neil, Aug 10, 2009
    #3
  4. Neil wrote:
    >> There is no way (that I know of) to get two groups without specifying
    >> two sets of parentheses in the regex.

    >
    > If I change my regex to be:
    > ^http://jammconsulting.com/jamm/[^/]+/[^/]+/(([^/]+)/([^/]+))*\
    > \.html?
    >
    > I get this result:
    >
    > Group 1: s/Backpacks
    > Group 2: s
    > Group 3: Backpacks
    >
    > Which is splitting up the subexpression but the outer group is wrong
    > in the first place.
    >
    > Any ideas?
    >
    > --
    > Neil Aggarwal, (281)846-8957, www.JAMMConsulting.com
    > Will your e-commerce site go offline if you have
    > a DB server failure, fiber cut, flood, fire, or other disaster?
    > If so, ask about our geographically redundant database system.


    import java.util.regex.*;

    public class test {
    public static void main(String[] args) {
    String str =
    // "http://jamconsulting.com/jamm/page/products/Brand/Abc.html";

    "http://jammconsulting.com/jamm/page/products/Stuff/Bags-%26-Luggage/Bags-%26-Totes/Backpacks.html";
    Pattern p = Pattern.compile("http://.*/(.*/.*)/(.*/.*)\\.html");
    Matcher m = p.matcher(str);
    System.out.println(m.matches());
    System.out.println(m.group(1));
    System.out.println(m.group(2));
    }
    }

    C:\Documents and Settings\Knute Johnson>java test
    true
    Stuff/Bags-%26-Luggage
    Bags-%26-Totes/Backpacks

    --

    Knute Johnson
    email s/nospam/knute2009/

    --
    Posted via NewsDemon.com - Premium Uncensored Newsgroup Service
    ------->>>>>>http://www.NewsDemon.com<<<<<<------
    Unlimited Access, Anonymous Accounts, Uncensored Broadband Access
    Knute Johnson, Aug 10, 2009
    #4
  5. Neil

    markspace Guest

    Neil wrote:
    >
    > So, for example, this url:
    > http://jammconsulting.com/jamm/page/products/Brand/Abc.html
    >
    > Should give me Brand and Abc as parameters.



    I get "Brand/Abc" as one single capture group, not two separate things.

    Don't get confused, there are two groups in the Matcher. The first is
    the WHOLE STRING. It's what the whole regex matches. That's not a
    capturing group, it's just the first "group" in the matcher group list.
    The second group (argument or index #1) is the first capturing group,
    if any. Don't confuse matcher.group(0) and matcher.group(1), they're
    two different things, really.

    In other works, if you're not testing this carefully and you think
    matcher.group(0) is the first capturing group, that's your mistake.


    >
    > I wrote this regular expression:
    > ^http://jammconsulting.com/jamm/[^/]+/[^/]+/([^/]+/[^/]+)*\\.html?
    >
    > It seems to be working fine for most urls, but it barfed on this one:
    > http://jammconsulting.com/jamm/page/products/Stuff/Bags-&-Luggage/Bags-&-Totes/Backpacks.html
    >
    > The matcher gives me 1 group with this value: s/Backpacks



    I get that result too. However, there's no way that regex is "working"
    on "most urls" unless the target data is different than what you are
    showing us, or you've got some post processing of the capturing group
    that masks the problem.


    >
    > I dont understand how that could have happened. I was expecting to
    > get
    > two groups:
    > Stuff/Bags-%26-Luggage
    > Bags-%26-Totes/Backpacks
    >
    > Any ideas what went wrong?


    I agree with Eric: you'll need two capture groups:

    ^http://jammconsulting.com/jamm/[^/]+/[^/]+/([^/]+/[^/]+)/([^/]+/[^/]+)\.html?


    I don't understand what the * was in the end of your regex: "*\.html" ?
    I'm not an expert so maybe I missed something though. Note the above is
    regex, not a Java string. You'll need to double up the \ at the end for
    Java string escape sequences.


    >
    > Also, is there a way to tell the pattern to further parse the group
    > into
    > Stuff and Bags-%26-Luggage separately or should I do that with another
    > Pattern I apply to the group after I extract it from the main url?



    Depends how you want to "further parse" the target string. Example?
    markspace, Aug 10, 2009
    #5
  6. Neil

    markspace Guest

    Knute Johnson wrote:

    > Pattern p = Pattern.compile("http://.*/(.*/.*)/(.*/.*)\\.html");



    Just curious: how efficient do you think this is? I think that the
    first .* will match the whole string, then the regex will start backing
    off slowly one character at a time until it gets a match on the rest of
    the pattern. This may happen multiple times as each other .* is also
    backed off one character at a time to try to produce a match.

    A smart matcher could spot constant ".html" at the end and maybe
    optimize the resulting compiled regex, but I don't know enough about
    regex to predict whether this is really likely or even possible.

    The OP's regex string was a bit superior in that regard, he just needed
    the extra capturing group.
    markspace, Aug 10, 2009
    #6
  7. Neil

    markspace Guest

    markspace wrote:

    > I agree with Eric: you'll need two capture groups:


    Oops, meant "Knute." :)
    markspace, Aug 10, 2009
    #7
  8. markspace wrote:
    > Knute Johnson wrote:
    >
    >> Pattern p = Pattern.compile("http://.*/(.*/.*)/(.*/.*)\\.html");

    >
    >
    > Just curious: how efficient do you think this is? I think that the
    > first .* will match the whole string, then the regex will start backing
    > off slowly one character at a time until it gets a match on the rest of
    > the pattern. This may happen multiple times as each other .* is also
    > backed off one character at a time to try to produce a match.


    It's probably horrible. I didn't really play with it other than to make
    the two groups work.

    > A smart matcher could spot constant ".html" at the end and maybe
    > optimize the resulting compiled regex, but I don't know enough about
    > regex to predict whether this is really likely or even possible.
    >
    > The OP's regex string was a bit superior in that regard, he just needed
    > the extra capturing group.


    No doubt.

    --

    Knute Johnson
    email s/nospam/knute2009/

    --
    Posted via NewsDemon.com - Premium Uncensored Newsgroup Service
    ------->>>>>>http://www.NewsDemon.com<<<<<<------
    Unlimited Access, Anonymous Accounts, Uncensored Broadband Access
    Knute Johnson, Aug 10, 2009
    #8
  9. markspace wrote:
    > markspace wrote:
    >
    >> I agree with Eric: you'll need two capture groups:

    >
    > Oops, meant "Knute." :)


    Erik is my brother :).

    --

    Knute Johnson
    email s/nospam/knute2009/

    --
    Posted via NewsDemon.com - Premium Uncensored Newsgroup Service
    ------->>>>>>http://www.NewsDemon.com<<<<<<------
    Unlimited Access, Anonymous Accounts, Uncensored Broadband Access
    Knute Johnson, Aug 10, 2009
    #9
  10. Neil

    Wojtek Guest

    Neil wrote :
    > I am having trouble figuring out how to write a regular expression to
    > parse our parts of a url.


    Not to dis regex, but...

    I read this thread and think that I could have written a custom parser
    in less time, and probably with better performance.

    --
    Wojtek :)
    Wojtek, Aug 10, 2009
    #10
  11. Neil

    Roedy Green Guest

    On Mon, 10 Aug 2009 11:35:04 -0700 (PDT), Neil
    <> wrote, quoted or indirectly quoted someone
    who said :

    >
    >I am having trouble figuring out how to write a regular expression to
    >parse our parts of a url.


    The URL/URI classes are designed to take URLs apart and put them back
    together. You probably don't even have to roll your own regex.

    Even if it does not do everything, you can get it strip out the piece
    you need, that you can process with a simple regex.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    "You can have quality software, or you can have pointer arithmetic; but you cannot have both at the same time."
    ~ Bertrand Meyer (born: 1950 age: 59) 1989, creator of design by contract and the Eiffel language.
    Roedy Green, Aug 10, 2009
    #11
  12. Neil

    markspace Guest

    Wojtek wrote:
    > Neil wrote :
    >> I am having trouble figuring out how to write a regular expression to
    >> parse our parts of a url.

    >
    > Not to dis regex, but...
    >
    > I read this thread and think that I could have written a custom parser
    > in less time, and probably with better performance.
    >



    Seriously? It took me about two minutes of fiddling with the regex
    before I felt I had the answer, and some of that included just messing
    around to make absolutely sure I was doing what I thought I was doing.

    If you can write a custom parser in two minutes, I'd like to see it.

    Also, the regex will be more flexible when requirements do inevitably
    change.
    markspace, Aug 10, 2009
    #12
  13. Neil

    Roedy Green Guest

    On Mon, 10 Aug 2009 11:35:04 -0700 (PDT), Neil
    <> wrote, quoted or indirectly quoted someone
    who said :

    >([^/]+/[^/]+)


    This sort of thing might be easier to process by extracting a chunk of
    the big string, and doing a regex split.

    http://mindprod.com/jgloss/regex.html#SPLIT
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    "You can have quality software, or you can have pointer arithmetic; but you cannot have both at the same time."
    ~ Bertrand Meyer (born: 1950 age: 59) 1989, creator of design by contract and the Eiffel language.
    Roedy Green, Aug 10, 2009
    #13
  14. Neil

    Roedy Green Guest

    On Mon, 10 Aug 2009 11:35:04 -0700 (PDT), Neil
    <> wrote, quoted or indirectly quoted someone
    who said :

    >http://jammconsulting.com/jamm/page/products/Stuff/Bags-&-Luggage/Bags-&-Totes/Backpacks.html


    Complicated regexes are such a bitch to debug. We need a tool that
    shows you just how far it got.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    "You can have quality software, or you can have pointer arithmetic; but you cannot have both at the same time."
    ~ Bertrand Meyer (born: 1950 age: 59) 1989, creator of design by contract and the Eiffel language.
    Roedy Green, Aug 10, 2009
    #14
  15. Neil

    Wojtek Guest

    markspace wrote :
    > Wojtek wrote:
    >> Neil wrote :
    >>> I am having trouble figuring out how to write a regular expression to
    >>> parse our parts of a url.

    >>
    >> Not to dis regex, but...
    >>
    >> I read this thread and think that I could have written a custom parser in
    >> less time, and probably with better performance.
    >>

    >
    >
    > Seriously? It took me about two minutes of fiddling with the regex before I
    > felt I had the answer, and some of that included just messing around to make
    > absolutely sure I was doing what I thought I was doing.
    >
    > If you can write a custom parser in two minutes, I'd like to see it.


    Well maybe three minutes... or so :)

    For this one, the start of the parse would be the length of the base
    URI "http://jammconsulting.com/jamm/page/products/", then read through
    the remainder gathering characters into a StringBuffer. When the exit
    point is reached for that "block" (back-slash), place the
    StringBuffer.toString() into a ListArray and go again. When ".html" is
    reached, exit the loop.

    Print out the ListArray. Done.

    So :
    http://jammconsulting.com/jamm/page/products/Stuff/Bags-&-Luggage/Bags-&-Totes/Backpacks.html

    would produce:
    Stuff
    Bags-%26-Luggage
    Bags-%26-Totes
    Backpacks


    > Also, the regex will be more flexible when requirements do inevitably change.


    I write a lot of parsers and find them easier than regex, but then I do
    not pretend to be a regex master, so creating a regex is almost like a
    black art to me. I read through the docs, use a dynamic tester, and
    cross my fingers. Both hands...

    --
    Wojtek :)
    Wojtek, Aug 10, 2009
    #15
  16. Neil

    Roedy Green Guest

    On Mon, 10 Aug 2009 11:35:04 -0700 (PDT), Neil
    <> wrote, quoted or indirectly quoted someone
    who said :

    >http://jammconsulting.com/jamm/page/products/Stuff/Bags-&-Luggage/Bags-&-Totes/Backpacks.html



    try:


    "http://jammconsulting.com/jamm/[^/]+/[^/]+/([^/]+)/([^/]+)/([^.]+)\\.html"


    or much easier:

    String [] chunks = Pattern.compile( "/" ).split( s );
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    "You can have quality software, or you can have pointer arithmetic; but you cannot have both at the same time."
    ~ Bertrand Meyer (born: 1950 age: 59) 1989, creator of design by contract and the Eiffel language.
    Roedy Green, Aug 10, 2009
    #16
  17. Neil

    Stefan Ram Guest

    Stefan Ram, Aug 10, 2009
    #17
  18. Neil

    Tom Anderson Guest

    On Mon, 10 Aug 2009, Roedy Green wrote:

    > On Mon, 10 Aug 2009 11:35:04 -0700 (PDT), Neil
    > <> wrote, quoted or indirectly quoted someone
    > who said :
    >
    >> http://jammconsulting.com/jamm/page/products/Stuff/Bags-&-Luggage/Bags-&-Totes/Backpacks.html

    >
    > Complicated regexes are such a bitch to debug. We need a tool that
    > shows you just how far it got.


    There's a good regexp plugin for Eclipse (and there are doubtless others
    than this):

    http://brosinski.com/regex/

    It doesn't quite do what you say, but it does live updating of a match
    display as you edit the pattern, which goes a long way towards letting you
    play with regexps interactively.

    tom

    --
    I do not fear death. I had been dead for billions and billions of years
    before I was born. -- Mark Twain
    Tom Anderson, Aug 10, 2009
    #18
  19. Neil

    Tom Anderson Guest

    On Mon, 10 Aug 2009, markspace wrote:

    > Neil wrote:
    >
    >> I wrote this regular expression:
    >> ^http://jammconsulting.com/jamm/[^/]+/[^/]+/([^/]+/[^/]+)*\\.html?
    >>
    >> It seems to be working fine for most urls, but it barfed on this one:
    >> http://jammconsulting.com/jamm/page/products/Stuff/Bags-&-Luggage/Bags-&-Totes/Backpacks.html
    >>
    >> The matcher gives me 1 group with this value: s/Backpacks
    >>
    >> I dont understand how that could have happened. I was expecting to
    >> get
    >> two groups:
    >> Stuff/Bags-%26-Luggage
    >> Bags-%26-Totes/Backpacks
    >>
    >> Any ideas what went wrong?


    You have two problems.

    Firstly, the repeated group as written has no way to admit slashes
    *between* pairs of path elements. Expand the repetition by hand (three
    times, here):

    [^/]+/[^/]+[^/]+/[^/]+[^/]+/[^/]+

    You get the slash between elements in a pair, but not between pairs. This
    explains your results. You need something that expands to:

    [^/]+/[^/]+/[^/]+/[^/]+/[^/]+/[^/]+

    Like:

    ^http://jammconsulting.com/jamm/[^/]+/[^/]+(/[^/]+/[^/]+)*\\.html?

    You can get the individual elements with smaller capturing groups (here
    making the pair-level group non-capturing):

    ^http://jammconsulting.com/jamm/[^/]+/[^/]+(?:/([^/]+)/([^/]+))*\\.html?

    Secondly, you get one matching group per occurrence of a capturing group
    in the *pattern*, not per occurrence of the subpattern in the match. That
    is, if the above pair group matches five times, you'll still only get a
    single pair of captured groups (the last ones). That, i think, means
    there's no way to use a regular expression to do what you want to do here.

    At least, not directly. What you can do is make a regexp which matches a
    single occurrence of a pair of elements, and then use the Matcher's find()
    method to loop over all occurrences in the string. Like so:

    import java.net.URI;
    import java.net.URISyntaxException;
    import java.util.regex.Pattern;
    import java.util.regex.Matcher;

    public class Split {
    public static void main(String... args) throws URISyntaxException {
    Pattern whole = Pattern.compile("^/jamm/[^/]+/[^/]+(.*?)\\.html?$");
    Pattern pair = Pattern.compile("([^/]+)/([^/]+)");
    for (String s: args) {
    URI uri = new URI(s);
    String path = uri.getPath();
    Matcher wholeMatch = whole.matcher(path);
    if (wholeMatch.matches()) {
    Matcher pairMatch = pair.matcher(wholeMatch.group(1));
    while (pairMatch.find()) {
    String first = pairMatch.group(1);
    String second = pairMatch.group(2);
    System.out.println(Integer.toString(pairMatch.start()) + "\t" + first + "\t" + second);
    }
    }
    }
    }
    }

    Note that rather than matching against the raw URL string, i'm going via
    java.net.URI; this saves me having to match the other bits of the URL
    explicitly, and also takes care of resolving % escapes.

    > I don't understand what the * was in the end of your regex: "*\.html" ?


    It's a quantifier on the preceding group - the one which captures the
    paired path components like 'Stuff/Bags-%26-Luggage'. It means that there
    can be any number of such pairs.

    tom

    --
    I do not fear death. I had been dead for billions and billions of years
    before I was born. -- Mark Twain
    Tom Anderson, Aug 10, 2009
    #19
  20. Neil

    Tom Anderson Guest

    On Mon, 10 Aug 2009, Roedy Green wrote:

    > On Mon, 10 Aug 2009 11:35:04 -0700 (PDT), Neil
    > <> wrote, quoted or indirectly quoted someone
    > who said :
    >
    >> http://jammconsulting.com/jamm/page/products/Stuff/Bags-&-Luggage/Bags-&-Totes/Backpacks.html

    >
    > or much easier:
    >
    > String [] chunks = Pattern.compile( "/" ).split( s );


    This is absolutely the right thing to do (yes, i know i've just posted a
    completely different solution - split() is better), and i'm shocked that
    nobody else has suggested it yet.

    Writing a loop to iterate over the elements of the chunks array in pairs
    is a pain, but a very minor one.

    tom

    --
    I do not fear death. I had been dead for billions and billions of years
    before I was born. -- Mark Twain
    Tom Anderson, Aug 10, 2009
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. VSK
    Replies:
    2
    Views:
    2,282
  2. Kaidi
    Replies:
    5
    Views:
    466
    Andrew Thompson
    Jan 4, 2004
  3. Nathan Sokalski

    Converting Relative URLs into Absolute URLs

    Nathan Sokalski, Aug 11, 2008, in forum: ASP .Net
    Replies:
    1
    Views:
    738
    Sriram Srivatsan
    Aug 12, 2008
  4. Adam Monsen

    JDBC URLs ...not really URLs?

    Adam Monsen, Feb 6, 2009, in forum: Java
    Replies:
    11
    Views:
    6,179
    Adam Monsen
    Feb 8, 2009
  5. Justin F
    Replies:
    4
    Views:
    633
    James Willmore
    Mar 5, 2004
Loading...

Share This Page