Need help with regex

  • Thread starter Christophe Vanfleteren
  • Start date
C

Christophe Vanfleteren

Hello,

I'm having trouble getting finding the right regex for the following
problem:


Assume you have files in a directory in the following form:

*/GROUP1 - GROUP2/GROUP3 - GROUP4.extension

Group 1 can consist of any alphanumeric character, plus some other chars
(space, underscore, ...). Group 3 consists only of digits.
Group 2 and 4 can contain anything, except a File.separator (since that is
used to split on).

This is the regex (using java.util.regex) I use to split all this (all on
one line):
..*/([\w\s'&_,\
\-]+)\s+-\s+([\p{Graph}\s&&[^/]]+)/(\d+)\s+-\s+([\p{Graph}\s&&[^/]]+).*

I am able to retrieve these 4 separate groups, but I get into problems once
the first group also contains a - (minus) char.


When group 1 looks like X XX-ZZZ, it should still consider all this as the
first group. But at the moment, the regex doesn't match, since I don't
allow "-" in the first group. but if I do allow them, I can no longer split
on " - " (since I also allow spaces in the first group).

So I should be able to construct a regex that allows spaces and the "-" in
the first group, but still starts the second group once it finds a " - ".

I tried messing around with non-capturing groups, like ([\w\s'&_,\.\-]+),
but I guess that's not the way it should be done.


Do any regex experts have tips on how I should construct this regex?
 
C

Christophe Vanfleteren

Christophe said:
Hello,

I'm having trouble getting finding the right regex for the following
problem:


Assume you have files in a directory in the following form:

*/GROUP1 - GROUP2/GROUP3 - GROUP4.extension

Group 1 can consist of any alphanumeric character, plus some other chars
(space, underscore, ...). Group 3 consists only of digits.
Group 2 and 4 can contain anything, except a File.separator (since that is
used to split on).

This is the regex (using java.util.regex) I use to split all this (all on
one line):
.*/([\w\s'&_,\
\-]+)\s+-\s+([\p{Graph}\s&&[^/]]+)/(\d+)\s+-\s+([\p{Graph}\s&&[^/]]+).*

I am able to retrieve these 4 separate groups, but I get into problems
once the first group also contains a - (minus) char.


When group 1 looks like X XX-ZZZ, it should still consider all this as the
first group. But at the moment, the regex doesn't match, since I don't
allow "-" in the first group. but if I do allow them, I can no longer
split on " - " (since I also allow spaces in the first group).

So I should be able to construct a regex that allows spaces and the "-" in
the first group, but still starts the second group once it finds a " - ".

I tried messing around with non-capturing groups, like ([\w\s'&_,\.\-]+),
but I guess that's not the way it should be done.

Made a mistake here:

([\w\s'&_,\.\-]+)
should be
([\w\s'&_,\.\-&&^(?:\s\-\s)]+)
 
C

Christophe Vanfleteren

Roedy said:
One way to simplify this is to use a split to get some pieces, then
make a regex for each piece.

See http://mindprod.com/jgloss/regex.html

One giant regex can boggle the mind.

The giant regex you see is the result of 4 smaller regexes :)

The problem is that I can't easily start to work with a splitter, since the
GROUP1 - GROUP2/GROUP3 - GROUP4 pattern could also be something like
GROUP1/GROUP2 - GROUP3 - GROUP4. or GROUP1/(GROUP2) - GROUP3 - GROUP4

At the moment, I have a number of schemes like that, and I test each file
until one scheme matches.

But maybe I shouldn't have showed the entire regex, as it is just the part
where I want to be able to match spaces and "-" in the same group, without
matching " - ", that I have problems with.
 
C

Christophe Vanfleteren

Roedy said:
One way to simplify this is to use a split to get some pieces, then
make a regex for each piece.

See http://mindprod.com/jgloss/regex.html

One giant regex can boggle the mind.

Ok, I read your page again carefully, and the description for (?!X) seemed
usefull so I tried it. It now works.

The part for the first group now looks like this:
([\w\s'&_,\.\-(?!\s).]+)
 
A

Alan Moore

Ok, I read your page again carefully, and the description for (?!X) seemed
usefull so I tried it. It now works.

The part for the first group now looks like this:
([\w\s'&_,\.\-(?!\s).]+)

If that works, it's only by accident. Inside the character class,
"(?!\s)" isn't interpreted as a negative lookahead, but as the
individual characters '(', '?', '!' and ')', plus the whitespace
shorthand (again). Also, inside a character class, a dot is not a
metacharacter; it just matches a dot (so you didn't have to escape the
first one, and the second one isn't doing what you think it is). I
think you were trying to do this:

((?:[\w\s'&,.]|-(?!\s))+)

Here, the character class takes care of all the allowable characters
except the hyphen, while the second alternative matches the hyphen
only if it's not followed by a whitespace character. That can be done
more efficiently, but I think a much simpler approach may be in order.
If you know the groups will always be separated by the sequence " - ",
why not just use that to find the groups:

.*/([^/]+) - ([^/]+)/([^/]+) - ([^/]+)\.\w+

You should be able to use the same approach for the other formats you
listed.
 
C

Christophe Vanfleteren

Alan said:
Ok, I read your page again carefully, and the description for (?!X) seemed
usefull so I tried it. It now works.

The part for the first group now looks like this:
([\w\s'&_,\.\-(?!\s).]+)

If that works, it's only by accident. Inside the character class,
"(?!\s)" isn't interpreted as a negative lookahead, but as the
individual characters '(', '?', '!' and ')', plus the whitespace
shorthand (again). Also, inside a character class, a dot is not a
metacharacter; it just matches a dot (so you didn't have to escape the
first one, and the second one isn't doing what you think it is). I
think you were trying to do this:

((?:[\w\s'&,.]|-(?!\s))+)

Here, the character class takes care of all the allowable characters
except the hyphen, while the second alternative matches the hyphen
only if it's not followed by a whitespace character. That can be done
more efficiently, but I think a much simpler approach may be in order.
If you know the groups will always be separated by the sequence " - ",
why not just use that to find the groups:

.*/([^/]+) - ([^/]+)/([^/]+) - ([^/]+)\.\w+

You should be able to use the same approach for the other formats you
listed.

Hi Alan,

you were correct, after your post I noticed that XXX-ZZZ got matched, but in
the case of XXX - YYY - ZZZ, XXX - YYY got counted as the first group.

The regex you gave works perfectly, thanks a lot.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top