need help with regex

?

-

i have a sample text:
"Key: Value Key: Value2 Key: Value3 Subkey: apple Subkey: orange Key:
Value 4"

i need to extract:
"Subkey: apple Subkey: orange"

i have a regex expression:
p.compile("Key: Value2\\s.*(Subkey:\\s.*)Key:");

which only succeeds in extracting "Subkey: orange"

i am pretty sure the solution is to repeat the portion using
(Subkey:\\s.*)* <--- extra asterisk.

BUT in java, the bracket means something else and i don't know how to
make it work.

any help is appreciated. thank you.
 
J

John C. Bollinger

- said:
i have a sample text:
"Key: Value Key: Value2 Key: Value3 Subkey: apple Subkey: orange Key:
Value 4"

i need to extract:
"Subkey: apple Subkey: orange"

i have a regex expression:
p.compile("Key: Value2\\s.*(Subkey:\\s.*)Key:");

which only succeeds in extracting "Subkey: orange"

i am pretty sure the solution is to repeat the portion using
(Subkey:\\s.*)* <--- extra asterisk.

BUT in java, the bracket means something else and i don't know how to
make it work.

I'm not sure what you mean by that. Where does a bracket come into it
at all? If you mean the closing parenthesis, it has the same meaning in
Java regular expressions that it does in Perl regular expressions. You
do need to mark the Subkey part of the pattern as appearing zero or more
times (via an asterisk), but that by itself won't fix the whole problem,
which also includes that your ".*" subexpressions are matching too much
of the input. It might help to use reluctant quantifiers instead of
greedy ones.

The problem is trickier than it appears, assuming that you want to be
able to generalize it to other legal expressions in the apparent
language of the string example. For instance, what if the Subkeys occur
on the last Key in the String? What if more than one Key has Subkeys?
 
A

Alan Moore

i have a sample text:
"Key: Value Key: Value2 Key: Value3 Subkey: apple Subkey: orange Key:
Value 4"

i need to extract:
"Subkey: apple Subkey: orange"

i have a regex expression:
p.compile("Key: Value2\\s.*(Subkey:\\s.*)Key:");

which only succeeds in extracting "Subkey: orange"

i am pretty sure the solution is to repeat the portion using
(Subkey:\\s.*)* <--- extra asterisk.

The problem with your regex is that the first ".*" originally matches
all the way to the end of the line. Then the regex engine has to
backtrack in order to match the rest of the pattern--but it only
backtracks as far as it has to, i.e., to the *last* occurrence of
"Subkey:". Adding the asterisk where you suggested only makes things
worse, because now it doesn't have to match the parenthesized
expression even once.

The simplest solution is to make the first ".*" non-greedy: ".*?".
You do need to add a quantifier to the subexpression, but just tacking
on another asterisk is a bad idea. Whenever you have a regex of the
form (x*)*, you run the risk that the regex will take forever to
report failure. I suggest you modify the subexpression so that it
doesn't rely on backtracking. Assuming the Subkey values can't
contain spaces, this should work:

"Key: Value2\\s.*?((?:\\sSubkey:\\s\\S++)+)"

Notice that I also had to match the space preceding the Subkey value
in order for the quantifier to work. I also used a possessive plus
inside the subexpression to avoid the neverending nonmatch problem,
although in this case it isn't really necessary.
 
?

-

Alan said:
The simplest solution is to make the first ".*" non-greedy: ".*?".
You do need to add a quantifier to the subexpression, but just tacking
on another asterisk is a bad idea. Whenever you have a regex of the
form (x*)*, you run the risk that the regex will take forever to
report failure. I suggest you modify the subexpression so that it
doesn't rely on backtracking. Assuming the Subkey values can't
contain spaces, this should work:

"Key: Value2\\s.*?((?:\\sSubkey:\\s\\S++)+)"

Thank you so much.
 
?

-

hi alan and/or any experts out there can help me out with this?

Similar to the earlier post using regular expression, I am trying to
extract the corresponding subkeys. However, this time it should take
into account the newline (\n,\r, etc) and '#'.

---- Start of sample string ------------------
# This is a comment line .

Key: Value
Key: Value2
Key: Value3
Subkey: apple
Subkey: orange # This marks another comment.

Key: Value4
Subkey: papaya
Subkey: # This subkey is empty

Key:Value5 # This key value has no empty space but is still valid
Subkey: watermelon
Key: Value6 # There is no blank line to separate this record
# There is no subkey

---- End of sample string ------------------

Desired results when using pattern.matcher(string) where string is

1) "Key: Value" or "Key: Value2" or "Key: Value3"
produces
Subkey: apple
Subkey: orange

2) "Key: Value4"
produces
Subkey: papaya
Subkey:

3) "Key: Value5" or "Key:Value5"
produces
Subkey: watermelon

this is way too complex for me and if it is chicken feed to you, do take
the time to help me out. thank you so much.
 
A

Alan Moore

hi alan and/or any experts out there can help me out with this?

Similar to the earlier post using regular expression, I am trying to
extract the corresponding subkeys. However, this time it should take
into account the newline (\n,\r, etc) and '#'.

---- Start of sample string ------------------
# This is a comment line .

Key: Value
Key: Value2
Key: Value3
Subkey: apple
Subkey: orange # This marks another comment.

Key: Value4
Subkey: papaya
Subkey: # This subkey is empty

Key:Value5 # This key value has no empty space but is still valid
Subkey: watermelon
Key: Value6 # There is no blank line to separate this record
# There is no subkey

---- End of sample string ------------------

Desired results when using pattern.matcher(string) where string is

1) "Key: Value" or "Key: Value2" or "Key: Value3"
produces
Subkey: apple
Subkey: orange

2) "Key: Value4"
produces
Subkey: papaya
Subkey:

3) "Key: Value5" or "Key:Value5"
produces
Subkey: watermelon

this is way too complex for me and if it is chicken feed to you, do take
the time to help me out. thank you so much.

There's no way to do all that with one regex, if that's what you're
looking for. I suggest you read the data in one line at a time (if
it's in a file), or split it into lines with str.split("[\r\n]+").
Then write a simple state machine to process the data one line at a
time. You can still use regexes for the individual lines, and they'll
be much simpler than what we were talking about in the other thread:

Pattern keyPat = Pattern.compile("^Key:\\s*+(\\S*)");

Pattern subPat = Pattern.compile("^Subkey:\\s*+([^#\\s]*)");

Again, I'm assuming that the values can't contain whitespace; if they
can, different regexes will be needed, especially for Subkey lines.
 
?

-

Alan said:
There's no way to do all that with one regex, if that's what you're
looking for. I suggest you read the data in one line at a time (if
it's in a file), or split it into lines with str.split("[\r\n]+").
Then write a simple state machine to process the data one line at a
time. You can still use regexes for the individual lines, and they'll
be much simpler than what we were talking about in the other thread:

Pattern keyPat = Pattern.compile("^Key:\\s*+(\\S*)");

Pattern subPat = Pattern.compile("^Subkey:\\s*+([^#\\s]*)");

Again, I'm assuming that the values can't contain whitespace; if they
can, different regexes will be needed, especially for Subkey lines.

hi alan.. i don't think it is impossible but merely a bit more complex.
i managed to do it but am stuck at extracting the comments portion,
which means i can get a:

Subkey: apple
Subkey: orange # This marks another comment.

rather than a:

Subkey: apple
Subkey: orange

it can be easily solved by doing another pattern matching or indexOf but
i am trying to expand my knowledge by learning how to do it in just one
reg exp.
 
A

Alan Moore

hi alan.. i don't think it is impossible but merely a bit more complex.
i managed to do it but am stuck at extracting the comments portion,
which means i can get a:

Subkey: apple
Subkey: orange # This marks another comment.

rather than a:

Subkey: apple
Subkey: orange

it can be easily solved by doing another pattern matching or indexOf but
i am trying to expand my knowledge by learning how to do it in just one
reg exp.

That's what I meant when I said you can't do it all with one regex:
you can't capture all the Subkey entries together without also
capturing their comments. And even if it is possible due to some
quirk in the structure of the data, it isn't worth the *effort* to do
it with regexes. If the data structure changes, that regex you worked
so hard on will become useless.

The most valuable thing you can learn about regexes is that there's
only so much you can (or should) do with them. For a parsing task
like this one, you're better off using program logic in addition to
(or even instead of) regexes. But I don't want to discourage you from
learning about regexes; I love them, and use them all the time. If
you don't already have it, you should definitely get The Book:

http://www.amazon.com/exec/obidos/ASIN/0596002890/masteringregu-20
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,562
Members
45,038
Latest member
OrderProperKetocapsules

Latest Threads

Top