Help simplify complex regexp needing positive lookahead and reluctant quantifers

D

david.karr

My code is in Java, but my problem is a complicated regexp.
Ironically, I think I'm more likely to get a better response in here
than elsewhere. It's too bad there's no "regular expressions"
newsgroup (that I can find).

My sample data is the following (abstracted from real data):
--------------
*XXXlkjsflkw34lkjsfd
2XXXlkjsdfojsfjoimf344
3XXXabcdef9999999
4XXX9f9f9f9f9f9f9f9f
5XXXg8g8g8g8g8g8g8g
6XXXe6e6e6e6e6e6e6e6e
YYY=D/23333333
-xxxxxxxxxxxx
-yyyyyyyyyyyy
ZZZ=gggggggggggg
AAA=hhhhhhhhhh
-jjjjjjjjjjj
-kkkkkkkkkkk
/XXX 2
--------------

The important elements are "XXX", "YYY", "ZZZ", and "AAA". Each of
"YYY", "ZZZ", and "AAA" could be in any order, and some could be
missing, or others like it could be added. What I'd like to build is a
regexp that can group each of "YYY", "ZZZ", and "AAA" along with their
"associated data", up to either the next "[A-Z]{3}=", or the ending
"/XXX". If I can get the "associated data" into group values, I can
use other regexps for the detail in those group values.

The regexp that I've built so far comes close to solving this, but not
quite. This is what I have so far (translated from Java string syntax
to Perl):

--------------
"(?sm)\\*.{3}.*\n" .
"2.{3}.*\n" .
"3.{3}.*\n" .
"4.{3}.*\n" .
"5.{3}.*\n" .
"6.{3}.*\n" .
" ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" .
" ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" .
" ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3})" .
"/[A-Z]{3}.*"
--------------

You can ignore for now the fact that I'm not verifying that all the
places that require "XXX" are all "XXX". The problem area is the
"[A-Z]{3}=" groups. This regexp works for my sample data, but I wasn't
able to simplify those three repeated lines into a single expression,
which would handle any number of those. I tried the following, to
replace those three lines:

"( ([A-Z]{3}=)(.*?)(?= [A-Z]{3}=|/[A-Z]{3}))*"

but that didn't seem to work, and I'm not sure why.

The following is the output from my Java program, using the working
regexp, where it iterated through the found groups. I provide this
just as another view of what I'm trying to capture:

--------------
group[YYY=]
group[D/23333333
-xxxxxxxxxxxx
-yyyyyyyyyyyy
]
group[ZZZ=]
group[gggggggggggg
]
group[AAA=]
group[hhhhhhhhhh
-jjjjjjjjjjj
-kkkkkkkkkkk
]
--------------
 
S

Sherm Pendley

My code is in Java, but my problem is a complicated regexp.
Ironically, I think I'm more likely to get a better response in here
than elsewhere. It's too bad there's no "regular expressions"
newsgroup (that I can find).

No, but there is definitely a Java group.

I'm not just being snide - implementations of regular expressions vary. An
answer you get here may not apply to Java, and answers you get here or in a
Java group may not apply to sed, and so forth. You'd be far better off
asking your question in a group that's focused on the particular
implementation that you're using.

sherm--
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top