Need help reading a perl regexp - someone clue me?

D

Don Bruder

I've got a "canned" regexp I'm trying to analyze that I can't quite
follow due to one of the constructs used in it. Can anyone
translate/verify my translation for me?

Here's the segment that's throwing me (It's a very small sub-section of
a rather large and complex regexp - We're talking something on the order
of 300+ characters worth of "rather large and complex")

[a-zA-Z]{2}[.,\;:?%!&+^~`'\$*=\#|013467\(\)\[\]\{\}<>"][a-zA-Z]{2}

Now, if I'm reading rightly, and I'm not totally hopeless as far as my
understanding of perl regexps goes, this should be looking to match "any
two letters followed by pretty much any punctuation mark (including
parens, braces, and brackets of all flavors, but (seemingly) excluding
the "bar" (AKA "OR") character) or any of the digits 0, 1, 3, 4, 6, or
7, followed by any two letters.

How far off base am I with that interpretation?

Should I be ignoring any usual "special meaning" of the 'bar' character
when it appears as part of a square-bracketed set, and therefore taking
the overall regexp to mean that the "bar" character *IS NOT* being
excluded or used in its "special" capacity?
 
G

Gunnar Hjalmarsson

Don said:
[a-zA-Z]{2}[.,\;:?%!&+^~`'\$*=\#|013467\(\)\[\]\{\}<>"][a-zA-Z]{2}

Should I be ignoring any usual "special meaning" of the 'bar'
character when it appears as part of a square-bracketed set, and
therefore taking the overall regexp to mean that the "bar"
character *IS NOT* being excluded or used in its "special"
capacity?

What happened when you tested it?

What you are calling a "sqare-bracketed set" is a character class, and
the answer is yes.
 
J

J Krugman

In said:
I've got a "canned" regexp I'm trying to analyze that I can't quite
follow due to one of the constructs used in it. Can anyone
translate/verify my translation for me?
Here's the segment that's throwing me (It's a very small sub-section of
a rather large and complex regexp - We're talking something on the order
of 300+ characters worth of "rather large and complex")
[a-zA-Z]{2}[.,\;:?%!&+^~`'\$*=\#|013467\(\)\[\]\{\}<>"][a-zA-Z]{2}

Now, if I'm reading rightly, and I'm not totally hopeless as far as my
understanding of perl regexps goes, this should be looking to match "any
two letters followed by pretty much any punctuation mark (including
parens, braces, and brackets of all flavors, but (seemingly) excluding
the "bar" (AKA "OR") character) or any of the digits 0, 1, 3, 4, 6, or
7, followed by any two letters.

Why exclude "|"? It's right there in the character class, and
there's no ^ at the beginning of that class, so that regexp is
*supposed* to match "AB|CD".

Most of those backslashes are superfluous, BTW. You only need the
ones before $ and ].
 
B

Ben Morrow

Don Bruder said:
I've got a "canned" regexp I'm trying to analyze that I can't quite
follow due to one of the constructs used in it. Can anyone
translate/verify my translation for me?

Here's the segment that's throwing me (It's a very small sub-section of
a rather large and complex regexp - We're talking something on the order
of 300+ characters worth of "rather large and complex")

[a-zA-Z]{2}[.,\;:?%!&+^~`'\$*=\#|013467\(\)\[\]\{\}<>"][a-zA-Z]{2}

Good God who wrote that?!
None of those backslashes are necessary except for the one before ].
Those [a-zA-Z] should almost certainly be [[:alpha:]].

I would strongly recommend breaking the regex up into bits as you
understand it. Assign each 'chunk' to a variable with qr//, and use /x
on the bits so you can separate things out decently. For instance,
that bit you have there can be written:

my $code = qw/[[:alpha:]]{2}/;
my $symbol = qr/[.,;:...<>"]/;

/$code $symbol $code/x;

(I'm making the entirely unjustified assumption that the two-letter
sequences are some sort of code, to illustrate that you want to give
the pieces names which reflect their function, rather than merely what
they match). See how much more readable that is?
Should I be ignoring any usual "special meaning" of the 'bar' character
when it appears as part of a square-bracketed set,

Yes, you should. Read perldoc perlre again. Nothing is significant in
a [] class except ] (except at the start), ^ (if at the start), -
(except at either end), and \.

Ben
 
D

Don Bruder

Ben Morrow said:
Don Bruder said:
I've got a "canned" regexp I'm trying to analyze that I can't quite
follow due to one of the constructs used in it. Can anyone
translate/verify my translation for me?

Here's the segment that's throwing me (It's a very small sub-section of
a rather large and complex regexp - We're talking something on the order
of 300+ characters worth of "rather large and complex")

[a-zA-Z]{2}[.,\;:?%!&+^~`'\$*=\#|013467\(\)\[\]\{\}<>"][a-zA-Z]{2}

Good God who wrote that?!

Dunno. 'Tweren't me. I'm just trying to understand it.
None of those backslashes are necessary except for the one before ].
Those [a-zA-Z] should almost certainly be [[:alpha:]].

Tell the person who wrote it, not me! :) Extra backslashes aren't giving
me the problem, though. (FWIW, I'm not a Perl programmer, and the regexp
in question is part of another package, not (to my knowledge) Perl, but
questions regarding the syntax of such regexps are referred to the Perl
Regexp documentation - Either the package is written in Perl, and I'm
only seeing a piece of one of the plugins (very likely) or they coded it
to the Perl regexp standard since it was easier than "scratch-building"
their own regexp package)
I would strongly recommend breaking the regex up into bits as you
understand it.

Been doing pretty much that as I walked thorugh it.
Assign each 'chunk' to a variable with qr//, and use /x
on the bits so you can separate things out decently. For instance,
that bit you have there can be written:

my $code = qw/[[:alpha:]]{2}/;
my $symbol = qr/[.,;:...<>"]/;

/$code $symbol $code/x;

(I'm making the entirely unjustified assumption that the two-letter
sequences are some sort of code, to illustrate that you want to give
the pieces names which reflect their function, rather than merely what
they match). See how much more readable that is?

Agreed on the readability. But since I'm not intersted in trying to
"tweak" it or anything like that - only UNDERSTAND it - I'll be leaving
it "as-is".

Yes, you should.

Bingo. That's the answer I needed, and cleared a big part of the "fog" I
was stumbling around in. Now to figure out why only "013467" in the list
of digits... I could easily understand ALL digits, but having only that
particular sub-set of digits just doesn't seem to make any sense, either
on the surface, or in the context of what I know it's *SUPPOSED* to be
doing.

In case anybody's interested, here's the full regexp that I'm trying to
understand:

(Beware of line-wrap - there are no literal space/carriage
return/linefeed characters in the string other than the regulation CR/LF
pair at the very end, following the "/i")

/\s(?!(?:fn|re):|(?:cc|to)=|(?:ma|qu|un)[`'"]|(?:dr|m[rst]|li|st|td)\.)[a
-zA-Z]{2}[.,\;:?%!&+^~`'\$*=\#|013467\(\)\[\]\{\}<>"][a-zA-Z]{2}(?<!\.(?:
(?-i:[A-Z][a-z]{1})|a[eiu]|b[ebmrsz]|c[afhnrx]|d[bek]|es|f[ir]|g[uz]|h[kn
rtu]|i[elnqrst]|j[mops]|k[prwy]|m[ckx]|n[loz]|p[lmrty]|ru|s[eghm]|t[cnv]|
u[ksu]|v[gi])|:no|['`"](?:ed|ll|[rv]e))(?:[,'\?!]|\.?\s)/i

Its "advertised purpose" is to go through a block of text looking for a
string consisting of
"<space><alpha-char><alpha-char><period><alpha-char><alpha-char><space>",
with no interest in whether the two letters on either side of the
period are upper or lower case.

It appears (from my analysis - which may be in error) that several
two-character top level internet domain names (.us, .uk, .se, .cn, .br,
..ru, and quite a few others), a small handful of common filename
extensions (.db, .gz, .js, etc), and a few other two-letter combinations
(dr., mr./ms., etc) are special-cased to exclude them from causing a
match. It *DOES* work as advertised, so that's not at issue. I'm not
trying to debug it, tweak it, or otherwise mess with it, I just wanted
to know how/why it was doing what it did before I changed behavior (and
potentially breaking something due to not understanding exactly what was
being matched) that happens if/when it finds a match.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,073
Latest member
DarinCeden

Latest Threads

Top