Need help reading a perl regexp - someone clue me?

Discussion in 'Perl Misc' started by Don Bruder, Feb 4, 2004.

  1. Don Bruder

    Don Bruder Guest

    I've got a "canned" regexp I'm trying to analyze that I can't quite
    follow due to one of the constructs used in it. Can anyone
    translate/verify my translation for me?

    Here's the segment that's throwing me (It's a very small sub-section of
    a rather large and complex regexp - We're talking something on the order
    of 300+ characters worth of "rather large and complex")

    [a-zA-Z]{2}[.,\;:?%!&+^~`'\$*=\#|013467\(\)\[\]\{\}<>"][a-zA-Z]{2}

    Now, if I'm reading rightly, and I'm not totally hopeless as far as my
    understanding of perl regexps goes, this should be looking to match "any
    two letters followed by pretty much any punctuation mark (including
    parens, braces, and brackets of all flavors, but (seemingly) excluding
    the "bar" (AKA "OR") character) or any of the digits 0, 1, 3, 4, 6, or
    7, followed by any two letters.

    How far off base am I with that interpretation?

    Should I be ignoring any usual "special meaning" of the 'bar' character
    when it appears as part of a square-bracketed set, and therefore taking
    the overall regexp to mean that the "bar" character *IS NOT* being
    excluded or used in its "special" capacity?

    --
    Don Bruder - <--- Preferred Email - SpamAssassinated.
    Hate SPAM? See <http://www.spamassassin.org> for some seriously great info.
    I will choose a path that's clear: I will choose Free Will! - N. Peart
    Fly trap info pages: <http://www.sonic.net/~dakidd/Horses/FlyTrap/index.html>
     
    Don Bruder, Feb 4, 2004
    #1
    1. Advertising

  2. Don Bruder wrote:
    >
    > [a-zA-Z]{2}[.,\;:?%!&+^~`'\$*=\#|013467\(\)\[\]\{\}<>"][a-zA-Z]{2}


    <snip>

    > Should I be ignoring any usual "special meaning" of the 'bar'
    > character when it appears as part of a square-bracketed set, and
    > therefore taking the overall regexp to mean that the "bar"
    > character *IS NOT* being excluded or used in its "special"
    > capacity?


    What happened when you tested it?

    What you are calling a "sqare-bracketed set" is a character class, and
    the answer is yes.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Feb 4, 2004
    #2
    1. Advertising

  3. Don Bruder

    J Krugman Guest

    In <A0aUb.12559$> Don Bruder <> writes:

    >I've got a "canned" regexp I'm trying to analyze that I can't quite
    >follow due to one of the constructs used in it. Can anyone
    >translate/verify my translation for me?


    >Here's the segment that's throwing me (It's a very small sub-section of
    >a rather large and complex regexp - We're talking something on the order
    >of 300+ characters worth of "rather large and complex")


    >[a-zA-Z]{2}[.,\;:?%!&+^~`'\$*=\#|013467\(\)\[\]\{\}<>"][a-zA-Z]{2}


    >Now, if I'm reading rightly, and I'm not totally hopeless as far as my
    >understanding of perl regexps goes, this should be looking to match "any
    >two letters followed by pretty much any punctuation mark (including
    >parens, braces, and brackets of all flavors, but (seemingly) excluding
    >the "bar" (AKA "OR") character) or any of the digits 0, 1, 3, 4, 6, or
    >7, followed by any two letters.


    Why exclude "|"? It's right there in the character class, and
    there's no ^ at the beginning of that class, so that regexp is
    *supposed* to match "AB|CD".

    Most of those backslashes are superfluous, BTW. You only need the
    ones before $ and ].
     
    J Krugman, Feb 4, 2004
    #3
  4. Don Bruder

    Ben Morrow Guest

    Don Bruder <> wrote:
    >
    > I've got a "canned" regexp I'm trying to analyze that I can't quite
    > follow due to one of the constructs used in it. Can anyone
    > translate/verify my translation for me?
    >
    > Here's the segment that's throwing me (It's a very small sub-section of
    > a rather large and complex regexp - We're talking something on the order
    > of 300+ characters worth of "rather large and complex")
    >
    > [a-zA-Z]{2}[.,\;:?%!&+^~`'\$*=\#|013467\(\)\[\]\{\}<>"][a-zA-Z]{2}


    Good God who wrote that?!
    None of those backslashes are necessary except for the one before ].
    Those [a-zA-Z] should almost certainly be [[:alpha:]].

    I would strongly recommend breaking the regex up into bits as you
    understand it. Assign each 'chunk' to a variable with qr//, and use /x
    on the bits so you can separate things out decently. For instance,
    that bit you have there can be written:

    my $code = qw/[[:alpha:]]{2}/;
    my $symbol = qr/[.,;:...<>"]/;

    /$code $symbol $code/x;

    (I'm making the entirely unjustified assumption that the two-letter
    sequences are some sort of code, to illustrate that you want to give
    the pieces names which reflect their function, rather than merely what
    they match). See how much more readable that is?

    > Should I be ignoring any usual "special meaning" of the 'bar' character
    > when it appears as part of a square-bracketed set,


    Yes, you should. Read perldoc perlre again. Nothing is significant in
    a [] class except ] (except at the start), ^ (if at the start), -
    (except at either end), and \.

    Ben

    --
    Musica Dei donum optimi, trahit homines, trahit deos. |
    Musica truces mollit animos, tristesque mentes erigit. |
    Musica vel ipsas arbores et horridas movet feras. |
     
    Ben Morrow, Feb 4, 2004
    #4
  5. Don Bruder

    Ben Morrow Guest

    Ben Morrow <> wrote:
    > Don Bruder <> wrote:
    >
    > > [a-zA-Z]{2}[.,\;:?%!&+^~`'\$*=\#|013467\(\)\[\]\{\}<>"][a-zA-Z]{2}

    >
    > Good God who wrote that?!
    > None of those backslashes are necessary except for the one before
    > ]...


    ....and the one before $.

    > my $code = qw/[[:alpha:]]{2}/;

    ^ r
    Apologies.

    Ben

    --
    Heracles: Vulture! Here's a titbit for you / A few dried molecules of the gall
    From the liver of a friend of yours. / Excuse the arrow but I have no spoon.
    (Ted Hughes, [ Heracles shoots Vulture with arrow. Vulture bursts into ]
    /Alcestis/) [ flame, and falls out of sight. ]
     
    Ben Morrow, Feb 4, 2004
    #5
  6. Don Bruder

    Don Bruder Guest

    In article <bvrh6d$21n$>,
    Ben Morrow <> wrote:

    > Don Bruder <> wrote:
    > >
    > > I've got a "canned" regexp I'm trying to analyze that I can't quite
    > > follow due to one of the constructs used in it. Can anyone
    > > translate/verify my translation for me?
    > >
    > > Here's the segment that's throwing me (It's a very small sub-section of
    > > a rather large and complex regexp - We're talking something on the order
    > > of 300+ characters worth of "rather large and complex")
    > >
    > > [a-zA-Z]{2}[.,\;:?%!&+^~`'\$*=\#|013467\(\)\[\]\{\}<>"][a-zA-Z]{2}

    >
    > Good God who wrote that?!


    Dunno. 'Tweren't me. I'm just trying to understand it.

    > None of those backslashes are necessary except for the one before ].
    > Those [a-zA-Z] should almost certainly be [[:alpha:]].


    Tell the person who wrote it, not me! :) Extra backslashes aren't giving
    me the problem, though. (FWIW, I'm not a Perl programmer, and the regexp
    in question is part of another package, not (to my knowledge) Perl, but
    questions regarding the syntax of such regexps are referred to the Perl
    Regexp documentation - Either the package is written in Perl, and I'm
    only seeing a piece of one of the plugins (very likely) or they coded it
    to the Perl regexp standard since it was easier than "scratch-building"
    their own regexp package)

    > I would strongly recommend breaking the regex up into bits as you
    > understand it.


    Been doing pretty much that as I walked thorugh it.

    > Assign each 'chunk' to a variable with qr//, and use /x
    > on the bits so you can separate things out decently. For instance,
    > that bit you have there can be written:
    >
    > my $code = qw/[[:alpha:]]{2}/;
    > my $symbol = qr/[.,;:...<>"]/;
    >
    > /$code $symbol $code/x;
    >
    > (I'm making the entirely unjustified assumption that the two-letter
    > sequences are some sort of code, to illustrate that you want to give
    > the pieces names which reflect their function, rather than merely what
    > they match). See how much more readable that is?


    Agreed on the readability. But since I'm not intersted in trying to
    "tweak" it or anything like that - only UNDERSTAND it - I'll be leaving
    it "as-is".


    > > Should I be ignoring any usual "special meaning" of the 'bar' character
    > > when it appears as part of a square-bracketed set,

    >
    > Yes, you should.


    Bingo. That's the answer I needed, and cleared a big part of the "fog" I
    was stumbling around in. Now to figure out why only "013467" in the list
    of digits... I could easily understand ALL digits, but having only that
    particular sub-set of digits just doesn't seem to make any sense, either
    on the surface, or in the context of what I know it's *SUPPOSED* to be
    doing.

    In case anybody's interested, here's the full regexp that I'm trying to
    understand:

    (Beware of line-wrap - there are no literal space/carriage
    return/linefeed characters in the string other than the regulation CR/LF
    pair at the very end, following the "/i")

    /\s(?!(?:fn|re):|(?:cc|to)=|(?:ma|qu|un)[`'"]|(?:dr|m[rst]|li|st|td)\.)[a
    -zA-Z]{2}[.,\;:?%!&+^~`'\$*=\#|013467\(\)\[\]\{\}<>"][a-zA-Z]{2}(?<!\.(?:
    (?-i:[A-Z][a-z]{1})|a[eiu]|b[ebmrsz]|c[afhnrx]|d[bek]|es|f[ir]|g[uz]|h[kn
    rtu]|i[elnqrst]|j[mops]|k[prwy]|m[ckx]|n[loz]|p[lmrty]|ru|s[eghm]|t[cnv]|
    u[ksu]|v[gi])|:no|['`"](?:ed|ll|[rv]e))(?:[,'\?!]|\.?\s)/i

    Its "advertised purpose" is to go through a block of text looking for a
    string consisting of
    "<space><alpha-char><alpha-char><period><alpha-char><alpha-char><space>",
    with no interest in whether the two letters on either side of the
    period are upper or lower case.

    It appears (from my analysis - which may be in error) that several
    two-character top level internet domain names (.us, .uk, .se, .cn, .br,
    ..ru, and quite a few others), a small handful of common filename
    extensions (.db, .gz, .js, etc), and a few other two-letter combinations
    (dr., mr./ms., etc) are special-cased to exclude them from causing a
    match. It *DOES* work as advertised, so that's not at issue. I'm not
    trying to debug it, tweak it, or otherwise mess with it, I just wanted
    to know how/why it was doing what it did before I changed behavior (and
    potentially breaking something due to not understanding exactly what was
    being matched) that happens if/when it finds a match.

    --
    Don Bruder - <--- Preferred Email - SpamAssassinated.
    Hate SPAM? See <http://www.spamassassin.org> for some seriously great info.
    I will choose a path that's clear: I will choose Free Will! - N. Peart
    Fly trap info pages: <http://www.sonic.net/~dakidd/Horses/FlyTrap/index.html>
     
    Don Bruder, Feb 4, 2004
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. KatB
    Replies:
    0
    Views:
    452
  2. Danny Anderson
    Replies:
    2
    Views:
    3,371
    Mitja
    Apr 21, 2004
  3. Geoff Wright
    Replies:
    2
    Views:
    312
    Nobody
    Aug 6, 2011
  4. Joao Silva
    Replies:
    16
    Views:
    382
    7stud --
    Aug 21, 2009
  5. Charles Harrison Caudill

    need a clue

    Charles Harrison Caudill, Sep 3, 2004, in forum: Javascript
    Replies:
    5
    Views:
    130
    Thomas 'PointedEars' Lahn
    Sep 5, 2004
Loading...

Share This Page