Regular Expresions & pattern matching (mis)understanding

Discussion in 'Perl Misc' started by Robert Stelmack, Feb 16, 2004.

  1. I have embedded some marked text in a large file to indicate chapters and
    page numbers. I want to read that file and strip out the page number to be
    displayed with the line of text or following text so as to have a page
    number reference for the displayed text. I have read and reread the perlre
    and looked for examples on the Internet, but I must be missing a basic
    concept. I also have O'Reill's Programming Perl book, but the examples are
    sometimes hard to apply to what I am trying to do. Here is what I tried to
    get to work (with various syntax changes):

    #!/usr/bin/perl
    $_= "He had come to pass his experience along to me - if <page>10</page>I
    cared to have it.";
    $PAGE =~ /[<page>][0-9][<\/page>]/;
    s,[<page>][0-9][</page>],,g;
    printf "(p.$PAGE) contains [$_]:\n";

    The output I expected was:

    (p.10) contains [He had come to pass his experience along to me - if I cared
    to have it.]

    ....but instead it displayed:

    bash-2.05b$ test.cgi
    (p.) contains [He had come to pass his experience along to me - if
    <page>10</page>I cared to have it.]:


    I really want to get my head around pattern matching and binding since all
    my working code looks too much like my old FORTRAN programs.

    Cheers,

    Bob
    Robert Stelmack, Feb 16, 2004
    #1
    1. Advertising

  2. On Mon, 16 Feb 2004 12:08:17 +0000, Robert Stelmack wrote:


    > The output I expected was:
    >
    > (p.10) contains [He had come to pass his experience along to me - if I cared
    > to have it.]
    >
    > ...but instead it displayed:
    >
    > bash-2.05b$ test.cgi
    > (p.) contains [He had come to pass his experience along to me - if
    > <page>10</page>I cared to have it.]:


    Completely untested and probably wrong, but it might give you a kick in
    the right direction:

    /^(.*)<page>([0-9]+)</page>(.*)/;
    $page = $2;
    $rest = $1 . $3;
    print "(p.$page) contains $rest\n";

    > I really want to get my head around pattern matching and binding since all
    > my working code looks too much like my old FORTRAN programs.


    Look for capturing in the regexp chapter of your favorite perl book

    --
    NPV

    "the large print giveth, and the small print taketh away"
    Tom Waits - Step right up
    Nils Petter Vaskinn, Feb 16, 2004
    #2
    1. Advertising

  3. Robert Stelmack wrote:
    >
    > #!/usr/bin/perl
    > $_= "He had come to pass his experience along to me -
    > if <page>10</page>I cared to have it.";
    > $PAGE =~ /[<page>][0-9][<\/page>]/;


    Several mistakes in that line.

    - You need parentheses around $PAGE to enforce list context.
    - You need the '=' operator to assign the captured string.
    - Brackets are for character classes, and shall not be used around
    <page> etc.
    - You need [0-9]+ so that it matches one or more digits.
    - You need to surround the latter with parentheses to capture it.

    This is what I suppose you mean:

    ($PAGE) = /<page>([0-9]+)<\/page>/;

    > s,[<page>][0-9][</page>],,g;


    This is what I suppose you mean:

    s,<page>[0-9]+</page>,,;

    (not sure why you are using the /g modifier)

    > printf "(p.$PAGE) contains [$_]:\n";


    No need to use printf(). print() is sufficient.

    print "(p.$PAGE) contains [$_]:\n";

    But the $PAGE variable is redundant. Instead you can do:

    s,<page>([0-9]+)</page>,,;
    print "(p.$1) contains [$_]:\n";

    HTH

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Feb 16, 2004
    #3
  4. Robert Stelmack <robert.h.stelmack*REMOVE*@boeing.com> wrote:

    > I have read and reread the perlre



    Have you read the regex _tutorial_ too?

    perldoc perlretut


    > but I must be missing a basic
    > concept.



    You are missing multiple concepts simultaneously. See below.


    > I also have O'Reill's Programming Perl book,



    The best book for extracting the text-processing power from regexes is:

    "Mastering Regular Expressions" (2nd edition) O'Reilly


    > #!/usr/bin/perl



    You should ask for all the help you can get:

    use strict;
    use warnings;

    Have you seen the Posting Guidelines that are posted here frequently?


    > $_= "He had come to pass his experience along to me - if <page>10</page>I
    > cared to have it.";



    It is absolutely essential that we have exactly the same string as
    you if we are to help you with matching that string.

    Consider wrapping long strings yourself (in valid Perl) so your
    newsreader won't break stuff for you.


    > $PAGE =~ /[<page>][0-9][<\/page>]/;



    1) you are trying to match the pattern against the string contained
    in $PAGE, but the string is really in $_.
    (perl would have warned you about that if you had asked it to...)

    2) a "character class" matches a _single_ character.
    [<page>] is exactly equivalent to [aegp<>] since the
    listed characters are the same.

    3) your pattern will match only single-digit numbers, you need to
    allow multiple digit characters between the "tags".

    4) you need "capturing parenthesis" around the page number digits
    if you want access to them later.

    5) you don't need the m// at all if you are going to s/// with
    the same pattern. s/// does nothing if it the match fails.

    6) the \d shortcut char class matches the same chars as [0-9].


    > s,[<page>][0-9][</page>],,g;



    After this statement all of the "tags" will be gone, and it will
    be "too late" to apply further processing to them (such as print()).


    > printf "(p.$PAGE) contains [$_]:\n";



    You should use print() unless you make use of the formatting
    that printf() provides.


    > The output I expected was:
    >
    > (p.10) contains [He had come to pass his experience along to me - if I cared
    > to have it.]



    ----------------------------------
    #!/usr/bin/perl
    use strict;
    use warnings;

    $_= "He had come to pass his experience along to me - if "
    . "<page>10</page>I cared to have it.";

    while ( s,<page>(\d+)</page>,,g ) {
    print "(p.$1) contains [$_]:\n";
    }
    ----------------------------------


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Feb 16, 2004
    #4
  5. Nils Petter Vaskinn <> wrote:

    > and probably wrong


    > /^(.*)<page>([0-9]+)</page>(.*)/;
    > $page = $2;
    > $rest = $1 . $3;



    Using the dollar-digit variables without first ensuring that
    the match succeeded is indeed wrong.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Feb 16, 2004
    #5
  6. Robert Stelmack

    Uri Guttman Guest

    >>>>> "GH" == Gunnar Hjalmarsson <> writes:
    GH> Robert Stelmack wrote:
    >> #!/usr/bin/perl
    >> $_= "He had come to pass his experience along to me -
    >> if <page>10</page>I cared to have it.";
    >> $PAGE =~ /[<page>][0-9][<\/page>]/;



    GH> Several mistakes in that line.

    GH> - You need parentheses around $PAGE to enforce list context.

    why is list context needed?

    GH> This is what I suppose you mean:

    GH> ($PAGE) = /<page>([0-9]+)<\/page>/;

    and why do you have list context there? you never use the grabbed
    results in that line.

    uri

    --
    Uri Guttman ------ -------- http://www.stemsystems.com
    --Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
    Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
    Uri Guttman, Feb 16, 2004
    #6
  7. Uri Guttman wrote:
    >>>>>>"GH" == Gunnar Hjalmarsson <> writes:

    > GH> Robert Stelmack wrote:
    > >> #!/usr/bin/perl
    > >> $_= "He had come to pass his experience along to me -
    > >> if <page>10</page>I cared to have it.";
    > >> $PAGE =~ /[<page>][0-9][<\/page>]/;

    >
    > GH> Several mistakes in that line.
    >
    > GH> - You need parentheses around $PAGE to enforce list context.
    >
    > why is list context needed?
    >
    > GH> This is what I suppose you mean:
    >
    > GH> ($PAGE) = /<page>([0-9]+)<\/page>/;
    >
    > and why do you have list context there? you never use the grabbed
    > results in that line.


    No, but two lines further down, OP's code presupposes that $PAGE
    contains the page number. After having suggested a minimum of changes
    to OP's code, I also mentioned that the whole line is redundant, and
    that the page number well can be captured in the s/// operator.

    What's your message, Uri?

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Feb 16, 2004
    #7
  8. Robert Stelmack

    Uri Guttman Guest

    >>>>> "GH" == Gunnar Hjalmarsson <> writes:

    GH> Uri Guttman wrote:
    >>>>>>> "GH" == Gunnar Hjalmarsson <> writes:


    >> >> $PAGE =~ /[<page>][0-9][<\/page>]/;


    GH> - You need parentheses around $PAGE to enforce list context.

    >> why is list context needed?


    well, without any grabs nor assignment, list context is meaningless
    there. the line has =~.

    GH> This is what I suppose you mean:
    GH> ($PAGE) = /<page>([0-9]+)<\/page>/;

    >> and why do you have list context there? you never use the grabbed
    >> results in that line.


    GH> What's your message, Uri?

    i didn't see the change from =~ to = in that line. so it was more than
    just your previous comment about list context being needed.

    uri

    --
    Uri Guttman ------ -------- http://www.stemsystems.com
    --Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
    Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
    Uri Guttman, Feb 16, 2004
    #8
  9. On Mon, 16 Feb 2004 08:38:27 -0600, Tad McClellan wrote:

    > Nils Petter Vaskinn <> wrote:
    >
    >> and probably wrong

    >
    >> /^(.*)<page>([0-9]+)</page>(.*)/;
    >> $page = $2;
    >> $rest = $1 . $3;

    >
    >
    > Using the dollar-digit variables without first ensuring that
    > the match succeeded is indeed wrong.


    And i should probably have escaped that '/'

    if ( /^(.*)<page>([0-9]+)<\/page>(.*)/ ) {
    $page = $2;
    $rest = $1 . $3;
    }

    --
    NPV

    "the large print giveth, and the small print taketh away"
    Tom Waits - Step right up
    Nils Petter Vaskinn, Feb 17, 2004
    #9
  10. Robert Stelmack

    Ben Morrow Guest

    Nils Petter Vaskinn <> wrote:
    >
    > And i should probably have escaped that '/'
    >
    > if ( /^(.*)<page>([0-9]+)<\/page>(.*)/ ) {


    Bleech!

    if ( m|^(.*)<page>(\d+)</page>(.*)| ) {

    Ben

    --
    Musica Dei donum optimi, trahit homines, trahit deos. |
    Musica truces molit animos, tristesque mentes erigit. |
    Musica vel ipsas arbores et horridas movet feras. |
    Ben Morrow, Feb 17, 2004
    #10
  11. Ben Morrow <> writes:

    > Nils Petter Vaskinn <> wrote:
    > >
    > > if ( /^(.*)<page>([0-9]+)<\/page>(.*)/ ) {

    >
    > Bleech!
    >
    > if ( m|^(.*)<page>(\d+)</page>(.*)| ) {


    I think the ability to use alternate quoting delimiters is often
    overrated. I don't like to increase the ammount of contextual
    information needed to understand what I'm looking at. In the case of
    a long regex I don't want to have to remeber that it's using some
    delimiter other than /. Compared to the tiny effort of the extra
    keystoke to escape each / I don't think the small loss of readability
    is justified.

    --
    \\ ( )
    . _\\__[oo
    .__/ \\ /\@
    . l___\\
    # ll l\\
    ###LL LL\\
    Brian McCauley, Feb 18, 2004
    #11
  12. Also sprach Brian McCauley:

    > Ben Morrow <> writes:
    >
    >> Nils Petter Vaskinn <> wrote:
    >> >
    >> > if ( /^(.*)<page>([0-9]+)<\/page>(.*)/ ) {

    >>
    >> Bleech!
    >>
    >> if ( m|^(.*)<page>(\d+)</page>(.*)| ) {

    >
    > I think the ability to use alternate quoting delimiters is often
    > overrated. I don't like to increase the ammount of contextual
    > information needed to understand what I'm looking at. In the case of
    > a long regex I don't want to have to remeber that it's using some
    > delimiter other than /. Compared to the tiny effort of the extra
    > keystoke to escape each / I don't think the small loss of readability
    > is justified.


    And particularly, using a delimiter that has special meaning in regexps
    (such as '|') is always a bad idea. Under these circumstances I prefer
    to use '!' or '#' or in fact any character that visually stands out and
    is not meta in its semantics.

    Tassilo
    --
    $_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
    pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus})!JAPH!qq(rehtona{tsuJbus#;
    $_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexiixesixeseg;y~\n~~dddd;eval
    Tassilo v. Parseval, Feb 18, 2004
    #12
  13. Robert Stelmack

    Uri Guttman Guest

    >>>>> "BM" == Brian McCauley <> writes:

    BM> Ben Morrow <> writes:
    >> Nils Petter Vaskinn <> wrote:
    >> >
    >> > if ( /^(.*)<page>([0-9]+)<\/page>(.*)/ ) {

    >>
    >> Bleech!
    >>
    >> if ( m|^(.*)<page>(\d+)</page>(.*)| ) {


    BM> I think the ability to use alternate quoting delimiters is often
    BM> overrated. I don't like to increase the ammount of contextual
    BM> information needed to understand what I'm looking at. In the case of
    BM> a long regex I don't want to have to remeber that it's using some
    BM> delimiter other than /. Compared to the tiny effort of the extra
    BM> keystoke to escape each / I don't think the small loss of readability
    BM> is justified.

    i have to disagree. i find \ annoying to see when it is not needed. just
    choose a delimiter that works with this regex. i like paired delims like
    {} or []. and if the regex gets too long or complex, /x is called
    for. and then paired delims work very well:

    s{
    blah
    }
    {
    replace
    }sexi ;

    uri

    --
    Uri Guttman ------ -------- http://www.stemsystems.com
    --Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
    Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
    Uri Guttman, Feb 18, 2004
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?S2VubnkgTS4=?=

    learning to write Regular Expresions

    =?Utf-8?B?S2VubnkgTS4=?=, Jun 1, 2005, in forum: ASP .Net
    Replies:
    1
    Views:
    536
    =?Utf-8?B?cm9iIGx5bmNo?=
    Jun 1, 2005
  2. Kamikazy

    Regular expresions

    Kamikazy, Mar 28, 2006, in forum: C Programming
    Replies:
    1
    Views:
    245
    Ben C
    Mar 28, 2006
  3. Replies:
    3
    Views:
    298
    Ryan Ginstrom
    Aug 27, 2007
  4. Just Fill Bugs
    Replies:
    1
    Views:
    400
    Nobody
    Jul 7, 2011
  5. Marc Bissonnette

    Pattern matching : not matching problem

    Marc Bissonnette, Jan 8, 2004, in forum: Perl Misc
    Replies:
    9
    Views:
    221
    Marc Bissonnette
    Jan 13, 2004
Loading...

Share This Page