YARQ - Yet another regex question

Discussion in 'Perl Misc' started by sjp, Mar 29, 2005.

  1. sjp

    sjp Guest

    Hi folks,

    I'm parsing through a series of delimited records. Some of the records
    use '\t' for the delimiter, and others use '=09' as the delimiter. My
    program handles the tab-delimited records fine, but records that use '=09'
    have erroneous line breaks after '=' signs, like so:

    93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA AVE=09=09NEWBERG=09OR=099=
    71320000=098.33=09I=09CA=09PARK RANGER=09=0966.64=09=09=09=09=09=09PARKS & =
    RECREATION DIVISION=09

    I'd like to remove the '=' and EOL, but 'Sline =~ s/\=\n//g;' fails with
    an "Can't modify constant item in substitution (s///) at
    /usr/local/bin/mailparse line 18, near "s/\=\n//g;" error

    What is the proper way to do it?

    Thanks,

    SJP
     
    sjp, Mar 29, 2005
    #1
    1. Advertising

  2. sjp <> wrote in
    news:2gh2e.15366$Go4.14046@trnddc05:

    > I'd like to remove the '=' and EOL, but 'Sline =~ s/\=\n//g;' fails


    Is that supposed to be $line?

    > What is the proper way to do it?


    One way would be to read the error message, then fix the error in the
    given location, instead of asking hundreds of people to guess what your
    script looks like.

    Sinan.
     
    A. Sinan Unur, Mar 29, 2005
    #2
    1. Advertising

  3. sjp

    Paul Lalli Guest

    sjp wrote:
    > Hi folks,
    >
    > I'm parsing through a series of delimited records. Some of the records
    > use '\t' for the delimiter, and others use '=09' as the delimiter. My
    > program handles the tab-delimited records fine, but records that use '=09'
    > have erroneous line breaks after '=' signs, like so:
    >
    > 93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA AVE=09=09NEWBERG=09OR=099=
    > 71320000=098.33=09I=09CA=09PARK RANGER=09=0966.64=09=09=09=09=09=09PARKS & =
    > RECREATION DIVISION=09
    >
    > I'd like to remove the '=' and EOL, but 'Sline =~ s/\=\n//g;' fails with
    > an "Can't modify constant item in substitution (s///) at
    > /usr/local/bin/mailparse line 18, near "s/\=\n//g;" error


    What the heck is 'Sline'? Are you sure you don't mean $line?
    Conceivably, perl thinks that 'Sline' is some sort of constant item.

    You are enabling strict and warnings, right?

    Also, = is not special in a regexp. There's no reason to escape it.

    Beyond that, I don't understand what your actual issue is. How does the
    records being delimited by '=09' relate to the records having \n
    characters after some = characters?

    Paul Lalli
     
    Paul Lalli, Mar 29, 2005
    #3
  4. sjp

    John Bokma Guest

    Paul Lalli wrote:

    > sjp wrote:
    >> Hi folks,
    >>
    >> I'm parsing through a series of delimited records. Some of the
    >> records use '\t' for the delimiter, and others use '=09' as the
    >> delimiter. My program handles the tab-delimited records fine, but
    >> records that use '=09' have erroneous line breaks after '=' signs,
    >> like so:
    >>
    >> 93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA
    >> AVE=09=09NEWBERG=09OR=099= 71320000=098.33=09I=09CA=09PARK
    >> RANGER=09=0966.64=09=09=09=09=09=09PARKS & = RECREATION DIVISION=09
    >>
    >> I'd like to remove the '=' and EOL, but 'Sline =~ s/\=\n//g;' fails
    >> with
    > > an "Can't modify constant item in substitution (s///) at
    > > /usr/local/bin/mailparse line 18, near "s/\=\n//g;" error

    >
    > What the heck is 'Sline'? Are you sure you don't mean $line?
    > Conceivably, perl thinks that 'Sline' is some sort of constant item.
    >
    > You are enabling strict and warnings, right?
    >
    > Also, = is not special in a regexp. There's no reason to escape it.
    >
    > Beyond that, I don't understand what your actual issue is. How does
    > the records being delimited by '=09' relate to the records having \n
    > characters after some = characters?


    =
    09

    is not

    =09

    the =xx encoding is used in email (I forgot the name), I would *fix*
    that first, and then do the parsing.

    --
    John Small Perl scripts: http://johnbokma.com/perl/
    Perl programmer available: http://castleamber.com/
    Happy Customers: http://castleamber.com/testimonials.html
     
    John Bokma, Mar 29, 2005
    #4
  5. John Bokma <> wrote in news:Xns9628860D818B6castleamber@130.133.1.4:

    > Paul Lalli wrote:
    >
    >> sjp wrote:
    >>> Hi folks,
    >>>
    >>> I'm parsing through a series of delimited records. Some of the
    >>> records use '\t' for the delimiter, and others use '=09' as the
    >>> delimiter. My program handles the tab-delimited records fine, but
    >>> records that use '=09' have erroneous line breaks after '=' signs,
    >>> like so:
    >>>
    >>> 93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA
    >>> AVE=09=09NEWBERG=09OR=099= 71320000=098.33=09I=09CA=09PARK
    >>> RANGER=09=0966.64=09=09=09=09=09=09PARKS & = RECREATION DIVISION=09
    >>>


    ....

    >> Beyond that, I don't understand what your actual issue is. How does
    >> the records being delimited by '=09' relate to the records having \n
    >> characters after some = characters?

    >
    > =
    > 09
    >
    > is not
    >
    > =09


    But there no such cases in the data the OP posted.

    > the =xx encoding is used in email (I forgot the name), I would *fix*
    > that first, and then do the parsing.


    Base64. The CPAN module MIME::Base64 allows one to convert
    Base64 encoded strings. On the other hand, I am not sure
    if the data the OP posted really is Base64.

    The following seems to satisfy the OP's requirements:

    #! perl

    use strict;
    use warnings;

    my $d = q{93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA AVE=09=09NEWBERG=09OR=099=
    71320000=098.33=09I=09CA=09PARK RANGER=09=0966.64=09=09=09=09=09=09PARKS & =
    RECREATION DIVISION=09};

    $d =~ s/=09/\t/g;
    $d =~ s/=\n//g;

    print $d;
    __END__
     
    A. Sinan Unur, Mar 29, 2005
    #5
  6. sjp

    Paul Lalli Guest

    John Bokma wrote:
    >
    > Paul Lalli wrote:
    >>
    >>sjp wrote:
    >>>
    >>>My program handles the tab-delimited records fine, but
    >>> records that use '=09' have erroneous line breaks after '=' signs,
    >>> like so:
    >>>93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA
    >>>AVE=09=09NEWBERG=09OR=099= 71320000=098.33=09I=09CA=09PARK
    >>>RANGER=09=0966.64=09=09=09=09=09=09PARKS & = RECREATION DIVISION=09

    >>
    >>Beyond that, I don't understand what your actual issue is. How does
    >>the records being delimited by '=09' relate to the records having \n
    >>characters after some = characters?

    >
    > =
    > 09
    >
    > is not
    >
    > =09
    >
    > the =xx encoding is used in email (I forgot the name), I would *fix*
    > that first, and then do the parsing.



    There is no instance of
    =
    09

    anywhere in the OP's data. The way it sounds to me is that the OP is
    concerned about \n's after *any* = character.

    I admit, of course, that I could be quite wrong. But in fact, there is
    no instance of any "=\n" anywhere in the OP's data, so I don't think we
    can really know what the OP is talking about until the OP himself clarifies.

    Paul Lalli
     
    Paul Lalli, Mar 29, 2005
    #6
  7. sjp

    sjp Guest

    On Tue, 29 Mar 2005 19:10:40 +0000, John Bokma wrote:

    > Paul Lalli wrote:
    >
    >> sjp wrote:
    >>> Hi folks,
    >>>
    >>> I'm parsing through a series of delimited records. Some of the
    >>> records use '\t' for the delimiter, and others use '=09' as the
    >>> delimiter. My program handles the tab-delimited records fine, but
    >>> records that use '=09' have erroneous line breaks after '=' signs,
    >>> like so:
    >>>
    >>> 93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA
    >>> AVE=09=09NEWBERG=09OR=099= 71320000=098.33=09I=09CA=09PARK
    >>> RANGER=09=0966.64=09=09=09=09=09=09PARKS & = RECREATION DIVISION=09
    >>>
    >>> I'd like to remove the '=' and EOL, but 'Sline =~ s/\=\n//g;' fails
    >>> with
    >> > an "Can't modify constant item in substitution (s///) at
    >> > /usr/local/bin/mailparse line 18, near "s/\=\n//g;" error

    >>
    >> What the heck is 'Sline'? Are you sure you don't mean $line?
    >> Conceivably, perl thinks that 'Sline' is some sort of constant item.
    >>
    >> You are enabling strict and warnings, right?
    >>
    >> Also, = is not special in a regexp. There's no reason to escape it.
    >>
    >> Beyond that, I don't understand what your actual issue is. How does
    >> the records being delimited by '=09' relate to the records having \n
    >> characters after some = characters?

    >
    > =
    > 09
    >
    > is not
    >
    > =09
    >
    > the =xx encoding is used in email (I forgot the name), I would *fix*
    > that first, and then do the parsing.


    You're right, John. I'm parsing a very large email archive file and an
    indeterminate number of attachments in the file are encoded
    "quoted-printable". So the real issue, I suppose is how to properly
    decode an indeterminate number of quoted-printable records from a mail
    archive before processing the records contained in that archive.

    Thanks for helping me to frame the problem.
     
    sjp, Mar 29, 2005
    #7
  8. sjp

    John Bokma Guest

    A. Sinan Unur wrote:

    > But there no such cases in the data the OP posted.


    Yup, classical bad post / wrong example :-D

    I think "we" see this every day here?

    > The following seems to satisfy the OP's requirements:
    >
    > #! perl
    >
    > use strict;
    > use warnings;
    >
    > my $d = q{93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA
    > AVE=09=09NEWBERG=09OR=099= 71320000=098.33=09I=09CA=09PARK
    > RANGER=09=0966.64=09=09=09=09=09=09PARKS & = RECREATION DIVISION=09};
    >
    > $d =~ s/=09/\t/g;
    > $d =~ s/=\n//g;


    If you swap those two, yes.

    --
    John Small Perl scripts: http://johnbokma.com/perl/
    Perl programmer available: http://castleamber.com/
    Happy Customers: http://castleamber.com/testimonials.html
     
    John Bokma, Mar 29, 2005
    #8
  9. sjp

    John Bokma Guest

    Paul Lalli wrote:

    > There is no instance of
    > =
    > 09
    >
    > anywhere in the OP's data.


    Of course not, because the OP posted a wrong example :-D.

    Does that never happen here?

    > The way it sounds to me is that the OP is
    > concerned about \n's after *any* = character.
    >
    > I admit, of course, that I could be quite wrong. But in fact, there
    > is no instance of any "=\n" anywhere in the OP's data, so I don't
    > think we can really know what the OP is talking about until the OP
    > himself clarifies.


    My best guess:

    =
    xx

    should become

    =xx

    and then if xx = 09 it should be replaced with \t

    I would have the decoding be handled by a dedicated Perl module.

    --
    John Small Perl scripts: http://johnbokma.com/perl/
    Perl programmer available: http://castleamber.com/
    Happy Customers: http://castleamber.com/testimonials.html
     
    John Bokma, Mar 29, 2005
    #9
  10. sjp

    John Bokma Guest

    sjp wrote:

    > On Tue, 29 Mar 2005 19:10:40 +0000, John Bokma wrote:


    [ snip ]

    >> the =xx encoding is used in email (I forgot the name), I would *fix*
    >> that first, and then do the parsing.

    >
    > You're right, John. I'm parsing a very large email archive file and an
    > indeterminate number of attachments in the file are encoded
    > "quoted-printable".


    Yup, that's the one :-D

    > So the real issue, I suppose is how to properly
    > decode an indeterminate number of quoted-printable records from a mail
    > archive before processing the records contained in that archive.


    I am really sure that there are Perl modules that handle this.

    <http://search.cpan.org/~gaas/MIME-Base64-Perl-
    1.00/lib/MIME/QuotedPrint/Perl.pm>

    > Thanks for helping me to frame the problem.


    :) You're welcome.

    --
    John Small Perl scripts: http://johnbokma.com/perl/
    Perl programmer available: http://castleamber.com/
    Happy Customers: http://castleamber.com/testimonials.html
     
    John Bokma, Mar 29, 2005
    #10
  11. sjp

    Guest

    On Tue, 29 Mar 2005 18:37:50 GMT, sjp <>
    wrote:

    >Hi folks,
    >
    >I'm parsing through a series of delimited records. Some of the records
    >use '\t' for the delimiter, and others use '=09' as the delimiter. My
    >program handles the tab-delimited records fine, but records that use '=09'
    >have erroneous line breaks after '=' signs, like so:
    >
    >93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA AVE=09=09NEWBERG=09OR=099=
    >71320000=098.33=09I=09CA=09PARK RANGER=09=0966.64=09=09=09=09=09=09PARKS & =
    >RECREATION DIVISION=09
    >
    >I'd like to remove the '=' and EOL, but 'Sline =~ s/\=\n//g;' fails with
    >an "Can't modify constant item in substitution (s///) at
    >/usr/local/bin/mailparse line 18, near "s/\=\n//g;" error
    >
    >What is the proper way to do it?
    >
    >Thanks,
    >
    >SJP



    $line =~ s/[=\n]+//g;
     
    , Apr 1, 2005
    #11
  12. <> wrote:
    > On Tue, 29 Mar 2005 18:37:50 GMT, sjp <>
    > wrote:
    >


    >>but records that use '=09'
    >>have erroneous line breaks after '=' signs, like so:


    >>I'd like to remove the '=' and EOL, but 'Sline =~ s/\=\n//g;' fails with
    >>an "Can't modify constant item in substitution (s///) at



    >>What is the proper way to do it?



    > $line =~ s/[=\n]+//g;



    That does not do what was asked for.

    The OP wants to remove the 2-character sequence "=\n".

    That code removes all equal signs and all newlines.


    In fact, the OP's pattern match would do it just fine if he
    had typed "$" instead of "S".


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Apr 1, 2005
    #12
  13. sjp

    Guest

    On Tue, 29 Mar 2005 18:37:50 GMT, sjp <>
    wrote:

    >Hi folks,
    >
    >I'm parsing through a series of delimited records. Some of the records
    >use '\t' for the delimiter, and others use '=09' as the delimiter. My
    >program handles the tab-delimited records fine, but records that use '=09'
    >have erroneous line breaks after '=' signs, like so:
    >
    >93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA AVE=09=09NEWBERG=09OR=099=
    >71320000=098.33=09I=09CA=09PARK RANGER=09=0966.64=09=09=09=09=09=09PARKS & =
    >RECREATION DIVISION=09
    >

    Thats funny, I see "09" but I don't see "\n" or even (\n=) crfl 13,10
    Some "line-breaks" are 10, some are crlf. '10' or '13' is not an
    ESCape character as is not the '09', but isint '=09' a representative
    printable string of an ESCape sequence? But don't see '1013'.
    When do you process these records? Is it in its binary form?
    Regex only knows a few '\letter" escape control codes.
    You may want to go strictly hex representation of '\n' (even though
    its not visible here) by having either the \x0a or \x0d with the '='.

    while ($line =~ s/(=\x0a\0d|=[\x0a\x0d])//) {};

    I write it this way, without the 'g' modifyer because I don't think
    'backtracking' is done in this case since there are no quatifiyers,
    that could be your problem.

    for a quick test, try this:

    while ($line =~ s/=\n//) {};

    gluck!!


    >I'd like to remove the '=' and EOL, but 'Sline =~ s/\=\n//g;' fails with
    >an "Can't modify constant item in substitution (s///) at
    >/usr/local/bin/mailparse line 18, near "s/\=\n//g;" error
    >
    >What is the proper way to do it?
    >
    >Thanks,
    >
    >SJP
     
    , Apr 9, 2005
    #13
  14. wrote in
    news::

    > while ($line =~ s/=\n//) {};


    In general, you might want to use

    1 while ( ... );

    instead of putting an empty block at the end.

    However, you don't really need a while loop there:

    $line =~ s/=\n//g;

    would be preferable.

    Just because I am bored and looking for something to do:

    #! /usr/bin/perl

    use strict;
    use warnings;

    sub make_loop_replacer {
    my $s = 'a';
    $s .= "=\n" for (1 .. 100_000);
    $s .= 'b';
    sub { 1 while $s =~ s/=\n// }
    }

    sub make_sg_replacer {
    my $s = 'a';
    $s .= "=\n" for (1 .. 100_000);
    $s .= 'b';
    sub { $s =~ s/=\n//g }
    }

    use Benchmark ':all';

    cmpthese 5_000_000, {
    loop => make_loop_replacer(),
    sg => make_sg_replacer(),
    };

    __END__

    D:\Home\asu1\UseNet\clpmisc> t
    Rate loop sg
    loop 2908668/s -- -50%
    sg 5763689/s 98% --



    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
     
    A. Sinan Unur, Apr 9, 2005
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Martin
    Replies:
    3
    Views:
    408
    Martin
    Jul 24, 2003
  2. Replies:
    6
    Views:
    5,366
    Alan Moore
    May 24, 2005
  3. Berehem
    Replies:
    4
    Views:
    597
    Lawrence Kirby
    Apr 28, 2005
  4. Replies:
    3
    Views:
    833
    Reedick, Andrew
    Jul 1, 2008
  5. siliconmike

    Yet another regex question.

    siliconmike, Apr 18, 2005, in forum: Perl Misc
    Replies:
    4
    Views:
    111
    siliconmike
    Apr 18, 2005
Loading...

Share This Page