YARQ - Yet another regex question

sjp · Mar 29, 2005

Hi folks,

I'm parsing through a series of delimited records. Some of the records
use '\t' for the delimiter, and others use '=09' as the delimiter. My
program handles the tab-delimited records fine, but records that use '=09'
have erroneous line breaks after '=' signs, like so:

93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA AVE=09=09NEWBERG=09OR=099=
71320000=098.33=09I=09CA=09PARK RANGER=09=0966.64=09=09=09=09=09=09PARKS & =
RECREATION DIVISION=09

I'd like to remove the '=' and EOL, but 'Sline =~ s/\=\n//g;' fails with
an "Can't modify constant item in substitution (s///) at
/usr/local/bin/mailparse line 18, near "s/\=\n//g;" error

What is the proper way to do it?

Thanks,

SJP

A. Sinan Unur · Mar 29, 2005

I'd like to remove the '=' and EOL, but 'Sline =~ s/\=\n//g;' fails

Is that supposed to be $line?

What is the proper way to do it?

One way would be to read the error message, then fix the error in the
given location, instead of asking hundreds of people to guess what your
script looks like.

Sinan.

Paul Lalli · Mar 29, 2005

sjp said:
Hi folks,

I'm parsing through a series of delimited records. Some of the records
use '\t' for the delimiter, and others use '=09' as the delimiter. My
program handles the tab-delimited records fine, but records that use '=09'
have erroneous line breaks after '=' signs, like so:

93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA AVE=09=09NEWBERG=09OR=099=
71320000=098.33=09I=09CA=09PARK RANGER=09=0966.64=09=09=09=09=09=09PARKS & =
RECREATION DIVISION=09

I'd like to remove the '=' and EOL, but 'Sline =~ s/\=\n//g;' fails with
an "Can't modify constant item in substitution (s///) at
/usr/local/bin/mailparse line 18, near "s/\=\n//g;" error

What the heck is 'Sline'? Are you sure you don't mean $line?
Conceivably, perl thinks that 'Sline' is some sort of constant item.

You are enabling strict and warnings, right?

Also, = is not special in a regexp. There's no reason to escape it.

Beyond that, I don't understand what your actual issue is. How does the
records being delimited by '=09' relate to the records having \n
characters after some = characters?

Paul Lalli

John Bokma · Mar 29, 2005

Paul said:
What the heck is 'Sline'? Are you sure you don't mean $line?
Conceivably, perl thinks that 'Sline' is some sort of constant item.

You are enabling strict and warnings, right?

Also, = is not special in a regexp. There's no reason to escape it.

Beyond that, I don't understand what your actual issue is. How does
the records being delimited by '=09' relate to the records having \n
characters after some = characters?

=
09

is not

=09

the =xx encoding is used in email (I forgot the name), I would *fix*
that first, and then do the parsing.

A. Sinan Unur · Mar 29, 2005

John Bokma said:
....

=
09

is not

=09

But there no such cases in the data the OP posted.

the =xx encoding is used in email (I forgot the name), I would *fix*
that first, and then do the parsing.

Base64. The CPAN module MIME::Base64 allows one to convert
Base64 encoded strings. On the other hand, I am not sure
if the data the OP posted really is Base64.

The following seems to satisfy the OP's requirements:

#! perl

use strict;
use warnings;

my $d = q{93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA AVE=09=09NEWBERG=09OR=099=
71320000=098.33=09I=09CA=09PARK RANGER=09=0966.64=09=09=09=09=09=09PARKS & =
RECREATION DIVISION=09};

$d =~ s/=09/\t/g;
$d =~ s/=\n//g;

print $d;
__END__

Paul Lalli · Mar 29, 2005

John said:
=
09

is not

=09

the =xx encoding is used in email (I forgot the name), I would *fix*
that first, and then do the parsing.

There is no instance of
=
09

anywhere in the OP's data. The way it sounds to me is that the OP is
concerned about \n's after *any* = character.

I admit, of course, that I could be quite wrong. But in fact, there is
no instance of any "=\n" anywhere in the OP's data, so I don't think we
can really know what the OP is talking about until the OP himself clarifies.

Paul Lalli

sjp · Mar 29, 2005

=
09

is not

=09

the =xx encoding is used in email (I forgot the name), I would *fix*
that first, and then do the parsing.

You're right, John. I'm parsing a very large email archive file and an
indeterminate number of attachments in the file are encoded
"quoted-printable". So the real issue, I suppose is how to properly
decode an indeterminate number of quoted-printable records from a mail
archive before processing the records contained in that archive.

Thanks for helping me to frame the problem.

John Bokma · Mar 29, 2005

A. Sinan Unur said:
But there no such cases in the data the OP posted.

Yup, classical bad post / wrong example :-D

I think "we" see this every day here?

The following seems to satisfy the OP's requirements:

#! perl

use strict;
use warnings;

my $d = q{93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA
AVE=09=09NEWBERG=09OR=099= 71320000=098.33=09I=09CA=09PARK
RANGER=09=0966.64=09=09=09=09=09=09PARKS & = RECREATION DIVISION=09};

$d =~ s/=09/\t/g;
$d =~ s/=\n//g;

If you swap those two, yes.

John Bokma · Mar 29, 2005

Paul said:
There is no instance of
=
09

anywhere in the OP's data.

Of course not, because the OP posted a wrong example :-D.

Does that never happen here?

The way it sounds to me is that the OP is
concerned about \n's after *any* = character.

I admit, of course, that I could be quite wrong. But in fact, there
is no instance of any "=\n" anywhere in the OP's data, so I don't
think we can really know what the OP is talking about until the OP
himself clarifies.

My best guess:

=
xx

should become

=xx

and then if xx = 09 it should be replaced with \t

I would have the decoding be handled by a dedicated Perl module.

John Bokma · Mar 29, 2005

[ snip ]

You're right, John. I'm parsing a very large email archive file and an
indeterminate number of attachments in the file are encoded
"quoted-printable".

Yup, that's the one :-D

So the real issue, I suppose is how to properly
decode an indeterminate number of quoted-printable records from a mail
archive before processing the records contained in that archive.

I am really sure that there are Perl modules that handle this.

<http://search.cpan.org/~gaas/MIME-Base64-Perl-
1.00/lib/MIME/QuotedPrint/Perl.pm>

Thanks for helping me to frame the problem.

You're welcome.

robic0 · Apr 1, 2005

Hi folks,

I'm parsing through a series of delimited records. Some of the records
use '\t' for the delimiter, and others use '=09' as the delimiter. My
program handles the tab-delimited records fine, but records that use '=09'
have erroneous line breaks after '=' signs, like so:

93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA AVE=09=09NEWBERG=09OR=099=
71320000=098.33=09I=09CA=09PARK RANGER=09=0966.64=09=09=09=09=09=09PARKS & =
RECREATION DIVISION=09

I'd like to remove the '=' and EOL, but 'Sline =~ s/\=\n//g;' fails with
an "Can't modify constant item in substitution (s///) at
/usr/local/bin/mailparse line 18, near "s/\=\n//g;" error

What is the proper way to do it?

Thanks,

SJP

$line =~ s/[=\n]+//g;

Tad McClellan · Apr 1, 2005

but records that use '=09'
have erroneous line breaks after '=' signs, like so:
I'd like to remove the '=' and EOL, but 'Sline =~ s/\=\n//g;' fails with
an "Can't modify constant item in substitution (s///) at

Click to expand...

What is the proper way to do it?

Click to expand...

$line =~ s/[=\n]+//g;

That does not do what was asked for.

The OP wants to remove the 2-character sequence "=\n".

That code removes all equal signs and all newlines.

In fact, the OP's pattern match would do it just fine if he
had typed "$" instead of "S".

rbric · Apr 9, 2005

Hi folks,

I'm parsing through a series of delimited records. Some of the records
use '\t' for the delimiter, and others use '=09' as the delimiter. My
program handles the tab-delimited records fine, but records that use '=09'
have erroneous line breaks after '=' signs, like so:

93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA AVE=09=09NEWBERG=09OR=099=
71320000=098.33=09I=09CA=09PARK RANGER=09=0966.64=09=09=09=09=09=09PARKS & =
RECREATION DIVISION=09

Thats funny, I see "09" but I don't see "\n" or even (\n=) crfl 13,10
Some "line-breaks" are 10, some are crlf. '10' or '13' is not an
ESCape character as is not the '09', but isint '=09' a representative
printable string of an ESCape sequence? But don't see '1013'.
When do you process these records? Is it in its binary form?
Regex only knows a few '\letter" escape control codes.
You may want to go strictly hex representation of '\n' (even though
its not visible here) by having either the \x0a or \x0d with the '='.

while ($line =~ s/(=\x0a\0d|=[\x0a\x0d])//) {};

I write it this way, without the 'g' modifyer because I don't think
'backtracking' is done in this case since there are no quatifiyers,
that could be your problem.

for a quick test, try this:

while ($line =~ s/=\n//) {};

gluck!!

A. Sinan Unur · Apr 9, 2005

(e-mail address removed) wrote in

while ($line =~ s/=\n//) {};

In general, you might want to use

1 while ( ... );

instead of putting an empty block at the end.

However, you don't really need a while loop there:

$line =~ s/=\n//g;

would be preferable.

Just because I am bored and looking for something to do:

#! /usr/bin/perl

use strict;
use warnings;

sub make_loop_replacer {
my $s = 'a';
$s .= "=\n" for (1 .. 100_000);
$s .= 'b';
sub { 1 while $s =~ s/=\n// }
}

sub make_sg_replacer {
my $s = 'a';
$s .= "=\n" for (1 .. 100_000);
$s .= 'b';
sub { $s =~ s/=\n//g }
}

use Benchmark ':all';

cmpthese 5_000_000, {
loop => make_loop_replacer(),
sg => make_sg_replacer(),
};

__END__

D:\Home\asu1\UseNet\clpmisc> t
Rate loop sg
loop 2908668/s -- -50%
sg 5763689/s 98% --

Replace an occurrence of a regexp with a function call on a substringof the match, multiple times on	4	Sep 16, 2013
Yet another User Input Question	16	Jan 26, 2008
sendmail won't send me email but will the person filling out form	0	Nov 6, 2005
Beginner: read $array with line breaks line by line	12	Aug 27, 2006
No-syntax Web-programming-IDE (was: Does turtle graphics have the wrong associations?)	0	Nov 22, 2009
Musatov's 'Mode/Code' Primary method call	4	Oct 31, 2009
Roundup of FAQ change requests	4	Dec 6, 2004
Request for help	22	Sep 20, 2007

YARQ - Yet another regex question

sjp

A. Sinan Unur

Paul Lalli

John Bokma

A. Sinan Unur

Paul Lalli

sjp

John Bokma

John Bokma

John Bokma

robic0

Tad McClellan

rbric

A. Sinan Unur

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads