YARQ - Yet another regex question

S

sjp

Hi folks,

I'm parsing through a series of delimited records. Some of the records
use '\t' for the delimiter, and others use '=09' as the delimiter. My
program handles the tab-delimited records fine, but records that use '=09'
have erroneous line breaks after '=' signs, like so:

93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA AVE=09=09NEWBERG=09OR=099=
71320000=098.33=09I=09CA=09PARK RANGER=09=0966.64=09=09=09=09=09=09PARKS & =
RECREATION DIVISION=09

I'd like to remove the '=' and EOL, but 'Sline =~ s/\=\n//g;' fails with
an "Can't modify constant item in substitution (s///) at
/usr/local/bin/mailparse line 18, near "s/\=\n//g;" error

What is the proper way to do it?

Thanks,

SJP
 
A

A. Sinan Unur

I'd like to remove the '=' and EOL, but 'Sline =~ s/\=\n//g;' fails

Is that supposed to be $line?
What is the proper way to do it?

One way would be to read the error message, then fix the error in the
given location, instead of asking hundreds of people to guess what your
script looks like.

Sinan.
 
P

Paul Lalli

sjp said:
Hi folks,

I'm parsing through a series of delimited records. Some of the records
use '\t' for the delimiter, and others use '=09' as the delimiter. My
program handles the tab-delimited records fine, but records that use '=09'
have erroneous line breaks after '=' signs, like so:

93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA AVE=09=09NEWBERG=09OR=099=
71320000=098.33=09I=09CA=09PARK RANGER=09=0966.64=09=09=09=09=09=09PARKS & =
RECREATION DIVISION=09

I'd like to remove the '=' and EOL, but 'Sline =~ s/\=\n//g;' fails with
an "Can't modify constant item in substitution (s///) at
/usr/local/bin/mailparse line 18, near "s/\=\n//g;" error

What the heck is 'Sline'? Are you sure you don't mean $line?
Conceivably, perl thinks that 'Sline' is some sort of constant item.

You are enabling strict and warnings, right?

Also, = is not special in a regexp. There's no reason to escape it.

Beyond that, I don't understand what your actual issue is. How does the
records being delimited by '=09' relate to the records having \n
characters after some = characters?

Paul Lalli
 
J

John Bokma

Paul said:
What the heck is 'Sline'? Are you sure you don't mean $line?
Conceivably, perl thinks that 'Sline' is some sort of constant item.

You are enabling strict and warnings, right?

Also, = is not special in a regexp. There's no reason to escape it.

Beyond that, I don't understand what your actual issue is. How does
the records being delimited by '=09' relate to the records having \n
characters after some = characters?

=
09

is not

=09

the =xx encoding is used in email (I forgot the name), I would *fix*
that first, and then do the parsing.
 
A

A. Sinan Unur

John Bokma said:
....


=
09

is not

=09

But there no such cases in the data the OP posted.
the =xx encoding is used in email (I forgot the name), I would *fix*
that first, and then do the parsing.

Base64. The CPAN module MIME::Base64 allows one to convert
Base64 encoded strings. On the other hand, I am not sure
if the data the OP posted really is Base64.

The following seems to satisfy the OP's requirements:

#! perl

use strict;
use warnings;

my $d = q{93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA AVE=09=09NEWBERG=09OR=099=
71320000=098.33=09I=09CA=09PARK RANGER=09=0966.64=09=09=09=09=09=09PARKS & =
RECREATION DIVISION=09};

$d =~ s/=09/\t/g;
$d =~ s/=\n//g;

print $d;
__END__
 
P

Paul Lalli

John said:
=
09

is not

=09

the =xx encoding is used in email (I forgot the name), I would *fix*
that first, and then do the parsing.


There is no instance of
=
09

anywhere in the OP's data. The way it sounds to me is that the OP is
concerned about \n's after *any* = character.

I admit, of course, that I could be quite wrong. But in fact, there is
no instance of any "=\n" anywhere in the OP's data, so I don't think we
can really know what the OP is talking about until the OP himself clarifies.

Paul Lalli
 
S

sjp

=
09

is not

=09

the =xx encoding is used in email (I forgot the name), I would *fix*
that first, and then do the parsing.

You're right, John. I'm parsing a very large email archive file and an
indeterminate number of attachments in the file are encoded
"quoted-printable". So the real issue, I suppose is how to properly
decode an indeterminate number of quoted-printable records from a mail
archive before processing the records contained in that archive.

Thanks for helping me to frame the problem.
 
J

John Bokma

A. Sinan Unur said:
But there no such cases in the data the OP posted.

Yup, classical bad post / wrong example :-D

I think "we" see this every day here?
The following seems to satisfy the OP's requirements:

#! perl

use strict;
use warnings;

my $d = q{93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA
AVE=09=09NEWBERG=09OR=099= 71320000=098.33=09I=09CA=09PARK
RANGER=09=0966.64=09=09=09=09=09=09PARKS & = RECREATION DIVISION=09};

$d =~ s/=09/\t/g;
$d =~ s/=\n//g;

If you swap those two, yes.
 
J

John Bokma

Paul said:
There is no instance of
=
09

anywhere in the OP's data.

Of course not, because the OP posted a wrong example :-D.

Does that never happen here?
The way it sounds to me is that the OP is
concerned about \n's after *any* = character.

I admit, of course, that I could be quite wrong. But in fact, there
is no instance of any "=\n" anywhere in the OP's data, so I don't
think we can really know what the OP is talking about until the OP
himself clarifies.

My best guess:

=
xx

should become

=xx

and then if xx = 09 it should be replaced with \t

I would have the decoding be handled by a dedicated Perl module.
 
J

John Bokma

[ snip ]
You're right, John. I'm parsing a very large email archive file and an
indeterminate number of attachments in the file are encoded
"quoted-printable".

Yup, that's the one :-D
So the real issue, I suppose is how to properly
decode an indeterminate number of quoted-printable records from a mail
archive before processing the records contained in that archive.

I am really sure that there are Perl modules that handle this.

<http://search.cpan.org/~gaas/MIME-Base64-Perl-
1.00/lib/MIME/QuotedPrint/Perl.pm>
Thanks for helping me to frame the problem.

:) You're welcome.
 
R

robic0

Hi folks,

I'm parsing through a series of delimited records. Some of the records
use '\t' for the delimiter, and others use '=09' as the delimiter. My
program handles the tab-delimited records fine, but records that use '=09'
have erroneous line breaks after '=' signs, like so:

93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA AVE=09=09NEWBERG=09OR=099=
71320000=098.33=09I=09CA=09PARK RANGER=09=0966.64=09=09=09=09=09=09PARKS & =
RECREATION DIVISION=09

I'd like to remove the '=' and EOL, but 'Sline =~ s/\=\n//g;' fails with
an "Can't modify constant item in substitution (s///) at
/usr/local/bin/mailparse line 18, near "s/\=\n//g;" error

What is the proper way to do it?

Thanks,

SJP


$line =~ s/[=\n]+//g;
 
T

Tad McClellan

but records that use '=09'
have erroneous line breaks after '=' signs, like so:
I'd like to remove the '=' and EOL, but 'Sline =~ s/\=\n//g;' fails with
an "Can't modify constant item in substitution (s///) at
What is the proper way to do it?

$line =~ s/[=\n]+//g;


That does not do what was asked for.

The OP wants to remove the 2-character sequence "=\n".

That code removes all equal signs and all newlines.


In fact, the OP's pattern match would do it just fine if he
had typed "$" instead of "S".
 
R

rbric

Hi folks,

I'm parsing through a series of delimited records. Some of the records
use '\t' for the delimiter, and others use '=09' as the delimiter. My
program handles the tab-delimited records fine, but records that use '=09'
have erroneous line breaks after '=' signs, like so:

93=093=094/1/2004=09=09HARNEY=09JAMES=09808 SITKA AVE=09=09NEWBERG=09OR=099=
71320000=098.33=09I=09CA=09PARK RANGER=09=0966.64=09=09=09=09=09=09PARKS & =
RECREATION DIVISION=09
Thats funny, I see "09" but I don't see "\n" or even (\n=) crfl 13,10
Some "line-breaks" are 10, some are crlf. '10' or '13' is not an
ESCape character as is not the '09', but isint '=09' a representative
printable string of an ESCape sequence? But don't see '1013'.
When do you process these records? Is it in its binary form?
Regex only knows a few '\letter" escape control codes.
You may want to go strictly hex representation of '\n' (even though
its not visible here) by having either the \x0a or \x0d with the '='.

while ($line =~ s/(=\x0a\0d|=[\x0a\x0d])//) {};

I write it this way, without the 'g' modifyer because I don't think
'backtracking' is done in this case since there are no quatifiyers,
that could be your problem.

for a quick test, try this:

while ($line =~ s/=\n//) {};

gluck!!
 
A

A. Sinan Unur

(e-mail address removed) wrote in
while ($line =~ s/=\n//) {};

In general, you might want to use

1 while ( ... );

instead of putting an empty block at the end.

However, you don't really need a while loop there:

$line =~ s/=\n//g;

would be preferable.

Just because I am bored and looking for something to do:

#! /usr/bin/perl

use strict;
use warnings;

sub make_loop_replacer {
my $s = 'a';
$s .= "=\n" for (1 .. 100_000);
$s .= 'b';
sub { 1 while $s =~ s/=\n// }
}

sub make_sg_replacer {
my $s = 'a';
$s .= "=\n" for (1 .. 100_000);
$s .= 'b';
sub { $s =~ s/=\n//g }
}

use Benchmark ':all';

cmpthese 5_000_000, {
loop => make_loop_replacer(),
sg => make_sg_replacer(),
};

__END__

D:\Home\asu1\UseNet\clpmisc> t
Rate loop sg
loop 2908668/s -- -50%
sg 5763689/s 98% --
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,139
Latest member
JamaalCald
Top