Unexpected regex Behavior

M

Mark Shelor

Is it true that defining $/ to an integer reference (to read
fixed-length records) affects the meaning of the end-of-string symbol
($) in regex's?

For example, let's say I'm reading 4096-byte chunks from a file, and
wish to do special processing if any chunk ends with the carriage-return
character (\015). So, I start with code that looks like:

local $/ = \4096;
while (defined (my $rec = <F>)) {
while ($rec =~ /\015$/) {
# do special processing ...
}
...
}

Oddly, this doesn't seem to work. It ends up matching chunks that
contain, but don't necessarily end with, \015.

Instead, I have to do this:

local $/ = \4096;
while (defined (my $rec = <F>)) {
while (substr($rec, -1) eq "\015") {
# do special processing ...
}
...
}

Any idea what's going on?

Thanks, Mark
 
M

MSG

Mark said:
Is it true that defining $/ to an integer reference (to read
fixed-length records) affects the meaning of the end-of-string symbol
($) in regex's?

local $/ = \4096;
while (defined (my $rec = <F>)) {
while ($rec =~ /\015$/) {
# do special processing ...
($) is not exactly the end-of-string symbol, it is end-of-line symbol.
(Z) or (z) is end-of-string symbol and should serves your purpose.

Also I feel that "if" is better than a "while" loop ( the 2nd one),
since
you only want to match one \015 at the end of the string.
 
J

John W. Krahn

Mark said:
Is it true that defining $/ to an integer reference (to read
fixed-length records) affects the meaning of the end-of-string symbol
($) in regex's?

No, it is not true.
For example, let's say I'm reading 4096-byte chunks from a file, and
wish to do special processing if any chunk ends with the carriage-return
character (\015). So, I start with code that looks like:

local $/ = \4096;
while (defined (my $rec = <F>)) {
while ($rec =~ /\015$/) {
# do special processing ...
}
...
}

Oddly, this doesn't seem to work. It ends up matching chunks that
contain, but don't necessarily end with, \015.

Instead, I have to do this:

local $/ = \4096;
while (defined (my $rec = <F>)) {
while (substr($rec, -1) eq "\015") {
# do special processing ...
}
...
}

Any idea what's going on?

perldoc perlre
[snip]
By default, the "^" character is guaranteed to match only the beginning
of the string, the "$" character only the end (or before the newline at
the end), and Perl does certain optimizations with the assumption that
the string contains only one line. Embedded newlines will not be
matched by "^" or "$". You may, however, wish to treat a string as a
multi-line buffer, such that the "^" will match after any newline
within the string, and "$" will match before any newline. At the cost
of a little more overhead, you can do this by using the /m modifier on
the pattern match operator. (Older programs did this by setting $*,
but this practice is now deprecated.)


So the regular expression will match with either "\015" or "\015\012" at the
end of the string. If you want it to only match at the end of the string use
/\015\z/ or the substr() expression.



John
 
M

Mark Shelor

John said:
Mark said:
Is it true that defining $/ to an integer reference (to read
fixed-length records) affects the meaning of the end-of-string symbol
($) in regex's?


No, it is not true.

For example, let's say I'm reading 4096-byte chunks from a file, and
wish to do special processing if any chunk ends with the carriage-return
character (\015). So, I start with code that looks like:

local $/ = \4096;
while (defined (my $rec = <F>)) {
while ($rec =~ /\015$/) {
# do special processing ...
}
...
}

Oddly, this doesn't seem to work. It ends up matching chunks that
contain, but don't necessarily end with, \015.

Instead, I have to do this:

local $/ = \4096;
while (defined (my $rec = <F>)) {
while (substr($rec, -1) eq "\015") {
# do special processing ...
}
...
}

Any idea what's going on?


perldoc perlre
[snip]
By default, the "^" character is guaranteed to match only the beginning
of the string, the "$" character only the end (or before the newline at
the end), and Perl does certain optimizations with the assumption that
the string contains only one line. Embedded newlines will not be
matched by "^" or "$". You may, however, wish to treat a string as a
multi-line buffer, such that the "^" will match after any newline
within the string, and "$" will match before any newline. At the cost
of a little more overhead, you can do this by using the /m modifier on
the pattern match operator. (Older programs did this by setting $*,
but this practice is now deprecated.)


So the regular expression will match with either "\015" or "\015\012" at the
end of the string. If you want it to only match at the end of the string use
/\015\z/ or the substr() expression.


Now it all makes perfect sense. Thanks for citing the reference, and
thanks to you and MSG for the helpful replies.

As a side remark to MSG's response, both $ and \Z match *before* newline
at the end, so only /\015\z/ will work in this case.

Regards, Mark
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,586
Members
45,085
Latest member
cryptooseoagencies

Latest Threads

Top