John, I would have agreed with you. Plainly we're both wrong, as the follow-ups, not to mention the documentation, have shown, but what is it we're (mis)remembering? There's some circumstance in which newline \n behaves differently from the other white space characters.
Not sure if this helps, but I searched the manual for
/white.*newline and /\\s and /\\s.*newline and it yielded:
perlintro
....
More complex regular expressions
You don't just have to match on fixed strings. In fact, you can match on
just about anything you could dream of by using more complex regular
expressions. These are documented at great length in perlre, but for the
meantime, here's a quick cheat sheet:
. a single character
\s a whitespace character (space, tab, newline, ...)
perlglossary
....
continuation
The treatment of more than one physical "line" as a single logical line.
"Makefile" lines are continued by putting a backslash before the
"newline". Mail headers as defined by RFC 822 are continued by putting a
space or tab *after* the newline. In general, lines in Perl do not need
any form of continuation mark, because "whitespace" (including newlines)
is gleefully ignored. Usually.
perlrequick
....
Perl has several abbreviations for common character classes:
* \d is a digit and represents
[0-9]
* \s is a whitespace character and represents
[\ \t\r\n\f]
perlretut
....
* \s matches a whitespace character, the set [\ \t\r\n\f] and others
....
The "[:digit:]", "[:word:]", and "[:space:]" correspond to the
familiar "\d", "\w", and "\s" character classes.
perlfaq4
....
How do I strip blank space from the beginning/end of a string?
(contributed by brian d foy)
A substitution can do this for you. For a single line, you want to replace
all the leading or trailing whitespace with nothing. You can do that with a
pair of substitutions.
s/^\s+//;
s/\s+$//;
You can also write that as a single substitution, although it turns out the
combined statement is slower than the separate ones. That might not matter
to you, though.
s/^\s+|\s+$//g;
In this regular expression, the alternation matches either at the beginning
or the end of the string since the anchors have a lower precedence than the
alternation. With the "/g" flag, the substitution makes all possible
matches, so it gets both. Remember, the trailing newline matches the "\s+",
and the "$" anchor can match to the physical end of the string, so the
newline disappears too. Just add the newline to the output, which has the
added benefit of preserving "blank" (consisting entirely of whitespace)
lines which the "^\s+" would remove all by itself.
while( <> )
{
s/^\s+|\s+$//g;
print "$_\n";
}
For a multi-line string, you can apply the regular expression to each
logical line in the string by adding the "/m" flag (for "multi-line"). With
the "/m" flag, the "$" matches *before* an embedded newline, so it doesn't
remove it. It still removes the newline at the end of the string.
$string =~ s/^\s+|\s+$//gm;
Remember that lines consisting entirely of whitespace will disappear, since
the first part of the alternation can match the entire string and replace it
with nothing. If need to keep embedded blank lines, you have to do a little
more work. Instead of matching any whitespace (since that includes a
newline), just match the other whitespace.
$string =~ s/^[\t\f ]+|[\t\f ]+$//mg;
perlrebackslash
....
"\w" is a character class that matches any single *word* character (letters,
digits, underscore). "\d" is a character class that matches any decimal
digit, while the character class "\s" matches any whitespace character. New
in perl 5.10.0 are the classes "\h" and "\v" which match horizontal and
vertical whitespace characters.
perlrecharclass
....
Whitespace
"\s" matches any single character that is considered whitespace. The exact
set of characters matched by "\s" depends on whether the source string is in
UTF-8 format and the locale or EBCDIC code page that is in effect. If it's
in UTF-8 format, "\s" matches what is considered whitespace in the Unicode
database; the complete list is in the table below. Otherwise, if there is a
locale or EBCDIC code page in effect, "\s" matches whatever is considered
whitespace by the current locale or EBCDIC code page. Without a locale or
EBCDIC code page, "\s" matches the horizontal tab ("\t"), the newline
("\n"), the form feed ("\f"), the carriage return ("\r"), and the space.
(Note that it doesn't match the vertical tab, "\cK".) Perhaps the most
notable possible surprise is that "\s" matches a non-breaking space only if
the non-breaking space is in a UTF-8 encoded string or the locale or EBCDIC
code page that is in effect has that character. See "Locale, EBCDIC, Unicode
and UTF-8".
....
Note that unlike "\s", "\d" and "\w", "\h" and "\v" always match the same
characters, regardless whether the source string is in UTF-8 format or not.
The set of characters they match is also not influenced by locale nor EBCDIC
code page.
One might think that "\s" is equivalent to "[\h\v]". This is not true. The
vertical tab ("\x0b") is not matched by "\s", it is however considered
vertical whitespace. Furthermore, if the source string is not in UTF-8
format, and any locale or EBCDIC code page that is in effect doesn't include
them, the next line (ASCII-platform "\x85") and the no-break space
(ASCII-platform "\xA0") characters are not matched by "\s", but are by "\v"
and "\h" respectively. If the source string is in UTF-8 format, both the
next line and the no-break space are matched by "\s".
The following table is a complete listing of characters matched by "\s",
"\h" and "\v" as of Unicode 5.2.
The first column gives the code point of the character (in hex format), the
second column gives the (Unicode) name. The third column indicates by which
class(es) the character is matched (assuming no locale or EBCDIC code page
is in effect that changes the "\s" matching).
0x00009 CHARACTER TABULATION h s
0x0000a LINE FEED (LF) vs
0x0000b LINE TABULATION v
0x0000c FORM FEED (FF) vs
0x0000d CARRIAGE RETURN (CR) vs
0x00020 SPACE h s
0x00085 NEXT LINE (NEL) vs [1]
0x000a0 NO-BREAK SPACE h s [1]
0x01680 OGHAM SPACE MARK h s
0x0180e MONGOLIAN VOWEL SEPARATOR h s
0x02000 EN QUAD h s
0x02001 EM QUAD h s
0x02002 EN SPACE h s
0x02003 EM SPACE h s
0x02004 THREE-PER-EM SPACE h s
0x02005 FOUR-PER-EM SPACE h s
0x02006 SIX-PER-EM SPACE h s
0x02007 FIGURE SPACE h s
0x02008 PUNCTUATION SPACE h s
0x02009 THIN SPACE h s
0x0200a HAIR SPACE h s
0x02028 LINE SEPARATOR vs
0x02029 PARAGRAPH SEPARATOR vs
0x0202f NARROW NO-BREAK SPACE h s
0x0205f MEDIUM MATHEMATICAL SPACE h s
0x03000 IDEOGRAPHIC SPACE h s
[1] NEXT LINE and NO-BREAK SPACE only match "\s" if the source string is in
UTF-8 format, or the locale or EBCDIC code page that is in effect
includes them.
perl561delta
....
Unicode support
...
The Unicode character classes \p{Blank} and \p{SpacePerl} have been
added. "Blank" is like C isblank(), that is, it contains only
"horizontal whitespace" (the space character is, the newline isn't), and
the "SpacePerl" is the Unicode equivalent of "\s" (\p{Space} isn't,
since that includes the vertical tabulator character, whereas "\s"
doesn't.)