Why is this sub removing newlines??

J

John Black

This sub is just supposed to strip off whitespace (at both the beginning and end of a
string). But its also stripping off newlines at the end of the string! Why would that be?
\s does not include newline, right?

sub trim()
{
my $string = shift;
$string =~ s/^\s+//;
$string =~ s/\s+$//;
return $string;
}

John Black
 
R

Rainer Weikusat

John Black said:
This sub is just supposed to strip off whitespace (at both the beginning and end of a
string). But its also stripping off newlines at the end of the string! Why would that be?
\s does not include newline, right?

sub trim()
{
my $string = shift;
$string =~ s/^\s+//;
$string =~ s/\s+$//;
return $string;
}

[rw@sable]~#perl -e 'print "\n" =~ /\s/, "\n"'
1
 
C

Charlton Wilbur

JB> \s does not include newline, right?

perldoc perlrecharclass:

"\s" matches any single character considered whitespace.

and the following table:

0x00009 CHARACTER TABULATION h s
0x0000a LINE FEED (LF) vs
0x0000b LINE TABULATION v
0x0000c FORM FEED (FF) vs
0x0000d CARRIAGE RETURN (CR) vs
0x00020 SPACE h s
0x00085 NEXT LINE (NEL) vs [1]
0x000a0 NO-BREAK SPACE h s [1]

So yes, newline *is* considered whitespace.

Charlton
 
J

Jim Gibson

John Black said:
This sub is just supposed to strip off whitespace (at both the beginning and
end of a
string). But its also stripping off newlines at the end of the string! Why
would that be?
\s does not include newline, right?

'perldoc perlre' contains these excerpts:

Character Classes and other Special Escapes
....
In addition, Perl defines the following:

Sequence Note Description
....
\s [3] Match a whitespace character
....
[3] See "Backslash sequences" in perlrecharclass for details.
(end)

Following that reference to 'perldoc perlrecharclass' yields:

Whitespace

"\s" matches any single character that is considered whitespace. The
exact set of characters matched by "\s" depends on whether the source
string is in UTF-8 format and the locale or EBCDIC code page that is in
effect. If it's in UTF-8 format, "\s" matches what is considered
whitespace in the Unicode database; the complete list is in the table
below. Otherwise, if there is a locale or EBCDIC code page in effect,
"\s" matches whatever is considered whitespace by the current locale or
EBCDIC code page. Without a locale or EBCDIC code page, "\s" matches
the horizontal tab ("\t"), the newline ("\n"), the form feed ("\f"),
the carriage return ("\r"), and the space. (Note that it doesn't match
the vertical tab, "\cK".) Perhaps the most notable possible surprise
is that "\s" matches a non-breaking space only if the non-breaking
space is in a UTF-8 encoded string or the locale or EBCDIC code page
that is in effect has that character. See "Locale, EBCDIC, Unicode and
UTF-8".
(end)

So, yes, \s does include the newline.
 
G

gamo

El 05/12/13 19:50, John Black escribió:
This sub is just supposed to strip off whitespace (at both the beginning and end of a
string). But its also stripping off newlines at the end of the string! Why would that be?
\s does not include newline, right?

sub trim()
{
my $string = shift;
$string =~ s/^\s+//;
$string =~ s/\s+$//;
return $string;
}

John Black

This is absurd, but maybe do just what you want to do:

:~/test$ cat test.trim
#!/usr/bin/perl -W

$s = " only this:
";
print trim($s);


sub trim{
my $string = shift;
my $space = ' ';
$string =~ s/$space+//;
$string = reverse $string;
$string =~ s/$space+//;
$string = reverse $string;
return $string;
}

:~/test$ perl test.trim
only this:
:~/test$

Best regards
 
J

John Black

John, I would have agreed with you. Plainly we're both wrong, as the
follow-ups, not to mention the documentation, have shown, but what is it
we're (mis)remembering? There's some circumstance in which newline \n
behaves differently from the other white space characters.

Now that I see that \s includes vertical and horizontal types of characters, it makes more
sense. Up to this point, I've been using \s as a shortcut for spaces or tabs. I'll have to
keep this in mind - I had wanted that trim function to not strip the newlines (and not add
any either if there wasn't one). Should not be hard to workaround. Thanks all.

John Black
 
R

Rainer Weikusat

Henry Law said:
John, I would have agreed with you. Plainly we're both wrong, as the
follow-ups, not to mention the documentation, have shown, but what is
it we're (mis)remembering? There's some circumstance in which newline
\n behaves differently from the other white space characters.

Guess: There's a circumstance where it behaves differently from other
characters, namely, a . won't match \n unless the s-flag is used
together with the match operator.
 
J

Jürgen Exner

John Black said:
Now that I see that \s includes vertical and horizontal types of characters, it makes more
sense. Up to this point,

Try looking at it from a programming language point of view. Most modern
programming languages are free-format, i.e. in the program code a single
space is as good as 20 tabs or as 5 newlines. Therefore there is some
sense in including all of them in \s.

jue
 
J

Jim Gibson

That's a pretty old copy of that documentation. Since 5.14 the Unicode
Bug has been fixed, and character-class matching no longer depends on
the internal format of the string.

Ben

Thanks. It's from 5.12.4, which is what I am using.
 
J

Janek Schleicher

Am 05.12.2013 19:50, schrieb John Black:
This sub is just supposed to strip off whitespace (at both the beginning and end of a
string). But its also stripping off newlines at the end of the string! Why would that be?
\s does not include newline, right?

sub trim()
{
my $string = shift;
$string =~ s/^\s+//;
$string =~ s/\s+$//;
return $string;
}

BTW,
is there any reason to reinvent the wheel.
There are several CPAN-modules doing one of the most often needed Jobs:
- https://metacpan.org/pod/String::Trim
- https://metacpan.org/pod/String::Strip
- https://metacpan.org/pod/Text::Trim

In case it makes the source code more readable, shorter, easier to
maintain and will often have less bugs.


Greetings,
Janek
 
R

Rainer Weikusat

Janek Schleicher said:
Am 05.12.2013 19:50, schrieb John Black:

BTW,
is there any reason to reinvent the wheel.
There are several CPAN-modules doing one of the most often needed Jobs:
- https://metacpan.org/pod/String::Trim
- https://metacpan.org/pod/String::Strip
- https://metacpan.org/pod/Text::Trim

Using a gross oversimplification, there is only one 'wheel'[*] but there
are already at least three different CPAN modules for deleting
characters at the beginning or the end of a string. Consequently, none
of them can be the equivalent of 'the wheel' for solving this problem.

[*] Actually, there are all kinds of different wheels and new kinds are
constantly being invented.
 
P

perlpilot

Guess: There's a circumstance where it behaves differently from other

characters, namely, a . won't match \n unless the s-flag is used

together with the match operator.

Also, there's the fact that $ in regex matches the end of the string or before the newline at the end. If you're thinking of or expecting that second behavior and have forgotten about greediness, you may expect that the newline wouldn't be removed in the expression s/\s+$//;

Maybe a bit of a stretch, but as long we're guessing what's in other people's heads ... :)

-Scott
 
J

John Black

Of course, it's
probably easier to just use [ \t] if that's what you mean...

Well, for many long regexs \s is used a lot and they are already ugly enough without
substituting [ \t] everywhere. I think that now that I know \n is included, I can be careful
and work around that when it matters with [ \t] or something else. Thanks.

John Black
 
$

$Bill

John, I would have agreed with you. Plainly we're both wrong, as the follow-ups, not to mention the documentation, have shown, but what is it we're (mis)remembering? There's some circumstance in which newline \n behaves differently from the other white space characters.

Not sure if this helps, but I searched the manual for
/white.*newline and /\\s and /\\s.*newline and it yielded:

perlintro
....
More complex regular expressions
You don't just have to match on fixed strings. In fact, you can match on
just about anything you could dream of by using more complex regular
expressions. These are documented at great length in perlre, but for the
meantime, here's a quick cheat sheet:

. a single character
\s a whitespace character (space, tab, newline, ...)

perlglossary
....
continuation
The treatment of more than one physical "line" as a single logical line.
"Makefile" lines are continued by putting a backslash before the
"newline". Mail headers as defined by RFC 822 are continued by putting a
space or tab *after* the newline. In general, lines in Perl do not need
any form of continuation mark, because "whitespace" (including newlines)
is gleefully ignored. Usually.

perlrequick
....
Perl has several abbreviations for common character classes:

* \d is a digit and represents

[0-9]

* \s is a whitespace character and represents

[\ \t\r\n\f]

perlretut
....
* \s matches a whitespace character, the set [\ \t\r\n\f] and others
....
The "[:digit:]", "[:word:]", and "[:space:]" correspond to the
familiar "\d", "\w", and "\s" character classes.

perlfaq4
....
How do I strip blank space from the beginning/end of a string?
(contributed by brian d foy)

A substitution can do this for you. For a single line, you want to replace
all the leading or trailing whitespace with nothing. You can do that with a
pair of substitutions.

s/^\s+//;
s/\s+$//;

You can also write that as a single substitution, although it turns out the
combined statement is slower than the separate ones. That might not matter
to you, though.

s/^\s+|\s+$//g;

In this regular expression, the alternation matches either at the beginning
or the end of the string since the anchors have a lower precedence than the
alternation. With the "/g" flag, the substitution makes all possible
matches, so it gets both. Remember, the trailing newline matches the "\s+",
and the "$" anchor can match to the physical end of the string, so the
newline disappears too. Just add the newline to the output, which has the
added benefit of preserving "blank" (consisting entirely of whitespace)
lines which the "^\s+" would remove all by itself.

while( <> )
{
s/^\s+|\s+$//g;
print "$_\n";
}

For a multi-line string, you can apply the regular expression to each
logical line in the string by adding the "/m" flag (for "multi-line"). With
the "/m" flag, the "$" matches *before* an embedded newline, so it doesn't
remove it. It still removes the newline at the end of the string.

$string =~ s/^\s+|\s+$//gm;

Remember that lines consisting entirely of whitespace will disappear, since
the first part of the alternation can match the entire string and replace it
with nothing. If need to keep embedded blank lines, you have to do a little
more work. Instead of matching any whitespace (since that includes a
newline), just match the other whitespace.

$string =~ s/^[\t\f ]+|[\t\f ]+$//mg;

perlrebackslash
....
"\w" is a character class that matches any single *word* character (letters,
digits, underscore). "\d" is a character class that matches any decimal
digit, while the character class "\s" matches any whitespace character. New
in perl 5.10.0 are the classes "\h" and "\v" which match horizontal and
vertical whitespace characters.

perlrecharclass
....
Whitespace
"\s" matches any single character that is considered whitespace. The exact
set of characters matched by "\s" depends on whether the source string is in
UTF-8 format and the locale or EBCDIC code page that is in effect. If it's
in UTF-8 format, "\s" matches what is considered whitespace in the Unicode
database; the complete list is in the table below. Otherwise, if there is a
locale or EBCDIC code page in effect, "\s" matches whatever is considered
whitespace by the current locale or EBCDIC code page. Without a locale or
EBCDIC code page, "\s" matches the horizontal tab ("\t"), the newline
("\n"), the form feed ("\f"), the carriage return ("\r"), and the space.
(Note that it doesn't match the vertical tab, "\cK".) Perhaps the most
notable possible surprise is that "\s" matches a non-breaking space only if
the non-breaking space is in a UTF-8 encoded string or the locale or EBCDIC
code page that is in effect has that character. See "Locale, EBCDIC, Unicode
and UTF-8".
....
Note that unlike "\s", "\d" and "\w", "\h" and "\v" always match the same
characters, regardless whether the source string is in UTF-8 format or not.
The set of characters they match is also not influenced by locale nor EBCDIC
code page.

One might think that "\s" is equivalent to "[\h\v]". This is not true. The
vertical tab ("\x0b") is not matched by "\s", it is however considered
vertical whitespace. Furthermore, if the source string is not in UTF-8
format, and any locale or EBCDIC code page that is in effect doesn't include
them, the next line (ASCII-platform "\x85") and the no-break space
(ASCII-platform "\xA0") characters are not matched by "\s", but are by "\v"
and "\h" respectively. If the source string is in UTF-8 format, both the
next line and the no-break space are matched by "\s".

The following table is a complete listing of characters matched by "\s",
"\h" and "\v" as of Unicode 5.2.

The first column gives the code point of the character (in hex format), the
second column gives the (Unicode) name. The third column indicates by which
class(es) the character is matched (assuming no locale or EBCDIC code page
is in effect that changes the "\s" matching).

0x00009 CHARACTER TABULATION h s
0x0000a LINE FEED (LF) vs
0x0000b LINE TABULATION v
0x0000c FORM FEED (FF) vs
0x0000d CARRIAGE RETURN (CR) vs
0x00020 SPACE h s
0x00085 NEXT LINE (NEL) vs [1]
0x000a0 NO-BREAK SPACE h s [1]
0x01680 OGHAM SPACE MARK h s
0x0180e MONGOLIAN VOWEL SEPARATOR h s
0x02000 EN QUAD h s
0x02001 EM QUAD h s
0x02002 EN SPACE h s
0x02003 EM SPACE h s
0x02004 THREE-PER-EM SPACE h s
0x02005 FOUR-PER-EM SPACE h s
0x02006 SIX-PER-EM SPACE h s
0x02007 FIGURE SPACE h s
0x02008 PUNCTUATION SPACE h s
0x02009 THIN SPACE h s
0x0200a HAIR SPACE h s
0x02028 LINE SEPARATOR vs
0x02029 PARAGRAPH SEPARATOR vs
0x0202f NARROW NO-BREAK SPACE h s
0x0205f MEDIUM MATHEMATICAL SPACE h s
0x03000 IDEOGRAPHIC SPACE h s

[1] NEXT LINE and NO-BREAK SPACE only match "\s" if the source string is in
UTF-8 format, or the locale or EBCDIC code page that is in effect
includes them.

perl561delta
....
Unicode support
...
The Unicode character classes \p{Blank} and \p{SpacePerl} have been
added. "Blank" is like C isblank(), that is, it contains only
"horizontal whitespace" (the space character is, the newline isn't), and
the "SpacePerl" is the Unicode equivalent of "\s" (\p{Space} isn't,
since that includes the vertical tabulator character, whereas "\s"
doesn't.)
 
J

Janek Schleicher

Am 06.12.2013 15:29, schrieb Rainer Weikusat:
Using a gross oversimplification, ...

So, you also prefer to write
s/\r?\n$// instead of oversimplifying chomp; ?

I'd prefer instead to write 2 easy lines that express exactly what we
intend to do

use WhateverModule::Trim|Strip;
....
trim($string);

to half a dozen lines in close to most scripts.

All I'd wonder is why trim/strip isn't a system command like chomp.

If we use a reg exp in program logic, usually they should do something
that is special to our program, maybe s/blue/green/ or s/(\d+)/2*$1/ge.

Well, o.k., maybe I get religious here, so TMTOWTDI.


Greetings,
Janek
 
C

C.DeRykus

keep this in mind - I had wanted that trim function to not strip the newlines (and not add any either if there wasn't one). Should not be hard to workaround. Thanks all.

Another option: a regex that'd handle any
trailing newline:

$string =~ s/ ^\s+ | \s+(?=\n|)$ //gx;
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,014
Latest member
BiancaFix3

Latest Threads

Top