Why is this sub removing newlines??

Discussion in 'Perl Misc' started by John Black, Dec 5, 2013.

  1. John Black

    John Black Guest

    This sub is just supposed to strip off whitespace (at both the beginning and end of a
    string). But its also stripping off newlines at the end of the string! Why would that be?
    \s does not include newline, right?

    sub trim()
    {
    my $string = shift;
    $string =~ s/^\s+//;
    $string =~ s/\s+$//;
    return $string;
    }

    John Black
    John Black, Dec 5, 2013
    #1
    1. Advertising

  2. John Black <> writes:
    > This sub is just supposed to strip off whitespace (at both the beginning and end of a
    > string). But its also stripping off newlines at the end of the string! Why would that be?
    > \s does not include newline, right?
    >
    > sub trim()
    > {
    > my $string = shift;
    > $string =~ s/^\s+//;
    > $string =~ s/\s+$//;
    > return $string;
    > }


    [rw@sable]~#perl -e 'print "\n" =~ /\s/, "\n"'
    1
    Rainer Weikusat, Dec 5, 2013
    #2
    1. Advertising

  3. >>>>> "JB" == John Black <> writes:

    JB> \s does not include newline, right?

    perldoc perlrecharclass:

    "\s" matches any single character considered whitespace.

    and the following table:

    0x00009 CHARACTER TABULATION h s
    0x0000a LINE FEED (LF) vs
    0x0000b LINE TABULATION v
    0x0000c FORM FEED (FF) vs
    0x0000d CARRIAGE RETURN (CR) vs
    0x00020 SPACE h s
    0x00085 NEXT LINE (NEL) vs [1]
    0x000a0 NO-BREAK SPACE h s [1]

    So yes, newline *is* considered whitespace.

    Charlton


    --
    Charlton Wilbur
    Charlton Wilbur, Dec 5, 2013
    #3
  4. John Black

    hymie! Guest

    In our last episode, the evil Dr. Lacto had captured our hero,
    John Black <>, who said:
    >\s does not include newline, right?


    perldoc perlre

    "\s" means the five characters "[ \f\n\r\t]"

    --hymie! http://lactose.homelinux.net/~hymie
    -------------------------------------------------------------------------------
    hymie!, Dec 5, 2013
    #4
  5. John Black

    Jim Gibson Guest

    In article <-september.org>,
    John Black <> wrote:

    > This sub is just supposed to strip off whitespace (at both the beginning and
    > end of a
    > string). But its also stripping off newlines at the end of the string! Why
    > would that be?
    > \s does not include newline, right?


    'perldoc perlre' contains these excerpts:

    Character Classes and other Special Escapes
    ....
    In addition, Perl defines the following:

    Sequence Note Description
    ....
    \s [3] Match a whitespace character
    ....
    [3] See "Backslash sequences" in perlrecharclass for details.
    (end)

    Following that reference to 'perldoc perlrecharclass' yields:

    Whitespace

    "\s" matches any single character that is considered whitespace. The
    exact set of characters matched by "\s" depends on whether the source
    string is in UTF-8 format and the locale or EBCDIC code page that is in
    effect. If it's in UTF-8 format, "\s" matches what is considered
    whitespace in the Unicode database; the complete list is in the table
    below. Otherwise, if there is a locale or EBCDIC code page in effect,
    "\s" matches whatever is considered whitespace by the current locale or
    EBCDIC code page. Without a locale or EBCDIC code page, "\s" matches
    the horizontal tab ("\t"), the newline ("\n"), the form feed ("\f"),
    the carriage return ("\r"), and the space. (Note that it doesn't match
    the vertical tab, "\cK".) Perhaps the most notable possible surprise
    is that "\s" matches a non-breaking space only if the non-breaking
    space is in a UTF-8 encoded string or the locale or EBCDIC code page
    that is in effect has that character. See "Locale, EBCDIC, Unicode and
    UTF-8".
    (end)

    So, yes, \s does include the newline.

    --
    Jim Gibson
    Jim Gibson, Dec 5, 2013
    #5
  6. John Black

    gamo Guest

    El 05/12/13 19:50, John Black escribió:
    > This sub is just supposed to strip off whitespace (at both the beginning and end of a
    > string). But its also stripping off newlines at the end of the string! Why would that be?
    > \s does not include newline, right?
    >
    > sub trim()
    > {
    > my $string = shift;
    > $string =~ s/^\s+//;
    > $string =~ s/\s+$//;
    > return $string;
    > }
    >
    > John Black
    >


    This is absurd, but maybe do just what you want to do:

    :~/test$ cat test.trim
    #!/usr/bin/perl -W

    $s = " only this:
    ";
    print trim($s);


    sub trim{
    my $string = shift;
    my $space = ' ';
    $string =~ s/$space+//;
    $string = reverse $string;
    $string =~ s/$space+//;
    $string = reverse $string;
    return $string;
    }

    :~/test$ perl test.trim
    only this:
    :~/test$

    Best regards
    gamo, Dec 5, 2013
    #6
  7. John Black

    John Black Guest

    In article <>, says...
    >
    > On 05/12/13 18:50, John Black wrote:
    > > \s does not include newline, right?

    >
    > John, I would have agreed with you. Plainly we're both wrong, as the
    > follow-ups, not to mention the documentation, have shown, but what is it
    > we're (mis)remembering? There's some circumstance in which newline \n
    > behaves differently from the other white space characters.


    Now that I see that \s includes vertical and horizontal types of characters, it makes more
    sense. Up to this point, I've been using \s as a shortcut for spaces or tabs. I'll have to
    keep this in mind - I had wanted that trim function to not strip the newlines (and not add
    any either if there wasn't one). Should not be hard to workaround. Thanks all.

    John Black
    John Black, Dec 5, 2013
    #7
  8. Henry Law <> writes:
    > On 05/12/13 18:50, John Black wrote:
    >> \s does not include newline, right?

    >
    > John, I would have agreed with you. Plainly we're both wrong, as the
    > follow-ups, not to mention the documentation, have shown, but what is
    > it we're (mis)remembering? There's some circumstance in which newline
    > \n behaves differently from the other white space characters.


    Guess: There's a circumstance where it behaves differently from other
    characters, namely, a . won't match \n unless the s-flag is used
    together with the match operator.
    Rainer Weikusat, Dec 5, 2013
    #8
  9. John Black <> wrote:
    >In article <>, says...
    >>
    >> On 05/12/13 18:50, John Black wrote:
    >> > \s does not include newline, right?

    >>
    >> John, I would have agreed with you. Plainly we're both wrong, as the
    >> follow-ups, not to mention the documentation, have shown, but what is it
    >> we're (mis)remembering? There's some circumstance in which newline \n
    >> behaves differently from the other white space characters.

    >
    >Now that I see that \s includes vertical and horizontal types of characters, it makes more
    >sense. Up to this point,


    Try looking at it from a programming language point of view. Most modern
    programming languages are free-format, i.e. in the program code a single
    space is as good as 20 tabs or as 5 newlines. Therefore there is some
    sense in including all of them in \s.

    jue
    Jürgen Exner, Dec 5, 2013
    #9
  10. John Black

    Jim Gibson Guest

    In article <>, Ben Morrow
    <> wrote:

    > Quoth Jim Gibson <>:
    > >
    > > Following that reference to 'perldoc perlrecharclass' yields:
    > >
    > > Whitespace
    > >

    <snipped>

    > That's a pretty old copy of that documentation. Since 5.14 the Unicode
    > Bug has been fixed, and character-class matching no longer depends on
    > the internal format of the string.
    >
    > Ben


    Thanks. It's from 5.12.4, which is what I am using.

    --
    Jim Gibson
    Jim Gibson, Dec 6, 2013
    #10
  11. Am 05.12.2013 19:50, schrieb John Black:
    > This sub is just supposed to strip off whitespace (at both the beginning and end of a
    > string). But its also stripping off newlines at the end of the string! Why would that be?
    > \s does not include newline, right?
    >
    > sub trim()
    > {
    > my $string = shift;
    > $string =~ s/^\s+//;
    > $string =~ s/\s+$//;
    > return $string;
    > }


    BTW,
    is there any reason to reinvent the wheel.
    There are several CPAN-modules doing one of the most often needed Jobs:
    - https://metacpan.org/pod/String::Trim
    - https://metacpan.org/pod/String::Strip
    - https://metacpan.org/pod/Text::Trim

    In case it makes the source code more readable, shorter, easier to
    maintain and will often have less bugs.


    Greetings,
    Janek
    Janek Schleicher, Dec 6, 2013
    #11
  12. Janek Schleicher <> writes:
    > Am 05.12.2013 19:50, schrieb John Black:
    >> This sub is just supposed to strip off whitespace (at both the beginning and end of a
    >> string). But its also stripping off newlines at the end of the string! Why would that be?
    >> \s does not include newline, right?
    >>
    >> sub trim()
    >> {
    >> my $string = shift;
    >> $string =~ s/^\s+//;
    >> $string =~ s/\s+$//;
    >> return $string;
    >> }

    >
    > BTW,
    > is there any reason to reinvent the wheel.
    > There are several CPAN-modules doing one of the most often needed Jobs:
    > - https://metacpan.org/pod/String::Trim
    > - https://metacpan.org/pod/String::Strip
    > - https://metacpan.org/pod/Text::Trim


    Using a gross oversimplification, there is only one 'wheel'[*] but there
    are already at least three different CPAN modules for deleting
    characters at the beginning or the end of a string. Consequently, none
    of them can be the equivalent of 'the wheel' for solving this problem.

    [*] Actually, there are all kinds of different wheels and new kinds are
    constantly being invented.
    Rainer Weikusat, Dec 6, 2013
    #12
  13. John Black

    Guest

    On Thursday, December 5, 2013 4:00:51 PM UTC-6, Rainer Weikusat wrote:
    > Henry Law <> writes:
    >
    > > On 05/12/13 18:50, John Black wrote:

    >
    > >> \s does not include newline, right?

    >
    > >

    >
    > > John, I would have agreed with you. Plainly we're both wrong, as the

    >
    > > follow-ups, not to mention the documentation, have shown, but what is

    >
    > > it we're (mis)remembering? There's some circumstance in which newline

    >
    > > \n behaves differently from the other white space characters.

    >
    >
    >
    > Guess: There's a circumstance where it behaves differently from other
    >
    > characters, namely, a . won't match \n unless the s-flag is used
    >
    > together with the match operator.


    Also, there's the fact that $ in regex matches the end of the string or before the newline at the end. If you're thinking of or expecting that second behavior and have forgotten about greediness, you may expect that the newline wouldn't be removed in the expression s/\s+$//;

    Maybe a bit of a stretch, but as long we're guessing what's in other people's heads ... :)

    -Scott
    , Dec 6, 2013
    #13
  14. John Black

    John Black Guest

    In article <>, says...
    > Of course, it's
    > probably easier to just use [ \t] if that's what you mean...


    Well, for many long regexs \s is used a lot and they are already ugly enough without
    substituting [ \t] everywhere. I think that now that I know \n is included, I can be careful
    and work around that when it matters with [ \t] or something else. Thanks.

    John Black
    John Black, Dec 6, 2013
    #14
  15. John Black

    $Bill Guest

    On 12/5/2013 13:30, Henry Law wrote:
    > On 05/12/13 18:50, John Black wrote:
    >> \s does not include newline, right?

    >
    > John, I would have agreed with you. Plainly we're both wrong, as the follow-ups, not to mention the documentation, have shown, but what is it we're (mis)remembering? There's some circumstance in which newline \n behaves differently from the other white space characters.


    Not sure if this helps, but I searched the manual for
    /white.*newline and /\\s and /\\s.*newline and it yielded:

    perlintro
    ....
    More complex regular expressions
    You don't just have to match on fixed strings. In fact, you can match on
    just about anything you could dream of by using more complex regular
    expressions. These are documented at great length in perlre, but for the
    meantime, here's a quick cheat sheet:

    . a single character
    \s a whitespace character (space, tab, newline, ...)

    perlglossary
    ....
    continuation
    The treatment of more than one physical "line" as a single logical line.
    "Makefile" lines are continued by putting a backslash before the
    "newline". Mail headers as defined by RFC 822 are continued by putting a
    space or tab *after* the newline. In general, lines in Perl do not need
    any form of continuation mark, because "whitespace" (including newlines)
    is gleefully ignored. Usually.

    perlrequick
    ....
    Perl has several abbreviations for common character classes:

    * \d is a digit and represents

    [0-9]

    * \s is a whitespace character and represents

    [\ \t\r\n\f]

    perlretut
    ....
    * \s matches a whitespace character, the set [\ \t\r\n\f] and others
    ....
    The "[:digit:]", "[:word:]", and "[:space:]" correspond to the
    familiar "\d", "\w", and "\s" character classes.

    perlfaq4
    ....
    How do I strip blank space from the beginning/end of a string?
    (contributed by brian d foy)

    A substitution can do this for you. For a single line, you want to replace
    all the leading or trailing whitespace with nothing. You can do that with a
    pair of substitutions.

    s/^\s+//;
    s/\s+$//;

    You can also write that as a single substitution, although it turns out the
    combined statement is slower than the separate ones. That might not matter
    to you, though.

    s/^\s+|\s+$//g;

    In this regular expression, the alternation matches either at the beginning
    or the end of the string since the anchors have a lower precedence than the
    alternation. With the "/g" flag, the substitution makes all possible
    matches, so it gets both. Remember, the trailing newline matches the "\s+",
    and the "$" anchor can match to the physical end of the string, so the
    newline disappears too. Just add the newline to the output, which has the
    added benefit of preserving "blank" (consisting entirely of whitespace)
    lines which the "^\s+" would remove all by itself.

    while( <> )
    {
    s/^\s+|\s+$//g;
    print "$_\n";
    }

    For a multi-line string, you can apply the regular expression to each
    logical line in the string by adding the "/m" flag (for "multi-line"). With
    the "/m" flag, the "$" matches *before* an embedded newline, so it doesn't
    remove it. It still removes the newline at the end of the string.

    $string =~ s/^\s+|\s+$//gm;

    Remember that lines consisting entirely of whitespace will disappear, since
    the first part of the alternation can match the entire string and replace it
    with nothing. If need to keep embedded blank lines, you have to do a little
    more work. Instead of matching any whitespace (since that includes a
    newline), just match the other whitespace.

    $string =~ s/^[\t\f ]+|[\t\f ]+$//mg;

    perlrebackslash
    ....
    "\w" is a character class that matches any single *word* character (letters,
    digits, underscore). "\d" is a character class that matches any decimal
    digit, while the character class "\s" matches any whitespace character. New
    in perl 5.10.0 are the classes "\h" and "\v" which match horizontal and
    vertical whitespace characters.

    perlrecharclass
    ....
    Whitespace
    "\s" matches any single character that is considered whitespace. The exact
    set of characters matched by "\s" depends on whether the source string is in
    UTF-8 format and the locale or EBCDIC code page that is in effect. If it's
    in UTF-8 format, "\s" matches what is considered whitespace in the Unicode
    database; the complete list is in the table below. Otherwise, if there is a
    locale or EBCDIC code page in effect, "\s" matches whatever is considered
    whitespace by the current locale or EBCDIC code page. Without a locale or
    EBCDIC code page, "\s" matches the horizontal tab ("\t"), the newline
    ("\n"), the form feed ("\f"), the carriage return ("\r"), and the space.
    (Note that it doesn't match the vertical tab, "\cK".) Perhaps the most
    notable possible surprise is that "\s" matches a non-breaking space only if
    the non-breaking space is in a UTF-8 encoded string or the locale or EBCDIC
    code page that is in effect has that character. See "Locale, EBCDIC, Unicode
    and UTF-8".
    ....
    Note that unlike "\s", "\d" and "\w", "\h" and "\v" always match the same
    characters, regardless whether the source string is in UTF-8 format or not.
    The set of characters they match is also not influenced by locale nor EBCDIC
    code page.

    One might think that "\s" is equivalent to "[\h\v]". This is not true. The
    vertical tab ("\x0b") is not matched by "\s", it is however considered
    vertical whitespace. Furthermore, if the source string is not in UTF-8
    format, and any locale or EBCDIC code page that is in effect doesn't include
    them, the next line (ASCII-platform "\x85") and the no-break space
    (ASCII-platform "\xA0") characters are not matched by "\s", but are by "\v"
    and "\h" respectively. If the source string is in UTF-8 format, both the
    next line and the no-break space are matched by "\s".

    The following table is a complete listing of characters matched by "\s",
    "\h" and "\v" as of Unicode 5.2.

    The first column gives the code point of the character (in hex format), the
    second column gives the (Unicode) name. The third column indicates by which
    class(es) the character is matched (assuming no locale or EBCDIC code page
    is in effect that changes the "\s" matching).

    0x00009 CHARACTER TABULATION h s
    0x0000a LINE FEED (LF) vs
    0x0000b LINE TABULATION v
    0x0000c FORM FEED (FF) vs
    0x0000d CARRIAGE RETURN (CR) vs
    0x00020 SPACE h s
    0x00085 NEXT LINE (NEL) vs [1]
    0x000a0 NO-BREAK SPACE h s [1]
    0x01680 OGHAM SPACE MARK h s
    0x0180e MONGOLIAN VOWEL SEPARATOR h s
    0x02000 EN QUAD h s
    0x02001 EM QUAD h s
    0x02002 EN SPACE h s
    0x02003 EM SPACE h s
    0x02004 THREE-PER-EM SPACE h s
    0x02005 FOUR-PER-EM SPACE h s
    0x02006 SIX-PER-EM SPACE h s
    0x02007 FIGURE SPACE h s
    0x02008 PUNCTUATION SPACE h s
    0x02009 THIN SPACE h s
    0x0200a HAIR SPACE h s
    0x02028 LINE SEPARATOR vs
    0x02029 PARAGRAPH SEPARATOR vs
    0x0202f NARROW NO-BREAK SPACE h s
    0x0205f MEDIUM MATHEMATICAL SPACE h s
    0x03000 IDEOGRAPHIC SPACE h s

    [1] NEXT LINE and NO-BREAK SPACE only match "\s" if the source string is in
    UTF-8 format, or the locale or EBCDIC code page that is in effect
    includes them.

    perl561delta
    ....
    Unicode support
    ...
    The Unicode character classes \p{Blank} and \p{SpacePerl} have been
    added. "Blank" is like C isblank(), that is, it contains only
    "horizontal whitespace" (the space character is, the newline isn't), and
    the "SpacePerl" is the Unicode equivalent of "\s" (\p{Space} isn't,
    since that includes the vertical tabulator character, whereas "\s"
    doesn't.)
    $Bill, Dec 6, 2013
    #15
  16. Am 06.12.2013 15:29, schrieb Rainer Weikusat:
    > Janek Schleicher <> writes:
    >> Am 05.12.2013 19:50, schrieb John Black:
    >>> This sub is just supposed to strip off whitespace (at both the beginning and end of a
    >>> string). But its also stripping off newlines at the end of the string! Why would that be?
    >>> \s does not include newline, right?
    >>>
    >>> sub trim()
    >>> {
    >>> my $string = shift;
    >>> $string =~ s/^\s+//;
    >>> $string =~ s/\s+$//;
    >>> return $string;
    >>> }

    >>
    >> BTW,
    >> is there any reason to reinvent the wheel.
    >> There are several CPAN-modules doing one of the most often needed Jobs:
    >> - https://metacpan.org/pod/String::Trim
    >> - https://metacpan.org/pod/String::Strip
    >> - https://metacpan.org/pod/Text::Trim

    >
    > Using a gross oversimplification, ...


    So, you also prefer to write
    s/\r?\n$// instead of oversimplifying chomp; ?

    I'd prefer instead to write 2 easy lines that express exactly what we
    intend to do

    use WhateverModule::Trim|Strip;
    ....
    trim($string);

    to half a dozen lines in close to most scripts.

    All I'd wonder is why trim/strip isn't a system command like chomp.

    If we use a reg exp in program logic, usually they should do something
    that is special to our program, maybe s/blue/green/ or s/(\d+)/2*$1/ge.

    Well, o.k., maybe I get religious here, so TMTOWTDI.


    Greetings,
    Janek
    Janek Schleicher, Dec 7, 2013
    #16
  17. John Black

    C.DeRykus Guest

    On Thursday, December 5, 2013 1:56:46 PM UTC-8, John Black wrote:
    > ...


    > keep this in mind - I had wanted that trim function to not strip the newlines (and not add any either if there wasn't one). Should not be hard to workaround. Thanks all.
    >
    >


    Another option: a regex that'd handle any
    trailing newline:

    $string =~ s/ ^\s+ | \s+(?=\n|)$ //gx;


    --
    Charles DeRykus
    C.DeRykus, Dec 7, 2013
    #17
  18. "C.DeRykus" <> writes:

    > On Thursday, December 5, 2013 1:56:46 PM UTC-8, John Black wrote:
    >> ...

    >
    >> keep this in mind - I had wanted that trim function to not strip the
    >> newlines (and not add any either if there wasn't one). Should not be
    >> hard to workaround. Thanks all.
    >>

    >
    > Another option: a regex that'd handle any
    > trailing newline:
    >
    > $string =~ s/ ^\s+ | \s+(?=\n|)$ //gx;


    Surely this strips the newline?

    --
    Ben.
    Ben Bacarisse, Dec 7, 2013
    #18
  19. John Black

    C.DeRykus Guest

    On Saturday, December 7, 2013 4:34:44 AM UTC-8, Ben Bacarisse wrote:
    > "C.DeRykus" <> writes:
    >
    > > On Thursday, December 5, 2013 1:56:46 PM UTC-8, John Black wrote:>

    >
    > >> keep this in mind - I had wanted that trim function to not strip the
    > >> newlines (and not add any either if there wasn't one). Should not be
    > >>

    > > Another option: a regex that'd handle any
    > > trailing newline:
    > >
    > > $string =~ s/ ^\s+ | \s+(?=\n|)$ //gx;

    >
    > Surely this strips the newline?
    >


    Indeed. I was slipping off the end... I think,hope a redemptive tweak will do it:

    $string =~ s/ ^s+ | \s++(?=\n) /gx;

    --
    Charles DeRykus
    C.DeRykus, Dec 7, 2013
    #19
  20. John Black

    C.DeRykus Guest

    On Saturday, December 7, 2013 5:49:21 AM UTC-8, C.DeRykus wrote:
    > On Saturday, December 7, 2013 4:34:44 AM UTC-8, Ben Bacarisse wrote:
    >
    > > "C.DeRykus" <> writes:

    >
    > >

    >
    > > > On Thursday, December 5, 2013 1:56:46 PM UTC-8, John Black wrote:>

    >
    > >

    >
    > > >> keep this in mind - I had wanted that trim function to not strip the

    >
    > > >> newlines (and not add any either if there wasn't one). Should not be

    >
    >
    > > > Another option: a regex that'd handle any
    > > > trailing newline:

    >
    > > > $string =~ s/ ^\s+ | \s+(?=\n|)$ //gx;

    >
    > > Surely this strips the newline?

    >
    > Indeed. I was slipping off the end... I think,hope a redemptive tweak will do it:
    >
    > $string =~ s/ ^s+ | \s++(?=\n) /gx;
    >


    Sorry, more redemption is needed.

    --
    Charles DeRykus
    C.DeRykus, Dec 7, 2013
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Grant Edwards
    Replies:
    4
    Views:
    418
    Grant Edwards
    Sep 29, 2005
  2. Mr. SweatyFinger
    Replies:
    2
    Views:
    1,850
    Smokey Grindel
    Dec 2, 2006
  3. Ben
    Replies:
    2
    Views:
    886
  4. Tom
    Replies:
    5
    Views:
    308
  5. Lawrence D'Oliveiro

    Death To Sub-Sub-Sub-Directories!

    Lawrence D'Oliveiro, May 5, 2011, in forum: Java
    Replies:
    92
    Views:
    2,024
    Lawrence D'Oliveiro
    May 20, 2011
Loading...

Share This Page