Why does chomp leave newlines?

Discussion in 'Perl Misc' started by Mark Healey, May 8, 2004.

  1. Mark Healey

    Mark Healey Guest

    First some fragments

    I get the array thusly:

    13 @searchTerms=split(/\n/,$queryHash{"searchText"});

    I later print it:


    40 sub printSearchTerms
    41 {
    42 foreach(@searchTerms)
    43 {
    44 chomp;
    45 print ("$_<BR>\n");
    46 }
    47 }

    And yet I get:

    point loma
    <br>
    mission hills
    <br>
    hillcrest
    <br>
    bankers hill
    <br>
    university heights<br>

    What's up?
     
    Mark Healey, May 8, 2004
    #1
    1. Advertisements

  2. perldoc -f chomp

    It removes the input record separator ("$/") based the OS it is running
    on. If newlines in your data are CR-LF and default newlines in your OS
    are not, then you may need to set $/ = "\015\012"; before using chomp for
    that.
     
    David Efflandt, May 8, 2004
    #2
    1. Advertisements

  3. Mark Healey

    Mark Healey Guest

    Sinct this suppoded to be a CGI script and I don't know what os'es
    are going to be making requests is there any way to set $/ to several
    different possibilities such as CRLF, CR alone or LF alone?

    I'd still like a function that removes all leading and trailing
    whitespace. I suppose I could do it with regexps but that would be
    kind of ugly.
     
    Mark Healey, May 8, 2004
    #3
  4. Mark Healey

    Bob Walton Guest


    No. The value of $/ is a string, not a regexp. Unless you do something
    like [untested]:

    {local $/;$/="\n";chomp}
    {local $/;$/="\r";chomp}
    {local $/;$/="\r\n";chomp} #not needed?
    #etc?


    Why ugly? It should be simple [untested]:

    sub trim{
    my $s=shift;
    $s=~/^\s*//;
    $s=~/\s*$//;
    return $s;
    }
     
    Bob Walton, May 8, 2004
    #4
  5. Mark Healey

    Dave Cross Guest

    Of course, you can never be sure that your input is coming from a browser :)

    Dave...
     
    Dave Cross, May 8, 2004
    #5
  6. This is codified, e.g for HTML4.01, in the appropriate parts of
    http://www.w3.org/TR/html401/interact/forms.html#h-17.13.3

    What they submit is a CR followed by an LF.

    I don't see how a browser can be expected to submit something that's a
    logical Perl concept (\r and/or \n) rather than real control
    characters. See perlport, where it's explained that an appropriate
    notation in Perl for the ASCII CR LF sequence would be \015\012.

    And just to correct the sloppy wording: hitting Enter in a textarea
    input control does not in itself submit anything. The newline(s)
    would be part of the data when the form is finally submitted by other
    means.

    No. \r\n would be \012\015 on at least one operating system, and
    something else again on an EBCDIC-based architecture. What would be
    submitted by the client, and received by the server, would still be
    \015\012.
    That should be irrelevant. The HTML specification covers the
    interworking requirements for all kinds of client, not only browsers
    /per se/.

    But certainly it would seem wise to tolerate other newline
    conventions, no matter what the specification might demand. Contrary
    to the issue addressed a bit earlier in this thread, I don't see any
    way to handle that solely by means of settings of $/ - it's necessary
    to either do some kind of harmonisation separately, or to write code
    which explicitly handles any of the plausible representations.
     
    Alan J. Flavell, May 8, 2004
    #6
  7. Mark Healey

    gnari Guest

    ....
    change your split to:
    my ($tmp=$queryHash{"searchText"}) =~ /^ *(.*) *$/s;
    @searchTerms=split(/ *[\r\n]+ */,$tmp);

    and drop the chomp;

    this will remove all leading and trailing spaces , including
    the ones around the newlines

    gnari
     
    gnari, May 8, 2004
    #7
  8. Mark Healey

    Dave Cross Guest

    It's not. I'm simply pointing out for the benefit of the original poster
    that you should never assume that you know how the input to your CGI
    program is generated.

    Of course, you know this and you're just arguing for the sake of it.

    Dave...
     
    Dave Cross, May 8, 2004
    #8
  9. Mark Healey

    Joe Smith Guest

    Are you aware that there are instances where "\r\n" is not the
    same as "\015\012"? When dealing with data read from a
    network connection, it is better to use "\015\012".
    You've got no argument from me there.
    -Joe
     
    Joe Smith, May 9, 2004
    #9
  10. Mark Healey

    Joe Smith Guest

    Any time you read a text file on MacOS Classic.
    The end-of-line character, \015, is converted to \n on input
    and \n on output is converted to \015. If the text file does
    happen to contain \012, it is converted to \r on input and
    \r on output is converted to \012.

    This means that many Perl scripts written for Unix can run
    unmodified on MacOS Classic, when it comes to reading and
    writing lines in files on the native file system. This also
    means that Unix perl scripts doing I/O to TCP/IP sockets have
    problems on MacOS Classic if they use the logical end-of-line
    character (\n) instead of the ASCII code for linefeed (\012).

    References:

    perldoc -f binmode

    Mac OS, all variants of Unix, and Stream_LF files on
    VMS use a single character to end each line in the
    external representation of text (even though that
    single character is CARRIAGE RETURN on Mac OS and
    LINE FEED on Unix and most VMS files). In other
    systems like OS/2, DOS and the various flavors of
    MS-Windows your program sees a "\n" as a simple
    "\cJ", but what's stored in text files are the two
    characters "\cM\cJ". That means that, if you don't
    use binmode() on these systems, "\cM\cJ" sequences
    on disk will be converted to "\n" on input, and any
    "\n" in your program will be converted back to
    "\cM\cJ" on output. This is what you want for text
    files, but it can be disastrous for binary files.

    perldoc Socket

    Also, some common socket "newline" constants are provided:
    the constants "CR", "LF", and "CRLF", as well as $CR, $LF,
    and $CRLF, which map to "\015", "\012", and "\015\012". If
    you do not want to use the literal characters in your
    programs, then use the constants provided here. They are
    not exported by default, but can be imported individually,
    and with the ":crlf" export tag:

    use Socket qw:)DEFAULT :crlf);


    -Joe
     
    Joe Smith, May 10, 2004
    #10
  11. That's very confusing. On MacOS Classic, surely \n _is_ \015: there
    is no conversion involved. See perldoc perlport, one version of which
    says:

    Perl uses "\n" to represent the "logical" newline, where
    what is logical may depend on the platform in use. In
    MacPerl, "\n" always means "\015". In DOSish perls, "\n"
    usually means "\012", but when accessing a file in "text"
    mode, STDIO translates it to (or from) "\015\012", dependĀ­
    ing on whether you're reading or writing. Unix does the
    same thing on ttys in canonical mode. "\015\012" is comĀ­
    monly referred to as CRLF.

    and so on.
    Again, no "conversion" takes place, as I understand the message of
    perlport. The only "conversion" needed is in the heads of certain
    folks.
    Which is why the FAQs say don't do that. THERE IS NO PROBLEM, other
    than the ones created by a refusal to read the documentation.

    [your useful additional references snipped for brevity, but I think
    they support my contention that Perl does not perform any
    "conversion" in this situation.]
     
    Alan J. Flavell, May 10, 2004
    #11
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.