Why does chomp leave newlines?

M

Mark Healey

First some fragments

I get the array thusly:

13 @searchTerms=split(/\n/,$queryHash{"searchText"});

I later print it:


40 sub printSearchTerms
41 {
42 foreach(@searchTerms)
43 {
44 chomp;
45 print ("$_<BR>\n");
46 }
47 }

And yet I get:

point loma
<br>
mission hills
<br>
hillcrest
<br>
bankers hill
<br>
university heights<br>

What's up?
 
D

David Efflandt

First some fragments

I get the array thusly:

13 @searchTerms=split(/\n/,$queryHash{"searchText"});

I later print it:


40 sub printSearchTerms
41 {
42 foreach(@searchTerms)
43 {
44 chomp;
45 print ("$_<BR>\n");
46 }
47 }

And yet I get:

point loma
<br>
mission hills
<br>
hillcrest
<br>
bankers hill
<br>
university heights<br>

What's up?

perldoc -f chomp

It removes the input record separator ("$/") based the OS it is running
on. If newlines in your data are CR-LF and default newlines in your OS
are not, then you may need to set $/ = "\015\012"; before using chomp for
that.
 
M

Mark Healey

perldoc -f chomp

It removes the input record separator ("$/") based the OS it is running
on. If newlines in your data are CR-LF and default newlines in your OS
are not, then you may need to set $/ = "\015\012"; before using chomp for
that.

Sinct this suppoded to be a CGI script and I don't know what os'es
are going to be making requests is there any way to set $/ to several
different possibilities such as CRLF, CR alone or LF alone?

I'd still like a function that removes all leading and trailing
whitespace. I suppose I could do it with regexps but that would be
kind of ugly.
 
B

Bob Walton

Mark said:
....
Sinct this suppoded to be a CGI script and I don't know what os'es
are going to be making requests is there any way to set $/ to several
different possibilities such as CRLF, CR alone or LF alone?


No. The value of $/ is a string, not a regexp. Unless you do something
like [untested]:

{local $/;$/="\n";chomp}
{local $/;$/="\r";chomp}
{local $/;$/="\r\n";chomp} #not needed?
#etc?

I'd still like a function that removes all leading and trailing
whitespace. I suppose I could do it with regexps but that would be
kind of ugly.


Why ugly? It should be simple [untested]:

sub trim{
my $s=shift;
$s=~/^\s*//;
$s=~/\s*$//;
return $s;
}
 
D

Dave Cross

All modern browsers submit \r\n when ENTER is pressed
while a cursor is inside a multiline text area box for a
cgi form action submission. This is independent of all
operating systems.

Of course, you can never be sure that your input is coming from a browser :)

Dave...
 
A

Alan J. Flavell

This is codified, e.g for HTML4.01, in the appropriate parts of
http://www.w3.org/TR/html401/interact/forms.html#h-17.13.3

What they submit is a CR followed by an LF.

I don't see how a browser can be expected to submit something that's a
logical Perl concept (\r and/or \n) rather than real control
characters. See perlport, where it's explained that an appropriate
notation in Perl for the ASCII CR LF sequence would be \015\012.

And just to correct the sloppy wording: hitting Enter in a textarea
input control does not in itself submit anything. The newline(s)
would be part of the data when the form is finally submitted by other
means.

No. \r\n would be \012\015 on at least one operating system, and
something else again on an EBCDIC-based architecture. What would be
submitted by the client, and received by the server, would still be
\015\012.
Of course, you can never be sure that your input is coming from a browser :)

That should be irrelevant. The HTML specification covers the
interworking requirements for all kinds of client, not only browsers
/per se/.

But certainly it would seem wise to tolerate other newline
conventions, no matter what the specification might demand. Contrary
to the issue addressed a bit earlier in this thread, I don't see any
way to handle that solely by means of settings of $/ - it's necessary
to either do some kind of harmonisation separately, or to write code
which explicitly handles any of the plausible representations.
 
G

gnari

....
I'd still like a function that removes all leading and trailing
whitespace. I suppose I could do it with regexps but that would be
kind of ugly.

change your split to:
my ($tmp=$queryHash{"searchText"}) =~ /^ *(.*) *$/s;
@searchTerms=split(/ *[\r\n]+ */,$tmp);

and drop the chomp;

this will remove all leading and trailing spaces , including
the ones around the newlines

gnari
 
D

Dave Cross

This is related to modern browser behavior in what way?

It's not. I'm simply pointing out for the benefit of the original poster
that you should never assume that you know how the input to your CGI
program is generated.

Of course, you know this and you're just arguing for the sake of it.

Dave...
 
J

Joe Smith

Purl said:
Which is a \r\n submission. Car - automobile, horse - equine.

Are you aware that there are instances where "\r\n" is not the
same as "\015\012"? When dealing with data read from a
network connection, it is better to use "\015\012".
Anal.
Purl Gurl

You've got no argument from me there.
-Joe
 
J

Joe Smith

Purl said:
If you would, provide a case example of \r\n not
being the same as \015\012 syntax.

Any time you read a text file on MacOS Classic.
The end-of-line character, \015, is converted to \n on input
and \n on output is converted to \015. If the text file does
happen to contain \012, it is converted to \r on input and
\r on output is converted to \012.

This means that many Perl scripts written for Unix can run
unmodified on MacOS Classic, when it comes to reading and
writing lines in files on the native file system. This also
means that Unix perl scripts doing I/O to TCP/IP sockets have
problems on MacOS Classic if they use the logical end-of-line
character (\n) instead of the ASCII code for linefeed (\012).

References:

perldoc -f binmode

Mac OS, all variants of Unix, and Stream_LF files on
VMS use a single character to end each line in the
external representation of text (even though that
single character is CARRIAGE RETURN on Mac OS and
LINE FEED on Unix and most VMS files). In other
systems like OS/2, DOS and the various flavors of
MS-Windows your program sees a "\n" as a simple
"\cJ", but what's stored in text files are the two
characters "\cM\cJ". That means that, if you don't
use binmode() on these systems, "\cM\cJ" sequences
on disk will be converted to "\n" on input, and any
"\n" in your program will be converted back to
"\cM\cJ" on output. This is what you want for text
files, but it can be disastrous for binary files.

perldoc Socket

Also, some common socket "newline" constants are provided:
the constants "CR", "LF", and "CRLF", as well as $CR, $LF,
and $CRLF, which map to "\015", "\012", and "\015\012". If
you do not want to use the literal characters in your
programs, then use the constants provided here. They are
not exported by default, but can be imported individually,
and with the ":crlf" export tag:

use Socket qw:)DEFAULT :crlf);


-Joe
 
A

Alan J. Flavell

Any time you read a text file on MacOS Classic.
The end-of-line character, \015, is converted to \n on input

That's very confusing. On MacOS Classic, surely \n _is_ \015: there
is no conversion involved. See perldoc perlport, one version of which
says:

Perl uses "\n" to represent the "logical" newline, where
what is logical may depend on the platform in use. In
MacPerl, "\n" always means "\015". In DOSish perls, "\n"
usually means "\012", but when accessing a file in "text"
mode, STDIO translates it to (or from) "\015\012", depend­
ing on whether you're reading or writing. Unix does the
same thing on ttys in canonical mode. "\015\012" is com­
monly referred to as CRLF.

and so on.
and \n on output is converted to \015. If the text file does
happen to contain \012, it is converted to \r on input and
\r on output is converted to \012.

Again, no "conversion" takes place, as I understand the message of
perlport. The only "conversion" needed is in the heads of certain
folks.
This means that many Perl scripts written for Unix can run
unmodified on MacOS Classic, when it comes to reading and
writing lines in files on the native file system.
This also means that Unix perl scripts doing I/O to TCP/IP sockets
have problems on MacOS Classic if they use the logical end-of-line
character (\n) instead of the ASCII code for linefeed (\012).

Which is why the FAQs say don't do that. THERE IS NO PROBLEM, other
than the ones created by a refusal to read the documentation.

[your useful additional references snipped for brevity, but I think
they support my contention that Perl does not perform any
"conversion" in this situation.]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,045
Latest member
DRCM

Latest Threads

Top