how to process data file with special (indisplayable) characters?

G

goomania

I am using Perl on Unix to process some data file I got from others.
Data are broken into and displayed on separate lines . However, when I
tried to use code as:

#####################################################################

@lines = split("\n", $text);
# $text is the variable where I stored the content of the data file
print "@lines\n";

#####################################################################

to parse the different lines of data file to the array @lines, the
output are not the same as the original data file.

Then I found out that, after I opened the data file with emacs and read
the file content, there are characters of "^M" appended to end of each
line of the file. "^M" is invisible if I read the file by unix "more"
command.

Can anyone please tell me what are those "^M" for and how to use Perl
to handle files with such special characters rather than standard ones
like "\n" for newline on Unix?

Thanks a lot for your help!

Andy
 
P

Paul Lalli

goomania said:
I am using Perl on Unix to process some data file I got from others.
Data are broken into and displayed on separate lines . However, when I
tried to use code as:

#####################################################################

@lines = split("\n", $text);
# $text is the variable where I stored the content of the data file
print "@lines\n";

#####################################################################

to parse the different lines of data file to the array @lines, the
output are not the same as the original data file.

Then I found out that, after I opened the data file with emacs and read
the file content, there are characters of "^M" appended to end of each
line of the file. "^M" is invisible if I read the file by unix "more"
command.

Can anyone please tell me what are those "^M" for and how to use Perl
to handle files with such special characters rather than standard ones
like "\n" for newline on Unix?

The ^M characters are the extra character Windows places at the end of
each line. Windows uses "\r\n" for newlines, whereas Unix uses "\n".
The ^M characters are equivalent to the \r's.

You have two basic options. One is to change the code to process
Windows-style newlines (split on "\r\n" instead of "\n"). Preferred,
IMHO, however, is to convert the file to Unix format.

man dos2unix

Paul Lalli
 
G

goomania

Paul Lalli 写é“:
The ^M characters are the extra character Windows places at the end of
each line. Windows uses "\r\n" for newlines, whereas Unix uses "\n".
The ^M characters are equivalent to the \r's.

You have two basic options. One is to change the code to process
Windows-style newlines (split on "\r\n" instead of "\n"). Preferred,
IMHO, however, is to convert the file to Unix format.

man dos2unix

Paul Lalli

Thanks a lot for the information. In addition to those two options you
nicely suggested, I also find out that the output is correct if I use
"^M" to split the $text. I am guessing maybe using "^M" is equivalent
to using "\r\n". Am I right?

Thanks,

Andy
 
P

Paul Lalli

goomania said:
Paul Lalli 写é“:


Thanks a lot for the information. In addition to those two options you
nicely suggested, I also find out that the output is correct if I use
"^M" to split the $text. I am guessing maybe using "^M" is equivalent
to using "\r\n". Am I right?

Never tried that method, so I can't be sure. My instinct however, is
to say "no". As I indicated previously, "^M" is how emacs is
representing the "\r" that Windows includes before every "\n". So it
seems to me that if you had lines ending with ^M, tried to split on
that character, your first line would contain no newline, and then
every subsequent line would contain a newline at the *start* of the
string.

Again, I could be wrong, as I've not tried it out. I still recommend
just fixing the datafile before procesing it, using dos2unix.

Paul Lalli
 
J

John W. Krahn

goomania said:
I am using Perl on Unix to process some data file I got from others.
Data are broken into and displayed on separate lines . However, when I
tried to use code as:

#####################################################################

@lines = split("\n", $text);
# $text is the variable where I stored the content of the data file
print "@lines\n";

#####################################################################

to parse the different lines of data file to the array @lines, the
output are not the same as the original data file.

perldoc -q "Why do I get weird spaces when I print an array of lines"

Then I found out that, after I opened the data file with emacs and read
the file content, there are characters of "^M" appended to end of each
line of the file. "^M" is invisible if I read the file by unix "more"
command.

Try this instead:

@lines = split /\s*\n/, $text;



John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top