how to process data file with special (indisplayable) characters?

goomania · Aug 15, 2005

I am using Perl on Unix to process some data file I got from others.
Data are broken into and displayed on separate lines . However, when I
tried to use code as:

#####################################################################

@lines = split("\n", $text);
# $text is the variable where I stored the content of the data file
print "@lines\n";

#####################################################################

to parse the different lines of data file to the array @lines, the
output are not the same as the original data file.

Then I found out that, after I opened the data file with emacs and read
the file content, there are characters of "^M" appended to end of each
line of the file. "^M" is invisible if I read the file by unix "more"
command.

Can anyone please tell me what are those "^M" for and how to use Perl
to handle files with such special characters rather than standard ones
like "\n" for newline on Unix?

Thanks a lot for your help!

Andy

Paul Lalli · Aug 15, 2005

goomania said:
I am using Perl on Unix to process some data file I got from others.
Data are broken into and displayed on separate lines . However, when I
tried to use code as:

#####################################################################

@lines = split("\n", $text);
# $text is the variable where I stored the content of the data file
print "@lines\n";

#####################################################################

to parse the different lines of data file to the array @lines, the
output are not the same as the original data file.

Then I found out that, after I opened the data file with emacs and read
the file content, there are characters of "^M" appended to end of each
line of the file. "^M" is invisible if I read the file by unix "more"
command.

Can anyone please tell me what are those "^M" for and how to use Perl
to handle files with such special characters rather than standard ones
like "\n" for newline on Unix?

The ^M characters are the extra character Windows places at the end of
each line. Windows uses "\r\n" for newlines, whereas Unix uses "\n".
The ^M characters are equivalent to the \r's.

You have two basic options. One is to change the code to process
Windows-style newlines (split on "\r\n" instead of "\n"). Preferred,
IMHO, however, is to convert the file to Unix format.

man dos2unix

Paul Lalli

goomania · Aug 15, 2005

Paul Lalli å†™é“ï¼š

The ^M characters are the extra character Windows places at the end of
each line. Windows uses "\r\n" for newlines, whereas Unix uses "\n".
The ^M characters are equivalent to the \r's.

You have two basic options. One is to change the code to process
Windows-style newlines (split on "\r\n" instead of "\n"). Preferred,
IMHO, however, is to convert the file to Unix format.

man dos2unix

Paul Lalli

Thanks a lot for the information. In addition to those two options you
nicely suggested, I also find out that the output is correct if I use
"^M" to split the $text. I am guessing maybe using "^M" is equivalent
to using "\r\n". Am I right?

Thanks,

Andy

Paul Lalli · Aug 15, 2005

goomania said:
Paul Lalli å†™é“ï¼š

Thanks a lot for the information. In addition to those two options you
nicely suggested, I also find out that the output is correct if I use
"^M" to split the $text. I am guessing maybe using "^M" is equivalent
to using "\r\n". Am I right?

Never tried that method, so I can't be sure. My instinct however, is
to say "no". As I indicated previously, "^M" is how emacs is
representing the "\r" that Windows includes before every "\n". So it
seems to me that if you had lines ending with ^M, tried to split on
that character, your first line would contain no newline, and then
every subsequent line would contain a newline at the *start* of the
string.

Again, I could be wrong, as I've not tried it out. I still recommend
just fixing the datafile before procesing it, using dos2unix.

Paul Lalli

John W. Krahn · Aug 16, 2005

goomania said:
I am using Perl on Unix to process some data file I got from others.
Data are broken into and displayed on separate lines . However, when I
tried to use code as:

#####################################################################

@lines = split("\n", $text);
# $text is the variable where I stored the content of the data file
print "@lines\n";

#####################################################################

to parse the different lines of data file to the array @lines, the
output are not the same as the original data file.

perldoc -q "Why do I get weird spaces when I print an array of lines"

Then I found out that, after I opened the data file with emacs and read
the file content, there are characters of "^M" appended to end of each
line of the file. "^M" is invisible if I read the file by unix "more"
command.

Try this instead:

@lines = split /\s*\n/, $text;

John

Sort by number of characters	1	Nov 2, 2023
Issue with passing fetched data to POST form. How can I?	0	Jul 23, 2023
How to convert MS Word special characters to HTML codes?	1	Mar 31, 2012
Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
How to treat an input data as variable?	4	Apr 13, 2023
How to change key name in json file with python	0	Oct 2, 2022
How to host data visualization beginner friendly?	1	Aug 10, 2023
How to save JSON Data to a file using fetch() api?	2	Apr 28, 2022

how to process data file with special (indisplayable) characters?

goomania

Paul Lalli

goomania

Paul Lalli

John W. Krahn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads