how to process data file with special (indisplayable) characters?

Discussion in 'Perl Misc' started by goomania, Aug 15, 2005.

  1. goomania

    goomania Guest

    I am using Perl on Unix to process some data file I got from others.
    Data are broken into and displayed on separate lines . However, when I
    tried to use code as:

    #####################################################################

    @lines = split("\n", $text);
    # $text is the variable where I stored the content of the data file
    print "@lines\n";

    #####################################################################

    to parse the different lines of data file to the array @lines, the
    output are not the same as the original data file.

    Then I found out that, after I opened the data file with emacs and read
    the file content, there are characters of "^M" appended to end of each
    line of the file. "^M" is invisible if I read the file by unix "more"
    command.

    Can anyone please tell me what are those "^M" for and how to use Perl
    to handle files with such special characters rather than standard ones
    like "\n" for newline on Unix?

    Thanks a lot for your help!

    Andy
     
    goomania, Aug 15, 2005
    #1
    1. Advertising

  2. goomania

    Paul Lalli Guest

    goomania wrote:
    > I am using Perl on Unix to process some data file I got from others.
    > Data are broken into and displayed on separate lines . However, when I
    > tried to use code as:
    >
    > #####################################################################
    >
    > @lines = split("\n", $text);
    > # $text is the variable where I stored the content of the data file
    > print "@lines\n";
    >
    > #####################################################################
    >
    > to parse the different lines of data file to the array @lines, the
    > output are not the same as the original data file.
    >
    > Then I found out that, after I opened the data file with emacs and read
    > the file content, there are characters of "^M" appended to end of each
    > line of the file. "^M" is invisible if I read the file by unix "more"
    > command.
    >
    > Can anyone please tell me what are those "^M" for and how to use Perl
    > to handle files with such special characters rather than standard ones
    > like "\n" for newline on Unix?


    The ^M characters are the extra character Windows places at the end of
    each line. Windows uses "\r\n" for newlines, whereas Unix uses "\n".
    The ^M characters are equivalent to the \r's.

    You have two basic options. One is to change the code to process
    Windows-style newlines (split on "\r\n" instead of "\n"). Preferred,
    IMHO, however, is to convert the file to Unix format.

    man dos2unix

    Paul Lalli
     
    Paul Lalli, Aug 15, 2005
    #2
    1. Advertising

  3. goomania

    goomania Guest

    Paul Lalli 写é“:

    > The ^M characters are the extra character Windows places at the end of
    > each line. Windows uses "\r\n" for newlines, whereas Unix uses "\n".
    > The ^M characters are equivalent to the \r's.
    >
    > You have two basic options. One is to change the code to process
    > Windows-style newlines (split on "\r\n" instead of "\n"). Preferred,
    > IMHO, however, is to convert the file to Unix format.
    >
    > man dos2unix
    >
    > Paul Lalli


    Thanks a lot for the information. In addition to those two options you
    nicely suggested, I also find out that the output is correct if I use
    "^M" to split the $text. I am guessing maybe using "^M" is equivalent
    to using "\r\n". Am I right?

    Thanks,

    Andy
     
    goomania, Aug 15, 2005
    #3
  4. goomania

    Paul Lalli Guest

    goomania wrote:
    > Paul Lalli 写é“:
    >
    > > The ^M characters are the extra character Windows places at the end of
    > > each line. Windows uses "\r\n" for newlines, whereas Unix uses "\n".
    > > The ^M characters are equivalent to the \r's.
    > >
    > > You have two basic options. One is to change the code to process
    > > Windows-style newlines (split on "\r\n" instead of "\n"). Preferred,
    > > IMHO, however, is to convert the file to Unix format.
    > >
    > > man dos2unix

    >
    > Thanks a lot for the information. In addition to those two options you
    > nicely suggested, I also find out that the output is correct if I use
    > "^M" to split the $text. I am guessing maybe using "^M" is equivalent
    > to using "\r\n". Am I right?


    Never tried that method, so I can't be sure. My instinct however, is
    to say "no". As I indicated previously, "^M" is how emacs is
    representing the "\r" that Windows includes before every "\n". So it
    seems to me that if you had lines ending with ^M, tried to split on
    that character, your first line would contain no newline, and then
    every subsequent line would contain a newline at the *start* of the
    string.

    Again, I could be wrong, as I've not tried it out. I still recommend
    just fixing the datafile before procesing it, using dos2unix.

    Paul Lalli
     
    Paul Lalli, Aug 15, 2005
    #4
  5. goomania wrote:
    > I am using Perl on Unix to process some data file I got from others.
    > Data are broken into and displayed on separate lines . However, when I
    > tried to use code as:
    >
    > #####################################################################
    >
    > @lines = split("\n", $text);
    > # $text is the variable where I stored the content of the data file
    > print "@lines\n";
    >
    > #####################################################################
    >
    > to parse the different lines of data file to the array @lines, the
    > output are not the same as the original data file.


    perldoc -q "Why do I get weird spaces when I print an array of lines"


    > Then I found out that, after I opened the data file with emacs and read
    > the file content, there are characters of "^M" appended to end of each
    > line of the file. "^M" is invisible if I read the file by unix "more"
    > command.


    Try this instead:

    @lines = split /\s*\n/, $text;



    John
    --
    use Perl;
    program
    fulfillment
     
    John W. Krahn, Aug 16, 2005
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Stefan Mueller
    Replies:
    3
    Views:
    33,137
    Stefan Mueller
    Jul 23, 2006
  2. Replies:
    2
    Views:
    1,114
    Ingo Menger
    May 31, 2007
  3. rvino
    Replies:
    0
    Views:
    4,680
    rvino
    Aug 14, 2007
  4. Vijay Pandey

    Data returned by WEB Service contains special characters

    Vijay Pandey, Jun 12, 2006, in forum: ASP .Net Web Services
    Replies:
    0
    Views:
    202
    Vijay Pandey
    Jun 12, 2006
  5. majna
    Replies:
    4
    Views:
    703
    Thomas 'PointedEars' Lahn
    Sep 19, 2007
Loading...

Share This Page