odd file with ^@ characters

Discussion in 'Perl Misc' started by cartercc@gmail.com, Jan 3, 2008.

  1. Guest

    I was given an odd CVS file this morning, with datafields delimited by
    ",". I do not know the provenance, and neither does the person who
    gave it to me. It's a data file with 1279 rows and size is 11M that
    contains data for an urgent report, and the target system is Windows.
    Neither Excel nor Access will take the file, but it opens up fine in
    with cat, head, etc.

    I moved the file to my Unix system and looked at it in vi. This is
    what it looks like, for example, 'perl is good' looks like this:
    ^@p^@e^@r^@l^@ ^@i^@s^@ ^@g^@o^@o^@d
    When I open up the output file in windows, I either get the ? or the
    square unrecognizible character symbols.

    I ~think~ I know what the problem is, but I haven't found the answer.
    Here are some of the things I've tried. The best result is a file with
    ASCII characters but with large strings of ?????????????????? at the
    end of each line. Is this an encoding problem? and if so, how do I
    convert the characters into plain ASCII that Excel or Access will
    accept?

    Thanks, CC
    (code follows)

    #!/usr/bin/perl -w
    use strict;
    use Encode; #tried various decode functions, also binmode

    #open INFILE, "<BB.TXT";
    open INFILE, "<:utf8", 'BB.TXT';
    open OUTFILE, ">:utf8", 'cleanBB.txt';

    while (<INFILE>)
    {
    # print $_;
    my $line = $_;
    # $line = decode_utf8($line);
    # $line =~ s/\x{0000}//g;
    # $line =~ s/<.*>//g;
    $line =~ s/[^[:ascii:]]+//g;
    $line =~ s/<(.*)>//g; #removed file data between <>, not HTML.
    print OUTFILE $line;
    }
    close INFILE;
    close OUTFILE;
     
    , Jan 3, 2008
    #1
    1. Advertising

  2. writes:

    > I moved the file to my Unix system and looked at it in vi. This is
    > what it looks like, for example, 'perl is good' looks like this:
    > ^@p^@e^@r^@l^@ ^@i^@s^@ ^@g^@o^@o^@d


    Could be UTF-16BE or UCS-2BE.

    //Makholm
     
    Peter Makholm, Jan 3, 2008
    #2
    1. Advertising

  3. wrote:
    >I was given an odd CVS file this morning, with datafields delimited by
    >",". I do not know the provenance, and neither does the person who
    >gave it to me. It's a data file with 1279 rows and size is 11M that
    >contains data for an urgent report, and the target system is Windows.
    >Neither Excel nor Access will take the file, but it opens up fine in
    >with cat, head, etc.
    >
    >I moved the file to my Unix system and looked at it in vi. This is
    >what it looks like, for example, 'perl is good' looks like this:
    >^@p^@e^@r^@l^@ ^@i^@s^@ ^@g^@o^@o^@d


    Nothing do to with Perl but it appears as if the file is encoded in a 16-bit
    encoding, most likely UTF-16.

    >When I open up the output file in windows, I either get the ? or the
    >square unrecognizible character symbols.


    This may or may not work: Try opening the file in Firefox (yes, in a web
    browser), and change "View -> Encoding -> More" to Unicode(UTF16).
    This should give you a readable display which then you can either
    copy-and-paste or even Save-As in a different encoding that is compatible
    with your other tools.

    Another option: Windows tools have the habit of adding a byte order mark
    (BOM) to any Unicode file, no matter if its needed or not. Maybe it's just
    that whatever program created that file did not write the BOM and therefore
    the Windows programs don't recognize the encoding.
    If that is the case you could use your favourite editor to just inject the
    BOM at the beginning of the file.

    jue


    >
    >I ~think~ I know what the problem is, but I haven't found the answer.
    >Here are some of the things I've tried. The best result is a file with
    >ASCII characters but with large strings of ?????????????????? at the
    >end of each line. Is this an encoding problem? and if so, how do I
    >convert the characters into plain ASCII that Excel or Access will
    >accept?
    >
    >Thanks, CC
    >(code follows)
    >
    >#!/usr/bin/perl -w
    >use strict;
    >use Encode; #tried various decode functions, also binmode
    >
    >#open INFILE, "<BB.TXT";
    >open INFILE, "<:utf8", 'BB.TXT';
    >open OUTFILE, ">:utf8", 'cleanBB.txt';
    >
    >while (<INFILE>)
    >{
    ># print $_;
    > my $line = $_;
    ># $line = decode_utf8($line);
    ># $line =~ s/\x{0000}//g;
    ># $line =~ s/<.*>//g;
    > $line =~ s/[^[:ascii:]]+//g;
    > $line =~ s/<(.*)>//g; #removed file data between <>, not HTML.
    > print OUTFILE $line;
    >}
    >close INFILE;
    >close OUTFILE;
     
    Jürgen Exner, Jan 3, 2008
    #3
  4. Guest

    On Jan 3, 1:38 pm, Jürgen Exner <>
    > Nothing do to with Perl but it appears as if the file is encoded in a 16-bit
    > encoding, most likely UTF-16.


    Yes, thanks, UTF16 it is. Since the guys who will work with this will
    use MS apps, I'll bow out ... but I may be back if they ask me to do
    something funky with the file (like create a script to spit out a
    report, which they may do since I think this will be a continuing
    task.)

    CC
     
    , Jan 3, 2008
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. DC Gringo

    Odd characters in browser

    DC Gringo, Aug 5, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    430
    =?Utf-8?B?VmlqYXkgUG90ZSAoTUNQLCBNQ0FELCBNQ1NELUlu
    Aug 5, 2005
  2. Replies:
    2
    Views:
    769
  3. Stefan Mueller
    Replies:
    3
    Views:
    33,049
    Stefan Mueller
    Jul 23, 2006
  4. Michael Speer

    Odd behavior with odd code

    Michael Speer, Feb 16, 2007, in forum: C Programming
    Replies:
    33
    Views:
    1,105
    Richard Heathfield
    Feb 18, 2007
  5. rvino
    Replies:
    0
    Views:
    4,661
    rvino
    Aug 14, 2007
Loading...

Share This Page