odd file with ^@ characters

C

cartercc

I was given an odd CVS file this morning, with datafields delimited by
",". I do not know the provenance, and neither does the person who
gave it to me. It's a data file with 1279 rows and size is 11M that
contains data for an urgent report, and the target system is Windows.
Neither Excel nor Access will take the file, but it opens up fine in
with cat, head, etc.

I moved the file to my Unix system and looked at it in vi. This is
what it looks like, for example, 'perl is good' looks like this:
^@p^@e^@r^@l^@ ^@i^@s^@ ^@g^@o^@o^@d
When I open up the output file in windows, I either get the ? or the
square unrecognizible character symbols.

I ~think~ I know what the problem is, but I haven't found the answer.
Here are some of the things I've tried. The best result is a file with
ASCII characters but with large strings of ?????????????????? at the
end of each line. Is this an encoding problem? and if so, how do I
convert the characters into plain ASCII that Excel or Access will
accept?

Thanks, CC
(code follows)

#!/usr/bin/perl -w
use strict;
use Encode; #tried various decode functions, also binmode

#open INFILE, "<BB.TXT";
open INFILE, "<:utf8", 'BB.TXT';
open OUTFILE, ">:utf8", 'cleanBB.txt';

while (<INFILE>)
{
# print $_;
my $line = $_;
# $line = decode_utf8($line);
# $line =~ s/\x{0000}//g;
# $line =~ s/<.*>//g;
$line =~ s/[^[:ascii:]]+//g;
$line =~ s/<(.*)>//g; #removed file data between <>, not HTML.
print OUTFILE $line;
}
close INFILE;
close OUTFILE;
 
P

Peter Makholm

I moved the file to my Unix system and looked at it in vi. This is
what it looks like, for example, 'perl is good' looks like this:
^@p^@e^@r^@l^@ ^@i^@s^@ ^@g^@o^@o^@d

Could be UTF-16BE or UCS-2BE.

//Makholm
 
J

Jürgen Exner

I was given an odd CVS file this morning, with datafields delimited by
",". I do not know the provenance, and neither does the person who
gave it to me. It's a data file with 1279 rows and size is 11M that
contains data for an urgent report, and the target system is Windows.
Neither Excel nor Access will take the file, but it opens up fine in
with cat, head, etc.

I moved the file to my Unix system and looked at it in vi. This is
what it looks like, for example, 'perl is good' looks like this:
^@p^@e^@r^@l^@ ^@i^@s^@ ^@g^@o^@o^@d

Nothing do to with Perl but it appears as if the file is encoded in a 16-bit
encoding, most likely UTF-16.
When I open up the output file in windows, I either get the ? or the
square unrecognizible character symbols.

This may or may not work: Try opening the file in Firefox (yes, in a web
browser), and change "View -> Encoding -> More" to Unicode(UTF16).
This should give you a readable display which then you can either
copy-and-paste or even Save-As in a different encoding that is compatible
with your other tools.

Another option: Windows tools have the habit of adding a byte order mark
(BOM) to any Unicode file, no matter if its needed or not. Maybe it's just
that whatever program created that file did not write the BOM and therefore
the Windows programs don't recognize the encoding.
If that is the case you could use your favourite editor to just inject the
BOM at the beginning of the file.

jue

I ~think~ I know what the problem is, but I haven't found the answer.
Here are some of the things I've tried. The best result is a file with
ASCII characters but with large strings of ?????????????????? at the
end of each line. Is this an encoding problem? and if so, how do I
convert the characters into plain ASCII that Excel or Access will
accept?

Thanks, CC
(code follows)

#!/usr/bin/perl -w
use strict;
use Encode; #tried various decode functions, also binmode

#open INFILE, "<BB.TXT";
open INFILE, "<:utf8", 'BB.TXT';
open OUTFILE, ">:utf8", 'cleanBB.txt';

while (<INFILE>)
{
# print $_;
my $line = $_;
# $line = decode_utf8($line);
# $line =~ s/\x{0000}//g;
# $line =~ s/<.*>//g;
$line =~ s/[^[:ascii:]]+//g;
$line =~ s/<(.*)>//g; #removed file data between <>, not HTML.
print OUTFILE $line;
}
close INFILE;
close OUTFILE;
 
C

cartercc

Nothing do to with Perl but it appears as if the file is encoded in a 16-bit
encoding, most likely UTF-16.

Yes, thanks, UTF16 it is. Since the guys who will work with this will
use MS apps, I'll bow out ... but I may be back if they ask me to do
something funky with the file (like create a script to spit out a
report, which they may do since I think this will be a continuing
task.)

CC
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top