C
cartercc
I was given an odd CVS file this morning, with datafields delimited by
",". I do not know the provenance, and neither does the person who
gave it to me. It's a data file with 1279 rows and size is 11M that
contains data for an urgent report, and the target system is Windows.
Neither Excel nor Access will take the file, but it opens up fine in
with cat, head, etc.
I moved the file to my Unix system and looked at it in vi. This is
what it looks like, for example, 'perl is good' looks like this:
^@p^@e^@r^@l^@ ^@i^@s^@ ^@g^@o^@o^@d
When I open up the output file in windows, I either get the ? or the
square unrecognizible character symbols.
I ~think~ I know what the problem is, but I haven't found the answer.
Here are some of the things I've tried. The best result is a file with
ASCII characters but with large strings of ?????????????????? at the
end of each line. Is this an encoding problem? and if so, how do I
convert the characters into plain ASCII that Excel or Access will
accept?
Thanks, CC
(code follows)
#!/usr/bin/perl -w
use strict;
use Encode; #tried various decode functions, also binmode
#open INFILE, "<BB.TXT";
open INFILE, "<:utf8", 'BB.TXT';
open OUTFILE, ">:utf8", 'cleanBB.txt';
while (<INFILE>)
{
# print $_;
my $line = $_;
# $line = decode_utf8($line);
# $line =~ s/\x{0000}//g;
# $line =~ s/<.*>//g;
$line =~ s/[^[:ascii:]]+//g;
$line =~ s/<(.*)>//g; #removed file data between <>, not HTML.
print OUTFILE $line;
}
close INFILE;
close OUTFILE;
",". I do not know the provenance, and neither does the person who
gave it to me. It's a data file with 1279 rows and size is 11M that
contains data for an urgent report, and the target system is Windows.
Neither Excel nor Access will take the file, but it opens up fine in
with cat, head, etc.
I moved the file to my Unix system and looked at it in vi. This is
what it looks like, for example, 'perl is good' looks like this:
^@p^@e^@r^@l^@ ^@i^@s^@ ^@g^@o^@o^@d
When I open up the output file in windows, I either get the ? or the
square unrecognizible character symbols.
I ~think~ I know what the problem is, but I haven't found the answer.
Here are some of the things I've tried. The best result is a file with
ASCII characters but with large strings of ?????????????????? at the
end of each line. Is this an encoding problem? and if so, how do I
convert the characters into plain ASCII that Excel or Access will
accept?
Thanks, CC
(code follows)
#!/usr/bin/perl -w
use strict;
use Encode; #tried various decode functions, also binmode
#open INFILE, "<BB.TXT";
open INFILE, "<:utf8", 'BB.TXT';
open OUTFILE, ">:utf8", 'cleanBB.txt';
while (<INFILE>)
{
# print $_;
my $line = $_;
# $line = decode_utf8($line);
# $line =~ s/\x{0000}//g;
# $line =~ s/<.*>//g;
$line =~ s/[^[:ascii:]]+//g;
$line =~ s/<(.*)>//g; #removed file data between <>, not HTML.
print OUTFILE $line;
}
close INFILE;
close OUTFILE;