Converting codepages to UTF8

P

P

Hello,

Is there a Perl module which implements converting of codepages
(such as you get when running "chcp" in a command prompt) to UTF8?
Something that allows me to specify, for example, codepage 437 and
then converting it to UTF8. I've looked through the documentation for
the module Encode, but it doesn't seem to deal with codepages at all.

Thank you for any information you can provide that will nudge me in
the
right direction.


Best regards,
Angela Druss
 
D

Dr.Ruud

P schreef:
Is there a Perl module which implements converting of codepages
(such as you get when running "chcp" in a command prompt) to UTF8?
Something that allows me to specify, for example, codepage 437 and
then converting it to UTF8. I've looked through the documentation for
the module Encode, but it doesn't seem to deal with codepages at all.


chcp is a command to change the parameters of the display.

C:\>chcp /?
Displays or sets the active code page number.

CHCP [nnn]

nnn Specifies a code page number.

Type CHCP without a parameter to display the active code page number.


What do you want to do? If you want to convert a file from one encoding
to another, look for 'iconv'.
 
P

P

Dr.Ruud said:
P schreef:
Is there a Perl module which implements converting of
codepages (such as you get when running "chcp" in a
command prompt) to UTF8? Something that allows me to
specify, for example, codepage 437 and then converting
it to UTF8. I've looked through the documentation for
the module Encode, but it doesn't seem to deal with
codepages at all.


chcp is a command to change the parameters of the display.

C:\>chcp /? Displays or sets the active code page number.

CHCP [nnn]

nnn Specifies a code page number.

Type CHCP without a parameter to display the active code
page number.


Yes, if you call chcp without a parameter you can establish
the code page. That information is necessary to know what
I'm converting from.

What do you want to do? If you want to convert a file from
one encoding to another, look for 'iconv'.


That's not exactly what I want to do. I have one file, which
is in UTF8, which contains a set of strings. I want to
determine whether any of the strings matches any file name
in a specified directory. Since there can be special
characters in the file names (and in the strings in the UTF8
file), sometimes I'll get false negatives, because a simple
eq on the strings in the UTF8 file and on the file names in
the directory won't match (due to the different encodings).
So I want to normalise the directory listing first (and this
should be dependent on the code page, because different
users might be using different code pages) and compare the
resulting list to the list in the UTF8 file. Does that make
sense? :)


Thank you for your input.
 
D

Donald King

P said:
Hello,

Is there a Perl module which implements converting of codepages
(such as you get when running "chcp" in a command prompt) to UTF8?
Something that allows me to specify, for example, codepage 437 and
then converting it to UTF8. I've looked through the documentation for
the module Encode, but it doesn't seem to deal with codepages at all.

Thank you for any information you can provide that will nudge me in
the
right direction.


Best regards,
Angela Druss

The Encode module should do what you want. As far as I know, Encode
supports all the codepages out there. Assuming that $filename has raw
octets in the native codepage, something like:

$unicodefn = decode("cp437", $filename);

.... should do the trick. The resulting string will be in Perl's Unicode
format -- keep in mind that while Perl uses UTF-8 internally, Perl
treats Unicode strings differently from strings of raw UTF-8 octets.
 
D

Dr.Ruud

P schreef:
I have one file, which
is in UTF8, which contains a set of strings. I want to
determine whether any of the strings matches any file name
in a specified directory.

Since there can be special
characters in the file names (and in the strings in the UTF8
file), sometimes I'll get false negatives, because a simple
eq on the strings in the UTF8 file and on the file names in
the directory won't match (due to the different encodings).

So I want to normalise the directory listing first (and this
should be dependent on the code page, because different
users might be using different code pages) and compare the
resulting list to the list in the UTF8 file. Does that make
sense? :)

Yes, that is much clearer. I'll assume that you have Windows and maybe
Cygwin.


Have you read perllocale, perluniintro, perlunicode, perlebcdic?


Use the command:

for /f "tokens=4" %w in ('chcp') do dir >text.%w

to create a file called "text.437" (if your chcp is 437)
with the dir-output for the current directory.


Under cygwin, you can use the command:

iconv -f CP437 -t UTF-8 text.437 > text.utf8

to convert the file from cp437 to utf8.


But that second step can also be done with Perl.

(Almost) platform-independent way to see all available encodings:

perl -MEncode -e "print join $/, Encode->encodings(':all')" |more

Now it is your turn to create some code and try to make it work.
 
P

P

Dr.Ruud said:
P schreef:


Yes, that is much clearer. I'll assume that you have
Windows and maybe Cygwin.


Have you read perllocale, perluniintro, perlunicode,
perlebcdic?

Yes, I have, and while I consider myself slightly more
intelligent than a garden gnome, I must admit that these
issues concerning character encoding are beyond my abilities
of comprehension (at least at present).

Use the command:

for /f "tokens=4" %w in ('chcp') do dir >text.%w

to create a file called "text.437" (if your chcp is 437)
with the dir-output for the current directory.


I assume this is a demonstration, rather than part of a
solution? Or are you saying I'll have to write a temporary
file in this way to solve my problem?

Under cygwin, you can use the command:

iconv -f CP437 -t UTF-8 text.437 > text.utf8

to convert the file from cp437 to utf8.


I don't have iconv.

But that second step can also be done with Perl.

(Almost) platform-independent way to see all available
encodings:

perl -MEncode -e "print join $/, Encode->encodings(':all')" |more


OK, this, and Mr King's reply tell me that Encode is capable
of doing this. I need 'cp437', 'cp850' and 'cp852'
(depending on which machine I'm using). For the rest of this
post I'll assume that I'll be using 'cp437'.

Now it is your turn to create some code and try to make it
work.


Here's the script (stripped for the purposes of this post)
*before* tackling the encoding issues:

----------
#!/usr/bin/perl
use warnings;
use strict;

opendir(DIR, '.') or die "Can't open input directory: $!";

my %files = map { $_ => 1 } grep { $_ !~ m/^\.\.?$/ } readdir(DIR);

while (<DATA>) {
chomp;

if ( exists $files{$_} ) {
print "$_ matches.\n";
}
else {
print "$_ doesn't match.\n";
}
}

__DATA__
Ðorde Bala-evic
----------


A file named "Ðorde Bala-evic" *does* exist in the CWD, yet
when I run this script I get:

ÄorÄ?e Bala-eviÄ? doesn't match.


So I tried the following fix:

----------
while (<DATA>) {
chomp;

my $key = decode('cp437', $_);

if ( exists $files{$key} ) {
print "$_ matches.\n";
}
else {
print "$_ doesn't match.\n";
}
}
 
D

Dr.Ruud

P schreef:
#!/usr/bin/perl
use warnings;
use strict;

I think you need this:

use Encode qw(cp437 cp850 cp852);

or maybe

use Encode::Byte;

but see also the remarks about PerlIO in `perldoc Encode`.

opendir(DIR, '.') or die "Can't open input directory: $!";

Alternative:

opendir my $dir, '.'
or die "Can't open input directory: $!";
my %files = map { $_ => 1 } grep { $_ !~ m/^\.\.?$/ } readdir(DIR);


Maybe:

my %files = map { $_ => 1 } grep { ! m/\A\.\.?\z/s } readdir $dir;

or:

my %files = map { $_ => 1 } grep -f, readdir $dir;

(untested)
 
P

P

Dr.Ruud said:
P schreef:


I think you need this:

use Encode qw(cp437 cp850 cp852);


But those are just arguments to the decode() subroutine. They aren't
exported by Encode.pm so that gives errors.

or maybe

use Encode::Byte;


According to the documentation for Encode::Byte decode()
loads Encode::Byte implicitly.

but see also the remarks about PerlIO in `perldoc Encode`.


Those remarks only show of a way to do the encoding on-the-fly.
The result is exactly the same, though.

Alternative:

opendir my $dir, '.'
or die "Can't open input directory: $!";



Maybe:

my %files = map { $_ => 1 } grep { ! m/\A\.\.?\z/s } readdir $dir;

or:

my %files = map { $_ => 1 } grep -f, readdir $dir;


These tips don't address the issue, though.


Thanks anyway.
 
P

P

Dr.Ruud said:
P schreef:


my $cp = 'cp437';


map { decode( $cp, $_ ) => 1 } grep ...


This does exactly what my code does, except at a different point in
time.
The result is the same.


Thanks anyway.
 
D

Dr.Ruud

P schreef:
Dr.Ruud:

But those are just arguments to the decode() subroutine. They aren't
exported by Encode.pm so that gives errors.

AFAIK, Encode exports by default just a few popular encodings. You need
to tell Encode explicitly which encodings you want it to export.
To get them all, you can use ':ALL', but you mentioned that you only
need those three.
 
D

Dr.Ruud

P schreef:
Dr.Ruud:


This does exactly what my code does, except at a different point in
time. The result is the same.

I think you are very wrong here. The value of the key needs to be in its
utf8 presentation (flagged properly as utf8), at key setup time. The key
gets hashed.

Something to play with:

map { decode( $cp, $_ ) => $_ } grep { m/^\.\.?$/ } readdir $dir;
 
P

P

Dr.Ruud said:
P schreef:

AFAIK, Encode exports by default just a few popular encodings. You need
to tell Encode explicitly which encodings you want it to export.
To get them all, you can use ':ALL', but you mentioned that you only
need those three.


I see by this and other responses you have provided, that you know as
much about this as I do, which is to say not much at all. Thanks
anyway.
 
D

Dr.Ruud

P schreef:
Dr.Ruud:

I see by this and other responses you have provided, that you know as
much about this as I do, which is to say not much at all. Thanks
anyway.

Well, it has been a while since I used it.

The following (without 'use Encoding') works here as I expect, on a
Windows 2000 system with ActivePerl 5.8.8.

#!/usr/bin/perl
use strict;
use warnings;

# First, in a DOS-box, do this:
# C:\Perl\Projects\misc> chcp 437
# C:\Perl\Projects\misc> echo test > Ðorde Bala-evic.txt

chomp( my @datanames = <DATA> );

my $path = 'C:/Perl/Projects/misc';
opendir my $dir, $path or die $!;
my @filenames = grep { -f "$path/$_" } readdir $dir;

for my $dataname ( @datanames ) {
for my $filename ( @filenames ) {
print "$filename\n" if $filename eq $dataname;
}
}

__DATA__
Ðorde Bala-evic.txt
zoölogic.dat
test.txt
 
D

Dr.Ruud

P schreef:
#!/usr/bin/perl
use warnings;
use strict;

opendir(DIR, '.') or die "Can't open input directory: $!";
my %files = map { $_ => 1 } grep { $_ !~ m/^\.\.?$/ } readdir(DIR);

while (<DATA>) {
chomp;
if ( exists $files{$_} ) {
print "$_ matches.\n";
}
else {
print "$_ doesn't match.\n";
}
}

__DATA__
Ðorde Bala-evic

A file named "Ðorde Bala-evic" *does* exist in the CWD, yet
when I run this script I get:
ÄorÄ?e Bala-eviÄ? doesn't match.

On a win2k system here with ActivePerl 5.8.8, it prints "matches".
What is the OS and Perl version at your systems?


This simplified version prints "matches" too.

#!/usr/bin/perl
use strict; use warnings;
while (<DATA>) {
chomp;
if (-f) { print "$_ matches.\n" }
else { print "$_ doesn't match.\n" }
}
__DATA__
Ðorde Bala-evic

It doesn't make a difference whether I have created the file in a
DOS-box (cp437), or in Windows Explorer.
 
P

Peter J. Holzer

Dr.Ruud said:
The following (without 'use Encoding') works here as I expect, on a
Windows 2000 system with ActivePerl 5.8.8.
[code snipped]

AFAICS you aren't doing any charset conversion here, i.e. you are
depending on the charset of script being the same as the charset of
filesystem. But that isn't the case for Angela. She has a filesystem
with CP437 filenames and a file with UTF-8 filenames. So she has to
convert them to a common charset before she can compare them. You hit
(part of) the correct solution in another message with

| $cp = 'cp437';
| map { decode( $cp, $_ ) => 1 } grep ...

That converts all filenames from cp437 to the internal perl
representation.

So we still need to convert the filenames from the file to that
representation[0]. Since they are in UTF-8, we can simply stick a

$_ = decode('utf-8', $_);

before or after the chomp. That's probably the solution which is easiest
to understand: Whenever we read something in a particular encoding, we
have to decode it first. (And when you want to write something, you have
to encode it)

Perl 5.8 can do the decoding for you if you are reading from a file with
"IO layers". If you open a file with

open(F, "<:encoding($inputencoding)", $filename);

all data read from F will automatically be decoded. I recommend doing it
this way because then you don't have to sprinkle lots of decode() and
encode() calls through your code.

hp
 
D

Dr.Ruud

Peter J. Holzer schreef:
Dr.Ruud:
The following (without 'use Encoding') works here as I expect, on a
Windows 2000 system with ActivePerl 5.8.8.
[code snipped]

AFAICS you aren't doing any charset conversion here, i.e. you are
depending on the charset of script being the same as the charset of
filesystem.

Yes, that's why I mentioned: without 'use Encoding'.

But that isn't the case for Angela. She has a filesystem
with CP437 filenames and a file with UTF-8 filenames.

I guess so too, but I am still not absolutely sure about that, that's
why I showed how it works on a Unicode-aware OS.
I wonder why she still hasn't mentioned more version details, older
software can behave rather differently.

So she has to
convert them to a common charset before she can compare them. You hit
(part of) the correct solution in another message with


That converts all filenames from cp437 to the internal perl
representation.

Yes, I really hoped that would get things going.

There may be situations where you rather want to use from_to(), to
remain in "octets" space, but that might bite :) as well, because
characters like composites can have more than one representation (in
octets), which would make it miss some equalities. Perl's Unicode
support is quite reliable.

Handy for debugging:

print((map {sprintf "[%s x%x %d] ", $_, ord($_), ord($_)} split //,
$buffer), "\n");

which shows 'abc' as '[a x61 97] [b x62 98] [c x63 99] '.


So we still need to convert the filenames from the file to that
representation[0]. Since they are in UTF-8, we can simply stick a

$_ = decode('utf-8', $_);

before or after the chomp.

I assumed that the file is read in utf-8 mode, meaning that the layer is
specified with the open() or even with a binmode().

Perl 5.8 can do the decoding for you if you are reading from a file
with "IO layers". If you open a file with

open(F, "<:encoding($inputencoding)", $filename);

all data read from F will automatically be decoded. I recommend doing
it this way because then you don't have to sprinkle lots of decode()
and encode() calls through your code.

OK, that's what I meant too; it's good that you mention the version
dependænce.

For UTF-8, you can shorten it to: open(my $fh, '<:utf8', $filename).
(since Perl 5.?.? ?)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top