Converting codepages to UTF8

Discussion in 'Perl Misc' started by P, Mar 30, 2006.

  1. P

    P Guest

    Hello,

    Is there a Perl module which implements converting of codepages
    (such as you get when running "chcp" in a command prompt) to UTF8?
    Something that allows me to specify, for example, codepage 437 and
    then converting it to UTF8. I've looked through the documentation for
    the module Encode, but it doesn't seem to deal with codepages at all.

    Thank you for any information you can provide that will nudge me in
    the
    right direction.


    Best regards,
    Angela Druss
     
    P, Mar 30, 2006
    #1
    1. Advertising

  2. P

    Dr.Ruud Guest

    P schreef:

    > Is there a Perl module which implements converting of codepages
    > (such as you get when running "chcp" in a command prompt) to UTF8?
    > Something that allows me to specify, for example, codepage 437 and
    > then converting it to UTF8. I've looked through the documentation for
    > the module Encode, but it doesn't seem to deal with codepages at all.



    chcp is a command to change the parameters of the display.

    C:\>chcp /?
    Displays or sets the active code page number.

    CHCP [nnn]

    nnn Specifies a code page number.

    Type CHCP without a parameter to display the active code page number.


    What do you want to do? If you want to convert a file from one encoding
    to another, look for 'iconv'.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Mar 30, 2006
    #2
    1. Advertising

  3. P

    P Guest

    Dr.Ruud wrote:
    > P schreef:
    >
    > > Is there a Perl module which implements converting of
    > > codepages (such as you get when running "chcp" in a
    > > command prompt) to UTF8? Something that allows me to
    > > specify, for example, codepage 437 and then converting
    > > it to UTF8. I've looked through the documentation for
    > > the module Encode, but it doesn't seem to deal with
    > > codepages at all.

    >
    >
    > chcp is a command to change the parameters of the display.
    >
    > C:\>chcp /? Displays or sets the active code page number.
    >
    > CHCP [nnn]
    >
    > nnn Specifies a code page number.
    >
    > Type CHCP without a parameter to display the active code
    > page number.



    Yes, if you call chcp without a parameter you can establish
    the code page. That information is necessary to know what
    I'm converting from.


    > What do you want to do? If you want to convert a file from
    > one encoding to another, look for 'iconv'.



    That's not exactly what I want to do. I have one file, which
    is in UTF8, which contains a set of strings. I want to
    determine whether any of the strings matches any file name
    in a specified directory. Since there can be special
    characters in the file names (and in the strings in the UTF8
    file), sometimes I'll get false negatives, because a simple
    eq on the strings in the UTF8 file and on the file names in
    the directory won't match (due to the different encodings).
    So I want to normalise the directory listing first (and this
    should be dependent on the code page, because different
    users might be using different code pages) and compare the
    resulting list to the list in the UTF8 file. Does that make
    sense? :)


    Thank you for your input.


    --
    Best regards,
    Angela Druss
     
    P, Mar 30, 2006
    #3
  4. P

    Donald King Guest

    P wrote:
    >
    > Hello,
    >
    > Is there a Perl module which implements converting of codepages
    > (such as you get when running "chcp" in a command prompt) to UTF8?
    > Something that allows me to specify, for example, codepage 437 and
    > then converting it to UTF8. I've looked through the documentation for
    > the module Encode, but it doesn't seem to deal with codepages at all.
    >
    > Thank you for any information you can provide that will nudge me in
    > the
    > right direction.
    >
    >
    > Best regards,
    > Angela Druss
    >


    The Encode module should do what you want. As far as I know, Encode
    supports all the codepages out there. Assuming that $filename has raw
    octets in the native codepage, something like:

    $unicodefn = decode("cp437", $filename);

    .... should do the trick. The resulting string will be in Perl's Unicode
    format -- keep in mind that while Perl uses UTF-8 internally, Perl
    treats Unicode strings differently from strings of raw UTF-8 octets.

    --
    Donald King, a.k.a. Chronos Tachyon
    http://chronos-tachyon.net/
     
    Donald King, Mar 30, 2006
    #4
  5. P

    Dr.Ruud Guest

    P schreef:

    > I have one file, which
    > is in UTF8, which contains a set of strings. I want to
    > determine whether any of the strings matches any file name
    > in a specified directory.
    >
    > Since there can be special
    > characters in the file names (and in the strings in the UTF8
    > file), sometimes I'll get false negatives, because a simple
    > eq on the strings in the UTF8 file and on the file names in
    > the directory won't match (due to the different encodings).
    >
    > So I want to normalise the directory listing first (and this
    > should be dependent on the code page, because different
    > users might be using different code pages) and compare the
    > resulting list to the list in the UTF8 file. Does that make
    > sense? :)


    Yes, that is much clearer. I'll assume that you have Windows and maybe
    Cygwin.


    Have you read perllocale, perluniintro, perlunicode, perlebcdic?


    Use the command:

    for /f "tokens=4" %w in ('chcp') do dir >text.%w

    to create a file called "text.437" (if your chcp is 437)
    with the dir-output for the current directory.


    Under cygwin, you can use the command:

    iconv -f CP437 -t UTF-8 text.437 > text.utf8

    to convert the file from cp437 to utf8.


    But that second step can also be done with Perl.

    (Almost) platform-independent way to see all available encodings:

    perl -MEncode -e "print join $/, Encode->encodings(':all')" |more

    Now it is your turn to create some code and try to make it work.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Mar 30, 2006
    #5
  6. P

    P Guest

    Dr.Ruud wrote:
    > P schreef:
    >
    > > I have one file, which is in UTF8, which contains a set
    > > of strings. I want to determine whether any of the
    > > strings matches any file name in a specified directory.
    > >
    > > Since there can be special characters in the file names
    > > (and in the strings in the UTF8 file), sometimes I'll
    > > get false negatives, because a simple eq on the strings
    > > in the UTF8 file and on the file names in the directory
    > > won't match (due to the different encodings).
    > >
    > > So I want to normalise the directory listing first (and
    > > this should be dependent on the code page, because
    > > different users might be using different code pages) and
    > > compare the resulting list to the list in the UTF8 file.
    > > Does that make sense? :)

    >
    > Yes, that is much clearer. I'll assume that you have
    > Windows and maybe Cygwin.
    >
    >
    > Have you read perllocale, perluniintro, perlunicode,
    > perlebcdic?


    Yes, I have, and while I consider myself slightly more
    intelligent than a garden gnome, I must admit that these
    issues concerning character encoding are beyond my abilities
    of comprehension (at least at present).


    > Use the command:
    >
    > for /f "tokens=4" %w in ('chcp') do dir >text.%w
    >
    > to create a file called "text.437" (if your chcp is 437)
    > with the dir-output for the current directory.



    I assume this is a demonstration, rather than part of a
    solution? Or are you saying I'll have to write a temporary
    file in this way to solve my problem?


    > Under cygwin, you can use the command:
    >
    > iconv -f CP437 -t UTF-8 text.437 > text.utf8
    >
    > to convert the file from cp437 to utf8.



    I don't have iconv.


    > But that second step can also be done with Perl.
    >
    > (Almost) platform-independent way to see all available
    > encodings:
    >
    > perl -MEncode -e "print join $/, Encode->encodings(':all')" |more



    OK, this, and Mr King's reply tell me that Encode is capable
    of doing this. I need 'cp437', 'cp850' and 'cp852'
    (depending on which machine I'm using). For the rest of this
    post I'll assume that I'll be using 'cp437'.


    > Now it is your turn to create some code and try to make it
    > work.



    Here's the script (stripped for the purposes of this post)
    *before* tackling the encoding issues:

    ----------
    #!/usr/bin/perl
    use warnings;
    use strict;

    opendir(DIR, '.') or die "Can't open input directory: $!";

    my %files = map { $_ => 1 } grep { $_ !~ m/^\.\.?$/ } readdir(DIR);

    while (<DATA>) {
    chomp;

    if ( exists $files{$_} ) {
    print "$_ matches.\n";
    }
    else {
    print "$_ doesn't match.\n";
    }
    }

    __DATA__
    Ðorde Bala-evic
    ----------


    A file named "Ðorde Bala-evic" *does* exist in the CWD, yet
    when I run this script I get:

    ÄorÄ?e Bala-eviÄ? doesn't match.


    So I tried the following fix:

    ----------
    while (<DATA>) {
    chomp;

    my $key = decode('cp437', $_);

    if ( exists $files{$key} ) {
    print "$_ matches.\n";
    }
    else {
    print "$_ doesn't match.\n";
    }
    }
    ----------


    But this gives the same exact result. What am I doing wrong?

    --
    Best regards,
    Angela Druss
     
    P, Mar 31, 2006
    #6
  7. P

    Dr.Ruud Guest

    P schreef:

    > #!/usr/bin/perl
    > use warnings;
    > use strict;


    I think you need this:

    use Encode qw(cp437 cp850 cp852);

    or maybe

    use Encode::Byte;

    but see also the remarks about PerlIO in `perldoc Encode`.


    > opendir(DIR, '.') or die "Can't open input directory: $!";


    Alternative:

    opendir my $dir, '.'
    or die "Can't open input directory: $!";

    > my %files = map { $_ => 1 } grep { $_ !~ m/^\.\.?$/ } readdir(DIR);



    Maybe:

    my %files = map { $_ => 1 } grep { ! m/\A\.\.?\z/s } readdir $dir;

    or:

    my %files = map { $_ => 1 } grep -f, readdir $dir;

    (untested)


    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Mar 31, 2006
    #7
  8. P

    Dr.Ruud Guest

    P schreef:


    my $cp = 'cp437';

    > my %files = map { $_ => 1 } grep { $_ !~ m/^\.\.?$/ } readdir(DIR);


    map { decode( $cp, $_ ) => 1 } grep ...

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Mar 31, 2006
    #8
  9. P

    P Guest

    Dr.Ruud wrote:
    > P schreef:
    >
    > > #!/usr/bin/perl
    > > use warnings;
    > > use strict;

    >
    > I think you need this:
    >
    > use Encode qw(cp437 cp850 cp852);



    But those are just arguments to the decode() subroutine. They aren't
    exported by Encode.pm so that gives errors.


    > or maybe
    >
    > use Encode::Byte;



    According to the documentation for Encode::Byte decode()
    loads Encode::Byte implicitly.


    > but see also the remarks about PerlIO in `perldoc Encode`.



    Those remarks only show of a way to do the encoding on-the-fly.
    The result is exactly the same, though.


    > > opendir(DIR, '.') or die "Can't open input directory: $!";

    >
    > Alternative:
    >
    > opendir my $dir, '.'
    > or die "Can't open input directory: $!";
    >
    > > my %files = map { $_ => 1 } grep { $_ !~ m/^\.\.?$/ } readdir(DIR);

    >
    >
    > Maybe:
    >
    > my %files = map { $_ => 1 } grep { ! m/\A\.\.?\z/s } readdir $dir;
    >
    > or:
    >
    > my %files = map { $_ => 1 } grep -f, readdir $dir;



    These tips don't address the issue, though.


    Thanks anyway.


    --
    Best regards,
    Angela Druss
     
    P, Mar 31, 2006
    #9
  10. P

    P Guest

    Dr.Ruud wrote:
    > P schreef:
    >
    >
    > my $cp = 'cp437';
    >
    > > my %files = map { $_ => 1 } grep { $_ !~ m/^\.\.?$/ } readdir(DIR);

    >
    > map { decode( $cp, $_ ) => 1 } grep ...



    This does exactly what my code does, except at a different point in
    time.
    The result is the same.


    Thanks anyway.


    --
    Best regards,
    Angela Druss
     
    P, Mar 31, 2006
    #10
  11. P

    Dr.Ruud Guest

    P schreef:
    > Dr.Ruud:


    >> I think you need this:
    >>
    >> use Encode qw(cp437 cp850 cp852);

    >
    > But those are just arguments to the decode() subroutine. They aren't
    > exported by Encode.pm so that gives errors.


    AFAIK, Encode exports by default just a few popular encodings. You need
    to tell Encode explicitly which encodings you want it to export.
    To get them all, you can use ':ALL', but you mentioned that you only
    need those three.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Mar 31, 2006
    #11
  12. P

    Dr.Ruud Guest

    P schreef:
    > Dr.Ruud:
    >> P:


    >> my $cp = 'cp437';
    >>
    >>> my %files = map { $_ => 1 } grep { $_ !~ m/^\.\.?$/ } readdir(DIR);

    >>
    >> map { decode( $cp, $_ ) => 1 } grep ...

    >
    >
    > This does exactly what my code does, except at a different point in
    > time. The result is the same.


    I think you are very wrong here. The value of the key needs to be in its
    utf8 presentation (flagged properly as utf8), at key setup time. The key
    gets hashed.

    Something to play with:

    map { decode( $cp, $_ ) => $_ } grep { m/^\.\.?$/ } readdir $dir;

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Mar 31, 2006
    #12
  13. P

    P Guest

    Dr.Ruud wrote:
    > P schreef:
    > > Dr.Ruud:

    >
    > >> I think you need this:
    > >>
    > >> use Encode qw(cp437 cp850 cp852);

    > >
    > > But those are just arguments to the decode() subroutine. They aren't
    > > exported by Encode.pm so that gives errors.

    >
    > AFAIK, Encode exports by default just a few popular encodings. You need
    > to tell Encode explicitly which encodings you want it to export.
    > To get them all, you can use ':ALL', but you mentioned that you only
    > need those three.



    I see by this and other responses you have provided, that you know as
    much about this as I do, which is to say not much at all. Thanks
    anyway.


    --
    Best regards,
    Angela Druss
     
    P, Apr 1, 2006
    #13
  14. P

    Dr.Ruud Guest

    P schreef:
    > Dr.Ruud:
    >> P:
    >>> Dr.Ruud:


    >>>> I think you need this:
    >>>>
    >>>> use Encode qw(cp437 cp850 cp852);
    >>>
    >>> But those are just arguments to the decode() subroutine. They aren't
    >>> exported by Encode.pm so that gives errors.

    >>
    >> AFAIK, Encode exports by default just a few popular encodings. You
    >> need to tell Encode explicitly which encodings you want it to export.
    >> To get them all, you can use ':ALL', but you mentioned that you only
    >> need those three.

    >
    > I see by this and other responses you have provided, that you know as
    > much about this as I do, which is to say not much at all. Thanks
    > anyway.


    Well, it has been a while since I used it.

    The following (without 'use Encoding') works here as I expect, on a
    Windows 2000 system with ActivePerl 5.8.8.

    #!/usr/bin/perl
    use strict;
    use warnings;

    # First, in a DOS-box, do this:
    # C:\Perl\Projects\misc> chcp 437
    # C:\Perl\Projects\misc> echo test > Ðorde Bala-evic.txt

    chomp( my @datanames = <DATA> );

    my $path = 'C:/Perl/Projects/misc';
    opendir my $dir, $path or die $!;
    my @filenames = grep { -f "$path/$_" } readdir $dir;

    for my $dataname ( @datanames ) {
    for my $filename ( @filenames ) {
    print "$filename\n" if $filename eq $dataname;
    }
    }

    __DATA__
    Ðorde Bala-evic.txt
    zoölogic.dat
    test.txt

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Apr 2, 2006
    #14
  15. P

    Dr.Ruud Guest

    P schreef:

    > #!/usr/bin/perl
    > use warnings;
    > use strict;
    >
    > opendir(DIR, '.') or die "Can't open input directory: $!";
    > my %files = map { $_ => 1 } grep { $_ !~ m/^\.\.?$/ } readdir(DIR);
    >
    > while (<DATA>) {
    > chomp;
    > if ( exists $files{$_} ) {
    > print "$_ matches.\n";
    > }
    > else {
    > print "$_ doesn't match.\n";
    > }
    > }
    >
    > __DATA__
    > Ðorde Bala-evic
    >
    > A file named "Ðorde Bala-evic" *does* exist in the CWD, yet
    > when I run this script I get:
    > ÄorÄ?e Bala-eviÄ? doesn't match.


    On a win2k system here with ActivePerl 5.8.8, it prints "matches".
    What is the OS and Perl version at your systems?


    This simplified version prints "matches" too.

    #!/usr/bin/perl
    use strict; use warnings;
    while (<DATA>) {
    chomp;
    if (-f) { print "$_ matches.\n" }
    else { print "$_ doesn't match.\n" }
    }
    __DATA__
    Ðorde Bala-evic

    It doesn't make a difference whether I have created the file in a
    DOS-box (cp437), or in Windows Explorer.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Apr 2, 2006
    #15
  16. Dr.Ruud wrote:
    > The following (without 'use Encoding') works here as I expect, on a
    > Windows 2000 system with ActivePerl 5.8.8.
    >

    [code snipped]

    AFAICS you aren't doing any charset conversion here, i.e. you are
    depending on the charset of script being the same as the charset of
    filesystem. But that isn't the case for Angela. She has a filesystem
    with CP437 filenames and a file with UTF-8 filenames. So she has to
    convert them to a common charset before she can compare them. You hit
    (part of) the correct solution in another message with

    | $cp = 'cp437';
    | map { decode( $cp, $_ ) => 1 } grep ...

    That converts all filenames from cp437 to the internal perl
    representation.

    So we still need to convert the filenames from the file to that
    representation[0]. Since they are in UTF-8, we can simply stick a

    $_ = decode('utf-8', $_);

    before or after the chomp. That's probably the solution which is easiest
    to understand: Whenever we read something in a particular encoding, we
    have to decode it first. (And when you want to write something, you have
    to encode it)

    Perl 5.8 can do the decoding for you if you are reading from a file with
    "IO layers". If you open a file with

    open(F, "<:encoding($inputencoding)", $filename);

    all data read from F will automatically be decoded. I recommend doing it
    this way because then you don't have to sprinkle lots of decode() and
    encode() calls through your code.

    hp

    --
    _ | Peter J. Holzer | Löschung von at.usenet.schmankerl?
    |_|_) | Sysadmin WSR/LUGA |
    | | | | Diskussion derzeit in at.usenet.gruppen
    __/ | http://www.hjp.at/ |
     
    Peter J. Holzer, Apr 2, 2006
    #16
  17. P

    Dr.Ruud Guest

    Peter J. Holzer schreef:
    > Dr.Ruud:


    >> The following (without 'use Encoding') works here as I expect, on a
    >> Windows 2000 system with ActivePerl 5.8.8.
    >>

    > [code snipped]
    >
    > AFAICS you aren't doing any charset conversion here, i.e. you are
    > depending on the charset of script being the same as the charset of
    > filesystem.


    Yes, that's why I mentioned: without 'use Encoding'.


    > But that isn't the case for Angela. She has a filesystem
    > with CP437 filenames and a file with UTF-8 filenames.


    I guess so too, but I am still not absolutely sure about that, that's
    why I showed how it works on a Unicode-aware OS.
    I wonder why she still hasn't mentioned more version details, older
    software can behave rather differently.


    > So she has to
    > convert them to a common charset before she can compare them. You hit
    > (part of) the correct solution in another message with
    >
    >> $cp = 'cp437';
    >> map { decode( $cp, $_ ) => 1 } grep ...

    >
    > That converts all filenames from cp437 to the internal perl
    > representation.


    Yes, I really hoped that would get things going.

    There may be situations where you rather want to use from_to(), to
    remain in "octets" space, but that might bite :) as well, because
    characters like composites can have more than one representation (in
    octets), which would make it miss some equalities. Perl's Unicode
    support is quite reliable.

    Handy for debugging:

    print((map {sprintf "[%s x%x %d] ", $_, ord($_), ord($_)} split //,
    $buffer), "\n");

    which shows 'abc' as '[a x61 97] [b x62 98] [c x63 99] '.



    > So we still need to convert the filenames from the file to that
    > representation[0]. Since they are in UTF-8, we can simply stick a
    >
    > $_ = decode('utf-8', $_);
    >
    > before or after the chomp.


    I assumed that the file is read in utf-8 mode, meaning that the layer is
    specified with the open() or even with a binmode().


    > Perl 5.8 can do the decoding for you if you are reading from a file
    > with "IO layers". If you open a file with
    >
    > open(F, "<:encoding($inputencoding)", $filename);
    >
    > all data read from F will automatically be decoded. I recommend doing
    > it this way because then you don't have to sprinkle lots of decode()
    > and encode() calls through your code.


    OK, that's what I meant too; it's good that you mention the version
    dependænce.

    For UTF-8, you can shorten it to: open(my $fh, '<:utf8', $filename).
    (since Perl 5.?.? ?)

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Apr 2, 2006
    #17
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    2
    Views:
    603
    Rohit Gupta
    Jun 20, 2005
  2. Mark

    codepages and cookies

    Mark, Mar 28, 2005, in forum: ASP General
    Replies:
    5
    Views:
    146
    [MSFT]
    Apr 1, 2005
  3. marco
    Replies:
    1
    Views:
    105
    Sascha Ebach
    Feb 26, 2005
  4. gry
    Replies:
    2
    Views:
    769
    Alf P. Steinbach
    Mar 13, 2012
  5. George Mpouras

    multiple codepages

    George Mpouras, Oct 3, 2013, in forum: Perl Misc
    Replies:
    31
    Views:
    366
    Rainer Weikusat
    Oct 15, 2013
Loading...

Share This Page