Creating UNICODE filenames with PERL 5.8

Discussion in 'Perl Misc' started by Allan Yates, Nov 17, 2003.

  1. Allan Yates

    Allan Yates Guest

    I have been having distinct trouble creating file names in PERL
    containing UNICODE characters. I am running ActiveState PERL 5.8 on
    Windows 2000.

    For a simple test, I picked a UNICODE character that could be
    displayed by Windows Explorer. I can select the character(U+0636) from
    'charmap' and cut/paste into a filename on Windows Explorer and the
    character displays the same as it does in 'charmap'. This proves that
    I have the font available.

    When I attempt to create the same filename with PERL, I end up with a
    filename two characters long: ض

    I somebody could point me in the correct direction, I would very much
    appreciate it. I have read the UNICODE documents included with PERL as
    well searching the newgroups and the web, and everything appears to
    indicate this should work.

    Perl program:

    $name = chr(0x0636);

    if (!open(FILE,">uni_names/$name")) {
    print STDERR "Could not open ($!): $name\n";
    }

    close (FILE);


    Thanks,

    Allan.
    a y a t e s a t s i g n i a n t d o t c o m
    Allan Yates, Nov 17, 2003
    #1
    1. Advertising

  2. On Mon, 17 Nov 2003, Allan Yates wrote:

    > I have been having distinct trouble creating file names in PERL
    > containing UNICODE characters. I am running ActiveState PERL 5.8 on
    > Windows 2000.


    N.B I have limited expertise in this specific area, but some of the
    locals around here seem to look to me to answer Unicode questions of
    any kind, so I'll give it a try, as long as you take the answers with
    the necessary grains of salt...

    First important question is - have you set the option for wide
    character API in system calls?

    > For a simple test, I picked a UNICODE character that could be
    > displayed by Windows Explorer. I can select the character(U+0636) from


    that'd be Arabic letter DAD, right?

    Its utf-8 representation will be two octets: 0xd8, 0xb6.

    > 'charmap' and cut/paste into a filename on Windows Explorer and the
    > character displays the same as it does in 'charmap'. This proves that
    > I have the font available.


    (I think that's the least of your worries at the moment...)

    > When I attempt to create the same filename with PERL, I end up with a
    > filename two characters long: ض


    Those look like 0xd8 and 0xb6 to me...

    At a quick glance, I suspect we are seeing the pair of octets that
    represent the character in utf-8 (Perl's internal representation)
    rather than as what Win32 would use, which AIUI is utf-16LE (which in
    this case would come out as 0x3606, IINM). However, I'm not sure that
    (other than for diagnostic purposes) you should ever need to tangle
    with it in that form, since Perl ought to know what to do in a (wide)
    system call.

    The system call is evidently treating them as two one-byte characters,
    hence my question about wide system calls. Look for the reference to
    wide system calls in the perlrun page, and the other references to
    which it links.

    > I somebody could point me in the correct direction, I would very much
    > appreciate it. I have read the UNICODE documents included with PERL as


    OK, but there are also some Win32-specific documents/web-pages that
    come with the ActivePerl distribution. In some situations they might
    be just what you need.

    > well searching the newgroups and the web, and everything appears to
    > indicate this should work.


    If the above is not the answer, then maybe Win32API::File has
    something for you - but I've never been there myself, so don't pay too
    much attention to that.

    > Perl program:


    But did you start it with the -C option, or set the wide system calls
    thingy? I think that may prove to be the key.

    Good luck, and please report your findings.
    Alan J. Flavell, Nov 17, 2003
    #2
    1. Advertising

  3. Allan Yates

    Ben Morrow Guest

    (Allan Yates) wrote:
    > I have been having distinct trouble creating file names in PERL


    Perl or perl, not PERL.

    > containing UNICODE


    I'm not so sure about UNICODE...

    > For a simple test, I picked a UNICODE character that could be
    > displayed by Windows Explorer. I can select the character(U+0636) from
    > 'charmap' and cut/paste into a filename on Windows Explorer and the
    > character displays the same as it does in 'charmap'. This proves that
    > I have the font available.
    >
    > When I attempt to create the same filename with PERL, I end up with a
    > filename two characters long: ض


    OK, your problem here is that Win2k is being stupid about Unicode: any
    sensible OS that understood UTF8 would be fine :). My guess would be
    that Windows stores filenames in utf16 with a BOM, and if it doesn't
    find a BOM it assumes ASCII/'Windows ANSI'... so try this:

    use Encode;

    > $name = chr(0x0636);


    $name = encode "utf16", $name;

    > if (!open(FILE,">uni_names/$name")) {
    > print STDERR "Could not open ($!): $name\n";
    > }
    >
    > close (FILE);


    If that works, then we could really do with an addition to the 'open'
    pragma to do it for you: use open NAMES => "utf16";... hmmm.

    If it fails, delete your file in uni_names and create one by
    copy/pasting that character out of charmap. Then run

    #!/usr/bin/perl

    use warnings;
    use bytes;

    opendir my $U, "uni_names";
    my @n = readdir $U;
    $, = $\ = "\n";
    print map { "$_: " . join ' ', map { ord } split // } @n;

    __END__

    and tell me what it says.

    Ben

    --
    And if you wanna make sense / Whatcha looking at me for? (Fiona Apple)
    * *
    Ben Morrow, Nov 17, 2003
    #3
  4. Allan Yates

    Allan Yates Guest

    The key was the missing "-C". I didn't clue in from the documentation
    that this was important. Once I added that command line parameter, the
    file was created with the correct name.

    My next step was to read the file name from the directory. However, I
    thought I read in some documentation somewhere that 'readdir' is not
    UNICODE aware. I seemed to prove this by reading the directory
    containing the file I just created. It comes back with a two character
    file name that 'ord' into 0xd8 and 0xb6 as you indicated.

    Do you know of a method of reading directories to get the UNICODE file
    names?


    Thanks,

    Allan.

    "Alan J. Flavell" <> wrote in message news:<>...
    > On Mon, 17 Nov 2003, Allan Yates wrote:
    >
    > > I have been having distinct trouble creating file names in PERL
    > > containing UNICODE characters. I am running ActiveState PERL 5.8 on
    > > Windows 2000.

    >
    > N.B I have limited expertise in this specific area, but some of the
    > locals around here seem to look to me to answer Unicode questions of
    > any kind, so I'll give it a try, as long as you take the answers with
    > the necessary grains of salt...
    >
    > First important question is - have you set the option for wide
    > character API in system calls?
    >
    > > For a simple test, I picked a UNICODE character that could be
    > > displayed by Windows Explorer. I can select the character(U+0636) from

    >
    > that'd be Arabic letter DAD, right?
    >
    > Its utf-8 representation will be two octets: 0xd8, 0xb6.
    >
    > > 'charmap' and cut/paste into a filename on Windows Explorer and the
    > > character displays the same as it does in 'charmap'. This proves that
    > > I have the font available.

    >
    > (I think that's the least of your worries at the moment...)
    >
    > > When I attempt to create the same filename with PERL, I end up with a
    > > filename two characters long: ض

    >
    > Those look like 0xd8 and 0xb6 to me...
    >
    > At a quick glance, I suspect we are seeing the pair of octets that
    > represent the character in utf-8 (Perl's internal representation)
    > rather than as what Win32 would use, which AIUI is utf-16LE (which in
    > this case would come out as 0x3606, IINM). However, I'm not sure that
    > (other than for diagnostic purposes) you should ever need to tangle
    > with it in that form, since Perl ought to know what to do in a (wide)
    > system call.
    >
    > The system call is evidently treating them as two one-byte characters,
    > hence my question about wide system calls. Look for the reference to
    > wide system calls in the perlrun page, and the other references to
    > which it links.
    >
    > > I somebody could point me in the correct direction, I would very much
    > > appreciate it. I have read the UNICODE documents included with PERL as

    >
    > OK, but there are also some Win32-specific documents/web-pages that
    > come with the ActivePerl distribution. In some situations they might
    > be just what you need.
    >
    > > well searching the newgroups and the web, and everything appears to
    > > indicate this should work.

    >
    > If the above is not the answer, then maybe Win32API::File has
    > something for you - but I've never been there myself, so don't pay too
    > much attention to that.
    >
    > > Perl program:

    >
    > But did you start it with the -C option, or set the wide system calls
    > thingy? I think that may prove to be the key.
    >
    > Good luck, and please report your findings.
    Allan Yates, Nov 17, 2003
    #4
  5. Allan Yates

    Allan Yates Guest

    But

    You are correct that unicode is not an acronym and should not be
    capitalised. My deepest apologies for offending you through the use of
    my grammer. I was not aware that grammer police were covering this
    newsgroup. PERL is an acronym, "Practical Extraction and Report
    Language", and thus may be capitalised.


    Allan.

    P.S. Please don't even think of chastising me for top posting versus
    bottom posting. Different people have different preferences.

    P.P.S. For the people who have ignored my grammer and helped me in my
    quest, I am very appeciative.

    Abigail <> wrote in message news:<>...
    > Allan Yates () wrote on MMMDCCXXX September MCMXCIII in
    > <URL:news:>:
    > \\ I have been having distinct trouble creating file names in PERL
    > \\ containing UNICODE characters. I am running ActiveState PERL 5.8 on
    > \\ Windows 2000.
    >
    > Neither Perl, nor Unicode are acronyms, so they aren't spelled in
    > all caps. If you do, it's like you are shouting. And that's rude.
    >
    >
    > Abigail
    Allan Yates, Nov 18, 2003
    #5
  6. Allan Yates

    Ben Morrow Guest

    (Allan Yates) wrote:
    > You are correct that unicode is not an acronym and should not be
    > capitalised. My deepest apologies for offending you through the use of
    > my grammer. I was not aware that grammer police were covering this
    > newsgroup.


    'Grammar police' cover every ng worth having, the reason being that it
    is very much easier to understand people when their spelling/grammar/
    punctuation is correct.

    > PERL is an acronym, "Practical Extraction and Report Language", and
    > thus may be capitalised.


    Nope, it isn't. from perlfaq1:

    | But never write "PERL", because perl is not an acronym, apocryphal
    | folklore and post- facto expansions notwithstanding.

    > P.S. Please don't even think of chastising me for top posting versus
    > bottom posting. Different people have different preferences.


    No they don't. Only idiots prefer top-posting.

    *PLONK*

    Ben

    --
    If I were a butterfly I'd live for a day, / I would be free, just blowing away.
    This cruel country has driven me down / Teased me and lied, teased me and lied.
    I've only sad stories to tell to this town: / My dreams have withered and died.
    <=>=<=>=<=>=<=>=<=>=<=>=<=>=<=>=<=>=<=>=<=> (Kate Rusby)
    Ben Morrow, Nov 18, 2003
    #6
  7. Ben Morrow () wrote:
    : (Allan Yates) wrote:
    : > I have been having distinct trouble creating file names in PERL

    : Perl or perl, not PERL.

    : > containing UNICODE

    : I'm not so sure about UNICODE...

    : > For a simple test, I picked a UNICODE character that could be
    : > displayed by Windows Explorer. I can select the character(U+0636) from
    : > 'charmap' and cut/paste into a filename on Windows Explorer and the
    : > character displays the same as it does in 'charmap'. This proves that
    : > I have the font available.
    : >
    : > When I attempt to create the same filename with PERL, I end up with a
    : > filename two characters long: ض

    : OK, your problem here is that Win2k is being stupid about Unicode: any
    : sensible OS that understood UTF8 would be fine :).

    Hum, NT has been handling unicode for at least ten years (3.5, 1993) by
    the simple expedient of using 16 bit characters. It is hardware that is
    stupid, by continuing to use ancient tiny 8 bit elementary units.

    Imagine if all that hardware still used 16 or 24 bit memory addresses.
    Imagine if all our communication and hardware backbones still actually
    transmitted data in single digit bit sizes.

    Character size was always a compromise between functionality and memory.
    Character size continually increased from the first character manipulating
    electronic equipment of the (gee, way way back 1930's or so, believe it or
    not) until the 1980's, when it suddenly solidified into a standard
    elementary unit that was still a compromise in terms of size, but is now
    clearly too small.

    Character size remains frozen due to one of murphy's laws regarding the
    success of hardware first build using compromises that were appropriate
    twenty years ago.
    Malcolm Dew-Jones, Nov 19, 2003
    #7
  8. Allan Yates

    Ben Morrow Guest

    [OT] Re: Creating UNICODE filenames with PERL 5.8

    (Malcolm Dew-Jones) wrote:
    > Ben Morrow () wrote:
    > : OK, your problem here is that Win2k is being stupid about Unicode: any
    > : sensible OS that understood UTF8 would be fine :).
    >
    > Hum, NT has been handling unicode for at least ten years (3.5, 1993) by
    > the simple expedient of using 16 bit characters. It is hardware that is
    > stupid, by continuing to use ancient tiny 8 bit elementary units.


    OK, I invited that with gratuitous OS-bashing :)... nevertheless:

    1. Unicode is *NOT* a 16-bit character set. UTF16 is an evil bodge to
    work around those who started assuming it was before the standards
    were properly in place.

    2. Given that the world does, in fact, use 8-bit bytes, any 16-bit
    encoding has this small problem of endianness... again, solved
    (IMHO) less-than-elegantly by the Unicode Consortium.

    3. Given that the most widespread character set is likely to be either
    ASCII or Chinese ideograms, and ideograms won't fit into less than
    16 bits anyway, it seems pretty silly to encode a 7-bit charset
    with 16 bits per character.

    4. It also seems pretty silly to break everything in the world that
    relies on a byte of 0 meaning end-of-string, not to mention '/'
    being '/' (or '\', or whatever, as appropriate).

    et cetera

    Ben

    --
    And if you wanna make sense / Whatcha looking at me for? (Fiona Apple)
    * *
    Ben Morrow, Nov 19, 2003
    #8
  9. Allan Yates <> wrote:

    > PERL is an acronym,



    No it isn't smarty pants.


    > P.S. Please don't even think of chastising me for top posting versus
    > bottom posting. Different people have different preferences.



    No chastisment, just ignoration in perpetuity.

    *plonk*


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Nov 19, 2003
    #9
  10. Also sprach Allan Yates:

    > P.S. Please don't even think of chastising me for top posting versus
    > bottom posting. Different people have different preferences.


    Right. And unless you write those articles solely for yourself, the
    preferences of your readers count and not yours. So stop top-posting or
    the regulars will stop reading your posts.

    Tassilo
    --
    $_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
    pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus})!JAPH!qq(rehtona{tsuJbus#;
    $_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexiixesixeseg;y~\n~~dddd;eval
    Tassilo v. Parseval, Nov 19, 2003
    #10
  11. Allan Yates

    Anno Siegel Guest

    Allan Yates <> wrote in comp.lang.perl.misc:
    > But
    >
    > You are correct that unicode is not an acronym and should not be
    > capitalised. My deepest apologies for offending you through the use of
    > my grammer. I was not aware that grammer police were covering this
    > newsgroup.


    Grammar. And grammar isn't the problem, spelling is.

    > PERL is an acronym, "Practical Extraction and Report
    > Language", and thus may be capitalised.


    Nope. That was retro-fitted.

    > Allan.
    >
    > P.S. Please don't even think of chastising me for top posting versus
    > bottom posting. Different people have different preferences.


    Complaining about grammar police and playing thought police?

    > P.P.S. For the people who have ignored my grammer and helped me in my
    > quest, I am very appeciative.


    Translation: "Others can **** off". I think you got what you want.

    Anno
    Anno Siegel, Nov 19, 2003
    #11
  12. On Wed, 18 Nov 2003, Malcolm Dew-Jones wrote:

    > Hum, NT has been handling unicode for at least ten years (3.5, 1993) by
    > the simple expedient of using 16 bit characters.


    ....which unfortunately turns out to be somewhat of a mistake, seeing
    that Unicode went and broke the 16-bit boundary.

    > It is hardware that is
    > stupid, by continuing to use ancient tiny 8 bit elementary units.


    utf-8 is the closest they managed to get to variable-length character
    encoding. It's not perfect, but it gets around quite a lot of the
    compatibility problems that exist with other approaches.

    > Imagine if all that hardware still used 16 or 24 bit memory addresses.


    Imagine if every us-ascii character were required to occupy 64 bits?
    And then there's legacy data to think about.

    > Character size was always a compromise between functionality and memory.


    Agreed.

    > Character size continually increased from the first character manipulating
    > electronic equipment of the (gee, way way back 1930's or so, believe it or
    > not)


    Interestingly, those early codes regularly had shift-in and shift-out
    codes to extend their repertoire. A practice which faded out for a
    while, almost got reborn in a big way in ISO-2022, and then -
    iso-10646/Unicode and associated encodings. I wonder what the future
    holds in store? ;-)

    > Character size remains frozen due to one of murphy's laws regarding the
    > success of hardware first build using compromises that were appropriate
    > twenty years ago.


    It's easy to poke fun, but it's harder to come up with a viable
    compromise IMHO.

    all the best
    Alan J. Flavell, Nov 19, 2003
    #12
  13. On Wed, 19 Nov 2003, Anno Siegel reveals to all and sundry that:

    > Allan Yates <> wrote in comp.lang.perl.misc:
    >
    > > P.P.S. For the people who have ignored my grammer and helped me in my
    > > quest, I am very appeciative.


    I don't think so. O.P could show appreciation by trying to fit in
    with the conventions of Usenet, and participate with the sharing.
    Spitting in the group's collective face is no way to show one's
    appreciation, that's for sure.

    > Translation: "Others can **** off".


    I took the hint, too.

    > I think you got what you want.


    I guess he did, just the once. Well, I hope more-perceptive others
    can learn from his mistakes, and I mean not only in terms of technical
    content but also in terms of newsgroup interaction.

    So much for pot luck.
    Alan J. Flavell, Nov 19, 2003
    #13
  14. Re: [OT] Re: Creating UNICODE filenames with PERL 5.8

    Some history required...


    "Ben Morrow" <> wrote in message
    news:bpel00$sl6$...
    > (Malcolm Dew-Jones) wrote:
    > > Ben Morrow () wrote:
    > > : OK, your problem here is that Win2k is being stupid about Unicode: any
    > > : sensible OS that understood UTF8 would be fine :).
    > >
    > > Hum, NT has been handling unicode for at least ten years (3.5, 1993) by
    > > the simple expedient of using 16 bit characters. It is hardware that is
    > > stupid, by continuing to use ancient tiny 8 bit elementary units.

    >
    > OK, I invited that with gratuitous OS-bashing :)... nevertheless:
    >
    > 1. Unicode is *NOT* a 16-bit character set. UTF16 is an evil bodge to
    > work around those who started assuming it was before the standards
    > were properly in place.


    Unicode 1.0 WAS a 16-bit character set. So there. UTF16 is a representation
    of Unicode 3.0 which is selected to be backwards compatible with Unicode
    1.0.

    The reason why NT doesn't use UTF-8 is that --- wait for it --- it wasn't
    invented back then. UTF-8 was specified in 1993, and adopted as an ISO
    standard in 1994. Windows NT shipped in 1993, after 5 years in development.
    Guess what: Decision on character set had to be made in the eighties.

    Yes, they got it wrong. They should have selected UTF-8. They should have
    INVENTED UTF-8.

    So you can knock them for not having the foresignt to know that 65535
    characters wouldn't be enough. That's a mistake a lot of people made, and
    with hindsight it is unaccountable: It required a concious decision to
    exclude uncommon characters. The best explanation I have heard for why this
    is wrong: "An uncommon character is a common character if it is your name,
    or the name of the place where you live".

    But don't knock them for not using UTF-8. Clearly anyone designing an OS now
    would use UTF-8, of course.

    Cheers,
    Ben Liddicott
    Ben Liddicott, Nov 19, 2003
    #14
  15. Allan Yates

    Ben Morrow Guest

    Re: [OT] Re: Creating UNICODE filenames with PERL 5.8

    "Ben Liddicott" <> wrote:
    > Some history required...
    >
    >
    > "Ben Morrow" <> wrote in message
    > news:bpel00$sl6$...
    > > (Malcolm Dew-Jones) wrote:
    > > > Ben Morrow () wrote:
    > > > : OK, your problem here is that Win2k is being stupid about Unicode: any
    > > > : sensible OS that understood UTF8 would be fine :).
    > > >
    > > > Hum, NT has been handling unicode for at least ten years (3.5, 1993) by
    > > > the simple expedient of using 16 bit characters. It is hardware that is
    > > > stupid, by continuing to use ancient tiny 8 bit elementary units.

    > >
    > > OK, I invited that with gratuitous OS-bashing :)... nevertheless:
    > >
    > > 1. Unicode is *NOT* a 16-bit character set. UTF16 is an evil bodge to
    > > work around those who started assuming it was before the standards
    > > were properly in place.

    >
    > Unicode 1.0 WAS a 16-bit character set. So there. UTF16 is a representation
    > of Unicode 3.0 which is selected to be backwards compatible with Unicode
    > 1.0.


    OK. This doesn't stop it being completely wrong. Given the choice
    between breaking compatibility with the few people who implemented
    Unicode 1.0, breaking compatibility with everyone else who was still
    assuming everything was a superset of ASCII and creating seven[1]
    different, incompatible representations of the supposed answer to
    character encoding problems it is fairly clear to me at least which is
    the right answer.

    Not to mention that, because of the endianness problem, ucs-2 was
    broken as an encoding from the start.

    [1] utf8, utf16 BE, LE and with BOM, utf32 ditto.

    > So you can knock them for not having the foresignt to know that 65535
    > characters wouldn't be enough.


    I can also knock them for not having changed in the ten years since
    NT3.5 was released. It is not *that* difficult a change to implement,
    as Perl 5.8 has demonstrated; even though it has some nasty bits,
    ditto.

    Ben

    --
    $.=1;*g=sub{print@_};sub r($$\$){my($w,$x,$y)=@_;for(keys%$x){/main/&&next;*p=$
    $x{$_};/(\w)::$/&&(r($w.$1,$x.$_,$y),next);$y eq\$p&&&g("$w$_")}};sub t{for(@_)
    {$f&&($_||&g(" "));$f=1;r"","::",$_;$_&&&g(chr(0012))}};t #
    $J::u::s::t, $a::n::eek:::t::h::e::r, $P::e::r::l, $h::a::c::k::e::r, $.
    Ben Morrow, Nov 19, 2003
    #15
  16. Re: [OT] Re: Creating UNICODE filenames with PERL 5.8

    On Wed, 19 Nov 2003, Ben Liddicott wrote:

    > Guess what: Decision on character set had to be made in the eighties.


    Yeah: as far as I recall, IBM invented DBCS EBCDIC. Doubtless a fine
    standard for its time. But things move on.
    Alan J. Flavell, Nov 19, 2003
    #16
  17. Probably your best bet is to try to use Unicode::String to convert the file
    names to utf-8. It is obviously reading the filenames using the Unicode API,
    (otherwise you would get REPLACEMENT CHARACTER instead), but not recognising
    that it has done so.

    Alternatively, with Win32::API you can use Win32 FindFirstFileW,
    FindNextFileW, FindCloseW. This should be pretty much guaranteed to work.

    Alternatively you can see if File::Find works, though I suspect it may
    suffer the same problems.

    Alternatively again, you can try spawning a cmd shell, and parsing the
    output. This is only going to be any good if ${^WIDE_SYSTEM_CALLS} affects
    qx() or open("command |"), and I don't know if it does or not.

    If you specify /u to cmd.exe, it sets the console output to UTF-16, which
    you could convert back by hand, using Unicode::String. I'm not entirely sure
    how one could send unicode in through $sDirName, though. Experimentation may
    tell you.

    # /u means unicode, /c means run command and exit
    my $sDirCommand = qq(cmd.exe /u /c dir /a "$sDirName");
    my $fh = new IO::File($sDirCommand);

    Cheers,
    Ben Liddicott


    "Allan Yates" <> wrote in message
    news:...
    > The key was the missing "-C". I didn't clue in from the documentation
    > that this was important. Once I added that command line parameter, the
    > file was created with the correct name.
    >
    > My next step was to read the file name from the directory. However, I
    > thought I read in some documentation somewhere that 'readdir' is not
    > UNICODE aware. I seemed to prove this by reading the directory
    > containing the file I just created. It comes back with a two character
    > file name that 'ord' into 0xd8 and 0xb6 as you indicated.
    >
    > Do you know of a method of reading directories to get the UNICODE file
    > names?
    >
    >
    Ben Liddicott, Nov 19, 2003
    #17
  18. Allan Yates

    Ben Morrow Guest

    [stop top-posting]

    "Ben Liddicott" <> wrote:
    > "Allan Yates" <> wrote in message
    > news:...
    > > The key was the missing "-C". I didn't clue in from the documentation
    > > that this was important. Once I added that command line parameter, the
    > > file was created with the correct name.


    Note that the functionality of -C no longer exists under 5.8.1, and
    perl581delta claims it didn't work under 5.8.0 either.

    > > My next step was to read the file name from the directory. However, I
    > > thought I read in some documentation somewhere that 'readdir' is not
    > > UNICODE aware. I seemed to prove this by reading the directory
    > > containing the file I just created. It comes back with a two character
    > > file name that 'ord' into 0xd8 and 0xb6 as you indicated.
    > >
    > > Do you know of a method of reading directories to get the UNICODE file
    > > names?

    >
    > Probably your best bet is to try to use Unicode::String to convert the file
    > names to utf-8. It is obviously reading the filenames using the Unicode API,
    > (otherwise you would get REPLACEMENT CHARACTER instead), but not recognising
    > that it has done so.


    No. The right answer is to use Encode::decode to convert *from* utf16.

    > Alternatively you can see if File::Find works, though I suspect it may
    > suffer the same problems.


    Why don't you look? A quick grep through perldoc -m File::Find shows
    that the names come straight out of readdir, so yes, it will suffer
    exactly the same problems.

    > Alternatively again, you can try spawning a cmd shell, and parsing the
    > output. This is only going to be any good if ${^WIDE_SYSTEM_CALLS} affects
    > qx() or open("command |"), and I don't know if it does or not.


    Bleech. And no, -C will have no effect on this; rather, it will be
    affected by the PerlIO layers pushed onto the filehandle.

    > If you specify /u to cmd.exe, it sets the console output to UTF-16, which
    > you could convert back by hand, using Unicode::String. I'm not entirely sure
    > how one could send unicode in through $sDirName, though.


    Either -C will use a Unicode-aware pipe-opening API, and it will Just
    Work, or use Encode::encode to encode it into whatever Windows expects
    command lines to be specified in, probably utf16.

    Ben

    --
    I've seen things you people wouldn't believe: attack ships on fire off the
    shoulder of Orion; I've watched C-beams glitter in the darkness near the
    Tannhauser Gate. All these moments will be lost, in time, like tears in rain.
    Time to die. |-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-|
    Ben Morrow, Nov 19, 2003
    #18
  19. Alan J. Flavell () wrote:
    : On Wed, 18 Nov 2003, Malcolm Dew-Jones wrote:

    : > Hum, NT has been handling unicode for at least ten years (3.5, 1993) by
    : > the simple expedient of using 16 bit characters.

    : ...which unfortunately turns out to be somewhat of a mistake, seeing
    : that Unicode went and broke the 16-bit boundary.

    Which was also a mistake. "character" now includes all the heiroglyphics
    of places like china, (but why not all the heiroglyphics of, say, ancient
    egypt? why not all the standardized international road symbols?). When
    the arabians invented the modern idea of characters then it became widely
    recognized as much more powerful, fundamentally better, and fundamentally
    "different" than the old single-picture-means-a-word method of writing.
    Now we have jumped backwards 1800 years. Things like chinese writing
    should not be treated using standardized application level encodings, just
    as we now standarize many markup languages for encoding other higher level
    data. ($0.02)

    : > It is hardware that is
    : > stupid, by continuing to use ancient tiny 8 bit elementary units.

    : utf-8 is the closest they managed to get to variable-length character
    : encoding. It's not perfect, but it gets around quite a lot of the
    : compatibility problems that exist with other approaches.

    : > Imagine if all that hardware still used 16 or 24 bit memory addresses.

    : Imagine if every us-ascii character were required to occupy 64 bits?

    First, it would never be 64 bits for a character. Even if we hardcoded
    current unicode values, it would be no more than 24 bits per character.

    That's three (or two at 16 bits) times the space, which for the vast
    majority of users would be irrelevent anyway due to the enourmous increase
    in storage capacities.

    Also, it is almost a norm to store any static data in compressed format,
    and compression tools would utilize the larger character size to pack more
    data, so the total storage space required for a lot of data would not
    increase.

    Things that would truly be affected, such as humungous databases, already
    have to use many mechanisms to be able to manipulate the data, and I'm
    sure they could find ways to handle the larger volumes, probably by using
    the exact reverse of wide characters.

    : And then there's legacy data to think about.

    stored on legacy systems, and manipulated using legacy software and
    hardware.

    This is murhpy's law. Because the old systems have been successful, new
    systems can't be made better.

    : > Character size was always a compromise between functionality and memory.

    : Agreed.

    : > Character size continually increased from the first character manipulating
    : > electronic equipment of the (gee, way way back 1930's or so, believe it or
    : > not)

    : Interestingly, those early codes regularly had shift-in and shift-out
    : codes to extend their repertoire. A practice which faded out for a
    : while,

    yes, as soon as hardware costs made larger characters possible, they got
    rid of the gludginess.

    almost got reborn in a big way in ISO-2022, and then -
    : iso-10646/Unicode and associated encodings. I wonder what the future
    : holds in store? ;-)

    : > Character size remains frozen due to one of murphy's laws regarding the
    : > success of hardware first build using compromises that were appropriate
    : > twenty years ago.

    : It's easy to poke fun, but it's harder to come up with a viable
    : compromise IMHO.

    I am out of time, to say more.
    Malcolm Dew-Jones, Nov 19, 2003
    #19
  20. Allan Yates

    Ben Morrow Guest

    (Malcolm Dew-Jones) wrote:
    > Alan J. Flavell () wrote:
    > : On Wed, 18 Nov 2003, Malcolm Dew-Jones wrote:
    >
    > : > Hum, NT has been handling unicode for at least ten years (3.5, 1993) by
    > : > the simple expedient of using 16 bit characters.
    >
    > : ...which unfortunately turns out to be somewhat of a mistake, seeing
    > : that Unicode went and broke the 16-bit boundary.
    >
    > Which was also a mistake. "character" now includes all the heiroglyphics
    > of places like china, (but why not all the heiroglyphics of, say, ancient
    > egypt?


    Proposed.

    > why not all the standardized international road symbols?


    I see no reason why these should not also be added.

    > ). When
    > the arabians invented the modern idea of characters then it became widely
    > recognized as much more powerful, fundamentally better, and fundamentally
    > "different" than the old single-picture-means-a-word method of writing.
    > Now we have jumped backwards 1800 years.


    I think this is a little arrogant, to say the least. Chinese ideograms
    (which are not the same as hieroglyphs) have served the need of the
    Chinese admirably: two people from opposite ends of the country,
    speaking mutually un-intelligible languages, can nevertheless
    communicate perfectly through the existence of a common form of
    writing.

    Apart from that, one of the basic reasons for inventing 'other' character
    encodings was so that one could write his own name without resorting
    to markup. There are an awful lot of people whose names require
    Chinese ideograms to spell...

    Note that I do not disagree with you that many of the choices about
    what is 'in' and what 'out' of Unicode seem more than a little
    arbitrary... :)

    > Things like chinese writing
    > should not be treated using standardized application level encodings, just
    > as we now standarize many markup languages for encoding other higher level
    > data. ($0.02)


    I'm afraid I don't follow what you mean here.

    > : And then there's legacy data to think about.
    >
    > stored on legacy systems, and manipulated using legacy software and
    > hardware.
    >
    > This is murhpy's law. Because the old systems have been successful, new
    > systems can't be made better.


    They can, and are being. Intelligence just needs to be applied at
    every stage. A case in point: utf8 both keeps legacy compatibility
    *and* is more extensible than ucs2.

    > : Interestingly, those early codes regularly had shift-in and shift-out
    > : codes to extend their repertoire. A practice which faded out for a
    > : while,
    >
    > yes, as soon as hardware costs made larger characters possible, they got
    > rid of the gludginess.


    Agreed, shifting is nasty and has serious problems, such as getting
    out of sync. UTF16 surrogates, though, are pure eeevilll.....

    Ben

    --
    EAT
    KIDS (...er, whoops...)
    FOR
    99p
    Ben Morrow, Nov 19, 2003
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Chris
    Replies:
    2
    Views:
    5,928
    Roedy Green
    Oct 31, 2005
  2. B.J.
    Replies:
    4
    Views:
    725
    Toby Inkster
    Apr 23, 2005
  3. Edward K. Ream
    Replies:
    5
    Views:
    611
    Martin v. =?iso-8859-15?q?L=F6wis?=
    Oct 23, 2003
  4. Neil Hodgson
    Replies:
    3
    Views:
    671
    Do Re Mi chel La Si Do
    Aug 28, 2005
  5. Christian Stooker

    Unicode to DOS filenames (to call FSUM.exe)

    Christian Stooker, May 15, 2006, in forum: Python
    Replies:
    3
    Views:
    1,599
    DurumDara
    May 15, 2006
Loading...

Share This Page