Creating UNICODE filenames with PERL 5.8

M

Malcolm Dew-Jones

Malcolm Dew-Jones ([email protected]) wrote:

: Now we have jumped backwards 1800 years. Things like chinese writing
: should not be treated using standardized application level encodings, just
^^^
: as we now standarize many markup languages for encoding other higher level
: data. ($0.02)

when he meant to say

: should be treated using standardized application level encodings, just
^
: as we now standarize many markup languages for encoding other higher level
: data. ($0.02)
 
M

Malcolm Dew-Jones

Ben Liddicott ([email protected]) wrote:
: Some history required...


: : > (e-mail address removed) (Malcolm Dew-Jones) wrote:
: > > Ben Morrow ([email protected]) wrote:
: > > : OK, your problem here is that Win2k is being stupid about Unicode: any
: > > : sensible OS that understood UTF8 would be fine :).
: > >
: > > Hum, NT has been handling unicode for at least ten years (3.5, 1993) by
: > > the simple expedient of using 16 bit characters. It is hardware that is
: > > stupid, by continuing to use ancient tiny 8 bit elementary units.
: >
: > OK, I invited that with gratuitous OS-bashing :)... nevertheless:
: >
: > 1. Unicode is *NOT* a 16-bit character set. UTF16 is an evil bodge to
: > work around those who started assuming it was before the standards
: > were properly in place.

: Unicode 1.0 WAS a 16-bit character set. So there. UTF16 is a representation
: of Unicode 3.0 which is selected to be backwards compatible with Unicode
: 1.0.

: The reason why NT doesn't use UTF-8 is that --- wait for it --- it wasn't
: invented back then. UTF-8 was specified in 1993, and adopted as an ISO
: standard in 1994. Windows NT shipped in 1993, after 5 years in development.
: Guess what: Decision on character set had to be made in the eighties.

: Yes, they got it wrong. They should have selected UTF-8. They should have
: INVENTED UTF-8.

Well no. The original philosophy of unicode was to simply increase
character size. If "alternative" hardware such as alpha, powerpc, etc,
(which were already incompatible with old hardware in many ways) had been
designed to use 16 bits of data as the smallest addressable unit of data
then the NT decision would have been an extremely good one.
 
M

Malcolm Dew-Jones

Ben Morrow ([email protected]) wrote:
: (e-mail address removed) (Malcolm Dew-Jones) wrote:
: > Ben Morrow ([email protected]) wrote:
: > : OK, your problem here is that Win2k is being stupid about Unicode: any
: > : sensible OS that understood UTF8 would be fine :).
: >
: > Hum, NT has been handling unicode for at least ten years (3.5, 1993) by
: > the simple expedient of using 16 bit characters. It is hardware that is
: > stupid, by continuing to use ancient tiny 8 bit elementary units.

: OK, I invited that with gratuitous OS-bashing :)... nevertheless:

: 2. Given that the world does, in fact, use 8-bit bytes, any 16-bit
: encoding has this small problem of endianness... again, solved
: (IMHO) less-than-elegantly by the Unicode Consortium.

Endianness is a hardware problem. 16 bit character hardware would not
have this problem. For other hardware, this is identical to the problem
encountered when transmitting numeric values over mediums such as the
internet, and is solved just as easily, (and as it has been) by specifying
the order in which the hi and lo parts are transmitted.

: 3. Given that the most widespread character set is likely to be either
: ASCII or Chinese ideograms, and ideograms won't fit into less than
: 16 bits anyway, it seems pretty silly to encode a 7-bit charset
: with 16 bits per character.

Hum, did you notice the contradiction in what you say "character set" and
"ideograms".

You might also say it seams silly to encode a 7-bit charset in 8 bits. I
think it's silly to worry about a few bits of storage when the simplicity
of software involved in handling larger characters would be greatly
enhanced by simply updating the size of characters. Simply look at the
number of years and bugs it has taken to introduce unicode handling to
many applications, and compare that to the time it took for the NT team
originally to update notepad - it consisted basically of recompiling it
with characters defined to be a larger size.


: 4. It also seems pretty silly to break everything in the world that
: relies on a byte of 0 meaning end-of-string, not to mention '/'
: being '/' (or '\', or whatever, as appropriate).

huh? if you used 16 bit characters then the (16 bit) character with
numeric value of 0 is still the null byte.
 
M

Malcolm Dew-Jones

Ben Morrow ([email protected]) wrote:


: > So you can knock them for not having the foresignt to know that 65535
: > characters wouldn't be enough.

: I can also knock them for not having changed in the ten years since
: NT3.5 was released. It is not *that* difficult a change to implement,
: as Perl 5.8 has demonstrated; even though it has some nasty bits,
: ditto.

If any alternative hardware had used 16 bit characters then perl 4 would
have had built in 65535 characters support on those platforms, with
virtually no changes, as would every other application handling character
data.
 
B

Ben Liddicott

What you may be seeing is that STDOUT is writing UTF-8, which is being
treated as ISO-8859-1, and translated to UTF-16 for display. I haven't
managed to establish this, or come up with a workaround.

My own testing indicates that, under 5.8.0, with -C, you can read unicode
filenames correctly using glob, and then create new files with related names
which preserve the unicode chars. I haven't tested readdir, so maybe that
doesn't work. I haven't tested 5.8.1, so probably that doesn't work either.

Therefore I suggest you try "glob qq($dir/*)" instead of readdir, which
should also be reasonably portable.

Watch line wrapping:

perl -C -MFile::Copy -e "use utf8; foreach my $f (glob qq("$ARGV[0]")){print
length($f), qq($f\n); open(FILEH, $f) or die qq(Can't open $f: $!); print
<FILEH>; File::Copy::copy $f, qq($f.2.eml) }" "T*1.eml"


In other words, given a file:
Test 1 with Attachement ? RFC2231.eml
(the squiggle is ARABIC LETTER DAD)

It successfully creates file:
Test 1 with Attachement ? RFC2231.eml.2.eml
and prints out:
######################
39Test 1 with Attachement ??? RFC2231.eml
From: "Ben Liddicott" <[email protected]>
To: "(e-mail address removed)"
Subject: Test 1
######################

(remainder omitted, as irrellevant).

Now, 39 is the wrong length, as it appears to be counting ARABIC LETTER DAD
as three.

However, with "use bytes" instead of "use utf8", it fails altogether, with:
39Test 1 with Attachement ??? RFC2231.eml
Can't open Test 1 with Attachement ??? RFC2231.eml: No such file or
directory at -e line 1.

Clearly, something is not quite right here.

Cheers,
Ben Liddicott
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,045
Latest member
DRCM

Latest Threads

Top