UTF-8 vs w_char

J

James Kuyper

The forward slash character poses a problem, because they can't appear in
file names.

Technically, that's correct, but it gives a wrong impression. You'll
never get an error indication from a Unix utility due to trying to put a
'/' character in a file name. In almost all contexts, Unix routines
require pathname arguments, not filename arguments (there are probably
exceptions, but I can't think of any right now). You can put a '/'
character in a path name, but it will never be interpreted as an invalid
part of a filename, it will always be interpreted as a valid directory
separator. Any error indications you get will be due to that
interpretation leading to a non-existent directory, or, less likely, a
directory for which you don't have the needed permissions.
 
S

Seebs

Technically, that's correct, but it gives a wrong impression. You'll
never get an error indication from a Unix utility due to trying to put a
'/' character in a file name. In almost all contexts, Unix routines
require pathname arguments, not filename arguments (there are probably
exceptions, but I can't think of any right now). You can put a '/'
character in a path name, but it will never be interpreted as an invalid
part of a filename, it will always be interpreted as a valid directory
separator. Any error indications you get will be due to that
interpretation leading to a non-existent directory, or, less likely, a
directory for which you don't have the needed permissions.

Veering wildly off-topic, but it's an awesome story:

There was once a network filesystem driver which exported Unix filesystems
to Mac clients, which had the curious trait that it was possible for a Mac
user to create a file on the Unix machine which had a slash in its actual
file name. The results were spectacular.

-s
 
S

Seebs

Do you have any idea how that was achieved, internally in the network
driver running on the Unix side of the connection?

None. I can't imagine how you could do this. But then, it was late 80s
or so, and there were all sorts of crazy things people were doing in
custom kernel drivers.

(I also once managed to misconfigure a machine such that "telnet localhost"
got me a login prompt from a different machine. Yes, it was really trying
to connect to 127.0.0.1, and getting a different machine.)

-s
 
J

James Kuyper

Veering wildly off-topic, but it's an awesome story:

There was once a network filesystem driver which exported Unix filesystems
to Mac clients, which had the curious trait that it was possible for a Mac
user to create a file on the Unix machine which had a slash in its actual
file name. The results were spectacular.

Do you have any idea how that was achieved, internally in the network
driver running on the Unix side of the connection?
 
G

glen herrmannsfeldt

None. I can't imagine how you could do this. But then, it was late 80s
or so, and there were all sorts of crazy things people were doing in
custom kernel drivers.

NFS servers seem to run in a strange place in the file system.
I suppose it is possible that NFS would create a file with a /
inside it, which it would not be possible to access locally.

Around that time frame, I accidentally put in a hard link to
a directory through NFS, which promptly crashed the server system.
When it rebooted, the client then retried the failed link, again
crashing the server.

Note also that NFS was designed to be OS independent as much
as possible, so path separators are not passed through. The path
has to be parsed on the client (which might not be unix).
(I also once managed to misconfigure a machine such that
"telnet localhost" got me a login prompt from a different machine.
Yes, it was really trying to connect to 127.0.0.1, and getting
a different machine.)

Also around that time I had a machine with one physical network
port running an OS genned for three. To make it happy, I put
127.2 and 127.3 on the other two ports. (With appropriate subnet mask.)

-- glen
 
S

Stephen Sprunk

The forward slash character poses a problem, because they can't
appear in file names.

A '/' will never appear in a UTF-8 string except as a real U+002F,
unlike in several other character encodings. One of the design goals of
UTF-8 was strict compatibility with ASCII for all code units from
0x00-0x7F, which is sufficient (actually, overkill) for dealing with the
special meanings assigned to various ASCII characters. As a result, you
can safely pass UTF-8 strings to/through the Unix API.

If Windows had better support for CP_UTF8, the same would be true of the
(original) Win32 "ANSI" (i.e. non-UTF-16) API, completely removing the
need for the parallel "Unicode" (i.e. UTF-16) API and all the attendant
problems that having two APIs causes.
UTF-8 was not the first multi-byte character set encoding to aim for
ASCII compatibility. Big5, Shift-JIS, and the ISO-2022 family are all
functionally ASCII compatible.

They're also more complicated and can only represent a small subset of
Unicode code points, so they're not suitable candidates for worldwide
use. UTF-8 is.
And these may actually still be more common in Asia than UTF-8.

According to web survey statistics, UTF-8 has a stunning ~77% of web
sites. ISO-8859-1 comes in at ~12%, and everything else is in the
single digits.

Certain countries complain that UTF-8 is larger than UTF-16 for their
scripts, which fall in the range U+8000 to U+FFFF, but that is only true
for pure text. Add punctuation, Arabic numbers, HTML markup, etc. to a
document and UTF-8 usually wins. Even if not, given the other problems
that UTF-16 causes, UTF-8 is the clear winner.
OS X has settled on UTF-8 internally, which has already caused
interoperability issues, e.g. with networked file systems. My money
says it was a similarly poor decision. I'm sure UTF-8 will have more
staying power than UCS-2, but most of the caveats stem from relying
on any non-ASCII encoding. They shouldn't have choosen any poison.

You've got to choose something, and UTF-8 is the most backward
compatible and most universal. If everyone (e.g. Microsoft) would quit
trying to force obsolete, script-specific encodings down other people's
throats and just get on the UTF-8 bandwagon, we'd all be better off.

S
 
S

Stephen Sprunk

My frame of reference was e-mail. But I've just been informed that,
with the exception of Japan, UTF-8 has surprised the various national
encodings in Asia.

Japan has an unusually good national encoding, plus they're more
insular; they tend to do their own thing when it comes to technology
standards, e.g. almost completely ignoring GSM.

China, Hong Kong and Taiwan have several national encodings that are
mutually incompatible and computationally expensive; China requires
support for GB18030 by law, but in my experience most users actually end
up using UTF-8, which is no worse even for internal use.

India's several national encodings are a mess, so UTF-8 is a better
solution even for communicating amongst themselves, despite their rather
vocal (and mostly incorrect) complaints about it.

I don't know much about the rest of Asia, except that there are many
national encodings that are now rarely seen in the wild.
When I was working on I18N stuff a few years ago I rarely saw UTF-8
used overseas, and the company I worked for processed hundreds of
millions of e-mail messages every day.

Use of UTF-8 has definitely grown over the last decade, especially on
the Web. Email is a bit different; national encodings tend to be used
more because Windows won't let users select a UTF-8-flavored locale and
popular software (e.g. Outlook) hides how to override that, which
wouldn't occur to most users anyway. One minor (and completely
transparent) code change by Microsoft and email would be 90%+ UTF-8
within a year or two.

On the plus side, recent versions of Outlook are smart enough to
automatically switch to UTF-8 when the selected national encoding fails,
which is a major improvement. Why they don't just do that by default is
a mystery, but it's probably related to Microsoft's persistent
conflation of "Unicode" and UCS-2.

S
 
X

Xavier Roche

UTF-8 was not the first multi-byte character set encoding to aim for ASCII
compatibility. Big5, Shift-JIS, and the ISO-2022 family are all functionally
ASCII compatible.

(A bit more off-topic)

Not ISO-2022, as its insane escape sequences are in the ASCII range
(more precisely, using ESC sequences). (I won't mention the poor design
choices regarding ISO-2022 encodingS - they are all as bad, poorly
crafted, incomptabible, and plaged by idiotic decisions).
 
G

glen herrmannsfeldt

(snip on file name characters)
And there were other odd effects - you could sometimes have files in a
directory that you simply could not see from some types of
workstations. Or a file could have several different names (in
different name spaces), and if you were particularly lucky you could
set up a real mess of inconsistent names (you could, for example, copy
several files that had both long and DOS names from one directory to
another, and end up with the DOS names swapped in the destination).

MS-DOS was funny about names. I once had a file with a name of COM3,
I belive the extension doesn't matter. Later the disk was one a
system with a COM3 device, such that you can't access a file with
that name.

-- glen
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top