UTF-8 vs w_char

James Kuyper · Nov 5, 2013

The forward slash character poses a problem, because they can't appear in
file names.

Technically, that's correct, but it gives a wrong impression. You'll
never get an error indication from a Unix utility due to trying to put a
'/' character in a file name. In almost all contexts, Unix routines
require pathname arguments, not filename arguments (there are probably
exceptions, but I can't think of any right now). You can put a '/'
character in a path name, but it will never be interpreted as an invalid
part of a filename, it will always be interpreted as a valid directory
separator. Any error indications you get will be due to that
interpretation leading to a non-existent directory, or, less likely, a
directory for which you don't have the needed permissions.

Seebs · Nov 5, 2013

Technically, that's correct, but it gives a wrong impression. You'll
never get an error indication from a Unix utility due to trying to put a
'/' character in a file name. In almost all contexts, Unix routines
require pathname arguments, not filename arguments (there are probably
exceptions, but I can't think of any right now). You can put a '/'
character in a path name, but it will never be interpreted as an invalid
part of a filename, it will always be interpreted as a valid directory
separator. Any error indications you get will be due to that
interpretation leading to a non-existent directory, or, less likely, a
directory for which you don't have the needed permissions.

Veering wildly off-topic, but it's an awesome story:

There was once a network filesystem driver which exported Unix filesystems
to Mac clients, which had the curious trait that it was possible for a Mac
user to create a file on the Unix machine which had a slash in its actual
file name. The results were spectacular.

-s

Seebs · Nov 5, 2013

Do you have any idea how that was achieved, internally in the network
driver running on the Unix side of the connection?

None. I can't imagine how you could do this. But then, it was late 80s
or so, and there were all sorts of crazy things people were doing in
custom kernel drivers.

(I also once managed to misconfigure a machine such that "telnet localhost"
got me a login prompt from a different machine. Yes, it was really trying
to connect to 127.0.0.1, and getting a different machine.)

-s

James Kuyper · Nov 5, 2013

Veering wildly off-topic, but it's an awesome story:

There was once a network filesystem driver which exported Unix filesystems
to Mac clients, which had the curious trait that it was possible for a Mac
user to create a file on the Unix machine which had a slash in its actual
file name. The results were spectacular.

Do you have any idea how that was achieved, internally in the network
driver running on the Unix side of the connection?

glen herrmannsfeldt · Nov 5, 2013

None. I can't imagine how you could do this. But then, it was late 80s
or so, and there were all sorts of crazy things people were doing in
custom kernel drivers.

NFS servers seem to run in a strange place in the file system.
I suppose it is possible that NFS would create a file with a /
inside it, which it would not be possible to access locally.

Around that time frame, I accidentally put in a hard link to
a directory through NFS, which promptly crashed the server system.
When it rebooted, the client then retried the failed link, again
crashing the server.

Note also that NFS was designed to be OS independent as much
as possible, so path separators are not passed through. The path
has to be parsed on the client (which might not be unix).

(I also once managed to misconfigure a machine such that
"telnet localhost" got me a login prompt from a different machine.
Yes, it was really trying to connect to 127.0.0.1, and getting
a different machine.)

Also around that time I had a machine with one physical network
port running an OS genned for three. To make it happy, I put
127.2 and 127.3 on the other two ports. (With appropriate subnet mask.)

-- glen

Stephen Sprunk · Nov 5, 2013

The forward slash character poses a problem, because they can't
appear in file names.

A '/' will never appear in a UTF-8 string except as a real U+002F,
unlike in several other character encodings. One of the design goals of
UTF-8 was strict compatibility with ASCII for all code units from
0x00-0x7F, which is sufficient (actually, overkill) for dealing with the
special meanings assigned to various ASCII characters. As a result, you
can safely pass UTF-8 strings to/through the Unix API.

If Windows had better support for CP_UTF8, the same would be true of the
(original) Win32 "ANSI" (i.e. non-UTF-16) API, completely removing the
need for the parallel "Unicode" (i.e. UTF-16) API and all the attendant
problems that having two APIs causes.

UTF-8 was not the first multi-byte character set encoding to aim for
ASCII compatibility. Big5, Shift-JIS, and the ISO-2022 family are all
functionally ASCII compatible.

They're also more complicated and can only represent a small subset of
Unicode code points, so they're not suitable candidates for worldwide
use. UTF-8 is.

And these may actually still be more common in Asia than UTF-8.

According to web survey statistics, UTF-8 has a stunning ~77% of web
sites. ISO-8859-1 comes in at ~12%, and everything else is in the
single digits.

Certain countries complain that UTF-8 is larger than UTF-16 for their
scripts, which fall in the range U+8000 to U+FFFF, but that is only true
for pure text. Add punctuation, Arabic numbers, HTML markup, etc. to a
document and UTF-8 usually wins. Even if not, given the other problems
that UTF-16 causes, UTF-8 is the clear winner.

OS X has settled on UTF-8 internally, which has already caused
interoperability issues, e.g. with networked file systems. My money
says it was a similarly poor decision. I'm sure UTF-8 will have more
staying power than UCS-2, but most of the caveats stem from relying
on any non-ASCII encoding. They shouldn't have choosen any poison.

You've got to choose something, and UTF-8 is the most backward
compatible and most universal. If everyone (e.g. Microsoft) would quit
trying to force obsolete, script-specific encodings down other people's
throats and just get on the UTF-8 bandwagon, we'd all be better off.

S

Stephen Sprunk · Nov 5, 2013

My frame of reference was e-mail. But I've just been informed that,
with the exception of Japan, UTF-8 has surprised the various national
encodings in Asia.

Japan has an unusually good national encoding, plus they're more
insular; they tend to do their own thing when it comes to technology
standards, e.g. almost completely ignoring GSM.

China, Hong Kong and Taiwan have several national encodings that are
mutually incompatible and computationally expensive; China requires
support for GB18030 by law, but in my experience most users actually end
up using UTF-8, which is no worse even for internal use.

India's several national encodings are a mess, so UTF-8 is a better
solution even for communicating amongst themselves, despite their rather
vocal (and mostly incorrect) complaints about it.

I don't know much about the rest of Asia, except that there are many
national encodings that are now rarely seen in the wild.

When I was working on I18N stuff a few years ago I rarely saw UTF-8
used overseas, and the company I worked for processed hundreds of
millions of e-mail messages every day.

Use of UTF-8 has definitely grown over the last decade, especially on
the Web. Email is a bit different; national encodings tend to be used
more because Windows won't let users select a UTF-8-flavored locale and
popular software (e.g. Outlook) hides how to override that, which
wouldn't occur to most users anyway. One minor (and completely
transparent) code change by Microsoft and email would be 90%+ UTF-8
within a year or two.

On the plus side, recent versions of Outlook are smart enough to
automatically switch to UTF-8 when the selected national encoding fails,
which is a major improvement. Why they don't just do that by default is
a mystery, but it's probably related to Microsoft's persistent
conflation of "Unicode" and UCS-2.

S

Xavier Roche · Nov 6, 2013

UTF-8 was not the first multi-byte character set encoding to aim for ASCII
compatibility. Big5, Shift-JIS, and the ISO-2022 family are all functionally
ASCII compatible.

(A bit more off-topic)

Not ISO-2022, as its insane escape sequences are in the ASCII range
(more precisely, using ESC sequences). (I won't mention the poor design
choices regarding ISO-2022 encodingS - they are all as bad, poorly
crafted, incomptabible, and plaged by idiotic decisions).

glen herrmannsfeldt · Nov 6, 2013

(snip on file name characters)

And there were other odd effects - you could sometimes have files in a
directory that you simply could not see from some types of
workstations. Or a file could have several different names (in
different name spaces), and if you were particularly lucky you could
set up a real mess of inconsistent names (you could, for example, copy
several files that had both long and DOS names from one directory to
another, and end up with the DOS names swapped in the destination).

MS-DOS was funny about names. I once had a file with a name of COM3,
I belive the extension doesn't matter. Later the disk was one a
system with a COM3 device, such that you can't access a file with
that name.

-- glen

Unicode (UTF-8) in C	13	Mar 16, 2014
Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
UTF-8 and strings	44	Jun 7, 2011
CGI and UTF-8	14	Sep 28, 2009
hex dump w/ or w/out utf-8 chars	40	Jul 7, 2013
UTF-8 to Unicode conversion in ajax response	9	May 17, 2011
converting UTF-8 to unicode hex with perl	4	Jun 27, 2009
XML::LibXML UTF-8 toString() -vs- nodeValue()	36	Apr 7, 2009

UTF-8 vs w_char

James Kuyper

Seebs

Seebs

James Kuyper

glen herrmannsfeldt

Stephen Sprunk

Stephen Sprunk

Xavier Roche

glen herrmannsfeldt

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads