A proposal to handle file encodings

Sven Köhler · Nov 25, 2012

Am 23.11.2012 19:21, schrieb Jan Burse:

For example when you edit a HTML file locally, you don't
have this HTTP header information. Also where does the HTTP
header get the charset information in the first place?

Scenario 1:
- HTTP returns only mimetype=text/html without
the chartset option.
- The browser then reads the HTML doc meta tag, and
adjust the charset.

Scenario 2:
- HTTP returns mimetype=text/html; charset=<encoding>
fetched from the HTML file meta tag.
- The browser does not read the HTML doc meta tag, and
follows the charset found in the mimetype.

In both scenarios 1 + 2, the meta tag is used. Don't
know whether there is a scenario 3, and where should
this scenario take the encoding from?

Scenario 3:

Apache configuration sets a default charset and sends Content-Type:
text/html; charset=iso-8859-1 even though the meta tag in the file
specifies utf8.

Luckily, this feature could be turned off. I'm not sure, what the
default config is at the moment. Also, I don't know of any webserver
that actually implements scenario 2. Mostly, specifying the charset in
the HTTP header is used by dynamic webpages (JSP, PHP, ASP), as they
allow setting the headers.

Also, why is this discussion in the Java newsgroup?
Just because Java asks programmer to specify the charset sometimes?

Regards,
Sven

Sven Köhler · Nov 25, 2012

Am 24.11.2012 00:11, schrieb Peter J. Holzer:

Not true in practice. Almost all encodings used in the real world are
some superset of ASCII, and you only need to recognize ASCII characters
to find the relevant meta tag.

With the exception of UTF-16LE/BE for example.
Or is a BOM mandatory for UTF-16? The downside of BOMs is that they
break feature like includes. Many include mechanism just copy the
bytestream, this BOMs appear in the middle of the page.

Regards,
Sven

Joshua Cranmer · Nov 25, 2012

No. Normally it isn't guessing at all. It just uses the configured
charset.

And how is the configured charset not guessing? If a server is serving
static files from a directory, I'm willing to bet that most
administrators won't bother changing the default setting and instead
will just hope that the default works.

I've had enough charset pains to know that much of it (particularly in
en regions) are going to be people blindly using default settings. And I
also know that not all tools agree on their default settings.

BGB · Nov 25, 2012

The problem is primarily raw text files with no indication of the
encoding.

The HTML encoding is incompetent. You can't read it without knowing
the encoding. It is just a confirmation. Thankfully the encoding comes
in the HTTP header -- a case where meta information is available.

it works as far as most usable encodings have ASCII as a subset, so
whether it is UTF-8 or 8859-1 or similar doesn't matter, as the header
can still be parsed.

for UTF-16, there is typically the BOM, so if a BOM is seen, assume UTF-16.

with some cleverness, it could probably also be extended to support
EBCEDIC, basically just try reading as EBCEDIC and see if it "makes sense".

I feel angry about this. What asshole dreamed up the idea of
exchanging files in various encodings without any labelling of the
encoding? That there is no universal way of identifying the format of
a file is astounding. Parents who thought this way would send their
kids out into the world not knowing their names, addresses, or
genders.

It sounds like something one of those people who live on beer and
pizza, with a roomful of old pizza boxes lying around would have come
up with. I wish Martha Stewart had gone into programming.

this is overdramatizing the issue.

at first I thought it was about binary formats, which can often be
identified if-needed by checking for magic values (sometimes augmented
with things like header-checksums, ... which can reduce likelihood of
false-positives).

OTOH, one can get into the whole thing of container formats, where a
glob of opaque binary data is often wrapped up in such a format with
some identification of what it is. a typical example of such a container
format are things like video-formats (AVI, MKV, MP4, OGG/OGM, ...),
which may contain frames using any number of codecs, and may sometimes
add additional capabilities, such as the ability to multiplex or
interleave data chunks, ...

for some of my own stuff, I am using informal container formats loosely
based on the JPEG file format (itself mostly based on a system of
"markers"). it works...

or such...

Joshua Cranmer · Nov 25, 2012

it works as far as most usable encodings have ASCII as a subset, so
whether it is UTF-8 or 8859-1 or similar doesn't matter, as the header
can still be parsed.

Well, there's also the minor issue that some encodings use the same name
for slightly (or sometimes greatly) different variants--I think Big5 is
an offender here in having a few different variants in mapping
multioctet chars to Unicode code points, and "ASCII" and "EBCDIC" are
both laughably useless, since they pretend that the 8th bit is never set.

for UTF-16, there is typically the BOM, so if a BOM is seen, assume UTF-16.

In the HTML 5 specification (which is far closer to reality as far as
HTML parsing is concerned than HTML 4 is [1]), the BOM trumps all other
charset information, including what HTML claims the header is.

with some cleverness, it could probably also be extended to support
EBCEDIC, basically just try reading as EBCEDIC and see if it "makes sense".

I think EBCDIC is dead as far as web-compatibility is concerned, but the
HTML 5 spec also specifies that the scanning for the <meta happens by
looking for the ASCII octets in particular, so any non-ASCII-compatible
charset (in particular, EBCDIC and UTF-7) is probably in practice
unusable on the web.

And, seriously, if you're designing a new format that contains textual
data, require UTF-8.

[1] HTML 4.01 is a 13-year old specification which was never fully
implemented by browsers and is laughably irrelevant for how modern
browsers actually look at input. The HTML 5 specification, though still
a draft, is much more grounded in reality, at least as far as how
browsers are actually going to parse the mangled crap people claim is
HTML; it was developed, in part, by reverse engineering what browsers
actually DID and not rely on what an ancient spec said they should do.

BGB · Nov 25, 2012

Well, there's also the minor issue that some encodings use the same name
for slightly (or sometimes greatly) different variants--I think Big5 is
an offender here in having a few different variants in mapping
multioctet chars to Unicode code points, and "ASCII" and "EBCDIC" are
both laughably useless, since they pretend that the 8th bit is never set.

well, you only need to read far enough to read the header, then you can
re-read in the needed encoding, if needed.

example:
assume ASCII, try to read header;
see that encoding says UTF-8 or 8859-1 or KOI-8R or whatever else;
reset, read again, "for real this time".

for UTF-16, there is typically the BOM, so if a BOM is seen, assume
UTF-16.

Click to expand...

In the HTML 5 specification (which is far closer to reality as far as
HTML parsing is concerned than HTML 4 is [1]), the BOM trumps all other
charset information, including what HTML claims the header is.

well, yes, partly. if you ignore the BOM and assume ASCII or 8859-1 or
similar, then the document can't be parsed.

I think EBCDIC is dead as far as web-compatibility is concerned, but the
HTML 5 spec also specifies that the scanning for the <meta happens by
looking for the ASCII octets in particular, so any non-ASCII-compatible
charset (in particular, EBCDIC and UTF-7) is probably in practice
unusable on the web.

pretty much, but not theoretically impossible at least.

And, seriously, if you're designing a new format that contains textual
data, require UTF-8.

this is pretty much what I do.
though not everywhere are things really clear cut as to whether it is
plain ASCII or UTF-8, but this can be glossed over:
if it is textual, it is meant to be UTF-8, and falling short of this is
an implementation issue.

I sometimes support UTF-16, but usually in these areas it is a shim to
detect the BOM and convert the data to UTF-8, and other times the UTF-8
is converted back to UTF-16 as-needed.

[1] HTML 4.01 is a 13-year old specification which was never fully
implemented by browsers and is laughably irrelevant for how modern
browsers actually look at input. The HTML 5 specification, though still
a draft, is much more grounded in reality, at least as far as how
browsers are actually going to parse the mangled crap people claim is
HTML; it was developed, in part, by reverse engineering what browsers
actually DID and not rely on what an ancient spec said they should do.

makes sense.

Jan Burse · Nov 26, 2012

Joshua said:
Well, there's also the minor issue that some encodings use the same name
for slightly (or sometimes greatly) different variants--I think Big5 is
an offender here in having a few different variants in mapping
multioctet chars to Unicode code points, and "ASCII" and "EBCDIC" are
both laughably useless, since they pretend that the 8th bit is never set.

According to Wiki:

"Generally, this encoding form is rarely used, even on EBCDIC based
mainframes for which it was designed. IBM EBCDIC based mainframe
operating systems, like z/OS, usually use UTF-16 for complete Unicode
support. For example, DB2 UDB, COBOL, PL/I, Java and the IBM XML toolkit
support UTF-16 on IBM mainframes."

http://en.wikipedia.org/wiki/UTF-EBCDIC

Jan Burse · Nov 26, 2012

BTW: This is a nice read:
http://www.transbay.net/~enf/ascii/ascii.pdf

Shows history of ASCII, EBCDIC, ISO-646, etc..

Peter J. Holzer · Nov 27, 2012

And how is the configured charset not guessing?

The server doesn't guess. It just does what it is told.

The admin may be guessing, though.

hp

Peter J. Holzer · Nov 27, 2012

Yes, but only pretty basic ones.

They are arbitrary key/value pairs. You can put any information there,
there is no restriction to "basic" information (whatever that might be).
They are limited to a single block (typically 4kB), though, so MIME
type, character set, keywords, etc. are ok, but a thumbnail image might
be problematic.

Here we're talking about hypothetically storing stuff like character
encoding

This one is even somewhat standardized: user.charset is documented on
http://www.freedesktop.org/wiki/CommonExtendedAttributes which probably
means that some GUI programs are actually using it (besides the Apache
module where it originated).

To return to the topic of this group: Is there a Java library for
setting and retrieving xattrs?

Of course, but if the metadata is external to the file as it is in the
'other fork' in an Apple filing system, you still have to make sure that
cp, mv and friends have all been rewritten to handle that.

Why "but"? That's exactly what I wrote. The kernel doesn't know what the
a process is intending to do with a file, therefore programs like cp,
tar, etc. must be rewritten to handle xattrs explicitely. (And many of
them have been rewritten, of course. Xattrs aren't new)

You may well find that its easier to pull metadata management into the
kernel because then you've only got one piece of code to maintain
rather than tweaks in umpteen utility programs and libraries.

The problem is that this just doesn't fit into the Unix system call
scheme. There is no "copy" system call. The kernel just sees that a
process opens one file for reading and another file for writing. It
cannot assume that this process wants to copy the metadata from the
first to the second file. Of course Linux could introduce such a system
call, but then those umpteen utility programs and libraries would still
have to be modified to use that new system call.

hp

Peter J. Holzer · Dec 2, 2012

Of course.

I can see two ways of handling it:

(1) introduce a pair of systems calls to retrieve and store the metadata
associated with a file,

There are of course already system calls to do that (how else would you
get at the data?). There are four of them (list, get, set, remove),
however, not two, so ...

and, yes, programs would need modification, but the amount would be
trivial because you'd be looking at one extra line of code per file
involved in the metadata transfer.

.... it's 3 extra lines, not 1. Not including error handling, of course.

But I don't think that's the problem. The problem is that a) you have to
do it and b) you have to think about how to do it. Plus there is no
consensus that it should be done at all (user_xattr isn't even enabled
by default on ext*). Microsoft and Apple have it easier: If they say
that some information has to be stored in an alternate stream/resource
fork, programmers will do it. Linux has no central authority which can
force programmers to do anything.

(2) alternatively it may be possible to do the job by adding a mode or to
to the file opening operations.

You mean an optional 4th parameter to open(2)?

If they were defaulted appropriately, many programs could silently
copy the metadata along with the data

I still don't see how that could work. That implies that the kernel
somehow guesses that you want to use the metadata from some file you
opened for reading for the file you are just opening for writing. While
that would be the right behaviour for "cp" or similar programs, it doubt
it would be right for the majority of programs.

It also raises the question of what the kernel should do if the process
doesn't have the necessary privileges to set some xattrs (or if the file
system doesn't support them). Fail? Silently drop them? I don't think
the kernel should make that decision. It's up to the application to
decide what's sensible ("mechanism, not policy" was a guiding principle
in the design of the Unix system call interface).

and/or automagically apply the appropriate transforms, such as charset
transforms, during the transfer.

That again makes no sense at the unix system call interface which deals
only with byte streams.

It does however make a lot of sense for higher level interfaces. So
it might be a good idea for java.io.FileReader to check the user.charset
xattr of the file and apply the appropriate encoding.

Thinking about it a little more, (2) is definitely the best solution
because it would be rather useful to be able to default the metadata
applied to a new file with a similar mechanism to that used for the
permission bits.

umask(2) is actually pretty broken IMHO.

hp

Peter J. Holzer · Dec 2, 2012

But, by definition, if you were using metadata to control the character
encoding (which is where this discussion started) or to define the file
as containing keyed, fixed field records, you would not be trying to
write a byte stream.

We were obviously talking past each other. I was only talking about
mechanisms like xattr, alternate streams or resource forks, not about
revamping the whole unix file model.

hp

Guessing Encodings and the PerlIO layer	2	Jul 27, 2009
The future of the character-encodings library	4	Mar 16, 2011
Python3: Sane way to deal with broken encodings	4	Dec 6, 2009
How to save JSON Data to a file using fetch() api?	2	Apr 28, 2022
Ruby 1.9.1, HTTP and Encodings	0	Jun 24, 2009
KML to CSV file conversion using Python and Windows Powershell	0	Oct 14, 2022
How to save textBox values into a xml-file(with naming an choosing directory)?	1	Aug 23, 2022
How do I save information from an GUI into a XML-file?	0	Aug 17, 2022

A proposal to handle file encodings

Sven Köhler

Sven Köhler

Joshua Cranmer

BGB

Joshua Cranmer

BGB

Jan Burse

Jan Burse

Peter J. Holzer

Peter J. Holzer

Peter J. Holzer

Peter J. Holzer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads