A proposal to handle file encodings

R

Roedy Green

The problem with encodings is they are not attached in any way or
embedded in any way in a file. You are just supposed to know how a
file is encoded.

Here is my idea to solve the problem.

We invent a new encoding.

Files in this encoding begin with a 0 byte, then an ASCII string
giving the name of a conventional encoding then another 0 byte.

When you read a file with this encoding, the header is invisible to
your application. When you write a file, a header for a UTF8 file gets
written automatically.

You write your app telling it to read and write this new encoding e.g.
"labeled".

You can write a utilty to import files into your labelled universe by
detecting or guessing or being told the encoding. It gets a header.
Other than that the file is unmodified.
 
J

Joerg Meier

The problem with encodings is they are not attached in any way or
embedded in any way in a file. You are just supposed to know how a
file is encoded.
Here is my idea to solve the problem.
We invent a new encoding.
Files in this encoding begin with a 0 byte, then an ASCII string
giving the name of a conventional encoding then another 0 byte.
When you read a file with this encoding, the header is invisible to
your application. When you write a file, a header for a UTF8 file gets
written automatically.
You write your app telling it to read and write this new encoding e.g.
"labeled".
You can write a utilty to import files into your labelled universe by
detecting or guessing or being told the encoding. It gets a header.
Other than that the file is unmodified.

I can't tell whether you are being serious or doing a joke about that old
"You have 25 standards" joke.

However, in case you are serious, this ugly and error prone hack idea
really belongs more with a language capable of realizing OS level/file
system black magic like that in a somewhat sensible way. Like C.

Liebe Gruesse,
Joerg
 
A

Arne Vajhøj

The problem with encodings is they are not attached in any way or
embedded in any way in a file. You are just supposed to know how a
file is encoded.

Here is my idea to solve the problem.

We invent a new encoding.

Files in this encoding begin with a 0 byte, then an ASCII string
giving the name of a conventional encoding then another 0 byte.

When you read a file with this encoding, the header is invisible to
your application. When you write a file, a header for a UTF8 file gets
written automatically.

You write your app telling it to read and write this new encoding e.g.
"labeled".

It is a bad idea to have meta data in the file body. This meta data
should be where the rest of meta data are.

But even if it was moved to the file info area then I doubt
the idea is good.

It is enforcing a limitation that a text file will only have
one encoding, that limitation does not exist today.

There are practical problems:
* different systems support different encodings (sometimes
same encoding has different name) - what should a system
do with an unknown encoding
* there will be a huge number of legacy files without this meta
data - what should a system do with those

And even if those problems were solved - would it really create
any benefits?

It would take many years to get such an approach approved and
widely implemented. Most likely >10 years. At that time I would
expect UTF-8 to be almost universal used for new text files.
Making this proposal obsolete.
You can write a utility to import files into your labelled universe by
detecting or guessing or being told the encoding.

Which just repeat the existing problems.
It gets a header.
Other than that the file is unmodified.

Solved much easier by using meta data.

Arne
 
M

markspace

Solved much easier by using meta data.


I think Roedy is talking about the physical encoding of the meta data.
I personally agree with him in this regard: meta data should be encoded
into the physical file.

Consider for example a meta data format that we all use: the Jar file.

Each single Jar file is actually composed of many pieces of information.
Class files, resources, libraries, the manifest file, etc. And yet
it's all encoded into a single physical file. You never loose pieces of
the file just because you made a copy of the file. You never have to
worry about the meta data changing on a new system just because it's *new*.

Contrast that with other schemes. Macintosh, I believe, uses a meta
data format where the data is in one file, and the meta data occupies a
second physical file with a name like .file-name.meta (I don't use Macs
so I'm not 100%) sure. So if you use a raw copy command ("cp" from the
Unix command line) you *don't* get the meta data, because you forgot to
copy it.

I hope you can all quickly see how obviously broken that is. Since we
all use Jar files I think you can all reflect on the idea that it's a
good solution. Have you ever had a problem with a Jar file retaining
its meta data? Is it ever desirable to have a Jar file's meta data
revert to nulls just because you FTP'ed the file someplace? I've never
desired that "feature".

It seems obvious to me. Encoding the meta data into a single physical
file is by far the better solution.

No, where I think Roedy goes wrong is to invent a *new* file format. My
solution: use what's there already, just use Jar files.

Proposal: Add a property "Data-Archive" like so:

Manifest-Version: 1.0
Data-Archive: /data

Where the value of the Data-Archive is the path to the primary data
stream (within the Zip/Jar file). You can just add an encoding or
mime-type or any other property to the manifest you like to describe
your data stream and you're set.

Note that this is already being done. Open Office uses Jar files as its
native file format. They just rename the extension as they wish, and
open the file appropriately for a Jar file. They also store a lot more
meta data than just a couple of properties, so they effectively have
their own format, not this simple one.

It might be useful to try to solve some common cases for data and
meta-data. What I've got here is a single data stream and a single
"type" property. It wouldn't be hard to extend this to several streams
and several properties each. I think that would be the only other
useful general case; after that you should just roll your own solution.

BTW if anyone is copying this up to their website (mindprod), please
credit appropriately: Brenden Towey.
 
R

Roedy Green

Each single Jar file is actually composed of many pieces of information.
Class files, resources, libraries, the manifest file, etc. And yet
it's all encoded into a single physical file. You never loose pieces of
the file just because you made a copy of the file. You never have to
worry about the meta data changing on a new system just because it's *new*.

Yes, yes! The OS people have proved incompetent at keeping metadata
separately from the file. We need formats where the metadata is part
of the file. With text files the most important piece of metadata is
the encoding. We do it sometimes, jpg, jar, csv (sometimes), video
files,

More generally the mime type is something you should be able to get
with File.getMime()

Imagine if you could do:

File.getEncoding()
File.getVersion()
File.getCopyrightOwner()
File.getCopyrightDate()

Meta data-compliant file would look just like any other but with a
header of the form
0 <meta>...</meta> 0

The meta data could be stored as XML. That gives you ability to add
extra info without having to change the standard.

the header is in ASCII 7-bit.


We should be using somewhat more complicated formats for files with
embedded metadata.

As an application programmer you want to be able to have the system
parse it for you. You get to pretend it is not there, but with the
ability to query it.

This reminds me a bit of the innovation of ANSI labelled mag tapes
back in the 60s.

The bBase people got this right long ago. You don't go writing files
without a header describing the format of what was in the file.
 
J

Jan Burse

Hi,

If your files are HTML, then you can note the encoding in the
header, via a meta tag:

<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
</body>
</html>
http://de.wikipedia.org/wiki/Meta-Element#.C3.84quivalente_zu_HTTP-Kopfdaten

If your files are XML, then you can note the encoding in the
xml tag:

<?xml version="1.0" encoding="ISO-8859-1"?>
http://de.wikipedia.org/wiki/XML-Deklaration

If your file is plain text, you can insert a BOM, which allows to
automatically detect a couple of encoding. And skip the BOM during
reading. The BOM is:

\uFEFF
http://de.wikipedia.org/wiki/Byte_Order_Mark

Would this not cover your requirements?

Bye
 
R

Roedy Green

Would this not cover your requirements?

The problem is primarily raw text files with no indication of the
encoding.

The HTML encoding is incompetent. You can't read it without knowing
the encoding. It is just a confirmation. Thankfully the encoding comes
in the HTTP header -- a case where meta information is available.

I feel angry about this. What asshole dreamed up the idea of
exchanging files in various encodings without any labelling of the
encoding? That there is no universal way of identifying the format of
a file is astounding. Parents who thought this way would send their
kids out into the world not knowing their names, addresses, or
genders.

It sounds like something one of those people who live on beer and
pizza, with a roomful of old pizza boxes lying around would have come
up with. I wish Martha Stewart had gone into programming.
 
J

Jan Burse

Roedy said:
The HTML encoding is incompetent. You can't read it without knowing
the encoding. It is just a confirmation. Thankfully the encoding comes
in the HTTP header -- a case where meta information is available.

For example when you edit a HTML file locally, you don't
have this HTTP header information. Also where does the HTTP
header get the charset information in the first place?

Scenario 1:
- HTTP returns only mimetype=text/html without
the chartset option.
- The browser then reads the HTML doc meta tag, and
adjust the charset.

Scenario 2:
- HTTP returns mimetype=text/html; charset=<encoding>
fetched from the HTML file meta tag.
- The browser does not read the HTML doc meta tag, and
follows the charset found in the mimetype.

In both scenarios 1 + 2, the meta tag is used. Don't
know whether there is a scenario 3, and where should
this scenario take the encoding from?

Bye
 
J

Joshua Cranmer

The problem is primarily raw text files with no indication of the
encoding.

The HTML encoding is incompetent. You can't read it without knowing
the encoding. It is just a confirmation. Thankfully the encoding comes
in the HTTP header -- a case where meta information is available.

Except that sometimes the HTTP header is wrong. I have seen enough
UTF-8/ISO 8859-1 mojibake that I don't tend to place great confidence in
metadata except at the most direct level in the protocol (e.g., though
RFC 3977 dictates that NNTP transport is all done in UTF-8, I have
enough experience to know that this is a fiction not borne by reality;
but if I message says that it has an encoding of UTF-8 in its header,
I'll trust that the message body is actually UTF-8).

In general, the optimal way to handle encoding in this modern day and
age is the following is an extremely simple algorithm:
1. Always write out UTF-8.
2. When reading, if it doesn't fail to parse as UTF-8, assume it's
UTF-8. Otherwise, assume it's the "platform default" (which generally
means ISO 8859-1).
 
P

Peter J. Holzer

Not true in practice. Almost all encodings used in the real world are
some superset of ASCII, and you only need to recognize ASCII characters
to find the relevant meta tag.
It is just a confirmation. Thankfully the encoding comes
in the HTTP header -- a case where meta information is available.
[...]
Scenario 2:
- HTTP returns mimetype=text/html; charset=<encoding>
fetched from the HTML file meta tag.

Which web server does this? I think CERN httpd did, back in the 1990's,
but I don't think any of the current crop of servers does, at least not
without some extra plugins. Normally the charset is taken from the
server config.

hp
 
J

Jan Burse

Peter said:
Which web server does this? I think CERN httpd did, back in the 1990's,
but I don't think any of the current crop of servers does, at least not
without some extra plugins. Normally the charset is taken from the
server config.

Its the only way to retrieve the charset:
http://tools.ietf.org/html/rfc2045#section-5.1

Its also the only way to set the chartset in dynamic pages.
For example in JSP one has to do the following:

<%@page contentType="text/html; charset=UTF-8" %>

There is a header field Content-Encoding, which
is not what Roedy wants I guess. Since the term
"Encoding" refers to compression here:
http://en.wikipedia.org/wiki/HTTP_compression

I guess Roedy wants the charset.

Bye
 
J

Jan Burse

Joshua said:
In general, the optimal way to handle encoding in this modern day and
age is the following is an extremely simple algorithm:
1. Always write out UTF-8.
2. When reading, if it doesn't fail to parse as UTF-8, assume it's
UTF-8. Otherwise, assume it's the "platform default" (which generally
means ISO 8859-1).

This advice is only valid, if you cannot influence the charset
on the server side, via for example setting an appropriate mimetype. But
otherwise it works perfectly fine.

What is a little bit annonying is that I didn't find a MimeType
decoder for the client side that easily delivers me the
charset parameter. So I had to write my own.

In the class comment of this custom decoder I wrote:

* <p>Needed for pre JRE 1.5 code, since later in JRE 1.6 the
* activation framework has been bundled and one can use
* javax.activation.MimeType</p>

Just wrap your con.getContentType() into this class, and then
call getParameter().

Bye
 
P

Peter J. Holzer

Its the only way to retrieve the charset:
http://tools.ietf.org/html/rfc2045#section-5.1

That section defines the meaning of the Content-Type header, it doesn't
say anything about how that header is derived. It certainly doesn't say
anything about a web server (RFC 2045 is about mail, not web) extracting
the content type from an html file (the word "html" isn't even
mentioned).

Its also the only way to set the chartset in dynamic pages.
For example in JSP one has to do the following:

<%@page contentType="text/html; charset=UTF-8" %>

This is something completely different than
<meta http-equiv="content-type" content="text/html; charset=...">

The former is a JSP directive which gets translated into some Java code
which sets the Content-Type header of the HTTP response (probably by
calling setContentType() of the ServletResponse object).

The latter is just an element of the HTML response. It is typically
interpreted by the browser (but only if no charset was specified in the
HTTP header), not by the server.

hp
 
R

Roedy Green

Not true in practice. Almost all encodings used in the real world are
some superset of ASCII, and you only need to recognize ASCII characters
to find the relevant meta tag.

You still have the 8- 16- bit,which you can figure out with the BOM in
most cases. It is still Mickey Mouse. The encoding should be at the
very front and encoded in ASCII or something fixed.
 
R

Roedy Green

I guess Roedy wants the charset.

In HTTP the meta information is in the HTTP header. This is all very
well except the that the server is just guessing. It is serving a
standard header for all documents with a given extension. The meta
info needs to be in the document itself. Ditto for MIME type.

If the document is transported compressed e.g. SPDY
http://mindprod.com/jgloss/spdy.html
and fluffed on the other end, then that compression is not part of the
document meta data. If it is kept around compressed, e.g. zip, then it
is.

When it arrives, and is saved on disk, the meta info needs to be
retained, so that an editor knows how to deal with it. The only way
you can do that is is if the meta info is embedded in the file.

The half-assed way we do things depends on the fact encodings are not
all that different. You can get it wrong and still muddle through.
 
P

Peter J. Holzer

You still have the 8- 16- bit,which you can figure out with the BOM in
most cases.

In this case the encoding is already known and the meta element must not
be used:

| The META declaration must only be used when the character encoding is
| organized such that ASCII-valued bytes stand for ASCII characters (at
| least until the META element is parsed).
-- http://www.w3.org/TR/1999/REC-html401-19991224/charset.html
It is still Mickey Mouse.

That wasn't your claim. Your claim was that it's impossible while all
browsers in the last 15 years or so have demonstrated that it is in
practice possible - on billions of web sites.
The encoding should be at the very front and encoded in ASCII or
something fixed.

It is encoded in ASCII, and it

| should appear as early as possible in the HEAD element.
-- http://www.w3.org/TR/1999/REC-html401-19991224/charset.html

And of course there is always the HTTP header. In fact your whole
proposal sounds like an extremely simplified version of the MIME header.
Which was invented 20 years ago and is widely used.

And frankly, you picked the least interesting aspect of MIME: You can
just require that UTF-8 is the only permissible encoding for plain text
files. That's much simpler and more likely to be implemented than
requiring the all text files must start with a header declaring the
encoding. At the same time you are missing out on other aspects of plain
text files (e.g., newline as line end vs. paragraph end, flowed) and of
course everything except plain text.

hp
 
P

Peter J. Holzer

In HTTP the meta information is in the HTTP header. This is all very
well except the that the server is just guessing.

No. Normally it isn't guessing at all. It just uses the configured
charset.
It is serving a standard header for all documents with a given
extension.

Right. It is the responsibility of the server operator to make sure that
the extension matches the intended content-type. The server doesn't look
into the file to derive the content-type.

(For the "static files in a file system" case. Of course there are lots
of other cases, most prominently CMSs, where the finished HTML document
is assembled out of pieces stored in a database)
The meta info needs to be in the document itself. Ditto for MIME type.

Then you wouldn't need a mime-type. That was invented precicely because
not all file formats are self-identifying.

hp
 
P

Peter J. Holzer

IBM got it pretty much right in the OS/400 operating system. The metadata,
which is held in the filing system catalogue, is transparently and
permanently associated with the file. Its a general mechanism: the system
provides standard metadata for source files, executables etc. and the
developer creates the metadata for, e.g. fixed field data files with
keyed access. The only demerit is that it uses a rather ugly two level
filing system.

The UNIX/Linux equivalent would be to keep the meta-data in the file's
inode alongside the access permissions

File attributes have existed on ext* filesystems for a very long time.
and to modify the file copy and move operations

There is no file copy operation on the OS level. The kernel just sees
that a process is creating and writing a new file. It doesn't know
whether this process intends this new file to be an identical copy of
some other file.

rename(2) of course preserves file attributes, because it doesn't change
the file at all (except the ctime entry), only the directories linking
to it.

cp, rsync, tar, etc. have options to copy the attributes along with
the "normal" content. But the problem is that there are a lot of
utilities working on files and they would all have to be modified.
And worse, there isn't any standard for using those attributes, so
nobody uses them, so there is little incentive to modify them.

hp
 
S

Sven Köhler

Am 23.11.2012 02:25, schrieb Arne Vajhøj:
It is a bad idea to have meta data in the file body. This meta data
should be where the rest of meta data are.

Now which OS actually supports this idea?

Are you saying that XML is bad, because it contains metadata (i.e. the
encoding/charset) inside the file body?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top