How to tell character encoding?

A

Aaron Fude

Hi,

This is not a java question, but I have a java application in mind and
this is the only place where I get my computer questions answered.

Suppose I see text on a webpage in a foreign language, e.g. French. This
page, for example:

http://www.gabay.com/sources/Liste_Fiche.asp?CV=117

How can I determine the encoding used for the foreign text? What's the
easiest way? If it's nonstandard, how do I convert it to something more
standard? (Like a Unicode.)

Many thanks in advance,

Aaron
 
J

Joshua Cranmer

Aaron said:
How can I determine the encoding used for the foreign text? What's the
easiest way? If it's nonstandard, how do I convert it to something more
standard? (Like a Unicode.)

Determining encodings with 100% accuracy is impossible. The easiest way
to figure out the encoding of a page requires you to search for and find
associated metadata that lists it. For example, http provides a header
which allows you to find the charset, and so do most MIME messages (your
email was sent as ISO-8559-1, I can confirm). That, of course, assumes
that the server is sending its data correctly, which is not necessarily
a safe assumption.

In cases such as HTTP or email, the library you use is probably smart
enough to find the charset metadata and handle that information for you.

Now suppose no one tells you any metadata, such as you're looking in a
local file. In that case, there is typically a platform-default encoding
which would be faithfully followed by default (if you pay careful
attention in Java, it will automatically treat text in English-version
Windows as ISO 8859-1 and text in most Linux systems as UTF-8).

You can always try to do statistical analysis to guess which encoding is
correct. This is mainly useful for deciding between two encodings, such
as ISO 8859-1 and UTF-8. If you have text which is not valid UTF-8, then
obviously it cannot be UTF-8; if you always have multiple high-bit
sequences in a row, it's more likely UTF-8 than ISO 8859-1 (you can
generally tell when UTF-8 is being misinterpreted as the latter, as you
will see stuff like é); if you have no high bits set, it doesn't
matter. Unless you have EBCDIC, but I'm going to discount that possibility.
 
J

Jukka Lahtinen

Aaron Fude said:
Suppose I see text on a webpage in a foreign language, e.g. French. This
page, for example:
http://www.gabay.com/sources/Liste_Fiche.asp?CV=117
How can I determine the encoding used for the foreign text? What's the

Look at the <head> element, there should be a meta element with
the attribute http-equiv="Content-Type" and content attribute telling the
charset. If there isn't, I'd expect the contents to be iso-8859-1.
 
A

Arne Vajhøj

Aaron said:
This is not a java question, but I have a java application in mind and
this is the only place where I get my computer questions answered.

Suppose I see text on a webpage in a foreign language, e.g. French. This
page, for example:

http://www.gabay.com/sources/Liste_Fiche.asp?CV=117

How can I determine the encoding used for the foreign text? What's the
easiest way? If it's nonstandard, how do I convert it to something more
standard? (Like a Unicode.)

For web pages use the following logic:

if encoding specified in META tag then
use that
else if encoding specified in HTTP header then
use that
else
use ISO-8859-1
end

Arne
 
A

Arne Vajhøj

Arne said:
For web pages use the following logic:

if encoding specified in META tag then
use that
else if encoding specified in HTTP header then
use that
else
use ISO-8859-1
end

I think I have some Java code to do it if you are
interested.

Arne
 
R

Roedy Green

How can I determine the encoding used for the foreign text? What's the
easiest way? If it's nonstandard, how do I convert it to something more
standard? (Like a Unicode.)

I wrote a utility to assist. It still requires guessing. See
http://mindprod.com/applet/encodingrecogniser.html

See http://mindprod.com/jgloss/encoding.html
for info on how to convert.

There is native2ascii, which used twice, converts.

You can write a little utility to read/write. See
http://mindprod.com/applet/fileio.html
for the code.

You can use HunkIO.readEntireFile to read the file in one fell swoop
with one encoding and write it with another. see
http://mindprod.com/products1.html#HUNKIO
--
Roedy Green Canadian Mind Products
http://mindprod.com

"Simplicity is prerequisite for reliability,"
~ Edsger Wybe Dijkstra (born: 1930-05-11 died: 2002-08-06 at age: 72)
 
A

Arne Vajhøj

rossum said:
else if Byte Order Mark (BOM) present then
use that

You could do that.

But note that the BOM bytes is two valid bytes in ISO-8859-1.

The chances of these two coming as the first two bytes in
a file is extremely small, but it is possible.

Arne
 
L

Lothar Kimmeringer

Arne said:
You could do that.

But note that the BOM bytes is two valid bytes in ISO-8859-1.

The chances of these two coming as the first two bytes in
a file is extremely small, but it is possible.

Especially the chances of having a BOM but no Content-Type with
charset-attribute. OTOH, Microsoft IIS can't cope with folded
request-headers correctly so you can't assume anything in
this world.


Regards, Lothar
--
Lothar Kimmeringer E-Mail: (e-mail address removed)
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!
 
M

markspace

else if Byte Order Mark (BOM) present then
use that


Are you sure about that? I thought the HTTP spec said that if there
were no meta or other content tags, then the default was ISO-8859-1.
The BOM thing might actually make certain types of files accidentally
incorrect, I think.

 
T

Tom Anderson

Ooops.

You are correct.

I guess I only tested META tag with no HTTP header.

I tentatively consider that a bug in the spec - i'd prefer a meta tag to
be able to override the protocol header. The reason being that the server
serving up some static content doesn't always know the charset it's in,
but the person writing that content does.

tom
 
R

Roedy Green

This is not a java question, but I have a java application in mind and
this is the only place where I get my computer questions answered.

In the old days the notion you would possess a file without knowing
what was on it would have been ludicrous. You needed a program or at
least a detailed record layout, and all kinds of other trivia. Without
it the file might as well be blank.

It did not dawn on the ancient ones that every file needed a bundle of
metadata permanently glued to it. Steve Jobs was one of the first to
be enlightened. Even the very early Macs had data and resource forks.

The ancient ones did not share files, except with great ceremonies
involving lawyers. There was only one encoding within any one
institution, so the question of what encoding was used never came up.
Nearly all programs were written from scratch for that institution. I
recall my bafflement on learning about VisiCalc, early word processors
and accounting programs for the Apple][. How could the same program
be sold uncustomised to more than one customer and still be useful?

With global sharing of data, suddenly it became clear that we should
have been attaching meta-information to files.
--
Roedy Green Canadian Mind Products
http://mindprod.com

"People think of security as a noun, something you go buy. In reality, it’s an abstract concept like happiness. Openness is unbelievably helpful to security."
~ James Gosling (born: 1955-05-18 age: 54), inventor of Java.
 
A

Arne Vajhøj

Steven said:
Tom said:
Steven Simpson wrote:
I think you're supposed to check HTTP before <meta>, at least for
HTML: [...]
<http://www.w3.org/TR/html4/charset.html#h-5.2.2>
I tentatively consider that a bug in the spec - i'd prefer a meta tag
to be able to override the protocol header. The reason being that the
server serving up some static content doesn't always know the charset
it's in, but the person writing that content does.

I know what you mean, but I think I get what the spec is trying to do,
i.e. allow the embedded setting to be overridden without having to alter
the document, perhaps following a more general principle that a
container should be able to override its contents.

That may be the intention.

But given that:
* access to server config usually implies access to HTML files
* access to HTML files does not imply access to server config
then I agree with Tom that the opposite of current behavior
would be more useful.

Arne
 
R

Roedy Green

I tentatively consider that a bug in the spec - i'd prefer a meta tag to
be able to override the protocol header. The reason being that the server
serving up some static content doesn't always know the charset it's in,
but the person writing that content does.

Imagine something like JSP that prepares the document in 16-bit, then
it is converted to some encoding that the user likes based on the
request header. In this case it is possible the womb knows more than
the program building the content about what encoding finally goes out
the wire. The womb knows about any compression. The programmer does
not.

It would be best to get in right in both places. This allows the
client to pick it up from either place safely.

It seems to me that embedded encodings are too late. You have to know
at least the approximate encoding before you can parse the internal
encoding. I think is it more documentation intended for the user who
does a view source.

--
Roedy Green Canadian Mind Products
http://mindprod.com

"People think of security as a noun, something you go buy. In reality, it’s an abstract concept like happiness. Openness is unbelievably helpful to security."
~ James Gosling (born: 1955-05-18 age: 54), inventor of Java.
 
A

Arne Vajhøj

Roedy said:
Imagine something like JSP that prepares the document in 16-bit, then
it is converted to some encoding that the user likes based on the
request header. In this case it is possible the womb knows more than
the program building the content about what encoding finally goes out
the wire.

In JSP neither is the right way.

In JSP it should be specified in the page directive.
The womb knows about any compression. The programmer does
not.

Compression and charset is orthogonal.

Charset is used after decompression.

Arne
 
M

Mike Schilling

Arne said:
Steven said:
Tom said:
Steven Simpson wrote:
I think you're supposed to check HTTP before <meta>, at least
for
HTML: [...]
<http://www.w3.org/TR/html4/charset.html#h-5.2.2>
I tentatively consider that a bug in the spec - i'd prefer a meta
tag to be able to override the protocol header. The reason being
that the server serving up some static content doesn't always know
the charset it's in, but the person writing that content does.

I know what you mean, but I think I get what the spec is trying to
do, i.e. allow the embedded setting to be overridden without having
to alter the document, perhaps following a more general principle
that a container should be able to override its contents.

That may be the intention.

But given that:
* access to server config usually implies access to HTML files
* access to HTML files does not imply access to server config
then I agree with Tom that the opposite of current behavior
would be more useful.

If the sender has no idea of the encoding, it shouldn't put one into
the content type; this allows the data to identify itself. If, on the
other hand, the sender is a program that knows damned well that it
just converted chars to UTF-8, it needs a way to say so, overriding
any text in the data which says that it began life as ISO-8859-1.
 
A

Arne Vajhøj

Mike said:
Arne said:
Steven said:
Tom Anderson wrote:
Steven Simpson wrote:
I think you're supposed to check HTTP before <meta>, at least
for
HTML: [...]
<http://www.w3.org/TR/html4/charset.html#h-5.2.2>
I tentatively consider that a bug in the spec - i'd prefer a meta
tag to be able to override the protocol header. The reason being
that the server serving up some static content doesn't always know
the charset it's in, but the person writing that content does.
I know what you mean, but I think I get what the spec is trying to
do, i.e. allow the embedded setting to be overridden without having
to alter the document, perhaps following a more general principle
that a container should be able to override its contents.
That may be the intention.

But given that:
* access to server config usually implies access to HTML files
* access to HTML files does not imply access to server config
then I agree with Tom that the opposite of current behavior
would be more useful.

If the sender has no idea of the encoding, it shouldn't put one into
the content type; this allows the data to identify itself. If, on the
other hand, the sender is a program that knows damned well that it
just converted chars to UTF-8, it needs a way to say so, overriding
any text in the data which says that it began life as ISO-8859-1.

Simple web servers serve usually files as BLOB's. They do not
convert any charset.

And often they set a charset for text/html.

Arne
 
M

Mike Schilling

Arne said:
Mike said:
Arne said:
Steven Simpson wrote:
Tom Anderson wrote:
Steven Simpson wrote:
I think you're supposed to check HTTP before <meta>, at least
for
HTML: [...]
<http://www.w3.org/TR/html4/charset.html#h-5.2.2>
I tentatively consider that a bug in the spec - i'd prefer a
meta
tag to be able to override the protocol header. The reason being
that the server serving up some static content doesn't always
know
the charset it's in, but the person writing that content does.
I know what you mean, but I think I get what the spec is trying
to
do, i.e. allow the embedded setting to be overridden without
having
to alter the document, perhaps following a more general principle
that a container should be able to override its contents.
That may be the intention.

But given that:
* access to server config usually implies access to HTML files
* access to HTML files does not imply access to server config
then I agree with Tom that the opposite of current behavior
would be more useful.

If the sender has no idea of the encoding, it shouldn't put one
into
the content type; this allows the data to identify itself. If, on
the other hand, the sender is a program that knows damned well that
it just converted chars to UTF-8, it needs a way to say so,
overriding any text in the data which says that it began life as
ISO-8859-1.

Simple web servers serve usually files as BLOB's. They do not
convert any charset

Sure, but they're not the only HTTP clients (or servers.) Say I've
written a servlet that want to return some XML, which I've get in
memory as a DOM or a character string. In either case, it's
inconvenient to figure out whether it has an XML header or, if so,
what encoding that specifies. It's much simpler for me to serialize
it (or convert it) to UTF-8 and put that in the content-type.

On the other hand, I could (in theory) write a web server that accepts
lots of odd charsets for PUTs but saves everything as UTF-8, to be
nice to clients. It should reports content-type of UTF-8, and that
should override the said:
And often they set a charset for text/html.

That's wrong. But the problem is the web server's claiming knowledge
it doesn't possess, not the spec.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,151
Latest member
JaclynMarl
Top