platform's default charset ?

Chris Uppal · Feb 1, 2006

gk said:
so, this means, each encoding recognises other encoding.....and thats
why they are able to revert back.

Not quite. Your argument is sensible but what you don't (yet ;-) know is that
all or nearly all character encodings overlap for a certain range of
characters. Specifically, the printable ASCII characters have the same
numerical values in CP1252, ISO8859-1, and nearly all other character encodings
(including ASCII). What's more the Unicode assigned code-points (numbers to
you and me) for those characters are the same too.

So the String ABC contains the chars with numerical values 0x61 0x62 0x63. If
we translate that to bytes using ISO8859-1 then we will get bytes with values
0x61 0x62 0x63. But don't let that mislead you, outside that limited range
(essentially the printable characters in the range 32-127) things become very
different.

In a way that overlap is very handy. It means that if someone sends me an
old-fashioned, 8-bit, text file (not Unicode) written in English then the
chances are that I'll be able to read it without me having to try to find out
what codepage the author used to create it. Which is a good thing because (a)
there's a good chance that the author hasn't got the faintest idea what a
code-page /is/ let alone which one s/he used to create the file, and (b) I
don't want to mess around trying to change code-page. Unfortunately, that only
works for text using the restricted range of characters. As soon as you start
using accented characters, or characters from non-English orthographies, the
whole thing breaks down and life becomes very awkward. Which is what Unicode
is /intended/ to avoid.

But in a way, it's a very Bad Thing too. Because of the overlap, it's very
hard (at least for people handling mostly English text) to see when they've
made a mistake with their programming. Or when they've carelessly, or
sloppily, made assumptions about the code-page in use. It would be nice to
have (perhaps as part of the standard JDK) a debugging Charset which mapped
Unicode data to some sort of recognisable gibberish -- case-inverted or even
"rot13" would do. For all I know, there could be one there already, and I've
missed it...

-- chris

Thomas Hawtin · Feb 1, 2006

Chris said:
But in a way, it's a very Bad Thing too. Because of the overlap, it's very
hard (at least for people handling mostly English text) to see when they've
made a mistake with their programming. Or when they've carelessly, or
sloppily, made assumptions about the code-page in use. It would be nice to
have (perhaps as part of the standard JDK) a debugging Charset which mapped
Unicode data to some sort of recognisable gibberish -- case-inverted or even
"rot13" would do. For all I know, there could be one there already, and I've
missed it...

UTF-16LE should more or less fit the bill. Perhaps UTF-16BE would work
better with single characters (not entirely sure what happens with a
single byte), although it is more common.

export LANG=tr_TR,UTF-16LE

Tom Hawtin

Roedy Green · Feb 1, 2006

sorry, i meant ...i am NOT talking abot Cryptrography and the
different versions of encoding.

i am talking about these simple charset encoding .

so am I.

Roedy Green · Feb 1, 2006

Which is a good thing because (a)
there's a good chance that the author hasn't got the faintest idea what a
code-page /is/ let alone which one s/he used to create the file, and (b) I
don't want to mess around trying to change code-page.

And the encoding used is NOT embedded at the head of the document the
way you might imagine it would be handled. The receiver just has to
KNOW what encoding it is.

This reminds me back in the early 80s I wrote one of the first
electronic medical billing programs for doctors for whom this was a
complete novelty and status symbol. On a demo, one doctor was
horrified, "You mean you have to TYPE; it doesn't just KNOW?"

Another doctor was furious at my incompetence when he discovered that
he would lose keying when he rebooted his machine in the middle of
data entry. I tried to explain that he should not reboot. There was
no need to. He replied that he simply LIKED rebooting and he was not
about to change his nervous habit.

Roedy Green · Feb 1, 2006

It would be nice to
have (perhaps as part of the standard JDK) a debugging Charset which mapped
Unicode data to some sort of recognisable gibberish -- case-inverted or even
"rot13" would do. For all I know, there could be one there already, and I've
missed it...

what do you do with this?

Chris Uppal · Feb 1, 2006

Thomas Hawtin wrote:

[me:]

It would be nice to have (perhaps as part of the standard JDK) a
debugging Charset which mapped Unicode data to some sort of
recognisable gibberish -- case-inverted or even "rot13" would do. For
all I know, there could be one there already, and I've missed it...

Click to expand...

UTF-16LE should more or less fit the bill. [...]
export LANG=tr_TR,UTF-16LE

That's a thought. Not too sure about those NUL bytes though (haven't tried
it yet).

BTW, for anyone who's interested, I rummaged around the Web a little and found
a rot13 Charset, and the corresponding CharsetProvider, at the website for Ron
Hitchens's "Java NIO" book (which I haven't read). The website is
http://www.javanio.info/
the code (which is /not/ free for commercial use) is in:
filearea/bookexamples/unpacked/com/ronsoft/books/nio/charset
under the above root. See the files:
RonsoftCharsetProvider.java
Rot13Charset.java

The first of those files provides sketchy instructions for installing the new
Charset; note that the instructions contain a typo; the filename
META-INF/services/java.nio.charsets.spi.CharsetProvider
shoud be
META-INF/services/java.nio.charset.spi.CharsetProvider
(no 's' on the end of charset).

-- chris

Roedy Green · Feb 1, 2006

BTW, for anyone who's interested, I rummaged around the Web a little and found
a rot13 Charset, and the corresponding CharsetProvider, at the website for Ron
Hitchens's "Java NIO" book (which I haven't read). The website is
http://www.javanio.info/

If you feel up to rolling your own, the instructions for how to do it
are at http://mindprod.com/jgloss/encoding.html#ROLLYOUROWN

It is a bunch of mindless housekeeping BS plus writing a decodeLoop
and encodeLoop method to interconvert byte[] <=> char[]

Chris Uppal · Feb 1, 2006

Roedy Green wrote:

[me:]

It would be nice to
have (perhaps as part of the standard JDK) a debugging Charset
which mapped Unicode data to some sort of recognisable gibberish --
case-inverted or even "rot13" would do.[...]

Click to expand...

what do you do with this?

The problem for me, and I think for other programmers, is that you
can't /see/ when something is happening using the wrong Charset. Since
I'm only an English speaker, the only sample text I can read uses
English characters throughout, and so if I use a wrong Charset there
won't be any obvious differences (as "gk" found). So I'd like to be
able to either set the default Charset to something that is instantly
recognisable if it gets used when I'm not expecting it, or explicitly
use my debugging charset, so that I can follow the data through and see
that it is used everywhere that I intend.

Just a debugging aid. I'd have little use for it if I were -- say
-- Korean.

It would probably be helpful as a teaching tool too (although I am not
a teacher), since it would emphasise the difference between the
character sequences in String (or similar) and the byte sequences
produced by encoding -- differences that can be lost on those who's
native language is ASCII-compatible.

-- chris

Roedy Green · Feb 1, 2006

The problem for me, and I think for other programmers, is that you
can't /see/ when something is happening using the wrong Charset. Since
I'm only an English speaker, the only sample text I can read uses
English characters throughout, and so if I use a wrong Charset there
won't be any obvious differences (as "gk" found). So I'd like to be
able to either set the default Charset to something that is instantly
recognisable if it gets used when I'm not expecting it, or explicitly
use my debugging charset, so that I can follow the data through and see
that it is used everywhere that I intend.

A very simple one might convert char s -> byte f, or simply that
implemented some ligatures, see
http://mindprod.com/jgloss/ligature.html to give a early American look
to the page.

It then becomes a fully legit Charset you might use in real life.
It can piggy back on any other charset adding ligaturisation to it.

See http://mindprod.com/encoding.html#ROLLYOUROWN

for how to proceed. Even a newbie could tackle this one.

ozgwei · Feb 3, 2006

Chris said:
The problem for me, and I think for other programmers, is that you
can't /see/ when something is happening using the wrong Charset. Since
I'm only an English speaker, the only sample text I can read uses
English characters throughout, and so if I use a wrong Charset there
won't be any obvious differences (as "gk" found). So I'd like to be
able to either set the default Charset to something that is instantly
recognisable if it gets used when I'm not expecting it, or explicitly
use my debugging charset, so that I can follow the data through and see
that it is used everywhere that I intend.

Just a debugging aid. I'd have little use for it if I were -- say
-- Korean.

It would probably be helpful as a teaching tool too (although I am not
a teacher), since it would emphasise the difference between the
character sequences in String (or similar) and the byte sequences
produced by encoding -- differences that can be lost on those who's
native language is ASCII-compatible.

Have you tried EBCDIC? The encoding name is Cp1047. But I don't know
whether it is available in JVMs other than IBM's...

Roedy Green · Feb 3, 2006

Have you tried EBCDIC? The encoding name is Cp1047. But I don't know
whether it is available in JVMs other than IBM's...

There are scores of national variants for EBCDIC.

Check out my chart at http://mindprod.com/jgloss/encoding.html for
which ones are supported.

Chris Uppal · Feb 4, 2006

ozgwei wrote:

[me:]

The problem for me, and I think for other programmers, is that you
can't see when something is happening using the wrong Charset.

Click to expand...

[...]
Have you tried EBCDIC? The encoding name is Cp1047. But I don't know
whether it is available in JVMs other than IBM's...

Thanks for the suggestion.

java -Dfile.encoding=Cp1047 my.test.Application

produces satisfying gibberish ;-)

(Actually it's probably /too/ gibberishish, Thomas's suggested UTF16
works a little better.)

-- chris

Piotr Kobzda · Mar 13, 2006

Roedy said:
prior to that you had look at a System property. It might even have
been restricted to signed applets. See
http://mindprod.com/jgloss/encoding.html I should have it all
documented there.

Less restrictive alternative than System properties querying is:

String defaultEncodingName = new
java.i

utputStreamWriter(System.out).getEncoding();

Regards,
piotr

String default encoding: UTF-16 or Platform's default charset?	14	Dec 10, 2010
Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
HTTP request with trailer	0	Mar 22, 2024
platform default charset	6	Aug 20, 2004
How to try a range of hex values in C# code ?	0	Nov 19, 2022
PHP RSS Feed Aggregator changing to todays date everytime feed is aggregated	1	Jan 11, 2022
[sun java] conver charset for string	1	Apr 14, 2007
stmplib MIMEText charset weirdness	3	Feb 26, 2013

platform's default charset ?

Chris Uppal

Thomas Hawtin

Roedy Green

Roedy Green

Roedy Green

Chris Uppal

Roedy Green

Chris Uppal

Roedy Green

ozgwei

Roedy Green

Chris Uppal

Piotr Kobzda

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads