writing (char) 129 to file

L

leov

I write a string containing the character (char) 129 or hex 0x81 to a
FileWriter instance.
The default character encoding is Cp1252. Immediately before writing it
to the file, my String contains "\u0081". In the output file appears
the char 0x3F instead. So far I figured out I probably have to set a
different character encoding for the FileWriter.
- how can I set another char encoding for FileWriter, it supports the
method 'getEncoding()' , but no setEncoding()
- what encoding set will support the 0x81 (1byte) character?

thx
leo
 
T

Thomas Fritsch

leov said:
I write a string containing the character (char) 129 or hex 0x81 to a
FileWriter instance.
The default character encoding is Cp1252. Immediately before writing it
to the file, my String contains "\u0081". In the output file appears
the char 0x3F instead. So far I figured out I probably have to set a
different character encoding for the FileWriter.
- how can I set another char encoding for FileWriter, it supports the
method 'getEncoding()' , but no setEncoding()
And FileWriter doesn't have a constructor taking an encoding, too.

Instead of using
Writer writer = new FileWriter(...);
you should use
Writer writer =
new OutputStreamWriter(new FileInputStream(...), encoding));
- what encoding set will support the 0x81 (1byte) character?
What do you mean with an 1byte character 0x81 ?
(1) The 2byte char '\u0081'. Its meaning is defined by the
Unicode spec. See www.unicode.org
(2) The 1byte byte 0x81. Its meaning varies from encoding to
encoding. See http://mindprod.com/jgloss/encoding.html
 
O

Oliver Wong

Thomas Fritsch said:
What do you mean with an 1byte character 0x81 ?
(1) The 2byte char '\u0081'. Its meaning is defined by the
Unicode spec. See www.unicode.org

To be precise, I don't think the unicode spec defines a byte-length for
their characters. That is, the 129th character in the Unicode standard
(where 129 in decimal = 81 in hexadecimal) does not intrinsically have a
length of 2 bytes.

Particular encodings of the characters have length, but the character
itself doesn't have a length. In UTF-16, '\u0081' has a length of 2 bytes.
In other encodings, it might have other lengths.

To the OP, are you asking "Which encoding will encode the Unicode
character '\u0081' as the byte 0x81?"?

- Oliver
 
T

Thomas Fritsch

Oliver said:
To be precise, I don't think the unicode spec defines a byte-length for
their characters. That is, the 129th character in the Unicode standard
(where 129 in decimal = 81 in hexadecimal) does not intrinsically have a
length of 2 bytes.
Agreed! Unicode-characters are just abstract numbers without any length.
And there are actually characters defined beyond 0x10000 (Cuneiform, Gothic,
Linear B, ...).
BTW: I suspect, that Sun now regrets the Java-1.0 design-decision that a
char is 2 bytes long.
 
J

John O'Conner

Oliver said:
Yes. They allude to this regret in the Javadocs too:
http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html

- Oliver


I think that given the situation, we came up with the most reasonable
solution for 1.5. Unicode had evolved past 65k characters for a long
time...frankly, we ignored it as long as possible. With 1.5, the demand
was overwhelming...and legitimate, real characters had shown up in the
Unicode 4.0 specification. We had to find some way to move Java up to
the new 4.0 spec. We considered practically everything...making a new
char32 type, using ints exclusively as characters, changing the
definition of char to be 32 bits wide, etc. Finally, we have what we
have now...after much debate. It isn't perfect, but it works.

Best of luck,
John O'Conner
 
R

Roedy Green

BTW: I suspect, that Sun now regrets the Java-1.0 design-decision that a
char is 2 bytes long.

I don't think so. Going to 32-bit chars would double ram requirement
fro character processing. That is mostly what I do with Java. It
would cut my effective ram heap in two. This would mean more frequent
GC. Those characters are mainly needed for Chinese, and even then I
understand they are optional.
 
O

Oliver Wong

Roedy Green said:
I don't think so. Going to 32-bit chars would double ram requirement
fro character processing. That is mostly what I do with Java. It
would cut my effective ram heap in two. This would mean more frequent
GC. Those characters are mainly needed for Chinese, and even then I
understand they are optional.

I'm not sure if one of the specifications forbid this, but perhaps Java
could *appear* to be using 32-bit chars, but the VM actually internally uses
UTF-16 or even UTF-8 encoding.

I think it'd be more elegant (though perhaps less practical) if the char
data type was not considered a numeric type at all, and did not have any
bit-size. As Unicode expands, so would the implementations of the char data
type, without breaking existing code (since existing code shouldn't be
depending on char being of size 16-bit or anything like that).

- Oliver
 
S

Stefan Ram

Oliver Wong said:
I'm not sure if one of the specifications forbid this, but
perhaps Java could *appear* to be using 32-bit chars, but the
VM actually internally uses UTF-16 or even UTF-8 encoding.

This (with UTF-8) is done in Perl 5.
 
R

Roedy Green

This (with UTF-8) is done in Perl 5.

the problem with that is charAt, indexOf etc all greatly slow down.
Even substring could be a beast if you actually try to figure out the
length in bytes.
 
O

Oliver Wong

Roedy Green said:
the problem with that is charAt, indexOf etc all greatly slow down.
Even substring could be a beast if you actually try to figure out the
length in bytes.

When you're dealing with unicode characters above \uffff, charAt()
doesn't do what one would expect it to do... Is better to have a fast
implementation that works some of the time, or a slow implementation that
works all the time?

Actually, perhaps we could have multiple implementations of the String
interface. You could have an 8-bit-per-character String implementation for
strings which consist mostly of English characters, a 16-bit implementation
for String for European languages and mathematical symbols, and a 32-bit
implementation to handle everything else (for now).

Since most Java programs use strings like so:

<example>
String foo = "Hello world";
</example>

instead of

<example>
String foo = new String("Hello world");
</example>

the compiler could actually, at compile time, look at what kind of
string it is dealing with, and use the appropriate subclass. Similar
intelligence (except at runtime instead of compile time) could be build into
BufferedReader, and other classes which act as factories for Strings.

- Oliver
 
S

Stefan Ram

Roedy Green said:
the problem with that is charAt, indexOf etc all greatly slow down.

This is what we've got today in Java to get the nth character
from a string (because of surrogate pairs used): One can not
just skip (n-1) char values, but has to analyze each char
value for it surrogate property.

So the current Java solution combines problems from both
worlds: It needs more complicated algorithms to care for
surrogate pairs (so getting to the nth character is slower),
but this is not even hidden by a layer from the client, so he
needs to be aware of it.

It is not obvious that UTF-8 algorithms are slow, because
the data is so small that it might often fit into a
cache memory. Using UCS4 might simplify algorithms, but
more strings might not fit into cache memory completely,
which might slow down operations.

Perl 5 might have the suspected slowdown, but at least it has
a layer over its internal UTF-8, so that the client does not
have to be aware of it. His algorithms on strings look simple
and encode the intentions of the programmer, not distorted by
having to care for surrogate pairs. On the long run, code that
expresses the programmers intention more cleanly might even
lead to more chances for optimization. For example: Perl might
change its internal representation to UCS4 later, while Java
must keep surrogate pairs, because clients are written, which
expect them.
 
S

Stefan Ram

This is what we've got today in Java to get the nth character
from a string (because of surrogate pairs used): One can not
just skip (n-1) char values, but has to analyze each char
value for it surrogate property.

One might use:

final java.lang.String chString = string.substring( n - 1, n );
final int ch = java.lang.Character.codePointAt( chString, 0 );
 
C

Chris Uppal

Oliver said:
Actually, perhaps we could have multiple implementations of the String
interface. You could have an 8-bit-per-character String implementation for
strings which consist mostly of English characters, a 16-bit
implementation for String for European languages and mathematical
symbols, and a 32-bit implementation to handle everything else (for now).

I put together an implementation of the same basic idea (for Smalltalk -- where
the absence of static typing allows such things to work a lot better).

There's a separation between the interface to my strings (which are
intersubstituable with the implementation's built-in String class), and their
physical representation. One of the physical classes represents its data as an
internal Array of UnicodeCharacters (this is mainly meant as a
simple-as-possible implementation for sanity checking and unit tests). Most of
the other implementations keep their data as a ByteArray internally and use one
or another UnicodeByteEncoding to interpret it. There are encoding for
UTF-8/16/32, plus the obvious-but-doesn't-actually-exist "UTF-24", and Java's
wierd encoding.

One of the features I plan, but haven't got around to implementing yet, is for
the variable-width encoded strings to keep a record of the first "glitch" in
the encoding -- the first position where there's a character which doesn't fit
in the encoding's minimum width. That should (I hope) mean that UTF-8 can be
used efficiently in space /and/ time for data which is predominantly ASCII.

Writing about it here reminds me that I really ought to get that stuff
finished...

-- chris
 
R

Roedy Green

Actually, perhaps we could have multiple implementations of the String
interface. You could have an 8-bit-per-character String implementation for
strings which consist mostly of English characters, a 16-bit implementation
for String for European languages and mathematical symbols, and a 32-bit
implementation to handle everything else (for now)

that makes sense. Internally they could all be treated as the same
type to the programmer.

You could do it like this:

A string literal could have a two bits marker

00 stored as 8-bits per char NO MULTICHAR STRINGS Unicode 0..FF
(greater range that UTF single char)

01 stored as 16-bits per char no multichars

10 stored as 32-bits per char no multichars.

A string than has many possible internal and hidden representations:

It would even be possible for a string to be a list of the calls to
append that created it, an array of a hodge podge of the three sizes.

The String class would be at liberty to reorganise Strings, collapsing
pieces, making them all one piece of the largest size, or splitting
them to isolate just a few difficult characters leaving the rest in
narrower strings.

This sounds horribly complicated, but even a newbie could implement
such a string class. It is just a lot of bookkeeping. It the cases
where a string has a single segment, the code is almost as fast as the
code we use today, and it would actually use LESS RAM, since so many
strings are in made completely of characters in the rang 0..FF.

The difficult part comes in optimising. When to split, when to join.
Actually splitting and joining are trivial.

Any JVM maker or AOT maker could implement his idea today with 16-8
bit Strings and you would never know unless you peaked inside. The big
payoff for mixed width strings internally would come if Java started
using 32-bit Strings as the default.

Similarly optimisers might internally use arrays if byte or int
instead of long when the optimiser determines that in actuality that
suffices.
 
S

Stefan Ram

One might use:
final java.lang.String chString = string.substring( n - 1, n );
final int ch = java.lang.Character.codePointAt( chString, 0 );

No! It seems as if the substring index is not the number
of code points, just of char values.

public class Main
{ public static void main( final java.lang.String[] args )
{ java.lang.System.out.println( "\udb40\udc50a".substring( 1 )); }}

The above string literal should contain only two code points,
the second one being "a". But substring( 1 ) seems to give
"\udc50a", which contains two chars, but is possibly no
meaningful Unicode code point sequence at alle.

So how does one get the second code point?

public class Main
{ public static void main( final java.lang.String[] args )
{ final java.lang.String text = "\udb40\udc50a";
java.lang.System.out.println
( text.substring( text.offsetByCodePoints( 0, 1 ))); }}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top