Transmitting strings via tcp from a windows c++ client to a Java server

Q

qqq111

Hi all,

We have a C++ client which runs on Windows and that needs to transmit
char* / wchar* strings to and from a Java server.

The client should correctly handle both 'standard' languages & east
Asian
languages (i.e. using wchar).

Now, I'm sure there is a best practice for doing so , I just haven't
found it yet :)

My best bet would be always encoding the string in UTF-8 before
sending
it via the net, but I could be wrong.

Your help will be highly appreciated.

Thanks,

Gilad
 
R

Roedy Green

Now, I'm sure there is a best practice for doing so , I just haven't
found it yet :)

How about UTF-8 encoding? It handles all the 16 bit chars. It is
reasonable efficient for American English using just 8-bit chars. It
does not have an endian ambiguity.

HTTP has heard of it and it tend to be an accepted encoding.

You could use a 1 byte length byte giving either char or bytes
insides Or you could use a Java-style big endian length field
compatible with DataInputStream.readUTF

see http://mindprod.com/jgloss/utf.html
 
Q

qqq111

Hi Roedy,

The only problem I have with UTF-8 is its poor supported in Windows.
In fact, I did not manage to find Win C++ api that converts strings to
UTF-8.

My other thought was to use UTF-16/UCS-2 format, internally used by
both Win (client) and Java (server), but as you have stated, there's
the endian issue.

BTW, your site is at a high position at my Java-best list :)

Best,
Gilad
 
C

Chris Uppal

qqq111 said:
We have a C++ client which runs on Windows and that needs to transmit
char* / wchar* strings to and from a Java server.

The client should correctly handle both 'standard' languages & east
Asian
languages (i.e. using wchar).

The obvious options are:

Use UTF-8.
Advantages: Compact /if/ you send mostly ASCII text. Easily readable (for
debugging) /if/ you send mostly ASCII text. No byte-order issues.
Disadvantages: Consumes more bandwidth if you send mostly non-ASCII. Requires
explicit en/de-coding on the Windows box (perfectly possible, but you have to
write the code for it).

Use: UTF16-LE
Advantages: Compact in the cases where UTF-8 is not. Requires no special
handling in the Windows code (since that's the native format for a wstring) and
you always have to specify an encoding at the Java end so it makes no
difference which encoding you use from the Java point-of-view.
Disadvantages: Consumes more bandwidth if you send mostly ASCII text.

Without knowing your requirements, I'd can't guess which option would be best
for you, but I don't think any other options make sense.

Some other points to consider.

If you choose UTF8 then don't use java.io.DataInputStream.readUTF() or the
corresponding write method They doesn't do what the method names suggest.

If you choose UTF16-LE then you should consider whether a BOM (byte order mark)
is forbidden, tolerated, or required by your protocol. Alternatively you could
mandate merely UTF16 (either byte order) and /require/ a BOM -- that would give
you flexibility if you anticipate creating non Windows clients (which I doubt).

If you choose UTF8 then you should consider whether a BOM forbidden or
tolerated by your protocol.

If your choice between UTF-8 and -16 is significantly swayed by bandwidth
considerations, then it might be worthwhile considering using zlib compression.
Java already understands that, and it's easy to use the ZLIB1.DLL from Windows
code.

If your protocol is of the form:
<character count><character data>
then you should be very clear about what you mean by a "character", especially
if you use UTF16 (where there may be more 16-bit wchars / Java chars than
actual Unicode characters). Is the BOM (if any) included in the count ?

-- chris
 
Q

qqq111

Very interesting input, Chris. It does seem
that UTF-8 is the right way for us...


1. Our data will mainly consist of ASCII text

2. It turns out Windows does have an API for to/from UTF-8
conversions. See WideCharToMultiByte -and-
MultiByteToWideChar (code page s/b set to CP_UTF8)

3. Our system does not use DataInputStream, but rather:
CharsetEncoder/Decoder.

4. Each of our msgs is indeed preceded by a length field
(as fixed-size text field). Length is measured in Java
characters and dup by 2 to obtain size in bytes

5. The BOM issue is, frankly, news to me. If I limit myself to
UTF-8 strings only, and stick to standard Win/Java api at
both client & server end, do I need to worry about BOM ?


Thanks in advance,


Gilad
 
C

Chris Uppal

qqq111 said:

But first a request. /Please/ follow Usenet etiquette and say who you are
replying to and quote selectively from the post as you reply. Normally I just
ignore people who don't follow "The Rules"; I'm making an exception in this
case on a whim ;-)

4. Each of our msgs is indeed preceded by a length field
(as fixed-size text field). Length is measured in Java
characters and dup by 2 to obtain size in bytes

That algorithm will not give you the size in bytes of a UTF-8 encoded string.
There is no way to compute the length of the UTF-8 encoding of a Unicode
sequence that does not involve scanning every character. The easiest thing, of
course, is just to let the platform do the encoding and then transmit the
length of the resulting byte array. If you want to calculate the length
yourself, then it's a bit messy -- the main problem is that in Java or Windows
the input data is encoded as UTF-16 so you have to undo that encoding and then
re-encode the result as UTF-8. Not especially difficult, but more work than
you might expect if you are used to relying on strlen() and the like.

It would work for UTF-16. But if you decide to stick with UTF-8 (which sounds
better to me) then I suggest you prototype your receiving code (for both
platforms) before you set the protocol in stone.

Whatever you do, make very sure that your documentation (formal or informal) of
the protocol is /very/ clear about the meaning of the size field. Remember
that the word "character" is ambiguous -- it could mean Java char-s, C++
wchar-s, or (most confusingly) Unicode characters. An inexperienced programmer
could even assume it meant "byte".

5. The BOM issue is, frankly, news to me. If I limit myself to
UTF-8 strings only, and stick to standard Win/Java api at
both client & server end, do I need to worry about BOM ?

I doubt it. The important thing is to have made a conscious (and documented)
decision. I would probably decide that a BOM must not be used, unless there's
something in your project's requirements that I don't know about.

-- chris
 
Q

qqq111

Hi,

Chris said:
Normally I just ignore people who don't follow "The Rules"

Thanks for not ignoring me ;-)

That algorithm will not give you the size in bytes of a UTF-8 encoded string

You're right, of course.
[easiest way to calc utf-8 buffer len ] is just to let the platform
do the encoding and then transmit the length of the resulting byte array

That is what we'll probably do, in the end.
make very sure [doc] is /very/ clear about the meaning of the size field

Agree - very important to clearly state 'type of length' .


As a side note: you've mentioned zlib in a prior post. We do plan to
compress parts to the network-transferred data. We plan, however on
using
an open source lib called LZMA ( http://www.7-zip.org),
which achieves impressive compression ratios at a reasonable CPU cost
(see: http://tukaani.org/lzma/ ).
Do you feel we've missed any important considerations here?


Thanks again,

Gilad
 
R

Roedy Green

If you choose UTF8 then you should consider whether a BOM forbidden or
tolerated by your protocol.

the BOM for UTF-8 looks like this:

EF BB BF

It is a misnomer. You don't need a byte order mark for UTF-8 since are
no lo-hi bytes to order. It is more like a file signature to indicate
a UTF-8 encoded file. Otherwise it will at a casual glance look no
different from any native platform encoding.
 
Q

qqq111

Hi Roedy,
I posted the code for [ UTF-8 enc/dec ]

Apparently Win does have the api for UTF-8/other formats enc/dec.
encoding: WideCharToMultiByte(CP_UTF8... )
decoding: MultiByteToWideChar (CP_UTF8...)

Note that for the conversions to succeed, your C++ app s/b
compiled with a _UNICODE flag.

Best,
Gilad
 
C

Chris Uppal

qqq111 said:
Thanks for not ignoring me ;-)

Thank /you/ for listening!

We do plan to compress parts to the network-transferred data.
We plan, however on using an open source lib called LZMA
( http://www.7-zip.org),
which achieves impressive compression ratios at a reasonable CPU cost
(see: http://tukaani.org/lzma/ ).
Do you feel we've missed any important considerations here?

I don't know anything about that library or compression scheme myself (beyond
what it says on the website). It certainly looks OK, and using the same
library for your C++ and Java code would probably make things easier (if only
support queries). The only /potential/ issue I'd raise[*] is that the
[de]compression times are highly asymmetrical with compression being rather
compute-intensive. If the bulk of the compression happens on the clients,
leaving the server to do (mostly) only decompression, then that will work very
well for you. But if the situation is the other way around, then I'd want to
do a bit of measuring and a few sums before committing to LZMA. I'm not
suggesting that /would/ be a problem, just something to check (which you may
well have done already).

([*] Apart from a suggestion that you get your lawyers to OK the license --
which is my standard line for anything with LGPL.)

-- chris
 
C

Chris Uppal

Roedy said:

Roedy, I don't want to sound too hostile, but that page is full of
errors and is /very badly/ misleading.

UTF-8 is a standard. It has /nothing at all/ to do with the fomat used
in JNI, classfiles, and in the ObjectOutputStream.writeUtf8() method.
/Nothing/. You should not conflate the two.

UTF-8 does not include a prepended length count.

UTF-8 takes between 1 and 4 bytes (inclusive) to encode a Uncode
character. You encoder does not work properly for either:
* Unicode characters outside the 16-bit range.
* java.lang.Strings containing logical characters
outside that range (for which you have to decode
the UTF-16 before you can encode again into UTF-8).

The UTF-8 decoder has similar problems, and in addition does not
perform the mandatory checks for illegal uses of non-shortest-form
encodings (necessary for security).

Unicode characters outside the 16-bit range are /not/ represented as
surrogate pairs in UTF-8. That /only/ happens in UTF-16.

I stongly recommend that you review that page, and remove all
references to Sun's perversion, except a warning that
ObjectOutputStream.writeUtf8() does not write valid UTF-8. Move the
desciption of Sun's encoding onto a different page if you think there's
any value in describing it. Also you should either fix the en/decoder
code examples, or make it very much more obvious that they don't do
en/decode standard-compliant UTF-8 (i.e. don't work).

-- chris
 
C

Chris Uppal

I said:
[...] warning that ObjectOutputStream.writeUtf8() does not write valid UTF-8.

That should be expanded:

DataOutputStream.writeUTF() does not write valid UTF-8. Nor do the
other IO class implementing java.io.DataOutput, such as ObjectOutputStream
and RandomAccessStream. Similarly the corresponding readUTF() methods
do not decode UTF-8 correctly.

-- chris
 
R

Roedy Green

DataOutputStream.writeUTF() does not write valid UTF-8. Nor do the
other IO class implementing java.io.DataOutput, such as ObjectOutputStream
and RandomAccessStream. Similarly the corresponding readUTF() methods
do not decode UTF-8 correctly.

At times I feel like at the top a steep ski hill when I start a little
essay. Once you put something out there, you are committed to getting
it right, no matter how long it takes you.

The simplest little things turn into black holes for time.

all you said sounded correct except, I am pretty sure I read up that
UTF-8 had been extended to use surrogate pairs to encode 32-bit. That
is not just a Sun thing.
 
C

Chris Uppal

Roedy said:
all you said sounded correct except, I am pretty sure I read up that
UTF-8 had been extended to use surrogate pairs to encode 32-bit. That
is not just a Sun thing.

It's perfectly possible that you did read that. It's not true, though.
A great deal of junk has been written about Unicode.

-- chris
 
C

Chris Uppal

Roedy said:
I have rewritten the essay and written an experiment explorer program
to back up much of what I say.

see http://mindprod.com/jgloss/utf.html

Thanks for making the changes.

I haven't actually checked the code -- it seems safe to assume it does
what you say it does -- but with that proviso it seems pretty much OK.
I still think you could usefully make it clearer that your example
en/decoding code is not actually useful (because incomplete), I know
you /do/ say that, but it's burried away and (IMO) gives the impression
that it "doesn't really matter".

However, there is still one major error. It's near the bottom under
"Exploring Java's UTF Support". First off, it still isn't plain that 2
out of the four options you mention (1 and 3) have /nothing at all/ to
do with UTF-8. The so-called "modified UTF-8" format is not compatible
(upwards or downwards) with UTF-8. So I don't think you should mix
references to the two together, and certainly not intermingle them as
if they were all of comparable relevance. Specifically, the page
states (slightly further up, under "DataOutputStream.writeUTF()") that
the length is "followed by a standard UTF-8 byte encoding of the
String"; that is simply not true. You note already that Quasi-UTF-8
encodes 0x0 differently from UTF-8, which all by itself is enough to
make writeUTF() useless for interoperability with standards compliant
encodings. However there is also a major difference in how it encodes
characters off the BMP. Eg. the Uncode character:
U+10302
will encode in UTF-8 as (taken from the Uncode Standard 4.0.1, table
3.3):
0xF0 0x90 0x8C 0x82
whereas under Sun's scheme it encodes as:
0xED 0xA0 0x80 0xED 0xBC 0x82
(I'm using unsigned bytes here).

BTW, you also express some opinions on the (non-)value of the >16-bit
Unicode characters. I have no problem with your expressing your
opinions on your own webpages. I just wanted to add that I don't agree
with them.

-- chris
 
R

Roedy Green

However, there is still one major error. It's near the bottom under
"Exploring Java's UTF Support". First off, it still isn't plain that 2
out of the four options you mention (1 and 3) have /nothing at all/ to
do with UTF-8. The so-called "modified UTF-8" format is not compatible
(upwards or downwards) with UTF-8. So I don't think you should mix
references to the two together, and certainly not intermingle them as
if they were all of comparable relevance. Specifically, the page
states (slightly further up, under "DataOutputStream.writeUTF()") that
the length is "followed by a standard UTF-8 byte encoding of the
String"; that is simply not true. You note already that Quasi-UTF-8
encodes 0x0 differently from UTF-8, which all by itself is enough to
make writeUTF() useless for interoperability with standards compliant
encodings

I disagree. The only difference for16-bit is the way 0 is encoded,
and the Sun encoding comes out in the wash even when you decode making
no special provision for it. You are making a mountain out of a null.
They behave 99% the same way so it makes sense to discuss them both
under the http://mindprod.com/jgloss/utf.html
It is even less of a difference from a practical point of view than
the presence of absence of BOMs.

Personally, I don't see the point of any great rush to support 32-bit
Unicode. The new symbols will be rarely used. Consider what's there.
The only one I would conceivably use are musical symbols and
Mathematical Alphanumeric symbols (especially the German black letters
so favoured in real analysis). The rest I can't imagine ever using
unless I took up a career in anthropology, i.e. linear B syllabary (I
have not a clue what it is), linear B ideograms (Looks like symbols
for categorising cave petroglyphs), Aegean Numbers (counting with
stones and sticks), Old Italic (looks like Phoenecian), Gothic
(medieval script), Ugaritic (cuneiform), Deseret (Mormon), Shavian
(George Bernard Shaw's phonetic script), Osmanya (Somalian), Cypriot
syllabary, Byzantine music symbols (looks like Arabic), Musical
Symbols, Tai Xuan Jing Symbols (truncated I-Ching), CJK
extensions(Chinese Japanese Korean) and tags (letters with blank
“price tags”).

I think 32-bit Unicode becomes a matter of the tail wagging the dog,
spurred by the technical challenge rather than a practical necessity.
In the process, ordinary 16-bit character handling is turned into a
bleeding mess, for almost no benefit.

I think we should for the most part simply ignore 32-bit and continue
using the String class as we always have presuming every character is
16-bits.
 
R

Roedy Green

Eg. the Uncode character:
U+10302
will encode in UTF-8 as (taken from the Uncode Standard 4.0.1, table
3.3):
0xF0 0x90 0x8C 0x82
whereas under Sun's scheme it encodes as:
0xED 0xA0 0x80 0xED 0xBC 0x82
(I'm using unsigned bytes here).

I have done some tsk tsking over this that should warm your heart
cockles. I have also include exploration of codepoints in the test
program. I have also shown how 21 bit code points are encoded, though
I have not put code into the sample UTF code to handle codepoints by
decoding UTF-16 and recoding as UTF-8. I wanted to explain to how
thisg worked, not confuse the heck out of people with code they won't
likely ever use.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,050
Latest member
AngelS122

Latest Threads

Top