Transmitting strings via tcp from a windows c++ client to a Java server

Discussion in 'Java' started by qqq111, Feb 19, 2006.

  1. qqq111

    qqq111 Guest

    Hi all,

    We have a C++ client which runs on Windows and that needs to transmit
    char* / wchar* strings to and from a Java server.

    The client should correctly handle both 'standard' languages & east
    languages (i.e. using wchar).

    Now, I'm sure there is a best practice for doing so , I just haven't
    found it yet :)

    My best bet would be always encoding the string in UTF-8 before
    it via the net, but I could be wrong.

    Your help will be highly appreciated.


    qqq111, Feb 19, 2006
    1. Advertisements

  2. qqq111

    Roedy Green Guest

    How about UTF-8 encoding? It handles all the 16 bit chars. It is
    reasonable efficient for American English using just 8-bit chars. It
    does not have an endian ambiguity.

    HTTP has heard of it and it tend to be an accepted encoding.

    You could use a 1 byte length byte giving either char or bytes
    insides Or you could use a Java-style big endian length field
    compatible with DataInputStream.readUTF

    Roedy Green, Feb 19, 2006
    1. Advertisements

  3. qqq111

    qqq111 Guest

    Hi Roedy,

    The only problem I have with UTF-8 is its poor supported in Windows.
    In fact, I did not manage to find Win C++ api that converts strings to

    My other thought was to use UTF-16/UCS-2 format, internally used by
    both Win (client) and Java (server), but as you have stated, there's
    the endian issue.

    BTW, your site is at a high position at my Java-best list :)

    qqq111, Feb 20, 2006
  4. qqq111

    Chris Uppal Guest

    The obvious options are:

    Use UTF-8.
    Advantages: Compact /if/ you send mostly ASCII text. Easily readable (for
    debugging) /if/ you send mostly ASCII text. No byte-order issues.
    Disadvantages: Consumes more bandwidth if you send mostly non-ASCII. Requires
    explicit en/de-coding on the Windows box (perfectly possible, but you have to
    write the code for it).

    Use: UTF16-LE
    Advantages: Compact in the cases where UTF-8 is not. Requires no special
    handling in the Windows code (since that's the native format for a wstring) and
    you always have to specify an encoding at the Java end so it makes no
    difference which encoding you use from the Java point-of-view.
    Disadvantages: Consumes more bandwidth if you send mostly ASCII text.

    Without knowing your requirements, I'd can't guess which option would be best
    for you, but I don't think any other options make sense.

    Some other points to consider.

    If you choose UTF8 then don't use or the
    corresponding write method They doesn't do what the method names suggest.

    If you choose UTF16-LE then you should consider whether a BOM (byte order mark)
    is forbidden, tolerated, or required by your protocol. Alternatively you could
    mandate merely UTF16 (either byte order) and /require/ a BOM -- that would give
    you flexibility if you anticipate creating non Windows clients (which I doubt).

    If you choose UTF8 then you should consider whether a BOM forbidden or
    tolerated by your protocol.

    If your choice between UTF-8 and -16 is significantly swayed by bandwidth
    considerations, then it might be worthwhile considering using zlib compression.
    Java already understands that, and it's easy to use the ZLIB1.DLL from Windows

    If your protocol is of the form:
    <character count><character data>
    then you should be very clear about what you mean by a "character", especially
    if you use UTF16 (where there may be more 16-bit wchars / Java chars than
    actual Unicode characters). Is the BOM (if any) included in the count ?

    -- chris
    Chris Uppal, Feb 20, 2006
  5. qqq111

    qqq111 Guest

    Very interesting input, Chris. It does seem
    that UTF-8 is the right way for us...

    1. Our data will mainly consist of ASCII text

    2. It turns out Windows does have an API for to/from UTF-8
    conversions. See WideCharToMultiByte -and-
    MultiByteToWideChar (code page s/b set to CP_UTF8)

    3. Our system does not use DataInputStream, but rather:

    4. Each of our msgs is indeed preceded by a length field
    (as fixed-size text field). Length is measured in Java
    characters and dup by 2 to obtain size in bytes

    5. The BOM issue is, frankly, news to me. If I limit myself to
    UTF-8 strings only, and stick to standard Win/Java api at
    both client & server end, do I need to worry about BOM ?

    Thanks in advance,

    qqq111, Feb 21, 2006
  6. qqq111

    Chris Uppal Guest

    But first a request. /Please/ follow Usenet etiquette and say who you are
    replying to and quote selectively from the post as you reply. Normally I just
    ignore people who don't follow "The Rules"; I'm making an exception in this
    case on a whim ;-)

    That algorithm will not give you the size in bytes of a UTF-8 encoded string.
    There is no way to compute the length of the UTF-8 encoding of a Unicode
    sequence that does not involve scanning every character. The easiest thing, of
    course, is just to let the platform do the encoding and then transmit the
    length of the resulting byte array. If you want to calculate the length
    yourself, then it's a bit messy -- the main problem is that in Java or Windows
    the input data is encoded as UTF-16 so you have to undo that encoding and then
    re-encode the result as UTF-8. Not especially difficult, but more work than
    you might expect if you are used to relying on strlen() and the like.

    It would work for UTF-16. But if you decide to stick with UTF-8 (which sounds
    better to me) then I suggest you prototype your receiving code (for both
    platforms) before you set the protocol in stone.

    Whatever you do, make very sure that your documentation (formal or informal) of
    the protocol is /very/ clear about the meaning of the size field. Remember
    that the word "character" is ambiguous -- it could mean Java char-s, C++
    wchar-s, or (most confusingly) Unicode characters. An inexperienced programmer
    could even assume it meant "byte".

    I doubt it. The important thing is to have made a conscious (and documented)
    decision. I would probably decide that a BOM must not be used, unless there's
    something in your project's requirements that I don't know about.

    -- chris
    Chris Uppal, Feb 22, 2006
  7. qqq111

    qqq111 Guest


    Thanks for not ignoring me ;-)

    You're right, of course.
    That is what we'll probably do, in the end.
    Agree - very important to clearly state 'type of length' .

    As a side note: you've mentioned zlib in a prior post. We do plan to
    compress parts to the network-transferred data. We plan, however on
    an open source lib called LZMA (,
    which achieves impressive compression ratios at a reasonable CPU cost
    (see: ).
    Do you feel we've missed any important considerations here?

    Thanks again,

    qqq111, Feb 23, 2006
  8. qqq111

    Roedy Green Guest

    It is not hard. I posted the code for it at

    The code is in Java but I think it would likely compile as C with the
    right typedefs.
    Roedy Green, Feb 24, 2006
  9. qqq111

    Roedy Green Guest

    the BOM for UTF-8 looks like this:

    EF BB BF

    It is a misnomer. You don't need a byte order mark for UTF-8 since are
    no lo-hi bytes to order. It is more like a file signature to indicate
    a UTF-8 encoded file. Otherwise it will at a casual glance look no
    different from any native platform encoding.
    Roedy Green, Feb 24, 2006
  10. qqq111

    qqq111 Guest

    Hi Roedy,
    Apparently Win does have the api for UTF-8/other formats enc/dec.
    encoding: WideCharToMultiByte(CP_UTF8... )
    decoding: MultiByteToWideChar (CP_UTF8...)

    Note that for the conversions to succeed, your C++ app s/b
    compiled with a _UNICODE flag.

    qqq111, Feb 24, 2006
  11. qqq111

    Chris Uppal Guest

    Thank /you/ for listening!

    I don't know anything about that library or compression scheme myself (beyond
    what it says on the website). It certainly looks OK, and using the same
    library for your C++ and Java code would probably make things easier (if only
    support queries). The only /potential/ issue I'd raise[*] is that the
    [de]compression times are highly asymmetrical with compression being rather
    compute-intensive. If the bulk of the compression happens on the clients,
    leaving the server to do (mostly) only decompression, then that will work very
    well for you. But if the situation is the other way around, then I'd want to
    do a bit of measuring and a few sums before committing to LZMA. I'm not
    suggesting that /would/ be a problem, just something to check (which you may
    well have done already).

    ([*] Apart from a suggestion that you get your lawyers to OK the license --
    which is my standard line for anything with LGPL.)

    -- chris
    Chris Uppal, Feb 24, 2006
  12. qqq111

    Roedy Green Guest

    I have improved the code, to provide both encode and decode, and a
    test harness you can use to ensure that they both give the same
    results an the Sun classes.
    Roedy Green, Feb 24, 2006
  13. qqq111

    Chris Uppal Guest

    Roedy, I don't want to sound too hostile, but that page is full of
    errors and is /very badly/ misleading.

    UTF-8 is a standard. It has /nothing at all/ to do with the fomat used
    in JNI, classfiles, and in the ObjectOutputStream.writeUtf8() method.
    /Nothing/. You should not conflate the two.

    UTF-8 does not include a prepended length count.

    UTF-8 takes between 1 and 4 bytes (inclusive) to encode a Uncode
    character. You encoder does not work properly for either:
    * Unicode characters outside the 16-bit range.
    * java.lang.Strings containing logical characters
    outside that range (for which you have to decode
    the UTF-16 before you can encode again into UTF-8).

    The UTF-8 decoder has similar problems, and in addition does not
    perform the mandatory checks for illegal uses of non-shortest-form
    encodings (necessary for security).

    Unicode characters outside the 16-bit range are /not/ represented as
    surrogate pairs in UTF-8. That /only/ happens in UTF-16.

    I stongly recommend that you review that page, and remove all
    references to Sun's perversion, except a warning that
    ObjectOutputStream.writeUtf8() does not write valid UTF-8. Move the
    desciption of Sun's encoding onto a different page if you think there's
    any value in describing it. Also you should either fix the en/decoder
    code examples, or make it very much more obvious that they don't do
    en/decode standard-compliant UTF-8 (i.e. don't work).

    -- chris
    Chris Uppal, Feb 24, 2006
  14. qqq111

    Chris Uppal Guest

    That should be expanded:

    DataOutputStream.writeUTF() does not write valid UTF-8. Nor do the
    other IO class implementing, such as ObjectOutputStream
    and RandomAccessStream. Similarly the corresponding readUTF() methods
    do not decode UTF-8 correctly.

    -- chris
    Chris Uppal, Feb 24, 2006
  15. qqq111

    Roedy Green Guest

    At times I feel like at the top a steep ski hill when I start a little
    essay. Once you put something out there, you are committed to getting
    it right, no matter how long it takes you.

    The simplest little things turn into black holes for time.

    all you said sounded correct except, I am pretty sure I read up that
    UTF-8 had been extended to use surrogate pairs to encode 32-bit. That
    is not just a Sun thing.
    Roedy Green, Feb 24, 2006
  16. qqq111

    Chris Uppal Guest

    It's perfectly possible that you did read that. It's not true, though.
    A great deal of junk has been written about Unicode.

    -- chris
    Chris Uppal, Feb 27, 2006
  17. qqq111

    Roedy Green Guest

    I have rewritten the essay and written an experiment explorer program
    to back up much of what I say.

    Roedy Green, Feb 27, 2006
  18. qqq111

    Chris Uppal Guest

    Thanks for making the changes.

    I haven't actually checked the code -- it seems safe to assume it does
    what you say it does -- but with that proviso it seems pretty much OK.
    I still think you could usefully make it clearer that your example
    en/decoding code is not actually useful (because incomplete), I know
    you /do/ say that, but it's burried away and (IMO) gives the impression
    that it "doesn't really matter".

    However, there is still one major error. It's near the bottom under
    "Exploring Java's UTF Support". First off, it still isn't plain that 2
    out of the four options you mention (1 and 3) have /nothing at all/ to
    do with UTF-8. The so-called "modified UTF-8" format is not compatible
    (upwards or downwards) with UTF-8. So I don't think you should mix
    references to the two together, and certainly not intermingle them as
    if they were all of comparable relevance. Specifically, the page
    states (slightly further up, under "DataOutputStream.writeUTF()") that
    the length is "followed by a standard UTF-8 byte encoding of the
    String"; that is simply not true. You note already that Quasi-UTF-8
    encodes 0x0 differently from UTF-8, which all by itself is enough to
    make writeUTF() useless for interoperability with standards compliant
    encodings. However there is also a major difference in how it encodes
    characters off the BMP. Eg. the Uncode character:
    will encode in UTF-8 as (taken from the Uncode Standard 4.0.1, table
    0xF0 0x90 0x8C 0x82
    whereas under Sun's scheme it encodes as:
    0xED 0xA0 0x80 0xED 0xBC 0x82
    (I'm using unsigned bytes here).

    BTW, you also express some opinions on the (non-)value of the >16-bit
    Unicode characters. I have no problem with your expressing your
    opinions on your own webpages. I just wanted to add that I don't agree
    with them.

    -- chris
    Chris Uppal, Feb 28, 2006
  19. qqq111

    Roedy Green Guest

    I disagree. The only difference for16-bit is the way 0 is encoded,
    and the Sun encoding comes out in the wash even when you decode making
    no special provision for it. You are making a mountain out of a null.
    They behave 99% the same way so it makes sense to discuss them both
    under the
    It is even less of a difference from a practical point of view than
    the presence of absence of BOMs.

    Personally, I don't see the point of any great rush to support 32-bit
    Unicode. The new symbols will be rarely used. Consider what's there.
    The only one I would conceivably use are musical symbols and
    Mathematical Alphanumeric symbols (especially the German black letters
    so favoured in real analysis). The rest I can't imagine ever using
    unless I took up a career in anthropology, i.e. linear B syllabary (I
    have not a clue what it is), linear B ideograms (Looks like symbols
    for categorising cave petroglyphs), Aegean Numbers (counting with
    stones and sticks), Old Italic (looks like Phoenecian), Gothic
    (medieval script), Ugaritic (cuneiform), Deseret (Mormon), Shavian
    (George Bernard Shaw's phonetic script), Osmanya (Somalian), Cypriot
    syllabary, Byzantine music symbols (looks like Arabic), Musical
    Symbols, Tai Xuan Jing Symbols (truncated I-Ching), CJK
    extensions(Chinese Japanese Korean) and tags (letters with blank
    “price tags”).

    I think 32-bit Unicode becomes a matter of the tail wagging the dog,
    spurred by the technical challenge rather than a practical necessity.
    In the process, ordinary 16-bit character handling is turned into a
    bleeding mess, for almost no benefit.

    I think we should for the most part simply ignore 32-bit and continue
    using the String class as we always have presuming every character is
    Roedy Green, Feb 28, 2006
  20. qqq111

    Roedy Green Guest

    I have done some tsk tsking over this that should warm your heart
    cockles. I have also include exploration of codepoints in the test
    program. I have also shown how 21 bit code points are encoded, though
    I have not put code into the sample UTF code to handle codepoints by
    decoding UTF-16 and recoding as UTF-8. I wanted to explain to how
    thisg worked, not confuse the heck out of people with code they won't
    likely ever use.
    Roedy Green, Mar 1, 2006
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.