Character encoding between Win and *nix

Discussion in 'Java' started by pietdejong@gmail.com, Oct 22, 2004.

  1. Guest

    Hello all,

    I want to use a Windows client to send a character stream via http
    to a *nix server, that should interpret my characters as such. My
    problem
    lies with non-ascii chars. For example: if my char is 'ø' (byte 248),
    I
    want to keep that byte, and not the *nix interpretation of it 'ø'
    (bytes 195,184).

    The best way of handling things would probably be to do no
    interpretation,
    but since I'm parsing the HTTP headers line by line, I would like to
    use
    Java's BufferedReader (that makes already a byte-to-char
    transformation).
    Besides that, I'm using a bean like structure holding the information
    passed by the charstream, to process. Since my bean has some String
    fields, it is needless to say that some byte-char transformation has to
    be
    done.

    Basically, what I would like to get to work is the following:
    win-client passes byte 248 via http to *nix server. *nix server
    implements a
    method getString() that reproduces internally correct chars, but in the
    end
    writes initial bytes again..., or in case that's not possible, a user
    defined
    char encoding.

    Is this possible?

    Sincerely,

    Piet de Jong
     
    , Oct 22, 2004
    #1
    1. Advertising

  2. wrote:
    > Hello all,
    >
    > I want to use a Windows client to send a character stream via http
    > to a *nix server, that should interpret my characters as such. My
    > problem
    > lies with non-ascii chars. For example: if my char is 'ø' (byte 248),
    > I
    > want to keep that byte, and not the *nix interpretation of it 'ø'
    > (bytes 195,184).
    >
    > The best way of handling things would probably be to do no
    > interpretation,
    > but since I'm parsing the HTTP headers line by line, I would like to
    > use
    > Java's BufferedReader (that makes already a byte-to-char
    > transformation).
    > Besides that, I'm using a bean like structure holding the information
    > passed by the charstream, to process. Since my bean has some String
    > fields, it is needless to say that some byte-char transformation has to
    > be
    > done.
    >
    > Basically, what I would like to get to work is the following:
    > win-client passes byte 248 via http to *nix server. *nix server
    > implements a
    > method getString() that reproduces internally correct chars, but in the
    > end
    > writes initial bytes again..., or in case that's not possible, a user
    > defined
    > char encoding.
    >


    I would suggest using Base64 encoding. Jarkta commons-codec has a
    Base64 object you can use to encode & decode base64 strings to and from
    byte arrays. I don't know what you would use for your windows client
    code though.


    > Is this possible?
    >
    > Sincerely,
    >
    > Piet de Jong
    >
     
    Bryan Castillo, Oct 22, 2004
    #2
    1. Advertising

  3. wrote:
    > I want to use a Windows client to send a character stream via http
    > to a *nix server, that should interpret my characters as such.


    Whatever that means. Later you tell us you want a byte-interpretation,
    not a "character as such" interpretation.

    In general, I would suggest you read the http standards and construct a
    data stream entirely in line with HTTP. You hint that you do the
    transmission in the header, which is not a good idea. I would suggest
    you transmit your data in the body.

    I would also suggest you stop mixing characters and bytes. If you want
    to transmit bytes, don't just do a 1:1 substitution with characters.
    Don't use the Java character IO system for binary IO. If you indeed
    transmit your data in the header also consider the standards to produce
    valid header data.

    > The best way of handling things would probably be to do no
    > interpretation,
    > but since I'm parsing the HTTP headers line by line, I would like to
    > use
    > Java's BufferedReader (that makes already a byte-to-char
    > transformation).


    Write a real http parser instead, or use one of the existing ones.
    URLConnection is a simple solution that comes with Java. The apache
    project has much more sophisticated tools.

    > Basically, what I would like to get to work is the following:
    > win-client passes byte 248 via http to *nix server. *nix server
    > implements a
    > method getString() that reproduces internally correct chars, but in the
    > end
    > writes initial bytes again..., or in case that's not possible, a user
    > defined
    > char encoding.


    Again stop mixing chars and bytes. It doesn't make sense to pile up more
    and more code to correct something which you did wrong earlier. Fix the
    root cause, not the symptoms. You can't process the entire HTTP message
    with a Reader. So don't do it. Your effort is better spent on getting
    the basics right than writing correction code.

    /Thomas
     
    Thomas Weidenfeller, Oct 22, 2004
    #3
  4. wrote:
    > Hello all,

    Hello!
    >
    > I want to use a Windows client to send a character stream via http
    > to a *nix server, that should interpret my characters as such. My
    > problem
    > lies with non-ascii chars. For example: if my char is 'ø' (byte 248),
    > I

    Or to be more precise: Byte 248 is the representation of char 'ø' in the
    ISO-8859-1 encoding.
    > want to keep that byte, and not the *nix interpretation of it 'ø'
    > (bytes 195,184).

    Bytes {195, 184} is the representation of char 'ø' in the UTF8 encoding.
    >
    > The best way of handling things would probably be to do no
    > interpretation,

    Doing "no interpretation" probably suggests using the ISO-8859-1
    encoding on both sides, because this encoding translates char-values to
    byte-values simply by throwing away the high byte of each char. But even
    then you can't handle chars beyond '\u00ff'.
    > but since I'm parsing the HTTP headers line by line, I would like to
    > use
    > Java's BufferedReader (that makes already a byte-to-char
    > transformation).

    ....but probably the wrong transformation.
    Construct your BufferedReader by:
    new BufferedReader(new InputStreamReader(inputStream, "ISO-8859-1"))
    and similarly with the same encoding name for the OutputStreamWriter on
    the other machine.
    Do *not* omit the encoding name, because then you would get the
    java-default encoding, which may be (and in your case *is*) different
    for different machines.
    It is essential that both sides (client and server) agree on a common
    char-encoding when transmitting/receiving text.

    > Besides that, I'm using a bean like structure holding the information
    > passed by the charstream, to process. Since my bean has some String
    > fields, it is needless to say that some byte-char transformation has to
    > be
    > done.
    >
    > Basically, what I would like to get to work is the following:
    > win-client passes byte 248 via http to *nix server. *nix server
    > implements a
    > method getString() that reproduces internally correct chars, but in the
    > end
    > writes initial bytes again..., or in case that's not possible, a user
    > defined
    > char encoding.
    >
    > Is this possible?
    >
    > Sincerely,
    >
    > Piet de Jong
    >

    BTW: You can specify an char-encoding in the HTTP-header, for example
    like "text/html; charset=ISO-8859-1".
    You can also set the java-default encoding by a special system property,
    for example: java -Dfile.encoding=ISO-8859-1 ...

    Hope this helps...

    --
    "Thomas:Fritsch$ops:de".replace(':','.').replace('$','@')
     
    Thomas Fritsch, Oct 22, 2004
    #4
  5. On 2004-10-22, <> wrote:
    > I want to use a Windows client to send a character stream via http
    > to a *nix server, that should interpret my characters as such. My
    > problem
    > lies with non-ascii chars. For example: if my char is 'ø' (byte 248),
    > I
    > want to keep that byte, and not the *nix interpretation of it 'ø'
    > (bytes 195,184).


    If the server is getting those bytes, chances are it is because the
    client is sending the UTF-8 encoding of the '\xf8' (byte 248) character.

    > The best way of handling things would probably be to do no
    > interpretation,
    > but since I'm parsing the HTTP headers line by line,

    [snip]

    If a HTTP header value contains UTF-8 encoded values you have a larger
    concern. Values of a HTTP header should be encoded in ISO-8859-1.
    If the client in encoding them in UTF-8, it should be using the
    "encoded-words" syntax given in http://www.faqs.org/rfcs/rfc2047.html.

    If would help if you posted the bytes that the server is receiving for
    the whole header (and not just the one character). Until we know what
    bytes are being received, it is difficult to say how those bytes are
    to be interpreted as characters.
     
    A. Bolmarcich, Oct 22, 2004
    #5
  6. Yamin Guest

    wrote in message news:<>...

    > I want to use a Windows client to send a character stream via http
    > to a *nix server, that should interpret my characters as such. My
    > problem
    > lies with non-ascii chars. For example: if my char is ' ' (byte 248),
    > I
    > want to keep that byte, and not the *nix interpretation of it 'ø'
    > (bytes 195,184).

    ....
    >


    I'd really like to know what kind of data you are actually passing
    between the two programs...as well as whether or not you have control
    over both ends, as well as...what are you actually trying to do.

    1. You actually want to transmit binary data (numbers like int...)If
    it's meant to be binary data, then why not simply transmit the string
    representation of those numbers on a byte for byte basis? On your
    windows client you simply encode everything on a byte for byte basis
    into a string...really easy to do. then transmit that big string.
    YOu will end up using 2X bandwidth as if you were transmitting binary
    only data...but that's what you get for sending through http.

    2. For lord knows what reason, you actually want to transmit textual
    data from your windows PC to the unix server without the character
    translation...guess, what...I'd do the exact same thing here. On the
    windows client, get the numerical value of your characters. Hopefully
    your character set has each character being a set size. Convert these
    to bytes, and then convert that into a String representation.

    Yamin
     
    Yamin, Oct 22, 2004
    #6
  7. Guest

    I'll try to be a little bit more specific, while answering some of the
    questions that came in replies:

    I do not have control over the client, as a matter of fact I don't even
    know for sure if it is a Windows client.

    The connections my server accepts include non-http requests, so I
    cannot use URLConnection (or can I?)

    As a matter of fact, the data I'm processing is in the HTTP body (in
    case I do receive a HTTP request). In this body I can also find
    information about the encoding. As a matter of fact, the encoding
    specified here is the only way for me of knowing with which kind of
    characters I'm dealing... Should however this information be
    unavailable (since also that is possible), then I can consider the data
    encoded as UTF-8..

    Thanks a bunch for all the replies, they keep me going a bit... seeing
    this is a problem I'm struggling with since the last week or so..

    Piet


    wrote:
    > Hello all,
    >
    > I want to use a Windows client to send a character stream via http
    > to a *nix server, that should interpret my characters as such. My
    > problem
    > lies with non-ascii chars. For example: if my char is 'ø' (byte

    248),
    > I
    > want to keep that byte, and not the *nix interpretation of it 'ø'
    > (bytes 195,184).
    >
    > The best way of handling things would probably be to do no
    > interpretation,
    > but since I'm parsing the HTTP headers line by line, I would like to
    > use
    > Java's BufferedReader (that makes already a byte-to-char
    > transformation).
    > Besides that, I'm using a bean like structure holding the information
    > passed by the charstream, to process. Since my bean has some String
    > fields, it is needless to say that some byte-char transformation has

    to
    > be
    > done.
    >
    > Basically, what I would like to get to work is the following:
    > win-client passes byte 248 via http to *nix server. *nix server
    > implements a
    > method getString() that reproduces internally correct chars, but in

    the
    > end
    > writes initial bytes again..., or in case that's not possible, a user
    > defined
    > char encoding.
    >
    > Is this possible?
    >
    > Sincerely,
    >
    > Piet de Jong
     
    , Oct 25, 2004
    #7
  8. wrote:

    > I'll try to be a little bit more specific, while answering some of the
    > questions that came in replies:
    >
    > I do not have control over the client, as a matter of fact I don't even
    > know for sure if it is a Windows client.
    >
    > The connections my server accepts include non-http requests, so I
    > cannot use URLConnection (or can I?)
    >
    > As a matter of fact, the data I'm processing is in the HTTP body (in
    > case I do receive a HTTP request). In this body I can also find
    > information about the encoding. As a matter of fact, the encoding
    > specified here is the only way for me of knowing with which kind of
    > characters I'm dealing... Should however this information be
    > unavailable (since also that is possible), then I can consider the data
    > encoded as UTF-8..


    Well that doesn't sound so hard. You locate the encoding specification
    in the message, if present, and use it to construct an InputStreamReader
    around the input byte stream. If no encoding specification is available
    then you do the same assuming UTF-8 as a default. Read the content via
    the InputStreamReader, either directly or indirectly, and you've got it.
    If you remember the encoding used to read the request data then you
    can apply the same encoding to the outbound response data.

    Do note, however, that this depends on the client either providing a
    correct character encoding specification or using the same encoding that
    the server assumes for a default (UTF-8). If the client, for instance,
    encodes the data with ISO-8859-1 but doesn't specify an encoding
    (perhaps because ISO-8859-1 is the default for HTTP) then your program
    will not behave as desired. If this is not satisfactory then you need
    to change the server design.


    John Bollinger
     
    John C. Bollinger, Oct 25, 2004
    #8
  9. Guest

    So wait up, I should make a backup of the entire message then, because
    I need it to locate the encoding specification... BTW what do you mean
    by directly or indirectly reading the content?

    To determine encoding I would parse the entire message using any given
    charset, since the specification would be in plain old ASCII anyway...

    Whether or not the client is providing a correct char encoding
    specification is not my concern. I (the server) am acting acccording to
    the specifications...

    Thx,

    Piet

    PS: for further explanation of my problem, take also a look at
    http://groups-beta.google.com/group..._doneTitle=Back to overview&#ca03e45987b8aafa
     
    , Oct 26, 2004
    #9
  10. wrote:

    > So wait up, I should make a backup of the entire message then, because
    > I need it to locate the encoding specification...


    Yes, you may need to cache up to the entire message if you need to
    search for the encoding specification inside the message itself. One
    technique you could use would be to copy all the bytes read to a
    ByteArrayOutputStream until you are satisfied that you know what
    character encoding to apply. If you still have unread input at that
    point then you could wrap the ByteArrayOutputStream's byte array and the
    socket's InputStream together into one logical InputStream.

    I would strongly advise you to perform a protocol-specific read of
    incoming messages before handing them off the message content to your
    message parser. Consider, for instance, just some of the complexities
    of HTTP, which is only one of the protocols you want to support:

    (1) Part of the message (request line and headers) are required by the
    protocol to be encoded in a specific charset, which may include a
    specification of a different charset for another part of the message

    (2) The message body may have been subjected to a transfer encoding
    (e.g. chunked or gzip) that must be decoded before the body is otherwise
    processed

    (3) In the case of a persistent connection, the end of the message may
    not be marked by the end of the input stream.

    > BTW what do you mean
    > by directly or indirectly reading the content?


    I mean that you could read directly from the InputStreamReader or from
    some other reader (e.g. a BufferedReader) wrapped around it.

    > To determine encoding I would parse the entire message using any given
    > charset, since the specification would be in plain old ASCII anyway...


    No, you cannot use just any charset. Many of the more common ones do
    coincide with ASCII over the range of ASCII characters (0x00 - 0x7f),
    but others do not. You must also be careful to detect malformed
    messages and handle them appropriately, and it is conceivable that some
    malformed messages would be disguised by some charsets.

    > Whether or not the client is providing a correct char encoding
    > specification is not my concern. I (the server) am acting acccording to
    > the specifications...


    Well, yes and no. A user might very reasonably argue that messages sent
    via HTTP have the character encoding *implicitly* specified as
    ISO-8859-1, the default for HTTP, unless that is explicitly overridden
    in the message header. Combined with the fact that the user might not
    have direct control over the client's behavior in this regard, I think
    you would be well advised to take such considerations into account.


    John Bollinger
     
    John C. Bollinger, Oct 26, 2004
    #10
  11. Guest

    Hi John,

    Taking all of your considerations into account, I guess I'd be better
    off using a HttpURLConnection, like somebody proposed earlier. The
    problem is however, that in fact I'm supporting one other protocol as
    well. I might however be able to catch off that possibility before
    instantiating the HttpURLConnection, since the differentiation between
    both protocols is based on the first byte. I've been struggling however
    on how and where to instantiate the HttpURLConnection. Do you think I
    should rewrite the framework?

    I'm inserting some code snippets, trying to make things more clear...

    // Server class

    ServerSocket ss = new ServerSocket(port);
    while (true)
    {
    Socket s = ss.accept();
    Worker w = null;
    ....
    synchronized(semaphore)
    {
    w = new Worker();
    w.setSocket(s);
    }
    }



    // Worker class

    private Socket s;
    private InputStream is;
    private OutputStream os;


    Worker()
    {
    s = null;
    }

    synchronized void setSocket(Socket s)
    {
    this.s = s;
    notify();
    }

    public synchronized void run()
    {
    while (true)
    {
    if (s == null)
    {
    try
    {
    wait();
    }
    catch (InterruptedException e)
    {
    continue;
    }
    }
    try
    {
    execute();
    }
    catch (Throwable e)
    {
    e.printStackTrace();
    }
    ....
    }
    }

    void execute()
    throws Throwable
    {
    s.setSoTimeout(Server.timeout * 1000);
    s.setTcpNoDelay(true);
    is = new BufferedInputStream(s.getInputStream());
    os = s.getOutputStream();

    try
    {
    PushbackInputStream pis = new PushbackInputStream(is);
    int pduType = pis.read();
    pis.unread((byte)pduType);
    if(pduType == 0x1) handleOtherProtocol()...

    ....
    }
    finally
    {
    try
    {
    is.close();
    os.close();
    }
    finally
    {
    s.close();
    }
    }
    }
     
    , Oct 27, 2004
    #11
  12. wrote:

    > Hi John,
    >
    > Taking all of your considerations into account, I guess I'd be better
    > off using a HttpURLConnection, like somebody proposed earlier. The
    > problem is however, that in fact I'm supporting one other protocol as
    > well. I might however be able to catch off that possibility before
    > instantiating the HttpURLConnection, since the differentiation between
    > both protocols is based on the first byte. I've been struggling however
    > on how and where to instantiate the HttpURLConnection. Do you think I
    > should rewrite the framework?


    The _server_ cannot use an HttpURLConnection. The client would use one
    to *send* an HTTP request; if you need to receive and process HTTP
    requests then you need something else. Have you considered writing your
    application as a web application using servlets? The servlet
    architecture is designed for the kind of thing you are trying to do, at
    least on the HTTP side. It might well be possible to get it to handle
    your other protocol as well, as the base Servlet class is not
    protocol-specific. A big advantage of servlets is that the low-level
    details are handled for you, which relieves you of a major coding and
    maintenance burden. I have never looked into the details of teaching
    Tomcat about protocols other than HTTP, but I think you'd still be ahead
    even if you had to put an adapter in front of an HTTP-only servlet to
    translate your other protocol(s) into HTTP.


    John Bollinger
     
    John C. Bollinger, Oct 27, 2004
    #12
  13. Guest

    No, haven't considered it, since I don't want to introduce a servlet
    container for size sake. My package is currently no more than 200k and
    I would like to keep it around that size. Introducing TomCat (or any
    other servlet container for that matter) would significantly increase
    this, I guess.. but maybe I'm wrong? Would you happen to know any
    existing packages lots smaller than tomcat?

    Piet
     
    , Oct 27, 2004
    #13
  14. wrote in message news:<>...
    > No, haven't considered it, since I don't want to introduce a servlet
    > container for size sake. My package is currently no more than 200k and
    > I would like to keep it around that size. Introducing TomCat (or any
    > other servlet container for that matter) would significantly increase
    > this, I guess.. but maybe I'm wrong? Would you happen to know any
    > existing packages lots smaller than tomcat?
    >
    > Piet



    http://jetty.mortbay.org/jetty/index.html


    There site says:

    A HTTP/1.1 server can be configured in a jar file under 300KB.



    I believe that includes simple support for servlets, no jsps etc....
     
    Bryan Castillo, Oct 28, 2004
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. raavi
    Replies:
    2
    Views:
    933
    raavi
    Mar 2, 2006
  2. ARaman
    Replies:
    1
    Views:
    408
    Mike Wahler
    Oct 23, 2003
  3. aum
    Replies:
    3
    Views:
    368
    Grant Edwards
    Nov 15, 2005
  4. Krist
    Replies:
    6
    Views:
    803
    Arne Vajhøj
    May 7, 2010
  5. Replies:
    0
    Views:
    380
Loading...

Share This Page