Character encoding between Win and *nix

pietdejong · Oct 22, 2004

Hello all,

I want to use a Windows client to send a character stream via http
to a *nix server, that should interpret my characters as such. My
problem
lies with non-ascii chars. For example: if my char is 'ø' (byte 248),
I
want to keep that byte, and not the *nix interpretation of it 'Ã¸'
(bytes 195,184).

The best way of handling things would probably be to do no
interpretation,
but since I'm parsing the HTTP headers line by line, I would like to
use
Java's BufferedReader (that makes already a byte-to-char
transformation).
Besides that, I'm using a bean like structure holding the information
passed by the charstream, to process. Since my bean has some String
fields, it is needless to say that some byte-char transformation has to
be
done.

Basically, what I would like to get to work is the following:
win-client passes byte 248 via http to *nix server. *nix server
implements a
method getString() that reproduces internally correct chars, but in the
end
writes initial bytes again..., or in case that's not possible, a user
defined
char encoding.

Is this possible?

Sincerely,

Piet de Jong

Bryan Castillo · Oct 22, 2004

Hello all,

I want to use a Windows client to send a character stream via http
to a *nix server, that should interpret my characters as such. My
problem
lies with non-ascii chars. For example: if my char is 'ø' (byte 248),
I
want to keep that byte, and not the *nix interpretation of it 'Ã¸'
(bytes 195,184).

The best way of handling things would probably be to do no
interpretation,
but since I'm parsing the HTTP headers line by line, I would like to
use
Java's BufferedReader (that makes already a byte-to-char
transformation).
Besides that, I'm using a bean like structure holding the information
passed by the charstream, to process. Since my bean has some String
fields, it is needless to say that some byte-char transformation has to
be
done.

Basically, what I would like to get to work is the following:
win-client passes byte 248 via http to *nix server. *nix server
implements a
method getString() that reproduces internally correct chars, but in the
end
writes initial bytes again..., or in case that's not possible, a user
defined
char encoding.

I would suggest using Base64 encoding. Jarkta commons-codec has a
Base64 object you can use to encode & decode base64 strings to and from
byte arrays. I don't know what you would use for your windows client
code though.

Thomas Weidenfeller · Oct 22, 2004

I want to use a Windows client to send a character stream via http
to a *nix server, that should interpret my characters as such.

Whatever that means. Later you tell us you want a byte-interpretation,
not a "character as such" interpretation.

In general, I would suggest you read the http standards and construct a
data stream entirely in line with HTTP. You hint that you do the
transmission in the header, which is not a good idea. I would suggest
you transmit your data in the body.

I would also suggest you stop mixing characters and bytes. If you want
to transmit bytes, don't just do a 1:1 substitution with characters.
Don't use the Java character IO system for binary IO. If you indeed
transmit your data in the header also consider the standards to produce
valid header data.

The best way of handling things would probably be to do no
interpretation,
but since I'm parsing the HTTP headers line by line, I would like to
use
Java's BufferedReader (that makes already a byte-to-char
transformation).

Write a real http parser instead, or use one of the existing ones.
URLConnection is a simple solution that comes with Java. The apache
project has much more sophisticated tools.

Basically, what I would like to get to work is the following:
win-client passes byte 248 via http to *nix server. *nix server
implements a
method getString() that reproduces internally correct chars, but in the
end
writes initial bytes again..., or in case that's not possible, a user
defined
char encoding.

Again stop mixing chars and bytes. It doesn't make sense to pile up more
and more code to correct something which you did wrong earlier. Fix the
root cause, not the symptoms. You can't process the entire HTTP message
with a Reader. So don't do it. Your effort is better spent on getting
the basics right than writing correction code.

/Thomas

Thomas Fritsch · Oct 22, 2004

Hello all, Hello!

I want to use a Windows client to send a character stream via http
to a *nix server, that should interpret my characters as such. My
problem
lies with non-ascii chars. For example: if my char is 'ø' (byte 248),
I

Or to be more precise: Byte 248 is the representation of char 'ø' in the
ISO-8859-1 encoding.

want to keep that byte, and not the *nix interpretation of it 'Ã¸'
(bytes 195,184).

Bytes {195, 184} is the representation of char 'ø' in the UTF8 encoding.

The best way of handling things would probably be to do no
interpretation,

Doing "no interpretation" probably suggests using the ISO-8859-1
encoding on both sides, because this encoding translates char-values to
byte-values simply by throwing away the high byte of each char. But even
then you can't handle chars beyond '\u00ff'.

but since I'm parsing the HTTP headers line by line, I would like to
use
Java's BufferedReader (that makes already a byte-to-char
transformation).

....but probably the wrong transformation.
Construct your BufferedReader by:
new BufferedReader(new InputStreamReader(inputStream, "ISO-8859-1"))
and similarly with the same encoding name for the OutputStreamWriter on
the other machine.
Do *not* omit the encoding name, because then you would get the
java-default encoding, which may be (and in your case *is*) different
for different machines.
It is essential that both sides (client and server) agree on a common
char-encoding when transmitting/receiving text.

Besides that, I'm using a bean like structure holding the information
passed by the charstream, to process. Since my bean has some String
fields, it is needless to say that some byte-char transformation has to
be
done.

Basically, what I would like to get to work is the following:
win-client passes byte 248 via http to *nix server. *nix server
implements a
method getString() that reproduces internally correct chars, but in the
end
writes initial bytes again..., or in case that's not possible, a user
defined
char encoding.

Is this possible?

Sincerely,

Piet de Jong

BTW: You can specify an char-encoding in the HTTP-header, for example
like "text/html; charset=ISO-8859-1".
You can also set the java-default encoding by a special system property,
for example: java -Dfile.encoding=ISO-8859-1 ...

Hope this helps...

A. Bolmarcich · Oct 22, 2004

I want to use a Windows client to send a character stream via http
to a *nix server, that should interpret my characters as such. My
problem
lies with non-ascii chars. For example: if my char is 'ø' (byte 248),
I
want to keep that byte, and not the *nix interpretation of it 'Ã¸'
(bytes 195,184).

If the server is getting those bytes, chances are it is because the
client is sending the UTF-8 encoding of the '\xf8' (byte 248) character.

The best way of handling things would probably be to do no
interpretation,
but since I'm parsing the HTTP headers line by line,

[snip]

If a HTTP header value contains UTF-8 encoded values you have a larger
concern. Values of a HTTP header should be encoded in ISO-8859-1.
If the client in encoding them in UTF-8, it should be using the
"encoded-words" syntax given in http://www.faqs.org/rfcs/rfc2047.html.

If would help if you posted the bytes that the server is receiving for
the whole header (and not just the one character). Until we know what
bytes are being received, it is difficult to say how those bytes are
to be interpreted as characters.

Yamin · Oct 22, 2004

I want to use a Windows client to send a character stream via http
to a *nix server, that should interpret my characters as such. My
problem
lies with non-ascii chars. For example: if my char is ' ' (byte 248),
I
want to keep that byte, and not the *nix interpretation of it 'ø'
(bytes 195,184). ....

I'd really like to know what kind of data you are actually passing
between the two programs...as well as whether or not you have control
over both ends, as well as...what are you actually trying to do.

1. You actually want to transmit binary data (numbers like int...)If
it's meant to be binary data, then why not simply transmit the string
representation of those numbers on a byte for byte basis? On your
windows client you simply encode everything on a byte for byte basis
into a string...really easy to do. then transmit that big string.
YOu will end up using 2X bandwidth as if you were transmitting binary
only data...but that's what you get for sending through http.

2. For lord knows what reason, you actually want to transmit textual
data from your windows PC to the unix server without the character
translation...guess, what...I'd do the exact same thing here. On the
windows client, get the numerical value of your characters. Hopefully
your character set has each character being a set size. Convert these
to bytes, and then convert that into a String representation.

Yamin

pietdejong · Oct 25, 2004

I'll try to be a little bit more specific, while answering some of the
questions that came in replies:

I do not have control over the client, as a matter of fact I don't even
know for sure if it is a Windows client.

The connections my server accepts include non-http requests, so I
cannot use URLConnection (or can I?)

As a matter of fact, the data I'm processing is in the HTTP body (in
case I do receive a HTTP request). In this body I can also find
information about the encoding. As a matter of fact, the encoding
specified here is the only way for me of knowing with which kind of
characters I'm dealing... Should however this information be
unavailable (since also that is possible), then I can consider the data
encoded as UTF-8..

Thanks a bunch for all the replies, they keep me going a bit... seeing
this is a problem I'm struggling with since the last week or so..

Piet

John C. Bollinger · Oct 25, 2004

I'll try to be a little bit more specific, while answering some of the
questions that came in replies:

I do not have control over the client, as a matter of fact I don't even
know for sure if it is a Windows client.

The connections my server accepts include non-http requests, so I
cannot use URLConnection (or can I?)

As a matter of fact, the data I'm processing is in the HTTP body (in
case I do receive a HTTP request). In this body I can also find
information about the encoding. As a matter of fact, the encoding
specified here is the only way for me of knowing with which kind of
characters I'm dealing... Should however this information be
unavailable (since also that is possible), then I can consider the data
encoded as UTF-8..

Well that doesn't sound so hard. You locate the encoding specification
in the message, if present, and use it to construct an InputStreamReader
around the input byte stream. If no encoding specification is available
then you do the same assuming UTF-8 as a default. Read the content via
the InputStreamReader, either directly or indirectly, and you've got it.
If you remember the encoding used to read the request data then you
can apply the same encoding to the outbound response data.

Do note, however, that this depends on the client either providing a
correct character encoding specification or using the same encoding that
the server assumes for a default (UTF-8). If the client, for instance,
encodes the data with ISO-8859-1 but doesn't specify an encoding
(perhaps because ISO-8859-1 is the default for HTTP) then your program
will not behave as desired. If this is not satisfactory then you need
to change the server design.

John Bollinger
(e-mail address removed)

pietdejong · Oct 26, 2004

So wait up, I should make a backup of the entire message then, because
I need it to locate the encoding specification... BTW what do you mean
by directly or indirectly reading the content?

To determine encoding I would parse the entire message using any given
charset, since the specification would be in plain old ASCII anyway...

Whether or not the client is providing a correct char encoding
specification is not my concern. I (the server) am acting acccording to
the specifications...

Thx,

Piet

PS: for further explanation of my problem, take also a look at
http://groups-beta.google.com/group..._doneTitle=Back+to+overview&#ca03e45987b8aafa

John C. Bollinger · Oct 26, 2004

So wait up, I should make a backup of the entire message then, because
I need it to locate the encoding specification...

Yes, you may need to cache up to the entire message if you need to
search for the encoding specification inside the message itself. One
technique you could use would be to copy all the bytes read to a
ByteArrayOutputStream until you are satisfied that you know what
character encoding to apply. If you still have unread input at that
point then you could wrap the ByteArrayOutputStream's byte array and the
socket's InputStream together into one logical InputStream.

I would strongly advise you to perform a protocol-specific read of
incoming messages before handing them off the message content to your
message parser. Consider, for instance, just some of the complexities
of HTTP, which is only one of the protocols you want to support:

(1) Part of the message (request line and headers) are required by the
protocol to be encoded in a specific charset, which may include a
specification of a different charset for another part of the message

(2) The message body may have been subjected to a transfer encoding
(e.g. chunked or gzip) that must be decoded before the body is otherwise
processed

(3) In the case of a persistent connection, the end of the message may
not be marked by the end of the input stream.

BTW what do you mean
by directly or indirectly reading the content?

I mean that you could read directly from the InputStreamReader or from
some other reader (e.g. a BufferedReader) wrapped around it.

To determine encoding I would parse the entire message using any given
charset, since the specification would be in plain old ASCII anyway...

No, you cannot use just any charset. Many of the more common ones do
coincide with ASCII over the range of ASCII characters (0x00 - 0x7f),
but others do not. You must also be careful to detect malformed
messages and handle them appropriately, and it is conceivable that some
malformed messages would be disguised by some charsets.

Whether or not the client is providing a correct char encoding
specification is not my concern. I (the server) am acting acccording to
the specifications...

Well, yes and no. A user might very reasonably argue that messages sent
via HTTP have the character encoding *implicitly* specified as
ISO-8859-1, the default for HTTP, unless that is explicitly overridden
in the message header. Combined with the fact that the user might not
have direct control over the client's behavior in this regard, I think
you would be well advised to take such considerations into account.

John Bollinger
(e-mail address removed)

pietdejong · Oct 27, 2004

Hi John,

Taking all of your considerations into account, I guess I'd be better
off using a HttpURLConnection, like somebody proposed earlier. The
problem is however, that in fact I'm supporting one other protocol as
well. I might however be able to catch off that possibility before
instantiating the HttpURLConnection, since the differentiation between
both protocols is based on the first byte. I've been struggling however
on how and where to instantiate the HttpURLConnection. Do you think I
should rewrite the framework?

I'm inserting some code snippets, trying to make things more clear...

// Server class

ServerSocket ss = new ServerSocket(port);
while (true)
{
Socket s = ss.accept();
Worker w = null;
....
synchronized(semaphore)
{
w = new Worker();
w.setSocket(s);
}
}

// Worker class

private Socket s;
private InputStream is;
private OutputStream os;

Worker()
{
s = null;
}

synchronized void setSocket(Socket s)
{
this.s = s;
notify();
}

public synchronized void run()
{
while (true)
{
if (s == null)
{
try
{
wait();
}
catch (InterruptedException e)
{
continue;
}
}
try
{
execute();
}
catch (Throwable e)
{
e.printStackTrace();
}
....
}
}

void execute()
throws Throwable
{
s.setSoTimeout(Server.timeout * 1000);
s.setTcpNoDelay(true);
is = new BufferedInputStream(s.getInputStream());
os = s.getOutputStream();

try
{
PushbackInputStream pis = new PushbackInputStream(is);
int pduType = pis.read();
pis.unread((byte)pduType);
if(pduType == 0x1) handleOtherProtocol()...

....
}
finally
{
try
{
is.close();
os.close();
}
finally
{
s.close();
}
}
}

John C. Bollinger · Oct 27, 2004

Hi John,

Taking all of your considerations into account, I guess I'd be better
off using a HttpURLConnection, like somebody proposed earlier. The
problem is however, that in fact I'm supporting one other protocol as
well. I might however be able to catch off that possibility before
instantiating the HttpURLConnection, since the differentiation between
both protocols is based on the first byte. I've been struggling however
on how and where to instantiate the HttpURLConnection. Do you think I
should rewrite the framework?

The _server_ cannot use an HttpURLConnection. The client would use one
to *send* an HTTP request; if you need to receive and process HTTP
requests then you need something else. Have you considered writing your
application as a web application using servlets? The servlet
architecture is designed for the kind of thing you are trying to do, at
least on the HTTP side. It might well be possible to get it to handle
your other protocol as well, as the base Servlet class is not
protocol-specific. A big advantage of servlets is that the low-level
details are handled for you, which relieves you of a major coding and
maintenance burden. I have never looked into the details of teaching
Tomcat about protocols other than HTTP, but I think you'd still be ahead
even if you had to put an adapter in front of an HTTP-only servlet to
translate your other protocol(s) into HTTP.

John Bollinger
(e-mail address removed)

pietdejong · Oct 27, 2004

No, haven't considered it, since I don't want to introduce a servlet
container for size sake. My package is currently no more than 200k and
I would like to keep it around that size. Introducing TomCat (or any
other servlet container for that matter) would significantly increase
this, I guess.. but maybe I'm wrong? Would you happen to know any
existing packages lots smaller than tomcat?

Piet

Bryan Castillo · Oct 28, 2004

No, haven't considered it, since I don't want to introduce a servlet
container for size sake. My package is currently no more than 200k and
I would like to keep it around that size. Introducing TomCat (or any
other servlet container for that matter) would significantly increase
this, I guess.. but maybe I'm wrong? Would you happen to know any
existing packages lots smaller than tomcat?

Piet

http://jetty.mortbay.org/jetty/index.html

There site says:

A HTTP/1.1 server can be configured in a jar file under 300KB.

I believe that includes simple support for servlets, no jsps etc....

Character encoding (2)	1	Oct 25, 2004
LDAP character encoding	2	Dec 9, 2004
A few questiosn about encoding	103	Jun 9, 2013
Xml parser and character encoding	8	Jun 26, 2006
AJAX vs form submission (character encoding)	2	Jan 26, 2012
Ruby 1.8 - character encoding	22	Jul 7, 2009
EJB - magic quotes and encoding problem	4	Aug 5, 2010
Internationalization and character encoding	1	Apr 13, 2005

Character encoding between Win and *nix

pietdejong

Bryan Castillo

Thomas Weidenfeller

Thomas Fritsch

A. Bolmarcich

Yamin

pietdejong

John C. Bollinger

pietdejong

John C. Bollinger

pietdejong

John C. Bollinger

pietdejong

Bryan Castillo

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads