utf8 url encoded letters and their %values

P

pulvens

First my problem:

I have a javascript running at the server side, receiving both POST
and GET requests.
The POST request works fine, none latin-1 letters are displayed
correct.
But if I type the request directly into the address bar, I get the
replacement (65533) char instead.
If I type ?text=ø the browser sends ?text=%F8, and the receiving
script fails to read it correct.
If I type ?text=%C3%B8 the browser sends it correct and the receiving
script succeeds.

What is the difference, is %F8 utf16? or utf8? and are there any way
to influence it.

The problem is the same for both IE8 and Firefox 3.5.
The page is in utf-8, both the html header and the page meta tag
specifies that.

The script runs at a IIS server.
Here is the test script:
function main() {
var t = String(Request('text'));
for(var i =0; i < t.length;i++){
Content.add(i+'='+t.charCodeAt(i)+'<br />');
}
Content.add(t+'<br />');
}

I guess my question are:
Why dose the javascript String(obj) function not understand the %F8
encoding?
Do you have any suggestions how to solve this problem.

Thanks
-Rune
 
M

Musaul Karim

Have you checked at which point it fails on the server? Is it at
Request(), or is it at String()?

You might want to check what value returned by Request().

F8 is basically the hex value for utf-8 character 'ø'. You'll get this
if you run escape('ø');

%C3%B8 is the the URL encoded value for the same character, i.e.
encodeURI('ø');

If you are generating the get request via script, you should be
encoding your characters using encodeURI(). afaik escape() is
deprecated as it doesn't handle non-ascii characters properly.

How do you intend to generate the request?
 
M

Musaul Karim

Nope, any character at or above 128 (0x80) is encoded with at
least two bytes in UTF-8. Only ASCII characters (up to 127) are
identical in UTF-8 and ASCII or ANSI.

Hans-Georg

yeah sorry. I meant the unicode code point hex value is F8. The utf-8
hex value is C3 B8.
 
P

pulvens

Have you checked at which point it fails on the server? Is it at
Request(), or is it at String()?
well, good point. Request returns the misinterpreted values. So this
might easily be the wrong forum.
If I look at the querystring as such, everything is correct.

F8 is basically the hex value for utf-8 character 'ø'. You'll get this
if you run escape('ø');

%C3%B8 is the the URL encoded value for the same character, i.e.
encodeURI('ø');
Thanks for the clarification. I was momentarily confused there.
If you are generating the get request via script, you should be
encoding your characters using encodeURI(). afaik escape() is
deprecated as it doesn't handle non-ascii characters properly.
Good to know. But I would like to be able to get requests from
different sources.
How do you intend to generate the request?
well, mainly form a perl script, on some other server. But I was
hoping to make a general service that could accept typed in urls as
well. In that case Firefox sends ø as %F8 and IE as ?? (neither %F8
nor %C3%B8, maybe as the unicode character). So that Idea I drop for
now.

Is it possible to make some javascript function that reads and
converts the requests correctly. ie. guesses the encoding and converts
it to some 'standart' encoding.

-Rune
 
T

Thomas 'PointedEars' Lahn

pulvens said:
I have a javascript running at the server side, receiving both POST
and GET requests.

You should have said that you are using JScript in ASP.
The POST request works fine, none latin-1 letters are displayed
correct.

An HTTP POST request must include a Content-Type header which defines the
format and encoding (implicitly if omitted, the default is ISO-8859-1), so
the server, regardless of its default character set, should be able to
decode it properly.

GET requests are different because there is no message body for which a
Content-Type header would apply (which is why RFC 3986 defines it instead.)

Also keep in mind that the supported URI length is limited in IE/MSHTML to
2083 characters, so you probably would want to use POST requests anyway:

But if I type the request directly into the address bar, I get the
replacement (65533) char instead.

That is probably because the query part cannot be decoded by the server. A
Unicode-supporting application is required to use a replacement sequence if
decoding of a byte sequence is not possible; U+FFFD is the primary
possibility for doing that. (There are at least four others, see below.)
If I type ?text=ø the browser sends ?text=%F8, and the receiving
script fails to read it correct.

Understandable, see below.
If I type ?text=%C3%B8 the browser sends it correct and the receiving
script succeeds.

Understandable, too.
What is the difference, is %F8 utf16? or utf8?

`%F8' can be neither, and no part of either. BTW, UTF-16 and UTF-8 are only
different character encodings for the same Unicode character set (see
below).
and are there any way to influence it.

The problem is the same for both IE8 and Firefox 3.5.
The page is in utf-8, both the html header and the page meta tag
specifies that.

There is no HTML header. There is an HTTP (response) header, and if that
header begins with `Content-Type:' then its value takes precedence over the
META _element_ (<meta http-equiv="Content-Type" content="...">). (You only
need the META element when the resource should be displayed without a HTTP
server; unfortunately few browsers manage to duplicate the Content-Type
header as a META element when saving the document on the local filesystem.)
The script runs at a IIS server.
Here is the test script:
function main() {
var t = String(Request('text'));
for(var i =0; i < t.length;i++){

for (var i = 0, len = t.length; i < len; i++) {

is more efficient and better readable. (Allman style¹, which I use and
recommend, even requires the brace to be placed below the `f', but YMMV.)
Content.add(i+'='+t.charCodeAt(i)+'<br />');
}
Content.add(t+'<br />');
}

I guess my question are:
Why dose the javascript String(obj) function not understand the %F8
encoding?

First of all, AFAIK String() does nothing here but to return the passed
value as that is a string value already. AFAIK, it is never going to decode
anything. Second, how non-ASCII characters are encoded depends primarily on
the client.

If the client uses a percent-encoding not defined in RFC 3986, you have to
deal with that, for example by guessing the used encoding and apply
unescape(). That is relatively easy to do for some codes for characters of
8-bit character sets because a UTF-8 code unit is never going to be one of
C0, C1, F5, F6, F7, F8, F9, FA, FB, FC, FD, FE, and FF.²

It would be better, though, if the client used UTF-8 percent-encoding as
defined by RFC 3986 to begin with. You can encourage the client to do so if
you declare and use UTF-8 (instead of an encoding for an 8-bit character
set, like ISO-8859-1) for serving your content, the former with the
following HTTP header:

Content-Type: ...; charset=utf-8

or any case variation thereof (see also
<http://www.iana.org/assignments/character-sets>). (Observe that I am using
UTF-8 for encoding this posting [to pass the footnote 1 character], Google
Groups that you are using should be able to decode it, and your browser
should be able to display it.)

However, your posting suggests that the client might not behave anyway; in
that case the problem must be the client or the form with which the data is
submitted, because it works on a great number of other Web sites, including
those which I have worked on.

But better check the *received* response headers first, for example in
Firefox with Firebug or LiveHTTPHeaders.
Do you have any suggestions how to solve this problem.

HTH


PointedEars
___________
¹ <http://en.wikipedia.org/wiki/Indent_style>
² <http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,043
Latest member
CannalabsCBDReview

Latest Threads

Top