different encoding handling between old ASP and ASP.Net

G

Guest

Hi...

Just noticed something odd... In old ASP if you had query parameters that
were invalid for their encoding (broken utf-8, say), ASP would give you back
chars representing the 8-bit byte value of the broken encoding, so you still
got something for every input byte.

This appears to have changed radically in ASP.Net, going down to the base
System.Text.Encoding object. Now, it appears to simply vaporize bytes that
don't fit in the encoding. You don't even get a ? placeholder like you get
in so many other contexts in asp.

Could anyone explain why there was such a dramatic change in the handling
of error cases? Is there a way using the .net framework to know if you had
an encoding error?

An example of the input:
/test.aspx?query=%C7%D1%B1%DB%BA%A3%B3%CA%B9%E6
In the above, C7, A3, B3, and E6 don't make a valid utf-8 stream, but
looking for Request.QueryString ("query") gives me the decoded version, just
missing any representation of the offending characters, i.e. three characters
1137, 1786, and 697 (which don't render in IE either by the way).

Request.QueryString ("query") in ASP would yield a 10-character string, with
each of the original bytes converted to the raw 8-bit value.

Seems like a pretty big difference in handling things and I don't see a way
of getting any kind of indication (Exception or something) that there was a
conversion error.

Thanks
-mark
 
J

Joerg Jooss

Mark said:
Hi...

Just noticed something odd... In old ASP if you had query parameters
that were invalid for their encoding (broken utf-8, say), ASP would
give you back chars representing the 8-bit byte value of the broken
encoding, so you still got something for every input byte.

This appears to have changed radically in ASP.Net, going down to the
base System.Text.Encoding object. Now, it appears to simply vaporize
bytes that don't fit in the encoding. You don't even get a ?
placeholder like you get in so many other contexts in asp.

Could anyone explain why there was such a dramatic change in the
handling of error cases? Is there a way using the .net framework to
know if you had an encoding error?

An example of the input:
/test.aspx?query=%C7%D1%B1%DB%BA%A3%B3%CA%B9%E6
In the above, C7, A3, B3, and E6 don't make a valid utf-8 stream, but
looking for Request.QueryString ("query") gives me the decoded
version, just missing any representation of the offending characters,
i.e. three characters 1137, 1786, and 697 (which don't render in IE
either by the way).

Request.QueryString ("query") in ASP would yield a 10-character
string, with each of the original bytes converted to the raw 8-bit
value.

What does that mean? 0xC7, 0xA3, 0xB3, and 0xE6 are all meaningless in
UTF-8. There's no way to replace these bytes with a replacement
character, because that character's meaning would be ambiguous -- is it
the real character or a replacement? Whatever ASP does in this
situation, it's wrong.

Cheers,
 
G

Guest

Hi Joerg...

Actually, none of the vaporized characters in the original example are
prohibited from utf-8 per se; what was broken about the original example was
that %C7 was followed by %D1; to be legal utf-8, it would have to have been
followed by %BF or lower.

Taken together, the example string that was supposed to be utf-8 *as a
whole* is invalid, and the question was more about what's the appropriate way
to respond ot that. ASP responded to an invalid utf-8 string by not trying
to find valid bits in it but by giving as close to a "raw" approximation as
it could.

ASP.Net treats it like panning for gold. It sifts through the stream until
it finds byte combos that are legal, keeps those, and drops the rest. It
doesn't even put in ? as a placeholder, like so many of the other apis do. I
don't see how that's any less "wrong" than what ASP does.

What perplexes me more is why the discontinuity? It's just another thing
that won't work the same way when migrating from ASP to ASP.Net. If there's
a rationalization why picking out bits and pieces from an invalid stream is
better than not trying to translate it at all, I'd be curious to know.

If I were God, I'd say that the "right" way to do it in .Net would be to
throw an invalid format exception when garbarge is fed to an Encoding class.
But given how expensive Exception processing is, I could understand why they
might not want to do that. Next down on my most "right" list would be to
have HttpUtility.UrlDecode() return an instance of an object where one
member would be the successfully translated string (if any) and another
member would be an array of the raw bytes. Then you could test the result
and make use of the bits if you chose.

Thanks
_mark
 
S

Steven Cheng[MSFT]

Hi Mark,

Thanks for your posting.
Yes, I can imagine and believe the screen you got, however, this is infact
not caused by the underlyign charset processing difference between ASP and
ASP.NET. More exactly, this is somewhat caused by the different
globalization support and configuration between ASP and ASP.NET.

In ASP, we have limited configuration on global dev, so generally we have
two things need to set:
1. The codePage value for the serverside page, through
<%@ Language="VBScript" CodePage="65001" %> or
<%
Session.CodePage = 65001
%>

the above two aproach all set the serverside page's request processing
charset to utf-8(code page 65001). So the comming querystring will be
decode as utf-8 encoding. If you don't set either of them, ASP will use
the default charset( your system locale on the server) to decode the string
in the comming request.

In ASP.NET, we don't need to set these, since ASP.NET bydefault use utf-8
as the request/response EncodingCharset, we can find the default setting in
web.config's <globalization> element.

2. When the server page write content to clientside, the browser will
automatically use the proper encoding to display the page, also in ASP we
can use the following code to explicitly set.(If not , the server's default
charset will be used)
<%
Response.Charset = "UTF-8"
%>

In ASP.NET as I mentioned above, the UTF-8 is also the default setting.
Also, this info will indicate the client browser to automatically choose
the correctly encoding to display the page content. If we didn't explicitly
set it, we need to
manually adjust the client browser's view-->encoding to utf-8 to display
the correct content.

Now, as for the byte period you mentiond:

%C7%D1%B1%DB%BA%A3%B3%CA%B9%E6

when using utf-8 to decode them, they'll be parsed as three undiplayable
chars , we should see three empty squares on the page (this is the correct
behavior). We can also confirm this by running the below code in .net's
winform app:
================
byte[] bytes = {0xC7,0xD1,0xB1,0xDB,0xBA,0xA3,0xB3,0xCA,0xB9,0xE6};

string str = System.Text.Encoding.UTF8.GetString(bytes);

MessageBox.Show(string.Format("string:{0}, length:{1}",str,str.Length));
================

The reason why you got different behavior in ASP may caused by the ASP use
your server's system locale to parse the querystring rather than (utf-8).
So I suggest you try the following page which explicitly set the server
page's codepage as utf-8 and response charset to utf-8:
==============================
<%@ Language="VBScript" %>

<%
Session.CodePage = 65001
%>

<%
dim str

str = Request.QueryString("str")

Response.Write("<br>String: " & str)
Response.Write("<br>Length: " & Len(str))

Response.CharSet = "utf-8"
%>

========================

Then, when pass the
%C7%D1%B1%DB%BA%A3%B3%CA%B9%E6

as querystring, we can get three empty squares displayed on page(make sure
the client browser is using utf-8 encoding to display the page) which is
identical to the ASP.NET page(using utf-8 request/response encoding)'s
behavior.

If there're anything unclear or any other related questions, please feel
free to post here. Thanks,

Steven Cheng
Microsoft Online Support

Get Secure! www.microsoft.com/security
(This posting is provided "AS IS", with no warranties, and confers no
rights.)
 
J

Joerg Jooss

Mark said:
Hi Joerg...

Actually, none of the vaporized characters in the original example
are prohibited from utf-8 per se; what was broken about the original
example was that %C7 was followed by %D1; to be legal utf-8, it
would have to have been followed by %BF or lower.

Yep -- I was talking about bytes, not characters.
Taken together, the example string that was supposed to be utf-8 *as
a whole* is invalid, and the question was more about what's the
appropriate way to respond ot that. ASP responded to an invalid
utf-8 string by not trying to find valid bits in it but by giving as
close to a "raw" approximation as it could.

ASP.Net treats it like panning for gold. It sifts through the stream
until it finds byte combos that are legal, keeps those, and drops the
rest. It doesn't even put in ? as a placeholder, like so many of the
other apis do. I don't see how that's any less "wrong" than what ASP
does.

As I pointed put, replacement characters are misleading, because you
have no idea whether the '?' is genuine or a replacement.

What we really need here is a HttpRequest property that indicates
whether form data or the query string were decoded without skipping
input bytes.

Cheers,
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,570
Members
45,045
Latest member
DRCM

Latest Threads

Top