xml, character encoding, asp question

M

Mark

Hi...

I've been doing a lot of work both creating and consuming web services, and
I notice there seems to be a discontinuity between a number of the different
cogs in the wheel centering around windows-1252 and that it is not equivalent
to iso-8859-1.

Looking in the registry under HKEY_CLASSES_ROOT\MIME\Database\Charset and
\Codepage, it seems that all variations on iso-8859-1 (latin1, etc) are
mapped to code page 1252, which I'm assuming is windows-1252 in execution
terms. So if I set the codepage=1252 and Response.Charset=iso-8859-1 in ASP,
it seems that I'm *really* going to get out windows-1252, not iso-8859-1.
This becomes somewhat noticable in html since a lot of commonly used elements
(like the free-floating bullet •), which *aren't* really 8859-1, get
interpreted as such in browsers.

I occasionally run into problems, however, because MSXML doesn't appear to
be using the mime database to determine how to process the encoding
declaration (or at least it's got some different mapping hidden somewhere).
MSXML appears to treat the range 128-159 the way the ansi standard defines
them - undefined control sequences. As such, when you're processing xml
(either xml to xml or xml to html via xsl), if you get what is *intended* to
be a bullet (149) or curly quotes or any of those other extensions that are
really windows-1252 in your xml, msxml won't make the association and
translate the characters properly going between character sets. And
unfortunately a lot of web services don't accept or generate "windows-1252"
as an encoding declaration.

So...
1) Am I correct in assuming that MSXML is using different encoding routines
than IIS/ASP?

2) Is there a @Codepage I can specify that will produce real latin 1 in asp?

3) Will ASP.Net be more standards compliant? and/or does ASP.Net use the
mime database under the covers too?

4) just as an aside anybody have a clue why when output via xsl for
encoding utf-8 doesn't display properly in IE?

Thanks
-Mark
 
M

[MSFT]

Hello Mark,

MSXML has two methos to load XML:LoadXML method and the Load method.

The LoadXML method always takes a Unicode BSTR that is encoded in UCS-2 or
UTF-16 only. If you pass in anything other than a valid Unicode BSTR to
LoadXML, it will fail to load.

The Load method implements the following algorithm for determining the
character encoding or character set of the XML document:

1.If the Content-Type HTTP header defines a character set, this character
set overrides anything in the XML document itself. This obviously doesn't
apply to SAFEARRAY and IStream mechanisms because there is no HTTP header.
2.If there is a 2-byte Unicode byte-order mark, it assumes the encoding is
UTF-16. It can handle both big endian and little endian.
3.If there is a 4-byte Unicode byte order mark (0xFF 0xFE 0xFF 0xFE), it
assumes the encoding is UTF-32. It can handle both big endian and lttle
endian.
4.Otherwise, it assumes the encoding is UTF-8 unless it finds an XML
declaration with an encoding attribute that specifies some other character
set (such as ISO-8859-1, Windows-1252, Shift-JIS, and so on).

"Windows-1252" should be right thing to produce latin 1. ASP.NET also has
codepage property and simliar with ASP, however, the charator will be
UNICODE in its code behind.

Luke
 
M

Mark

Hi Luke...

Thanks for responding, but the response is a little too narrow to address
any of the questions I asked. We're using the Load() method to load the
response from web services, so the detection of the encoding is not the
issue. The issue is that the mappings between character sets that MSXML uses
doesn't appear to be the same as other apis available to ASP (like
Server.HTMLEncode() and Server.UrlEncode()) and other C++ apis (like
WideCharToMultiByte() and MultiByteToWideChar()).

Near as I can tell, everything other than MSXML doing encoding conversion
seems to be working from the HKEY_CLASSES_ROOT\MIME\Database\Charset &
CodePage system. Also near as I can tell, that system doesn't differentiate
between windows-1252 and iso-8859-1, even though they are *not* equivalent
(1252 is a superset of 8859-1). I probably wouldn't be running into as many
annoying inconsistencies if MSXML was standards-noncompliant in the same way,
but MSXML *does* recognize the difference between windows-1252 and iso-8859-1
and does process/output things differently. And since many of the web
services we consume come from other vendors, we don't have the option of just
telling them to use "windows-1252" instead of "iso-8859-1" in their xml
encoding headers.

First, I'm looking for ways to get MSXML and ASP to work together
consistently, if possible. If not, at least try to define what to avoid.
It's also of parenthetical interest whether ASP.Net has fixed any of these
inconsistencies; I haven't done trial cases myself to test it yet.

Take the small bullet as a good example. Putting • in your html gets you a
small bullet in IE, though this is only a legitimate interpretation if your
encoding is windows-1252 - not iso-8859-1 or any other non-windows-12*
encoding. 149 is a legal character in unicode just not the bullet character.
In unicode the bullet character is 8226. If I have a literal 149 character
in an xml document with a declared encoding of windows-1252, MSXML will
interpret that up to 8826 as part of the character set mapping when the xml
is parsed; how it gets represented when I spiel it out via xsl or
Response.Write depends on the output encoding I use.

If that same xml document, however, has a declared encoding of iso-8859-1,
MSXML doesn't map the 149 to anything at all - it doesn't recognize that it
has any particular meaning. So if my xsl stylesheet applied to that dom
outputs utf8, what comes out is a two byte representation of 149 - c2 95. IE
doesn't recognize those characters as meaning anything in particular and what
it displays is garbage. Hence the reason for my posting.

Ironically, there are some web services out there which have the same
misunderstanding of the difference between windows-1252 and iso-8859-1 that
you do. They generate xml with an encoding of "iso-8859-1" when they are
including 1252 characters between 128-159. It's frustrating that while MSXML
is more standards compliant in recognizing the difference, that standards
compliance causes garbage to come out the back end of the meat grinder.

Thanks
Mark
 
M

[MSFT]

Hi Mark,

I think we can specify the encoding in xsl, for example:

<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:eek:utput method="html" encoding="iso-8859-1" />
<xsl:template match="Books">

I test above code in IE and it can display char 149 correctly.

Luke
 
T

Tony Proctor

I can't help you much here Mark, but I can sympathise. We're going to be
hitting this problem ourselves soon so I'm especially interested in this
thread.

I know all to well that 'Windows Latin-1' (code page 1252) is *not* the same
as the ISO latin-1 set (iso 8859/1). There are some subtle differences where
MS have tried to make better use of some the lesser-used parts of the ISO
set.

Tony Proctor
 
M

Mark

Hi Luke...

Again, thanks for responding. We're getting closer to an understanding of
the problem, but not yet any resolution.

Yes, you can change the output encoding designation in xsl, and yes you can
use "iso-8859-1" and it will output a literal 149 and yes IE will display it
- usually. But this only delivers us to the doorstep of understanding the
inconsistencies that make this difficult to work with in ASP.

If you want to have any good support for internationalization on your
website, you really can't use windows-1252 OR iso-8859-1 (same thing as far
as ASP goes) as your ASP page's code page because the output encoding from
IIS (or the encoding IE receives depending on how you look at it) because
that will influence how IE tries to process form elements that it tries to
encode for resubmission.

The big problem is that an IE page with 1252 encoding lets you copy/paste,
say, chinese into the form element and it looks good in the form element, but
IE does a terrible job encoding those inputs on a url. It uses a
non-standard encoding format to construct the url and the tools in ASP for
interpreting are marginal.

To get really *good* support for url encoding from IE (or other browsers),
you have to set your page encoding to utf-8. If you do that, IE will use
utf-8 to stream international user input in the url encoding, and it does it
in a standard way.

But if you use utf-8 encoding and you're working with xml in your asp page,
then the *real* difference between windows-1252 and iso-8859-1 *does* become
a problem. Because, as i've been saying, MSXML is standards-compliant and
does recognize the difference while the rest of ASP is *not* standards
compliant in how it handles the two.

So these inconsistencies really put a web developer in a bind. Which
feature do you want to drop - internationalization? Use of web services?
Use of xml? Or do you just have to bend over backward as a developer trying
to develop all of your own tools to work around the fact that the MS tools
for this are inconsistent? Seems like the last one to me, but I thought I
would ask to see if these sorts of things were on the MS radar screen.

Thanks
-mark
 
M

[MSFT]

Hi Mark,

I understand your complaining on this issue. It is really a tough issue to
take care all these staff. The best thing I can suggest is to migrate to
ASP.NET. It has better support for internationalization and web service.
You can handle the web service with XML classes in .NET, convert it to utf8
and send result to client side.

Luke
 
T

Tony Proctor

Re: question (2) Mark, I've found a reference to a code page that I didn't
know existed: 28591. This is suppose to be exactly equivalent to ISO 8859/1.

If this works (I haven't tried it) then it won't solve all problems though.
The Euro symbol, for instance, is a very important character in Windows
Latin-1, but it isn't present in the ISO Latin-1. I believe ISO cope with it
using a newer ISO 8859/15 (Latin-9). The code page equivalent for this,
apparently, is 20865.

Tony Proctor
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,562
Members
45,038
Latest member
OrderProperKetocapsules

Latest Threads

Top