xml, character encoding, asp question

Discussion in 'ASP General' started by Mark, Mar 7, 2005.

  1. Mark

    Mark Guest

    Hi...

    I've been doing a lot of work both creating and consuming web services, and
    I notice there seems to be a discontinuity between a number of the different
    cogs in the wheel centering around windows-1252 and that it is not equivalent
    to iso-8859-1.

    Looking in the registry under HKEY_CLASSES_ROOT\MIME\Database\Charset and
    \Codepage, it seems that all variations on iso-8859-1 (latin1, etc) are
    mapped to code page 1252, which I'm assuming is windows-1252 in execution
    terms. So if I set the codepage=1252 and Response.Charset=iso-8859-1 in ASP,
    it seems that I'm *really* going to get out windows-1252, not iso-8859-1.
    This becomes somewhat noticable in html since a lot of commonly used elements
    (like the free-floating bullet •), which *aren't* really 8859-1, get
    interpreted as such in browsers.

    I occasionally run into problems, however, because MSXML doesn't appear to
    be using the mime database to determine how to process the encoding
    declaration (or at least it's got some different mapping hidden somewhere).
    MSXML appears to treat the range 128-159 the way the ansi standard defines
    them - undefined control sequences. As such, when you're processing xml
    (either xml to xml or xml to html via xsl), if you get what is *intended* to
    be a bullet (149) or curly quotes or any of those other extensions that are
    really windows-1252 in your xml, msxml won't make the association and
    translate the characters properly going between character sets. And
    unfortunately a lot of web services don't accept or generate "windows-1252"
    as an encoding declaration.

    So...
    1) Am I correct in assuming that MSXML is using different encoding routines
    than IIS/ASP?

    2) Is there a @Codepage I can specify that will produce real latin 1 in asp?

    3) Will ASP.Net be more standards compliant? and/or does ASP.Net use the
    mime database under the covers too?

    4) just as an aside anybody have a clue why when output via xsl for
    encoding utf-8 doesn't display properly in IE?

    Thanks
    -Mark
     
    Mark, Mar 7, 2005
    #1
    1. Advertising

  2. Mark

    [MSFT] Guest

    Hello Mark,

    MSXML has two methos to load XML:LoadXML method and the Load method.

    The LoadXML method always takes a Unicode BSTR that is encoded in UCS-2 or
    UTF-16 only. If you pass in anything other than a valid Unicode BSTR to
    LoadXML, it will fail to load.

    The Load method implements the following algorithm for determining the
    character encoding or character set of the XML document:

    1.If the Content-Type HTTP header defines a character set, this character
    set overrides anything in the XML document itself. This obviously doesn't
    apply to SAFEARRAY and IStream mechanisms because there is no HTTP header.
    2.If there is a 2-byte Unicode byte-order mark, it assumes the encoding is
    UTF-16. It can handle both big endian and little endian.
    3.If there is a 4-byte Unicode byte order mark (0xFF 0xFE 0xFF 0xFE), it
    assumes the encoding is UTF-32. It can handle both big endian and lttle
    endian.
    4.Otherwise, it assumes the encoding is UTF-8 unless it finds an XML
    declaration with an encoding attribute that specifies some other character
    set (such as ISO-8859-1, Windows-1252, Shift-JIS, and so on).

    "Windows-1252" should be right thing to produce latin 1. ASP.NET also has
    codepage property and simliar with ASP, however, the charator will be
    UNICODE in its code behind.

    Luke
     
    [MSFT], Mar 8, 2005
    #2
    1. Advertising

  3. Mark

    Mark Guest

    Hi Luke...

    Thanks for responding, but the response is a little too narrow to address
    any of the questions I asked. We're using the Load() method to load the
    response from web services, so the detection of the encoding is not the
    issue. The issue is that the mappings between character sets that MSXML uses
    doesn't appear to be the same as other apis available to ASP (like
    Server.HTMLEncode() and Server.UrlEncode()) and other C++ apis (like
    WideCharToMultiByte() and MultiByteToWideChar()).

    Near as I can tell, everything other than MSXML doing encoding conversion
    seems to be working from the HKEY_CLASSES_ROOT\MIME\Database\Charset &
    CodePage system. Also near as I can tell, that system doesn't differentiate
    between windows-1252 and iso-8859-1, even though they are *not* equivalent
    (1252 is a superset of 8859-1). I probably wouldn't be running into as many
    annoying inconsistencies if MSXML was standards-noncompliant in the same way,
    but MSXML *does* recognize the difference between windows-1252 and iso-8859-1
    and does process/output things differently. And since many of the web
    services we consume come from other vendors, we don't have the option of just
    telling them to use "windows-1252" instead of "iso-8859-1" in their xml
    encoding headers.

    First, I'm looking for ways to get MSXML and ASP to work together
    consistently, if possible. If not, at least try to define what to avoid.
    It's also of parenthetical interest whether ASP.Net has fixed any of these
    inconsistencies; I haven't done trial cases myself to test it yet.

    Take the small bullet as a good example. Putting • in your html gets you a
    small bullet in IE, though this is only a legitimate interpretation if your
    encoding is windows-1252 - not iso-8859-1 or any other non-windows-12*
    encoding. 149 is a legal character in unicode just not the bullet character.
    In unicode the bullet character is 8226. If I have a literal 149 character
    in an xml document with a declared encoding of windows-1252, MSXML will
    interpret that up to 8826 as part of the character set mapping when the xml
    is parsed; how it gets represented when I spiel it out via xsl or
    Response.Write depends on the output encoding I use.

    If that same xml document, however, has a declared encoding of iso-8859-1,
    MSXML doesn't map the 149 to anything at all - it doesn't recognize that it
    has any particular meaning. So if my xsl stylesheet applied to that dom
    outputs utf8, what comes out is a two byte representation of 149 - c2 95. IE
    doesn't recognize those characters as meaning anything in particular and what
    it displays is garbage. Hence the reason for my posting.

    Ironically, there are some web services out there which have the same
    misunderstanding of the difference between windows-1252 and iso-8859-1 that
    you do. They generate xml with an encoding of "iso-8859-1" when they are
    including 1252 characters between 128-159. It's frustrating that while MSXML
    is more standards compliant in recognizing the difference, that standards
    compliance causes garbage to come out the back end of the meat grinder.

    Thanks
    Mark
     
    Mark, Mar 8, 2005
    #3
  4. Mark

    [MSFT] Guest

    Hi Mark,

    I think we can specify the encoding in xsl, for example:

    <xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:eek:utput method="html" encoding="iso-8859-1" />
    <xsl:template match="Books">

    I test above code in IE and it can display char 149 correctly.

    Luke
     
    [MSFT], Mar 9, 2005
    #4
  5. Mark

    Tony Proctor Guest

    I can't help you much here Mark, but I can sympathise. We're going to be
    hitting this problem ourselves soon so I'm especially interested in this
    thread.

    I know all to well that 'Windows Latin-1' (code page 1252) is *not* the same
    as the ISO latin-1 set (iso 8859/1). There are some subtle differences where
    MS have tried to make better use of some the lesser-used parts of the ISO
    set.

    Tony Proctor

    "Mark" <> wrote in message
    news:...
    > Hi...
    >
    > I've been doing a lot of work both creating and consuming web services,

    and
    > I notice there seems to be a discontinuity between a number of the

    different
    > cogs in the wheel centering around windows-1252 and that it is not

    equivalent
    > to iso-8859-1.
    >
    > Looking in the registry under HKEY_CLASSES_ROOT\MIME\Database\Charset and
    > \Codepage, it seems that all variations on iso-8859-1 (latin1, etc) are
    > mapped to code page 1252, which I'm assuming is windows-1252 in execution
    > terms. So if I set the codepage=1252 and Response.Charset=iso-8859-1 in

    ASP,
    > it seems that I'm *really* going to get out windows-1252, not iso-8859-1.
    > This becomes somewhat noticable in html since a lot of commonly used

    elements
    > (like the free-floating bullet •), which *aren't* really 8859-1, get
    > interpreted as such in browsers.
    >
    > I occasionally run into problems, however, because MSXML doesn't appear to
    > be using the mime database to determine how to process the encoding
    > declaration (or at least it's got some different mapping hidden

    somewhere).
    > MSXML appears to treat the range 128-159 the way the ansi standard defines
    > them - undefined control sequences. As such, when you're processing xml
    > (either xml to xml or xml to html via xsl), if you get what is *intended*

    to
    > be a bullet (149) or curly quotes or any of those other extensions that

    are
    > really windows-1252 in your xml, msxml won't make the association and
    > translate the characters properly going between character sets. And
    > unfortunately a lot of web services don't accept or generate

    "windows-1252"
    > as an encoding declaration.
    >
    > So...
    > 1) Am I correct in assuming that MSXML is using different encoding

    routines
    > than IIS/ASP?
    >
    > 2) Is there a @Codepage I can specify that will produce real latin 1 in

    asp?
    >
    > 3) Will ASP.Net be more standards compliant? and/or does ASP.Net use the
    > mime database under the covers too?
    >
    > 4) just as an aside anybody have a clue why when output via xsl for
    > encoding utf-8 doesn't display properly in IE?
    >
    > Thanks
    > -Mark
    >
     
    Tony Proctor, Mar 9, 2005
    #5
  6. Mark

    Mark Guest

    Hi Luke...

    Again, thanks for responding. We're getting closer to an understanding of
    the problem, but not yet any resolution.

    Yes, you can change the output encoding designation in xsl, and yes you can
    use "iso-8859-1" and it will output a literal 149 and yes IE will display it
    - usually. But this only delivers us to the doorstep of understanding the
    inconsistencies that make this difficult to work with in ASP.

    If you want to have any good support for internationalization on your
    website, you really can't use windows-1252 OR iso-8859-1 (same thing as far
    as ASP goes) as your ASP page's code page because the output encoding from
    IIS (or the encoding IE receives depending on how you look at it) because
    that will influence how IE tries to process form elements that it tries to
    encode for resubmission.

    The big problem is that an IE page with 1252 encoding lets you copy/paste,
    say, chinese into the form element and it looks good in the form element, but
    IE does a terrible job encoding those inputs on a url. It uses a
    non-standard encoding format to construct the url and the tools in ASP for
    interpreting are marginal.

    To get really *good* support for url encoding from IE (or other browsers),
    you have to set your page encoding to utf-8. If you do that, IE will use
    utf-8 to stream international user input in the url encoding, and it does it
    in a standard way.

    But if you use utf-8 encoding and you're working with xml in your asp page,
    then the *real* difference between windows-1252 and iso-8859-1 *does* become
    a problem. Because, as i've been saying, MSXML is standards-compliant and
    does recognize the difference while the rest of ASP is *not* standards
    compliant in how it handles the two.

    So these inconsistencies really put a web developer in a bind. Which
    feature do you want to drop - internationalization? Use of web services?
    Use of xml? Or do you just have to bend over backward as a developer trying
    to develop all of your own tools to work around the fact that the MS tools
    for this are inconsistent? Seems like the last one to me, but I thought I
    would ask to see if these sorts of things were on the MS radar screen.

    Thanks
    -mark
     
    Mark, Mar 9, 2005
    #6
  7. Mark

    [MSFT] Guest

    Hi Mark,

    I understand your complaining on this issue. It is really a tough issue to
    take care all these staff. The best thing I can suggest is to migrate to
    ASP.NET. It has better support for internationalization and web service.
    You can handle the web service with XML classes in .NET, convert it to utf8
    and send result to client side.

    Luke
     
    [MSFT], Mar 10, 2005
    #7
  8. Mark

    Tony Proctor Guest

    Re: question (2) Mark, I've found a reference to a code page that I didn't
    know existed: 28591. This is suppose to be exactly equivalent to ISO 8859/1.

    If this works (I haven't tried it) then it won't solve all problems though.
    The Euro symbol, for instance, is a very important character in Windows
    Latin-1, but it isn't present in the ISO Latin-1. I believe ISO cope with it
    using a newer ISO 8859/15 (Latin-9). The code page equivalent for this,
    apparently, is 20865.

    Tony Proctor

    "Mark" wrote:

    > Hi...
    >
    > I've been doing a lot of work both creating and consuming web services, and
    > I notice there seems to be a discontinuity between a number of the different
    > cogs in the wheel centering around windows-1252 and that it is not equivalent
    > to iso-8859-1.
    >
    > Looking in the registry under HKEY_CLASSES_ROOT\MIME\Database\Charset and
    > \Codepage, it seems that all variations on iso-8859-1 (latin1, etc) are
    > mapped to code page 1252, which I'm assuming is windows-1252 in execution
    > terms. So if I set the codepage=1252 and Response.Charset=iso-8859-1 in ASP,
    > it seems that I'm *really* going to get out windows-1252, not iso-8859-1.
    > This becomes somewhat noticable in html since a lot of commonly used elements
    > (like the free-floating bullet •), which *aren't* really 8859-1, get
    > interpreted as such in browsers.
    >
    > I occasionally run into problems, however, because MSXML doesn't appear to
    > be using the mime database to determine how to process the encoding
    > declaration (or at least it's got some different mapping hidden somewhere).
    > MSXML appears to treat the range 128-159 the way the ansi standard defines
    > them - undefined control sequences. As such, when you're processing xml
    > (either xml to xml or xml to html via xsl), if you get what is *intended* to
    > be a bullet (149) or curly quotes or any of those other extensions that are
    > really windows-1252 in your xml, msxml won't make the association and
    > translate the characters properly going between character sets. And
    > unfortunately a lot of web services don't accept or generate "windows-1252"
    > as an encoding declaration.
    >
    > So...
    > 1) Am I correct in assuming that MSXML is using different encoding routines
    > than IIS/ASP?
    >
    > 2) Is there a @Codepage I can specify that will produce real latin 1 in asp?
    >
    > 3) Will ASP.Net be more standards compliant? and/or does ASP.Net use the
    > mime database under the covers too?
    >
    > 4) just as an aside anybody have a clue why when output via xsl for
    > encoding utf-8 doesn't display properly in IE?
    >
    > Thanks
    > -Mark
    >
     
    Tony Proctor, May 5, 2005
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. bugbear
    Replies:
    0
    Views:
    351
    bugbear
    Sep 28, 2005
  2. raavi
    Replies:
    2
    Views:
    929
    raavi
    Mar 2, 2006
  3. Replies:
    0
    Views:
    3,473
  4. Ghislain Benrais

    Xml parser and character encoding

    Ghislain Benrais, Jun 26, 2006, in forum: Java
    Replies:
    8
    Views:
    962
    Dale King
    Jun 28, 2006
  5. C. Benson Manica

    xml.dom.minidom character encoding

    C. Benson Manica, Apr 21, 2010, in forum: Python
    Replies:
    6
    Views:
    4,077
    Stefan Behnel
    Apr 22, 2010
Loading...

Share This Page