Reading UTF-8 Data from XML file

Discussion in 'ASP .Net' started by =?Utf-8?B?TWF0dCBIb2xsaW5nd29ydGg=?=, May 26, 2005.

  1. We have an XML file that contains text in various languages , ie English,
    French, German and Chinese etc.
    We currently have a StringWriter object that reads this in and transforms
    against an XslTransform object.
    the problem arises when we encounter Chinese characters; these characters
    just come out as garbage in the internet explorer browser.

    Setting the charset type on the .aspx page, in the web.config and in the
    ..xsl file to be transformed against has no effect.

    Using a simple transform in classic ASP,
    we can correctly display the text as its meant to be seen, however getting
    the same output in c# seems a lot more tricky.

    After trying various 'fixes' posted on several developer sites, nothing has
    prevailed and the problem is still there.
    We overloaded the StringWriter object to allow changing of the Encoding type
    to force UTF-8 in, but to no avail.

    When the transform is complete, we return the StringWriter objects .ToString
    method.. This is where the error seems to lie,
    because checking the .Encoding.EncodingName just prior to returning, its
    labelled as 'Unicode (UTF-8)', however when output
    to screen via a Text Literal, all we see is garbage.


    Some of the charachters are replaced with ???????. We know are browser is
    functioning correctly because we can see the types of text on
    http://www.yahoo.com.hk
    =?Utf-8?B?TWF0dCBIb2xsaW5nd29ydGg=?=, May 26, 2005
    #1
    1. Advertising

  2. =?Utf-8?B?TWF0dCBIb2xsaW5nd29ydGg=?=

    Joerg Jooss Guest

    Matt Hollingworth wrote:

    > We have an XML file that contains text in various languages , ie
    > English, French, German and Chinese etc.
    > We currently have a StringWriter object that reads this in and
    > transforms against an XslTransform object.


    I really don't believe that you use a String*Writer* to *read* input
    ;-)

    > the problem arises when we encounter Chinese characters; these
    > characters just come out as garbage in the internet explorer browser.
    >
    > Setting the charset type on the .aspx page, in the web.config and in
    > the .xsl file to be transformed against has no effect.
    >
    > Using a simple transform in classic ASP,
    > we can correctly display the text as its meant to be seen, however
    > getting the same output in c# seems a lot more tricky.
    >
    > After trying various 'fixes' posted on several developer sites,
    > nothing has prevailed and the problem is still there.
    > We overloaded the StringWriter object to allow changing of the
    > Encoding type to force UTF-8 in, but to no avail.
    >
    > When the transform is complete, we return the StringWriter objects
    > .ToString method.. This is where the error seems to lie,
    > because checking the .Encoding.EncodingName just prior to returning,
    > its labelled as 'Unicode (UTF-8)', however when output
    > to screen via a Text Literal, all we see is garbage.
    >
    >
    > Some of the charachters are replaced with ???????. We know are
    > browser is functioning correctly because we can see the types of text
    > on http://www.yahoo.com.hk


    Characters and strings in .NET are always Unicode und use UTF-16 as
    internal representation. This means
    a) a UTF-8 StringWriter is an oxymoron
    b) truely character-based operations aren't susceptible to encoding
    problems
    c) encodings are only relevant when you need to transport strings using
    a byte representation, i.e. when rendering a string on web page. Make
    sure that your web application uses UTF-8 (or any other UTF that suits
    your needs) as response encoding.

    Cheers,
    --
    http://www.joergjooss.de
    mailto:
    Joerg Jooss, May 27, 2005
    #2
    1. Advertising

  3. Joerg,

    Thanks - A developer wrote this question...

    We currently have a StringWriter object that reads this in and
    > > transforms against an XslTransform object.


    Sorry - this means that the result of a transformation of an XmlDocument
    object is written to a string writer to clarify.


    My Webform does use uft-8 response and request encoding and I have tried
    using several other different encoding types to get it to work.

    I can get chinese charachters to display but some of the content is still
    broken, could the fact that my transformation results in a mixture of html
    code + english text + chinese text be part of the problem?

    It seems I get something like "藛鈥犆bå“鈥?/P>" notice the question mark and half
    a </p> tag. I have disabled output escaping in my xslt but still to no avail.

    Your help appreciated,
    Thanks
    Matt


    "Joerg Jooss" wrote:

    > Matt Hollingworth wrote:
    >
    > > We have an XML file that contains text in various languages , ie
    > > English, French, German and Chinese etc.
    > > We currently have a StringWriter object that reads this in and
    > > transforms against an XslTransform object.

    >
    > I really don't believe that you use a String*Writer* to *read* input
    > ;-)
    >
    > > the problem arises when we encounter Chinese characters; these
    > > characters just come out as garbage in the internet explorer browser.
    > >
    > > Setting the charset type on the .aspx page, in the web.config and in
    > > the .xsl file to be transformed against has no effect.
    > >
    > > Using a simple transform in classic ASP,
    > > we can correctly display the text as its meant to be seen, however
    > > getting the same output in c# seems a lot more tricky.
    > >
    > > After trying various 'fixes' posted on several developer sites,
    > > nothing has prevailed and the problem is still there.
    > > We overloaded the StringWriter object to allow changing of the
    > > Encoding type to force UTF-8 in, but to no avail.
    > >
    > > When the transform is complete, we return the StringWriter objects
    > > .ToString method.. This is where the error seems to lie,
    > > because checking the .Encoding.EncodingName just prior to returning,
    > > its labelled as 'Unicode (UTF-8)', however when output
    > > to screen via a Text Literal, all we see is garbage.
    > >
    > >
    > > Some of the charachters are replaced with ???????. We know are
    > > browser is functioning correctly because we can see the types of text
    > > on http://www.yahoo.com.hk

    >
    > Characters and strings in .NET are always Unicode und use UTF-16 as
    > internal representation. This means
    > a) a UTF-8 StringWriter is an oxymoron
    > b) truely character-based operations aren't susceptible to encoding
    > problems
    > c) encodings are only relevant when you need to transport strings using
    > a byte representation, i.e. when rendering a string on web page. Make
    > sure that your web application uses UTF-8 (or any other UTF that suits
    > your needs) as response encoding.
    >
    > Cheers,
    > --
    > http://www.joergjooss.de
    > mailto:
    >
    =?Utf-8?B?TWF0dCBIb2xsaW5nd29ydGg=?=, May 31, 2005
    #3
  4. =?Utf-8?B?TWF0dCBIb2xsaW5nd29ydGg=?=

    Joerg Jooss Guest

    Matt Hollingworth wrote:

    > Joerg,
    >
    > Thanks - A developer wrote this question...
    >
    > We currently have a StringWriter object that reads this in and
    > > > transforms against an XslTransform object.

    >
    > Sorry - this means that the result of a transformation of an
    > XmlDocument object is written to a string writer to clarify.
    >
    >
    > My Webform does use uft-8 response and request encoding and I have
    > tried using several other different encoding types to get it to work.
    >
    > I can get chinese charachters to display but some of the content is
    > still broken, could the fact that my transformation results in a
    > mixture of html code + english text + chinese text be part of the
    > problem?


    Only if you were not using Unicode. But since you use UTF-8 as response
    encoding, and assuming you don't mistreat any string objects in your
    code, that should not be a problem.

    > It seems I get something like "藛鈥犆bå“鈥?/P>" notice the
    > question mark and half a </p> tag.


    What characters are missing in this string? Is it only the opening '<'?

    Cheers,
    --
    http://www.joergjooss.de
    mailto:
    Joerg Jooss, May 31, 2005
    #4
  5. "Joerg Jooss" wrote:

    > Matt Hollingworth wrote:
    >
    > > Joerg,
    > >
    > > Thanks - A developer wrote this question...
    > >
    > > We currently have a StringWriter object t

    hat reads this in and
    > > > > transforms against an XslTransform object.

    > >
    > > Sorry - this means that the result of a transformation of an
    > > XmlDocument object is written to a string writer to clarify.
    > >
    > >
    > > My Webform does use uft-8 response and request encoding and I have
    > > tried using several other different encoding types to get it to work.
    > >
    > > I can get chinese charachters to display but some of the content is
    > > still broken, could the fact that my transformation results in a
    > > mixture of html code + english text + chinese text be part of the
    > > problem?

    >
    > Only if you were not using Unicode. But since you use UTF-8 as response
    > encoding, and assuming you don't mistreat any string objects in your
    > code, that should not be a problem.
    >
    > > It seems I get something like "藛鈥犆bå“鈥?/P>" notice the
    > > question mark and half a </p> tag.

    >
    > What characters are missing in this string? Is it only the opening '<'?
    >
    > Cheers,
    > --
    > http://www.joergjooss.de
    > mailto:
    >


    Yes - although if i disable output escaping in my xsl i can see that ?lt; is
    in the code as if the & has been replaced with a ?


    Here is the code for your ref:
    XmlDocument oDoc = new XmlDocument();
    XslTransform oXsl = new XslTransform();

    oDoc.Load(Server.MapPath(""));
    oXsl.Load(Server.MapPath("xsl/x_language_test.xsl"));

    StringWriter oSw = new StringWriter();

    oXsl.Transform(oDoc,null,oSw);

    litTestText.Text = oSw.ToString();


    Thanks
    Matt
    =?Utf-8?B?TWF0dCBIb2xsaW5nd29ydGg=?=, Jun 1, 2005
    #5
  6. "Matt Hollingworth" wrote:

    >
    >
    > "Joerg Jooss" wrote:
    >
    > > Matt Hollingworth wrote:
    > >
    > > > Joerg,
    > > >
    > > > Thanks - A developer wrote this question...
    > > >
    > > > We currently have a StringWriter object t

    > hat reads this in and
    > > > > > transforms against an XslTransform object.
    > > >
    > > > Sorry - this means that the result of a transformation of an
    > > > XmlDocument object is written to a string writer to clarify.
    > > >
    > > >
    > > > My Webform does use uft-8 response and request encoding and I have
    > > > tried using several other different encoding types to get it to work.
    > > >
    > > > I can get chinese charachters to display but some of the content is
    > > > still broken, could the fact that my transformation results in a
    > > > mixture of html code + english text + chinese text be part of the
    > > > problem?

    > >
    > > Only if you were not using Unicode. But since you use UTF-8 as response
    > > encoding, and assuming you don't mistreat any string objects in your
    > > code, that should not be a problem.
    > >
    > > > It seems I get something like "藛鈥犆bå“鈥?/P>" notice the
    > > > question mark and half a </p> tag.

    > >
    > > What characters are missing in this string? Is it only the opening '<'?
    > >
    > > Cheers,
    > > --
    > > http://www.joergjooss.de
    > > mailto:
    > >

    >
    > Yes - although if i disable output escaping in my xsl i can see that ?lt; is
    > in the code as if the & has been replaced with a ?
    >
    >
    > Here is the code for your ref:
    > XmlDocument oDoc = new XmlDocument();
    > XslTransform oXsl = new XslTransform();
    >
    > oDoc.Load(Server.MapPath(""));
    > oXsl.Load(Server.MapPath("xsl/x_language_test.xsl"));
    >
    > StringWriter oSw = new StringWriter();
    >
    > oXsl.Transform(oDoc,null,oSw);
    >
    > litTestText.Text = oSw.ToString();
    >
    >
    > Thanks
    > Matt



    having further investigated, i forgot to say that i only see what i do by
    changing the encoding to simplified chinese in the browser, if i choose utf8
    it is all still encoded like it appears in notepad if you click view source.

    i did the same page in asp and it all displays correctly without issue.


    >
    >
    =?Utf-8?B?TWF0dCBIb2xsaW5nd29ydGg=?=, Jun 1, 2005
    #6
  7. =?Utf-8?B?TWF0dCBIb2xsaW5nd29ydGg=?=

    Joerg Jooss Guest

    Matt Hollingworth wrote:

    > Yes - although if i disable output escaping in my xsl i can see that
    > ?lt; is in the code as if the & has been replaced with a ?
    >
    >
    > Here is the code for your ref:
    > XmlDocument oDoc = new XmlDocument();
    > XslTransform oXsl = new XslTransform();
    >
    > oDoc.Load(Server.MapPath(""));
    > oXsl.Load(Server.MapPath("xsl/x_language_test.xsl"));
    >
    > StringWriter oSw = new StringWriter();
    >
    > oXsl.Transform(oDoc,null,oSw);
    >
    > litTestText.Text = oSw.ToString();


    Save for the wird Server.MapPath(""), there seems to be nothing wrong
    here. I can only imagine that there's something wrong with the XSL
    itself -- maybe somebody over in the XML group can help out.

    Cheers,
    --
    http://www.joergjooss.de
    mailto:
    Joerg Jooss, Jun 3, 2005
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. darrel
    Replies:
    5
    Views:
    463
    =?ISO-8859-1?Q?G=F6ran_Andersson?=
    Apr 14, 2007
  2. moonhkt
    Replies:
    18
    Views:
    2,488
    Roedy Green
    Feb 5, 2010
  3. Guest
    Replies:
    6
    Views:
    1,629
    Guest
    Apr 25, 2010
  4. Kioko --
    Replies:
    3
    Views:
    279
    Walton Hoops
    Mar 24, 2010
  5. Replies:
    5
    Views:
    68
    Chris Angelico
    May 14, 2014
Loading...

Share This Page