UTF-8 Character Encodings and "NO-BREAK SPACE" (dec: 202, hex: CA)Character

Discussion in 'HTML' started by mrdecav@gmail.com, Feb 1, 2009.

  1. Guest

    Hey all,
    I have a bizzare problem with a piece of mail (most likely sent by
    Outlook) that is in UTF-8 format.

    There is a character, coming after spaces, which from looking at a
    hexdump of the file, seems to be a CA (decimal: 202). From most UTF-8
    documentation I can find, this is an accent circumflex.

    In browsers (IE, FF, Safari), this character shows up as an unknown
    character, or as the accent circumflex. In a mail browser, however
    (Outlook, Apple Mail), the character appears as a "NO-BREAK
    WHITESPACE" (just a space visually), or the equivelent of an " ".

    Some documentation I have found shows this is a NO-BREAK WHITESPACE,
    and it is clearly what the intent is. The HTML header and MIME type
    of the body part both claim UTF-8 encoding.

    Is there something I am missing here? Why does this show up
    incorrectly in browsers, or why do mail clients feel compelled to
    replace this character, but browsers don't? Is there an easy fix to
    this? I am concerned that if I actually strip the CA, I'll break
    emails that actually are supposed to have the accent.

    The following hex is an example of the issue:
    00000250 20 64 65 73 69 67 6e 2e 20 ca 49 0d 0a 68 61 76 | design. ?
    I..hav|
    00000260 65 20 61 20 66 65 77 20 6d 69 6e 6f 72 20 64 65 |e a few
    minor de|

    design. <offending character>I have


    Thanks in advance,
    Andre de Cavaignac
     
    , Feb 1, 2009
    #1
    1. Advertising

  2. Re: UTF-8 Character Encodings and "NO-BREAK SPACE" (dec: 202, hex: CA) Character

    wrote:

    > I have a bizzare problem with a piece of mail (most likely sent by
    > Outlook) that is in UTF-8 format.


    This sounds like an e-mail problem, not an HTML issue. If the e-mail is in
    HTML format or contains an HTML part, then that side of the matter could
    relate to HTML, but it can hardly be the primary problem.

    To solve the e-mail problem, it's best to consult someone who knows the
    e-mail program you are using and give him full access to the e-mail. Of
    course he should be someone you really trust, if the message may contain
    confidential information.

    Without primary data, one can only present speculations.

    > There is a character, coming after spaces, which from looking at a
    > hexdump of the file, seems to be a CA (decimal: 202). From most UTF-8
    > documentation I can find, this is an accent circumflex.


    It seems that the secondary data, namely you conclusions drawn from some
    work on something that might be primary data, is inherently unreliable. Your
    understanding of UTF-8 is all wrong. In UTF-8, no octet > 7F as such means
    any character; such octets only appear as part of a multi-octet
    representation of a character.

    > In browsers (IE, FF, Safari), this character shows up as an unknown
    > character, or as the accent circumflex.


    Why would you use a web browser to display an e-mail? Anyway, it seems that
    you used them so that they interpreted the data as ISO-8859-1 encoded, or
    something like that.

    > In a mail browser, however
    > (Outlook, Apple Mail), the character appears as a "NO-BREAK
    > WHITESPACE" (just a space visually), or the equivelent of an "&nbsp;".


    It's NO-BREAK SPACE. But how can you distinguish it from SPACE just by
    looking at it?

    > The HTML header and MIME type
    > of the body part both claim UTF-8 encoding.


    So what?

    > Is there something I am missing here?


    Yes. And we are missing a description of the real situation, the primary
    data.

    > The following hex is an example of the issue:
    > 00000250 20 64 65 73 69 67 6e 2e 20 ca 49 0d 0a 68 61 76 | design.
    > ? I..hav|


    It looks like the data is e.g. ISO-8859-1 encoded. But you are not
    describing how you got that dump. It's quite possible that some software you
    used performed a character encoding conversion. This means you would not be
    looking at the primary data.

    --
    Yucca, http://www.cs.tut.fi/~jkorpela/
     
    Jukka K. Korpela, Feb 1, 2009
    #2
    1. Advertising

  3. Re: UTF-8 Character Encodings and "NO-BREAK SPACE" (dec: 202, hex:CA) Character

    On Feb 1, 2:28 am, "Jukka K. Korpela" <> wrote:
    > wrote:
    > > I have a bizzare problem with a piece of mail (most likely sent by
    > > Outlook) that is inUTF-8format.

    >
    > This sounds like an e-mail problem, not an HTML issue. If the e-mail is in
    > HTML format or contains an HTML part, then that side of the matter could
    > relate to HTML, but it can hardly be the primary problem.
    >
    > To solve the e-mail problem, it's best to consult someone who knows the
    > e-mail program you are using and give him full access to the e-mail. Of
    > course he should be someone you really trust, if the message may contain
    > confidential information.
    >
    > Without primary data, one can only present speculations.
    >
    > > There is acharacter, coming after spaces, which from looking at a
    > > hexdump of the file, seems to be a CA (decimal: 202).  From mostUTF-8
    > > documentation I can find, this is an accent circumflex.

    >
    > It seems that the secondary data, namely you conclusions drawn from some
    > work on something that might be primary data, is inherently unreliable. Your
    > understanding ofUTF-8is all wrong. InUTF-8, no octet > 7F as such means
    > anycharacter; such octets only appear as part of a multi-octet
    > representation of acharacter.
    >
    > > In browsers (IE, FF, Safari), thischaractershows up as an unknown
    > >character, or as the accent circumflex.

    >
    > Why would you use a web browser to display an e-mail? Anyway, it seems that
    > you used them so that they interpreted the data as ISO-8859-1 encoded, or
    > something like that.
    >
    > > In a mail browser, however
    > > (Outlook, Apple Mail), thecharacterappears as a "NO-BREAK
    > > WHITESPACE" (just a space visually), or the equivelent of an "&nbsp;".

    >
    > It's NO-BREAK SPACE. But how can you distinguish it from SPACE just by
    > looking at it?
    >
    > > The HTML header and MIME type
    > > of the body part both claimUTF-8encoding.

    >
    > So what?
    >
    > > Is there something I am missing here?

    >
    > Yes. And we are missing a description of the real situation, the primary
    > data.
    >
    > > The following hex is an example of the issue:
    > > 00000250  20 64 65 73 69 67 6e 2e  20 ca 49 0d 0a 68 61 76  | design.
    > > ? I..hav|

    >
    > It looks like the data is e.g. ISO-8859-1 encoded. But you are not
    > describing how you got that dump. It's quite possible that some software you
    > used performed acharacterencoding conversion. This means you would not be
    > looking at the primary data.
    >
    > --
    > Yucca,http://www.cs.tut.fi/~jkorpela/


    On Feb 1, 2:28 am, "Jukka K. Korpela" <> wrote:
    > wrote:
    > > I have a bizzare problem with a piece of mail (most likely sent by
    > > Outlook) that is inUTF-8format.

    >
    > This sounds like an e-mail problem, not an HTML issue. If the e-mail is in
    > HTML format or contains an HTML part, then that side of the matter could
    > relate to HTML, but it can hardly be the primary problem.
    >
    > To solve the e-mail problem, it's best to consult someone who knows the
    > e-mail program you are using and give him full access to the e-mail. Of
    > course he should be someone you really trust, if the message may contain
    > confidential information.
    >
    > Without primary data, one can only present speculations.
    >
    > > There is acharacter, coming after spaces, which from looking at a
    > > hexdump of the file, seems to be a CA (decimal: 202). From mostUTF-8
    > > documentation I can find, this is an accent circumflex.

    >
    > It seems that the secondary data, namely you conclusions drawn from some
    > work on something that might be primary data, is inherently unreliable. Your
    > understanding ofUTF-8is all wrong. InUTF-8, no octet > 7F as such means
    > anycharacter; such octets only appear as part of a multi-octet
    > representation of acharacter.
    >
    > > In browsers (IE, FF, Safari), thischaractershows up as an unknown
    > >character, or as the accent circumflex.

    >
    > Why would you use a web browser to display an e-mail? Anyway, it seems that
    > you used them so that they interpreted the data as ISO-8859-1 encoded, or
    > something like that.
    >
    > > In a mail browser, however
    > > (Outlook, Apple Mail), thecharacterappears as a "NO-BREAK
    > > WHITESPACE" (just a space visually), or the equivelent of an "&nbsp;".

    >
    > It's NO-BREAK SPACE. But how can you distinguish it from SPACE just by
    > looking at it?
    >
    > > The HTML header and MIME type
    > > of the body part both claimUTF-8encoding.

    >
    > So what?
    >
    > > Is there something I am missing here?

    >
    > Yes. And we are missing a description of the real situation, the primary
    > data.
    >
    > > The following hex is an example of the issue:
    > > 00000250 20 64 65 73 69 67 6e 2e 20 ca 49 0d 0a 68 61 76 | design.
    > > ? I..hav|

    >
    > It looks like the data is e.g. ISO-8859-1 encoded. But you are not
    > describing how you got that dump. It's quite possible that some software you
    > used performed acharacterencoding conversion. This means you would not be
    > looking at the primary data.
    >
    > --
    > Yucca,http://www.cs.tut.fi/~jkorpela/


    Hi Yucca,
    I appreciate the response.

    The email body is in fact in HTML, and although HTML is not in itself
    the problem, the way it is interpreted by clients (such as a browser)
    is the issue.

    I am using the web browser to display the email because I am writing
    an application that supports email integration, and embedding a
    browser in my application was the easiest way to render an HTML
    formatted message.

    I understand that the first octet in a UTF-8 formatted message can
    describe the length of the data for the entire character, and did some
    reading in the UTF-8 RFC. It appears, from the hex in the previous
    email, that the character is a space (20) followed by a NO-BREAK SPACE
    (CA, or E with a circumflex, depending on who you consult), followed
    by an I. This happens in every instance there is more than one space
    after a space (20). It makes sense, because two consecutive spaces
    (20 20) in HTML would only render as one space. (20 &nbsp;) would
    render as two spaces. It appears that the &nbsp; was encoded as a
    character.

    I've consulted many UTF-8 and ASCII format guides. One that I found
    claims that the ASCII equivalent of 202 is "NO-BREAK SPACE". This is
    how both Outlook and Apple Mail (Mail.app) render 202. Web browser
    render it as the accented E.

    I considered the ISO 8859-1 character set. This character set
    reference also states that it is the accented E:
    http://htmlhelp.com/reference/charset/iso192-223.html
    In this UTF-8 reference, 202 is also the accented E:
    http://www.tony-franks.co.uk/UTF-8.htm
    This reference mentions 202 as being NO-BREAK SPACE in, from what I
    can tell, ASCII: http://www1.tip.nl/~t876506/utf8tbl.html
    But this says ASCII 202 is not a NO-BREAK SPACE: http://www.asciitable.com/

    My confusion here is not with a single message, but a whole suite of
    messages from different sources.

    The hex above was found by taking the raw, base-64 encoded MIME part,
    and decoding it -- into HTML. That HTML, according to the MIME header
    and the HTML header is UTF-8 formatted. I have used two base64
    decoders (.NET on Windows and Java on OSX) to decode it -- same
    result. From there, I saved the output and ran "hexdump -C file.txt"
    to get the hex values. The data has been pulled by both JavaMail and
    the Apple Mail client (Apple mail renders it correctly). There is no
    doubt that the message in question is correct, and has not been
    corrupted by the code used to retrieve it.
     
    Andre de Cavaignac, Feb 1, 2009
    #3
  4. Re: UTF-8 Character Encodings and "NO-BREAK SPACE" (dec: 202, hex: CA) Character

    Andre de Cavaignac wrote:

    > I appreciate the response.


    Before that statement, you quoted my entire message, even including the sig,
    and then quoted it again.

    > I am using the web browser to display the email because I am writing
    > an application that supports email integration,


    Seriously, stop doing that. You lack the prerequisites. You can't even use a
    newsreader decently, and you are totally confused with character encoding
    issues.

    > I understand that the first octet in a UTF-8 formatted message can
    > describe the length of the data for the entire character,


    At best, that's a very odd way of describing things. If you replace "can
    describe" by "implies", it makes much better sense.

    > I've consulted many UTF-8 and ASCII format guides.


    But you obviously cannot distinguish the rubbish from reliable sources.

    > One that I found
    > claims that the ASCII equivalent of 202 is "NO-BREAK SPACE".


    That's nonsense. ASCII has nothing corresponding to 202 decimal, and ASCII
    does not contain NO-BREAK SPACE at all.

    > The hex above was found by taking the raw, base-64 encoded MIME part,
    > and decoding it -- into HTML.


    "Into HTML"? Base64 is a transfer encoding of characters and has nothing to
    do with any markup.

    > There is no
    > doubt that the message in question is correct, and has not been
    > corrupted by the code used to retrieve it.


    It surely isn't correct, in the very technical sense of the word, if it
    claims to be UTF-8 encoded and yet isn't and specifically contains octet
    sequences that are not allowed in UTF-8 data. But lacking the primary data,
    we have a big "if" here.

    ObHTML: Your conjecture that the data contains instances of a space followed
    by a no-break space in order to create two visible spaces is plausible, but
    we have no way of actually testing whether it is actually true. People have
    been observed to do such things, and the method works for some values of
    "work". It sounds odd that someone would write e-mail that way, but perhaps
    some software used to compose e-mail creates such data by default.

    --
    Yucca, http://www.cs.tut.fi/~jkorpela/
     
    Jukka K. Korpela, Feb 1, 2009
    #4
  5. Guest

    Re: UTF-8 Character Encodings and "NO-BREAK SPACE" (dec: 202, hex:CA) Character

    On Feb 1, 4:14 am, Ben C <> wrote:
    > On 2009-02-01, Andre de Cavaignac <> wrote:
    > [...]
    >
    > >> > The following hex is an example of the issue:
    > >> > 00000250  20 64 65 73 69 67 6e 2e  20 ca 49 0d 0a 68 61 76  | design.
    > >> > ? I..hav|

    > [...]
    > > I understand that the first octet in a UTF-8 formatted message can
    > > describe the length of the data for the entire character, and did some
    > > reading in the UTF-8 RFC.  It appears, from the hex in the previous
    > > email, that the character is a space (20) followed by a NO-BREAK SPACE
    > > (CA, or E with a circumflex, depending on who you consult), followed
    > > by an I.  This happens in every instance there is more than one space
    > > after a space (20).  It makes sense, because two consecutive spaces
    > > (20 20) in HTML would only render as one space.  (20 &nbsp;) would
    > > render as two spaces.  It appears that the &nbsp; was encoded as a
    > > character.

    >
    > In UTF-8, NO-BREAK SPACE should appear as 0xC2 0xA0. E with circumflex
    > should appear as 0xC3 0x8A.
    >
    > 0xCA is what E with circumflex looks like in ISO-8859-1.
    >
    > 0xCA 0x49 is invalid as UTF-8. So it looks to me like the program
    > displaying this is trying to treat it as UTF-8, but then falling back to
    > ISO-8859-1 when it finds to its disappointment that it isn't actually
    > UTF-8. Lots of data incorrectly identifies itself so many programs
    > employ a bit of guesswork. If it did do that, you'd see the E with a
    > circumflex.
    >
    > > I've consulted many UTF-8 and ASCII format guides.  One that I found
    > > claims that the ASCII equivalent of 202 is "NO-BREAK SPACE". This is
    > > how both Outlook and Apple Mail (Mail.app) render 202.  Web browser
    > > render it as the accented E.

    >
    > 202 is definitely the circumflexed E in ISO-8859-1, and the unicode
    > character 202 is also the circumflexed E. But it may be the NO-BREAK
    > SPACE in some other encoding. If so I don't know which one. But this is
    > one way to explain what is happening.
    >
    > > I considered the ISO 8859-1 character set.  This character set
    > > reference also states that it is the accented E:
    > >http://htmlhelp.com/reference/charset/iso192-223.html
    > > In this UTF-8 reference, 202 is also the accented E:
    > >http://www.tony-franks.co.uk/UTF-8.htm
    > > This reference mentions 202 as being NO-BREAK SPACE in, from what I
    > > can tell, ASCII:http://www1.tip.nl/~t876506/utf8tbl.html

    >
    > Not ASCII-- ASCII only goes up to 127. But it may be that 202 is the
    > NO-BREAK SPACE in _something_. That guide may just be wrong, but it's a
    > bit of a coincidence if you're sure Apple Mail and Outlook are rendering
    > a no-break space. Maybe they're just rendering a gap because they don't
    > know what to do with the error.


    Thank you Ben for a useful, productive response.

    Unfortunately, some people on this board haven't seen daylight from
    their mothers basement in a while and have the need to show off their
    1337 knowledge of character sets by insulting others :).


    **I actually found the cause of the problem I was having, a brief
    description is below:**

    Clearly, from what I described, the input data looked to be corrupt.
    Given that I don't have intricate knowledge of character sets (just
    know the basics), I figured I may have been missing something.

    As it turns out, the problem is not with the encoding, but with the
    headers that define the character set. Both headers (MIME and HTML)
    define the character set as UTF-8, however the document is actually
    encoded in Mac-Roman. In the Mac-Roman character set, 202 (0xCA) is
    in fact the "NO-BREAK SPACE".

    When opened in a normal text editor, which tries to determine the type
    of encoding from the byte stream itself (rather than a header), it is
    properly opened as Mac-Roman. Browsers are looking at the HTML header
    (<meta content="text/html;charset=UTF-8" http-equiv="Content-Type">),
    while normal text editors look at the raw file. I suppose mail
    clients are determining the encoding from the raw file, before
    rendering it as HTML, and that is why it renders properly there.

    There is undoubtedly a bug in one or more mail clients, which mark
    text bodies as UTF-8, rather than their real encoding, Mac-Roman.
     
    , Feb 1, 2009
    #5
  6. Guest

    Re: UTF-8 Character Encodings and "NO-BREAK SPACE" (dec: 202, hex:CA) Character

    On Feb 1, 4:48 pm, Ben C <> wrote:
    > On 2009-02-01, <> wrote:
    >
    >
    >
    > > On Feb 1, 4:14 am, Ben C <> wrote:
    > >> On 2009-02-01, Andre de Cavaignac <> wrote:
    > >> [...]

    >
    > >> >> > The following hex is an example of the issue:
    > >> >> > 00000250  20 64 65 73 69 67 6e 2e  20 ca 49 0d 0a 68 61 76  | design.
    > >> >> > ? I..hav|

    > [...]
    > >> 202 is definitely the circumflexed E in ISO-8859-1, and the unicode
    > >> character 202 is also the circumflexed E. But it may be the NO-BREAK
    > >> SPACE in some other encoding. If so I don't know which one. But this is
    > >> one way to explain what is happening.

    > [...]
    > > As it turns out, the problem is not with the encoding, but with the
    > > headers that define the character set.  Both headers (MIME and HTML)
    > > define the character set as UTF-8, however the document is actually
    > > encoded in Mac-Roman.  In the Mac-Roman character set, 202 (0xCA) is
    > > in fact the "NO-BREAK SPACE".

    >
    > Ah, that explains it. The headers say it's UTF-8, but the bytes are not
    > valid UTF-8. So the text editor falls back on its default. You would
    > expect the default to be ISO-8859-1 for most tools (giving you an E with
    > a circumflex), but evidently it's Mac-Roman for some.
    >
    > You're probably using a Mac. Actually I can tell you are from the
    > headers on your message:
    >
    >     X-HTTP-UserAgent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6;
    >     en-us)
    >
    > > When opened in a normal text editor, which tries to determine the type
    > > of encoding from the byte stream itself (rather than a header), it is
    > > properly opened as Mac-Roman.

    >
    > I would think it's practically impossible in most cases to guess that
    > something is Mac-Roman rather than one of the other 8-bit encodings.
    > Your editor is just falling back on its default.
    >
    > > Browsers are looking at the HTML header
    > > (<meta content="text/html;charset=UTF-8" http-equiv="Content-Type">),
    > > while normal text editors look at the raw file.  I suppose mail
    > > clients are determining the encoding from the raw file, before
    > > rendering it as HTML, and that is why it renders properly there.

    >
    > > There is undoubtedly a bug in one or more mail clients, which mark
    > > text bodies as UTF-8, rather than their real encoding, Mac-Roman.

    >
    > Certainly. Mac-Roman is rather a strange encoding to be using anyway. If
    > I were fixing that bug I'd make the contents UTF-8 rather than change
    > the header to Mac-Roman.


    Yeah, originally I was saving the raw bytes of the message to storage
    and then pulling it back out. I'm going to convert any text-based
    body I get to UTF-8 before saving.

    Thanks again,
    Andre
     
    , Feb 1, 2009
    #6
  7. Guest

    Re: UTF-8 Character Encodings and "NO-BREAK SPACE" (dec: 202, hex:CA) Character

    On Feb 1, 4:48 pm, Ben C <> wrote:
    > On 2009-02-01, <> wrote:
    >
    >
    >
    > > On Feb 1, 4:14 am, Ben C <> wrote:
    > >> On 2009-02-01, Andre de Cavaignac <> wrote:
    > >> [...]

    >
    > >> >> > The following hex is an example of the issue:
    > >> >> > 00000250  20 64 65 73 69 67 6e 2e  20 ca 49 0d 0a 68 61 76  | design.
    > >> >> > ? I..hav|

    > [...]
    > >> 202 is definitely the circumflexed E in ISO-8859-1, and the unicode
    > >> character 202 is also the circumflexed E. But it may be the NO-BREAK
    > >> SPACE in some other encoding. If so I don't know which one. But this is
    > >> one way to explain what is happening.

    > [...]
    > > As it turns out, the problem is not with the encoding, but with the
    > > headers that define the character set.  Both headers (MIME and HTML)
    > > define the character set as UTF-8, however the document is actually
    > > encoded in Mac-Roman.  In the Mac-Roman character set, 202 (0xCA) is
    > > in fact the "NO-BREAK SPACE".

    >
    > Ah, that explains it. The headers say it's UTF-8, but the bytes are not
    > valid UTF-8. So the text editor falls back on its default. You would
    > expect the default to be ISO-8859-1 for most tools (giving you an E with
    > a circumflex), but evidently it's Mac-Roman for some.
    >
    > You're probably using a Mac. Actually I can tell you are from the
    > headers on your message:
    >
    >     X-HTTP-UserAgent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6;
    >     en-us)
    >
    > > When opened in a normal text editor, which tries to determine the type
    > > of encoding from the byte stream itself (rather than a header), it is
    > > properly opened as Mac-Roman.

    >
    > I would think it's practically impossible in most cases to guess that
    > something is Mac-Roman rather than one of the other 8-bit encodings.
    > Your editor is just falling back on its default.
    >
    > > Browsers are looking at the HTML header
    > > (<meta content="text/html;charset=UTF-8" http-equiv="Content-Type">),
    > > while normal text editors look at the raw file.  I suppose mail
    > > clients are determining the encoding from the raw file, before
    > > rendering it as HTML, and that is why it renders properly there.

    >
    > > There is undoubtedly a bug in one or more mail clients, which mark
    > > text bodies as UTF-8, rather than their real encoding, Mac-Roman.

    >
    > Certainly. Mac-Roman is rather a strange encoding to be using anyway. If
    > I were fixing that bug I'd make the contents UTF-8 rather than change
    > the header to Mac-Roman.


    Interestingly, Windows Mail and Outlook also render it
    "correctly" (I'm guessing using Mac-Roman). There must be a bit more
    to it than a default fallback...
     
    , Feb 1, 2009
    #7
  8. Guest

    Re: UTF-8 Character Encodings and "NO-BREAK SPACE" (dec: 202, hex:CA) Character

    On Feb 1, 5:25 pm, Ben C <> wrote:
    > On 2009-02-01, <> wrote:
    >
    >
    >
    > > On Feb 1, 4:48 pm, Ben C <> wrote:
    > >> On 2009-02-01, <> wrote:

    >
    > >> > On Feb 1, 4:14 am, Ben C <> wrote:
    > >> >> On 2009-02-01, Andre de Cavaignac <> wrote:
    > >> >> [...]

    >
    > >> >> >> > The following hex is an example of the issue:
    > >> >> >> > 00000250  20 64 65 73 69 67 6e 2e  20 ca 49 0d 0a 68 61 76  | design.
    > >> >> >> > ? I..hav|
    > >> [...]
    > >> >> 202 is definitely the circumflexed E in ISO-8859-1, and the unicode
    > >> >> character 202 is also the circumflexed E. But it may be the NO-BREAK
    > >> >> SPACE in some other encoding. If so I don't know which one. But this is
    > >> >> one way to explain what is happening.
    > >> [...]
    > >> > As it turns out, the problem is not with the encoding, but with the
    > >> > headers that define the character set.  Both headers (MIME and HTML)
    > >> > define the character set as UTF-8, however the document is actually
    > >> > encoded in Mac-Roman.  In the Mac-Roman character set, 202 (0xCA) is
    > >> > in fact the "NO-BREAK SPACE".

    >
    > >> Ah, that explains it. The headers say it's UTF-8, but the bytes are not
    > >> valid UTF-8. So the text editor falls back on its default. You would
    > >> expect the default to be ISO-8859-1 for most tools (giving you an E with
    > >> a circumflex), but evidently it's Mac-Roman for some.
    > >> >> You're probably using a Mac. Actually I can tell you are from the
    > >> headers on your message:

    >
    > >>     X-HTTP-UserAgent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6;
    > >>     en-us)

    >
    > >> > When opened in a normal text editor, which tries to determine the type
    > >> > of encoding from the byte stream itself (rather than a header), it is
    > >> > properly opened as Mac-Roman.

    >
    > >> I would think it's practically impossible in most cases to guess that
    > >> something is Mac-Roman rather than one of the other 8-bit encodings.
    > >> Your editor is just falling back on its default.

    >
    > >> > Browsers are looking at the HTML header
    > >> > (<meta content="text/html;charset=UTF-8" http-equiv="Content-Type">),
    > >> > while normal text editors look at the raw file.  I suppose mail
    > >> > clients are determining the encoding from the raw file, before
    > >> > rendering it as HTML, and that is why it renders properly there.

    >
    > >> > There is undoubtedly a bug in one or more mail clients, which mark
    > >> > text bodies as UTF-8, rather than their real encoding, Mac-Roman.

    >
    > >> Certainly. Mac-Roman is rather a strange encoding to be using anyway. If
    > >> I were fixing that bug I'd make the contents UTF-8 rather than change
    > >> the header to Mac-Roman.

    >
    > > Interestingly, Windows Mail and Outlook also render it
    > > "correctly" (I'm guessing using Mac-Roman).  There must be a bit more
    > > to it than a default fallback...

    >
    > They may just be displaying nothing at all. They try to decode UTF-8,
    > find an octet sequence they don't like, and just move on. Are you sure
    > they're really showing a no-break space?


    Well, they should be showing an E with an accent circumflex if they
    are truly following UTF-8, so they must be handling that 0xCA
    somehow...

    Oddly enough, both Notepad and some simple .NET code
    (File.ReadAllText) will try to use UTF-8, so its not a platform-
    specific behavior.

    If you look at the hex I displayed earlier, which is the raw text,
    taken using different methods, you see this:
    20 ca 49
    which corresponds to:
    <space>?I

    This is both clear from the hexdump output above, as well as just
    manually looking it up in the UTF-8 character tables. 20 is a space,
    49 is an "I" and CA is most certainly between them. If mail was
    decoding as UTF-8, you would expect an accent circumflex.

    They may just be ignoring it (they shouldn't if they are just decoding
    as UTF-8), but they are definitely adding space where the character
    belongs. A single "20" looks different than "20 CA" in the mail
    readers.
     
    , Feb 1, 2009
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    10
    Views:
    6,273
    Neredbojias
    Aug 19, 2005
  2. Replies:
    2
    Views:
    2,035
    Henri Sivonen
    May 28, 2005
  3. Bengt Richter
    Replies:
    6
    Views:
    488
    Juha Autero
    Aug 19, 2003
  4. Rk Ch
    Replies:
    3
    Views:
    116
    Damjan Rems
    Apr 30, 2008
  5. J.E./C.Y.Cripps
    Replies:
    0
    Views:
    137
    J.E./C.Y.Cripps
    Dec 3, 2004
Loading...

Share This Page