Re: number of bytes for each (uni)code point while using utf-8 asencoding ...

Discussion in 'Java' started by Joshua Cranmer, Jul 12, 2012.

  1. On 7/10/2012 3:45 PM, lbrt chx _ gemale wrote:
    >> On 10/07/2012 12:21, lbrt chx _ gemale allegedly wrote:

    >
    >>> How can you get the number of bytes you "get()"?

    >
    >> Well, UTF-8 always encodes the same char to the same (number of) bytes,
    >> doesn't it?

    > ~
    > What about files, which (author's) claim to be UTF-8 encoded but they aren't, and/or get somehow corrupted in transit? There are quite a bit of "monkeys" (us) messing with the metadata headers of html pages
    > ~
    > Sometimes you must double check every file you keep in a text bank/corpus, because, through associations, one mistake may propagate and create other kinds of problems
    > ~


    I don't see how knowing the char -> length mapping is going to help you
    in this case. If your input is a blob of bytes which someone claims is
    UTF-8 but isn't, you can set up decoders to throw an error or at least
    instead of the replacement char (U+FFFD) which makes it detectable that
    someone screwed up.

    The problem also is, if it's not UTF-8, what is it then? The heuristics
    for this kind of stuff is incredibly squirrely and it more or less turns
    out that the most reliable way to fix it is to know the default charset
    of the computer spitting data out at you. Even then, there's still a
    possibility that its input was screwed up in a similar fashion: I've
    seen one message undergo the standard I-thought-your-UTF8-was-ISO-8859-1
    twice, so that every standard character ended up with 4 gibberish
    characters.

    --
    Beware of bugs in the above code; I have only proved it correct, not
    tried it. -- Donald E. Knuth
    Joshua Cranmer, Jul 12, 2012
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Daniele Futtorovic
    Replies:
    0
    Views:
    205
    Daniele Futtorovic
    Jul 10, 2012
  2. Lew
    Replies:
    0
    Views:
    214
  3. Daniele Futtorovic
    Replies:
    1
    Views:
    303
  4. Robert Klemme
    Replies:
    0
    Views:
    216
    Robert Klemme
    Jul 11, 2012
  5. Lew
    Replies:
    0
    Views:
    206
Loading...

Share This Page