Re: number of bytes for each (uni)code point while using utf-8 asencoding ...

Discussion in 'Java' started by Daniele Futtorovic, Jul 10, 2012.

  1. On 10/07/2012 21:45, lbrt chx _ gemale allegedly wrote:
    >> On 10/07/2012 12:21, lbrt chx _ gemale allegedly wrote:

    >
    >>> How can you get the number of bytes you "get()"?

    >
    >> Well, UTF-8 always encodes the same char to the same (number of) bytes,
    >> doesn't it?

    > ~
    > What about files, which (author's) claim to be UTF-8 encoded but they aren't, and/or get somehow corrupted in transit? There are quite a bit of "monkeys" (us) messing with the metadata headers of html pages
    > ~
    > Sometimes you must double check every file you keep in a text bank/corpus, because, through associations, one mistake may propagate and create other kinds of problems
    > ~
    >> So you could just build a map char -> size /a priori/.

    > ~
    > ...
    > ~
    >> But really, what's the use? ...

    > ~
    > to you there is none but I am trying pinpoint the closest I possibly can:
    > ~
    > .onMalformedInput(CodingErrorAction.REPORT);
    > .onUnmappableCharacter(CodingErrorAction.REPORT);
    > ~
    > errors
    > ~
    > There should be a way to get sizes as you get UTF-8 encoded sequences from a file. Also I how found that quite a few files get corrupted while in transmission and sometimes I wonder how safe that naive mapping you mention is, since those file formatting don't have any kind of built-in error correction measures


    And what's that knowledge about the mapping size going to tell you?

    Assume the file is corrupted. Then you can't know the original character
    (since it's corrupted). Hence even if you know to how many bytes each
    character maps, you can't tell whether the size you're seeing is wrong
    or right.

    At least that's how it seems to me.

    Even the malformedness is no reliable indicator. Your data might get
    corrupted and the outcome be well-formed, as far as the character
    encoding is concerned.

    I have to agree with Lew. Only the transmission layer can reliably
    tackle this problem. Just pass a checksum and be done with it.

    --
    DF.
     
    Daniele Futtorovic, Jul 10, 2012
    #1
    1. Advertisements

  2. Daniele Futtorovic

    Lew Guest

    Daniele Futtorovic wrote:
    > lbrt chx _ gemale allegedly wrote:
    > lbrt chx _ gemale allegedly wrote:
    > >
    > >>> How can you get the number of bytes you "get()"?
    > >
    > >> Well, UTF-8 always encodes the same char to the same (number of)bytes,
    > >> doesn't it?
    > > ~
    > > What about files, which (author's) claim to be UTF-8 encoded but they aren't, and/or get somehow corrupted in transit? There are quitea bit of "monkeys" (us) messing with the metadata headers of html pages
    > > ~
    > > Sometimes you must double check every file you keep in a text bank/corpus, because, through associations, one mistake may propagate and createother kinds of problems
    > > ~
    > >> So you could just build a map char -> size /a priori/.
    > > ~
    > > ...
    > > ~
    > >> But really, what's the use? ...
    > > ~
    > > to you there is none but I am trying pinpoint the closest I possibly can:
    > > ~
    > > .onMalformedInput(CodingErrorAction.REPORT);
    > > .onUnmappableCharacter(CodingErrorAction.REPORT);
    > > ~
    > > errors
    > > ~
    > > There should be a way to get sizes as you get UTF-8 encoded sequences from a file. Also I how found that quite a few files get corrupted whilein transmission and sometimes I wonder how safe that naive mapping you mention is, since those file formatting don't have any kind of built-in error correction measures
    >
    > And what's that knowledge about the mapping size going to tell you?
    >
    > Assume the file is corrupted. Then you can't know the original character
    > (since it's corrupted). Hence even if you know to how many bytes each
    > character maps, you can't tell whether the size you're seeing is wrong
    > or right.
    >
    > At least that's how it seems to me.
    >
    > Even the malformedness is no reliable indicator. Your data might get
    > corrupted and the outcome be well-formed, as far as the character
    > encoding is concerned.
    >
    > I have to agree with Lew. Only the transmission layer can reliably
    > tackle this problem. Just pass a checksum and be done with it.


    Even the file being corrupt has no bearing on the correctness of the Java
    code. The file itself may actually be corrupt and the Java code yet
    working perfectly.

    --
    Lew
     
    Lew, Jul 10, 2012
    #2
    1. Advertisements

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jason Collins
    Replies:
    3
    Views:
    6,350
    Jason Collins
    Feb 18, 2004
  2. mrby

    4-bytes or 8-bytes alignment?

    mrby, Nov 2, 2004, in forum: C Programming
    Replies:
    8
    Views:
    741
    Mark McIntyre
    Nov 2, 2004
  3. Replies:
    5
    Views:
    843
    Flash Gordon
    Apr 9, 2006
  4. Yandos
    Replies:
    12
    Views:
    5,469
    Pete Becker
    Sep 15, 2005
  5. Replies:
    8
    Views:
    780
    Bob Hairgrove
    Apr 10, 2006
  6. karthikbalaguru

    where do the extra bytes go while using Malloc ?

    karthikbalaguru, Oct 23, 2007, in forum: C Programming
    Replies:
    15
    Views:
    762
    karthikbalaguru
    Oct 24, 2007
  7. Saraswati lakki
    Replies:
    0
    Views:
    1,699
    Saraswati lakki
    Jan 6, 2012
  8. Oscar Lok
    Replies:
    1
    Views:
    281
    David Kastrup
    Nov 19, 2006
Loading...