Re: number of bytes for each (uni)code point while using utf-8 asencoding ...

Discussion in 'Java' started by Daniele Futtorovic, Jul 10, 2012.

  1. On 10/07/2012 12:21, lbrt chx _ gemale allegedly wrote:
    > number of bytes for each (uni)code point while using utf-8 as encoding ...
    > <snip />
    > each time you get() a unicode point from the buffer, you will get from 1 to 4 bytes and the sum of all "lengths" should equal the file length in bytes, right?
    > ~
    > I am using the (new) nio in java 7 and I wonder if sun made changes which make hard getting lenghts of bytes a unicode point needs
    > ~
    > How can you get the number of bytes you "get()"?


    Well, UTF-8 always encodes the same char to the same (number of) bytes,
    doesn't it? So you could just build a map char -> size /a priori/.

    But really, what's the use? Knowing how big in bytes your text will be?
    Probably just as cheap to just write the text to a Writer backed by a
    counting /dev/null OutputStream.

    --
    DF.
    Daniele Futtorovic, Jul 10, 2012
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.

Share This Page