Using unpack on a UTF-8 string

Greg Hurrell · Feb 24, 2007

On my system:

'â‚¬'.unpack('U*')

Produces:

=> [8364]

I would have expected this:

=> [342, 202, 254]

In fact, I could have sworn that things used to work this way... Am I
going crazy? The following seems to confirm that the string is indeed
using a UTF-8 representation internally.

'â‚¬'.collect
=> ["\342\202\254"]

I get exactly the same results whether $KCODE is set to 'NONE' or 'u'.

Cheers,
Greg

Carlos · Feb 24, 2007

On my system:

'â‚¬'.unpack('U*')

Produces:

=> [8364]

I would have expected this:

=> [342, 202, 254]

In fact, I could have sworn that things used to work this way... Am I
going crazy? The following seems to confirm that the string is indeed
using a UTF-8 representation internally.

'â‚¬'.collect
=> ["\342\202\254"]

I get exactly the same results whether $KCODE is set to 'NONE' or 'u'.

The UNICODE codepoint for the euro sign is 8364. In your string you have
that number encoded as a sequence of bytes [226, 130, 172]. That encoding is
known as UTF-8. #unpack decodifies that sequence of bytes and gives you the
number.

For analogy, think as if you had the string "\272!\000\000" and did an
#unpack("I"). The sequence of bytes [186, 33, 0, 0] also represent the
number 8364, but this time encoded in the internal format my computer uses.
#unpack retrieves that number. The fact that UTF-8 is used for encoding
UNICODE codepoints is incidental to this.

To unpack the bytes from a string use #unpack("C*").

HTH.
--

Greg Hurrell · Feb 25, 2007

The UNICODE codepoint for the euro sign is 8364. In your string you have
that number encoded as a sequence of bytes [226, 130, 172]. That encoding is
known as UTF-8. #unpack decodifies that sequence of bytes and gives you the
number.

For analogy, think as if you had the string "\272!\000\000" and did an
#unpack("I"). The sequence of bytes [186, 33, 0, 0] also represent the
number 8364, but this time encoded in the internal format my computer uses.
#unpack retrieves that number. The fact that UTF-8 is used for encoding
UNICODE codepoints is incidental to this.

To unpack the bytes from a string use #unpack("C*").

Thanks a million, Carlos. I never would have figured that out for
myself. I misunderstood the documentation for String#unpack:

C | Fixnum | extract a character as an unsigned integer
U | Integer | UTF-8 characters as unsigned integers

unpack('C*') does indeed give me what I want...

Cheers,
Greg

Clifford Heath · Feb 25, 2007

Greg said:
Thanks a million, Carlos. I never would have figured that out for
myself. I misunderstood the documentation for String#unpack:
C | Fixnum | extract a character as an unsigned integer
U | Integer | UTF-8 characters as unsigned integers

The problem here is the inconsistent use of character in the
documentation. A character is *not* a byte. The documentation
should be revised to use the two words only in their correct
contexts, with annotations to remind people of this use.

Clifford Heath.

Converting to UCS-2 or UTF-16 for use by a C extension	0	Jun 7, 2007
decoding utf-8 on rails	1	Jun 23, 2009
Forcing a string to valid UTF-8	2	Apr 26, 2010
`cmd` and UTF-8	5	Aug 18, 2006
codec for UTF-8 with BOM	3	May 2, 2011
Malformed UTF-8?	4	Mar 11, 2005
How to use rb_enc_str_new() to create a String with UTF-8 encoding?	4	Dec 2, 2009
ifstream >> string with UTF-8?	6	Sep 9, 2009

Using unpack on a UTF-8 string

Greg Hurrell

Carlos

Greg Hurrell

Clifford Heath

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads