convert raw bytes to Unicode strings

B

brad

Does standard C++ have any methods to do this? I'd like to convert raw
bytes to utf-8. Thanks for any tips.
 
B

brad

Victor said:
What is the difference between "raw bytes" and "utf-8"?

V

raw bytes are not character streams. They do not conform to the concept
of a char. grep a binary file for a string, then grep a text file for a
string to gain a better understanding of this difference.
 
P

Pascal J. Bourguignon

brad said:
raw bytes are not character streams. They do not conform to the
concept of a char. grep a binary file for a string, then grep a text
file for a string to gain a better understanding of this difference.

But when you take a string containing characters, and you encode it
into a sequence of UTF-8 bytes, you don't get a string, but a sequence
of bytes.

What is the difference between these bytes and your "raw" bytes?

Do you know what UTF-8 is? (read at least wikipedia article about it).


Anyways, there's no standard C++ function to do what you want. You
could use an external library like libiconv, or just write the utf-8
encoding/decoding algorithm in C++ yourself.
 
J

Juha Nieminen

brad said:
Does standard C++ have any methods to do this? I'd like to convert raw
bytes to utf-8. Thanks for any tips.

I think you are confusing two different (although related) concepts.

An "unicode string" and an "utf-8 string" are two different things.

The former is a string where each character represents a unicode
character. Usually that means that every character must be 4 bytes long
in order to be able to store any unicode value. (Although I'm not sure
if there's an existing convention for this. I'm not exactly sure what's
the "standard" width for a unicode wide character.)

An "utf-8 string" is a string which has been utf-8-encoded. This means
that each "character" in the string is of variable length. Each
character may be between 1 and 4 bytes in size. (This means, among other
things, that random access is not possible. That is, you can't get the
nth character in constant time, but you must traverse the string from
the beginning if you want to do so.)

You should be clearer about what is it that you want. Let me guess:
You have an utf-8-encoded input, and you want to decode it to a string
containing wide characters.
 
J

James Kanze

I think you are confusing two different (although related)
concepts.

I think he's confusing a lot of things. Raw bytes underly all
data in the computer; UTF-8 strings sit in raw bytes, as do
double, and anything else you can think of. The idea of
"converting" raw bytes into anything is pattently absurd.
An "unicode string" and an "utf-8 string" are two different
things.

The first is less precise than the second. UTF-8 is only one
possible encoding form of Unicode.
The former is a string where each character represents a
unicode character. Usually that means that every character
must be 4 bytes long in order to be able to store any unicode
value. (Although I'm not sure if there's an existing
convention for this. I'm not exactly sure what's the
"standard" width for a unicode wide character.)

Strictly speaking: "Unicode" is a mapping between "characters"
and integral values. Unicode also defines severaly encoding
formats, ways of encoding these integral values in machine words
of various lengths: UTF-8, UTF-16 and UTF-32. In contexts where
byte order matters (e.g. byte oriented transmission mediums),
you can append an LE of BE after UTF-16 or UTF-32, to further
precise.
An "utf-8 string" is a string which has been utf-8-encoded.

And thus, is a Unicode string.
This means that each "character" in the string is of variable
length.

That's more or less true of every encoding format.
Each character may be between 1 and 4 bytes in size.

Each encoding point may be betwwen 1 and 4 bytes. A character
may use several encoding points in its representation.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,756
Messages
2,569,540
Members
45,025
Latest member
KetoRushACVFitness

Latest Threads

Top