support for UTF-8 in C language standard?

D

David Mathog

Does any standard C function support reading or writing UTF-8?
I'm not talking about the trivial case where the text is just the
ASCII subset of UTF-8. Rather, I'm referring to a hypothetical
function that could read UTF-8 when 2, 3, or even 4 byte encodings are
present and store the final unencoded character in, I guess, an array
of 32 bit integers.

I'm guessing that there _might_ be functions for this somewhere
in the C standard because trying to apply typical text manipulations
on a UTF-8 string directly seems to be quite messy and slow.
For instance, even a simple operation like "swap characters 1002->1005
with 2007->2010" would be a pain, you'd pretty much have to
parse from the beginning of the UTF-8 string
just to find the specified ranges, and then they might be different
numbers of bytes. So even though the number of characters is the same
they couldn't just be swapped byte for byte.

Thanks,

David Mathog
 
M

Mathias Gaunard

David said:
Does any standard C function support reading or writing UTF-8?

No.
UTF-8 is pretty simple though, and C code is available everywhere.

For instance, even a simple operation like "swap characters 1002->1005
with 2007->2010" would be a pain, you'd pretty much have to
parse from the beginning of the UTF-8 string
just to find the specified ranges, and then they might be different
numbers of bytes.

And how would having a standard function change that?
 
J

J. J. Farrell

David said:
Does any standard C function support reading or writing UTF-8?
I'm not talking about the trivial case where the text is just the
ASCII subset of UTF-8. Rather, I'm referring to a hypothetical
function that could read UTF-8 when 2, 3, or even 4 byte encodings are
present and store the final unencoded character in, I guess, an array
of 32 bit integers.

I'm guessing that there _might_ be functions for this somewhere
in the C standard because trying to apply typical text manipulations
on a UTF-8 string directly seems to be quite messy and slow.
For instance, even a simple operation like "swap characters 1002->1005
with 2007->2010" would be a pain, you'd pretty much have to
parse from the beginning of the UTF-8 string
just to find the specified ranges, and then they might be different
numbers of bytes. So even though the number of characters is the same
they couldn't just be swapped byte for byte.

Yes. Assuming your environment has a locale which supports UTF-8 and
whatever format you want the result in (UCS-4, presumably), then the
multibyte and wide chararcter functions should do what you want - see
mbtowc() and mbstowcs() for starters.
 
S

Stephen Sprunk

David Mathog said:
Does any standard C function support reading or writing UTF-8?
I'm not talking about the trivial case where the text is just the
ASCII subset of UTF-8. Rather, I'm referring to a hypothetical
function that could read UTF-8 when 2, 3, or even 4 byte encodings are
present and store the final unencoded character in, I guess, an array
of 32 bit integers.

I'm guessing that there _might_ be functions for this somewhere
in the C standard because trying to apply typical text manipulations
on a UTF-8 string directly seems to be quite messy and slow.

The locale support somewhat addresses this; unfortunately, locale names
are not standardized so your program still won't be portable in practice
even if the code is technically portable. However, if you can find the
right locale on your system, it's possible to use C's standard functions
to turn an input stream into an array of wchar_t's, manipulate them as
desired, and output them again as UTF-8.

<OT>There are a number of third-party libraries that provide a specific
set of conversions including UTF-8, such as libiconv. However, those
libraries are not part of the C Standard itself and thus not portable
either.</OT>

S
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top