Unicode characters

abhi147 · Aug 15, 2006

Hi ,

I want to pass a string of unicode characters to a function .
The string is a 4 bit unicode character string like"\xab\x0a\x0c\x0d" .
These chars get converted to their ascii equivalent . Hence \x0a or
\x0d is getting converted to Line feed /carriage return etc .
So the function which accepts the string is reading just the 4 bits and
ignoring the rest hence returning failure .
Is there any other way to send these unicode characters ?

Thanks !

Phlip · Aug 15, 2006

abhi147 said:
I want to pass a string of unicode characters to a function .
The string is a 4 bit unicode character string like"\xab\x0a\x0c\x0d" .

UTF-8? If so, \x0a is a carriage return.

UTF-16? If so, the string looks 8-bit to me, not 16-bit. It should have an L
on the front, and the \x codes should be 4 hexies long.

Also, I always guessed that UTF-16 would not allow any extra ASCII control
code to appear. The UTFs all use the 8th (or 16th, or whatever) bit, so any
extended code would accidentally resemble any ASCII code.

You need to step back from the situation and evaluate how you treat this
string before it gets to this juncture. Maybe it belongs inside a
std::wstring. Maybe you have too many typecasts.

If you write the string to a file and view it in MS's Notepad, what does
that say? It's very good about getting an encoding right.

Lew Pitcher · Aug 15, 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

UTF-8? If so, \x0a is a carriage return.

Can't be UTF-8, as 0xab isn't a proper UTF-8 "introducer" value.

[snip]

- --
Lew Pitcher

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (MingW32) - WinPT 0.11.12

iD8DBQFE4d8LagVFX4UWr64RAgYPAJ4m4hwgAkMwiMZy4xgivJyipgomvwCaAsML
fwuCEFonRbuuoSDOqiph2rs=
=O+4Q
-----END PGP SIGNATURE-----

Thomas Lumley · Aug 15, 2006

Phlip said:
UTF-8? If so, \x0a is a carriage return.

This is not a valid UTF-8 string, since only the first octet has its
high bit set.

UTF-16? If so, the string looks 8-bit to me, not 16-bit. It should have an L
on the front, and the \x codes should be 4 hexies long.

The encoding for wide string literals specified by L"" is
implementation-defined.

Also, I always guessed that UTF-16 would not allow any extra ASCII control
code to appear. The UTFs all use the 8th (or 16th, or whatever) bit, so any
extended code would accidentally resemble any ASCII code.

This turns out not to be the case. UTF-16 stores Unicode code points
as 16-bit unsigned integers. You can't put UTF-16 into a C string [on
systems where char is 8-bit] because it tends to have lots of zero
bytes [when bytes are 8-bit], especially in English, but even in other
languages.

You need to step back from the situation and evaluate how you treat this
string before it gets to this juncture. Maybe it belongs inside a
std::wstring.

Not in this newsgroup, it doesn't. It might belong in an array of
wchar_t, though.

-thomas

Dik T. Winter · Aug 15, 2006

> abhi147 wrote:
>
>
> UTF-8? If so, \x0a is a carriage return.
>
> UTF-16? If so, the string looks 8-bit to me, not 16-bit. It should have an L
> on the front, and the \x codes should be 4 hexies long.
>
> Also, I always guessed that UTF-16 would not allow any extra ASCII control
> code to appear. The UTFs all use the 8th (or 16th, or whatever) bit, so any
> extended code would accidentally resemble any ASCII code.

Nope. In UTF-16, 0x0000 to 0xD7FF and 0xE000 to 0xFFFF represent themselves
(0xD800 to 0xDFFF are special).

J. J. Farrell · Aug 16, 2006

I want to pass a string of unicode characters to a function .
The string is a 4 bit unicode character string like"\xab\x0a\x0c\x0d" .
These chars get converted to their ascii equivalent . Hence \x0a or
\x0d is getting converted to Line feed /carriage return etc .
So the function which accepts the string is reading just the 4 bits and
ignoring the rest hence returning failure .
Is there any other way to send these unicode characters ?

Your question doesn't make any sense. Where does "4 bits" come into it?
Unicode characters are 21 bits, and can be represented in a number of
different ways - either individually in an object of at least 21 bits,
or in various multi-word formats (where a "word" is 7, 8 or 16 bits).

What format is the "string" you are starting with, and where did it
come from? What representation format is the function expecting to
receive? What C type does the function expect to receive?

As well as clearly defining the problem, if you post the code which
doesn't work we may be able to help you fix it. We won't be able to
help until both you and we understand precisely what you are trying to
do, though.

Benchmarking stripping of Unicode characters which are invalid XML	0	Mar 18, 2012
attempting to print unicode characters.	23	Aug 29, 2010
Unicode (UTF-8) in C	13	Mar 16, 2014
fgetwc doesn't read Unicode	6	Jun 8, 2011
Thinking Unicode	0	Aug 8, 2013
Unicode characters length (NOT size)	5	Jan 15, 2009
Non latin characters in string literals	17	Jan 3, 2010
Unicode characters in btye-strings	5	Mar 12, 2010

Unicode characters

abhi147

Phlip

Lew Pitcher

Thomas Lumley

Dik T. Winter

J. J. Farrell

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads