Unicode characters

A

abhi147

Hi ,

I want to pass a string of unicode characters to a function .
The string is a 4 bit unicode character string like"\xab\x0a\x0c\x0d" .
These chars get converted to their ascii equivalent . Hence \x0a or
\x0d is getting converted to Line feed /carriage return etc .
So the function which accepts the string is reading just the 4 bits and
ignoring the rest hence returning failure .
Is there any other way to send these unicode characters ?

Thanks !
 
P

Phlip

abhi147 said:
I want to pass a string of unicode characters to a function .
The string is a 4 bit unicode character string like"\xab\x0a\x0c\x0d" .

UTF-8? If so, \x0a is a carriage return.

UTF-16? If so, the string looks 8-bit to me, not 16-bit. It should have an L
on the front, and the \x codes should be 4 hexies long.

Also, I always guessed that UTF-16 would not allow any extra ASCII control
code to appear. The UTFs all use the 8th (or 16th, or whatever) bit, so any
extended code would accidentally resemble any ASCII code.

You need to step back from the situation and evaluate how you treat this
string before it gets to this juncture. Maybe it belongs inside a
std::wstring. Maybe you have too many typecasts.

If you write the string to a file and view it in MS's Notepad, what does
that say? It's very good about getting an encoding right.
 
L

Lew Pitcher

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

UTF-8? If so, \x0a is a carriage return.

Can't be UTF-8, as 0xab isn't a proper UTF-8 "introducer" value.

[snip]

- --
Lew Pitcher

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (MingW32) - WinPT 0.11.12

iD8DBQFE4d8LagVFX4UWr64RAgYPAJ4m4hwgAkMwiMZy4xgivJyipgomvwCaAsML
fwuCEFonRbuuoSDOqiph2rs=
=O+4Q
-----END PGP SIGNATURE-----
 
T

Thomas Lumley

Phlip said:
UTF-8? If so, \x0a is a carriage return.

This is not a valid UTF-8 string, since only the first octet has its
high bit set.
UTF-16? If so, the string looks 8-bit to me, not 16-bit. It should have an L
on the front, and the \x codes should be 4 hexies long.

The encoding for wide string literals specified by L"" is
implementation-defined.
Also, I always guessed that UTF-16 would not allow any extra ASCII control
code to appear. The UTFs all use the 8th (or 16th, or whatever) bit, so any
extended code would accidentally resemble any ASCII code.

This turns out not to be the case. UTF-16 stores Unicode code points
as 16-bit unsigned integers. You can't put UTF-16 into a C string [on
systems where char is 8-bit] because it tends to have lots of zero
bytes [when bytes are 8-bit], especially in English, but even in other
languages.
You need to step back from the situation and evaluate how you treat this
string before it gets to this juncture. Maybe it belongs inside a
std::wstring.

Not in this newsgroup, it doesn't. It might belong in an array of
wchar_t, though.

-thomas
 
D

Dik T. Winter

> abhi147 wrote:
>
>
> UTF-8? If so, \x0a is a carriage return.
>
> UTF-16? If so, the string looks 8-bit to me, not 16-bit. It should have an L
> on the front, and the \x codes should be 4 hexies long.
>
> Also, I always guessed that UTF-16 would not allow any extra ASCII control
> code to appear. The UTFs all use the 8th (or 16th, or whatever) bit, so any
> extended code would accidentally resemble any ASCII code.

Nope. In UTF-16, 0x0000 to 0xD7FF and 0xE000 to 0xFFFF represent themselves
(0xD800 to 0xDFFF are special).
 
J

J. J. Farrell

I want to pass a string of unicode characters to a function .
The string is a 4 bit unicode character string like"\xab\x0a\x0c\x0d" .
These chars get converted to their ascii equivalent . Hence \x0a or
\x0d is getting converted to Line feed /carriage return etc .
So the function which accepts the string is reading just the 4 bits and
ignoring the rest hence returning failure .
Is there any other way to send these unicode characters ?

Your question doesn't make any sense. Where does "4 bits" come into it?
Unicode characters are 21 bits, and can be represented in a number of
different ways - either individually in an object of at least 21 bits,
or in various multi-word formats (where a "word" is 7, 8 or 16 bits).

What format is the "string" you are starting with, and where did it
come from? What representation format is the function expecting to
receive? What C type does the function expect to receive?

As well as clearly defining the problem, if you post the code which
doesn't work we may be able to help you fix it. We won't be able to
help until both you and we understand precisely what you are trying to
do, though.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top