Questions on various string literals in c++0x

Z

ZMZ

I am reading the standard but I'm getting quite confused on the
following questions. I hope someone can clarify for me, many thanks.

1. In standard 2.14.5, it states that the Raw String is

raw-string:
" d-char-sequenceopt ( r-char-sequenceopt ) d-char-sequenceopt "

where
r-char:
any member of the source character set, except a right parenthesis )
followed by the initial d-char-sequence (which may be empty) followed
by a double quote ".

to compare, a normal string character is defined as
s-char:
any member of the source character set except the double-quote ",
backslash \, or new-line character
escape-sequence
universal-character-name

This means that Raw string exclude any "universal-character-name". I
searched the whole documents, and I can only find the definition of
"basic source character set" which contains at least 90+ characters.
I could not find anything on "source character set".

Anyway I understand why that raw string cannot contain "universal-
character-name". But how can unicode be expressed inside raw string?
directly in the way like UR"(♠♣♥♦)"? or Raw string excludes unicodes?

2. from wiki:

The Microsoft Windows application programming interfaces Win32 and
Win64, as well as the Java and .Net Framework platforms, require that
wide character variables be defined as 16-bit values, and that
characters be encoded using UTF-16 (due to former use of UCS-2), while
modern Unix-like systems generally require 32-bit values encoded using
UTF-32.

Does this mean that wchar_t is equivalent to char16_t in windows, and
to char32_t in unix-like systems?

3. what's the difference of normal string and with prefix u8?

From my understanding, they are exactly the same in bytes. It's just
the string literal with u8 prefix tells the compiler that the string
is encoded in UTF-8, but normal string doesn't enforce such encoding?


4. (2.14.5/15) In a narrow string literal, a universal-character name
may map to more than one char element due to multibyte encoding.

A narrow string includes both normal string and u8 prefixed string,
From my understanding,
"\UFFFFFFFF" (just an example, it is may not be a valid unicode
character) has exactly 4 bytes + 1 byte '\0' at end

but
u8"\UFFFFFFFF": the compiler will convert it into utf-8 encoding? the
binary form will be completely different from the above normal string?

Is that correct?

5. (2.14.5/15) The size of a char32_t or wide string literal is the
total number of escape sequences, universal-character-names, and other
characters, plus one for the terminating U’\0’ or L’\0’. The size of a
char16_t string literal is the total number of escape sequences,
universal-character-names, and other characters, plus one for each
character requiring a surrogate pair, plus one for the terminating
u’\0’.

I am getting quite confused here. Here the standard apparently makes
char32_t and wchar_t the same definition. But in windows wchar_t is
actually char16_t (as in my question 2), so it actually means that,
in windows, though wchar_t and char16_t are both encoded in UTF-16,
but wchar_t are somehow different.

For example,
U"\UFFFFFFFF", from definition, the size is 1 universal-character-name
+ 1 terminating \0 = 2
L"\UFFFFFFFF", though encoded in UTF-16, it's size is 2
u"\UFFFFFFFF", 1 universal-character-name + 1 surrogate pair (I assume
it requires surrogate because it's size exceeds 16bits, I hope I
didn't make wrong assumption about unicodes) + 1 \0 = 3

but the binary form of the L"\UFFFFFFFF" and u"\UFFFFFFFF" are exactly
the same (in windows).

Am I correct?

Thanks again.
 
Z

ZMZ

I am reading the standard but I'm getting quite confused on the
following questions. I hope someone can clarify for me, many thanks.

1. In standard 2.14.5, it states that the Raw String is

raw-string:
  " d-char-sequenceopt ( r-char-sequenceopt ) d-char-sequenceopt "

where
r-char:
  any member of the source character set, except a right parenthesis )
followed by the initial d-char-sequence (which may be empty) followed
by a double quote ".

to compare, a normal string character is defined as
s-char:
  any member of the source character set except the double-quote ",
backslash \, or new-line character
  escape-sequence
  universal-character-name

This means that Raw string exclude any "universal-character-name". I
searched the whole documents, and I can only find the definition of
"basic source character set" which contains at least 90+  characters..
I could not find anything on "source character set".

Anyway I understand why that raw string cannot contain "universal-
character-name". But how can unicode be expressed inside raw string?
directly in the way like UR"(♠♣♥♦)"? or Raw string excludes unicodes?

2. from wiki:

The Microsoft Windows application programming interfaces Win32 and
Win64, as well as the Java and .Net Framework platforms, require that
wide character variables be defined as 16-bit values, and that
characters be encoded using UTF-16 (due to former use of UCS-2), while
modern Unix-like systems generally require 32-bit values encoded using
UTF-32.

Does this mean that wchar_t is equivalent to char16_t in windows, and
to char32_t in unix-like systems?

3. what's the difference of normal string and with prefix u8?

From my understanding, they are exactly the same in bytes. It's just
the string literal with u8 prefix tells the compiler that the string
is encoded in UTF-8, but normal string doesn't enforce such encoding?

4. (2.14.5/15) In a narrow string literal, a universal-character name
may map to more than one char element due to multibyte encoding.

A narrow string includes both normal string and u8 prefixed string,
From my understanding,
"\UFFFFFFFF" (just an example, it is may not be a valid unicode
character) has exactly 4 bytes + 1 byte '\0' at end

but
u8"\UFFFFFFFF": the compiler will convert it into utf-8 encoding? the
binary form will be completely different from the above normal string?

Is that correct?

5. (2.14.5/15) The size of a char32_t or wide string literal is the
total number of escape sequences, universal-character-names, and other
characters, plus one for the terminating U’\0’ or L’\0’. The size of a
char16_t string literal is the total number of escape sequences,
universal-character-names, and other characters, plus one for each
character requiring a surrogate pair, plus one for the terminating
u’\0’.

I am getting quite confused here. Here the standard apparently makes
char32_t and wchar_t the same definition. But in windows wchar_t is
actually char16_t (as in my question 2),  so it actually means that,
in windows, though wchar_t and char16_t are both encoded in UTF-16,
but wchar_t are somehow different.

For example,
U"\UFFFFFFFF", from definition, the size is 1 universal-character-name
+ 1 terminating \0 = 2
L"\UFFFFFFFF", though encoded in UTF-16, it's size is 2
u"\UFFFFFFFF", 1 universal-character-name + 1 surrogate pair (I assume
it requires surrogate because it's size exceeds 16bits, I hope I
didn't make wrong assumption about unicodes) + 1 \0 = 3

but the binary form of the L"\UFFFFFFFF" and u"\UFFFFFFFF" are exactly
the same (in windows).

Am I correct?

Thanks again.

I played around with gcc for some time (VC doesn't support any u U u8
prefix nor universal character name),

From gcc, the prefix u8, u, and U is to tell the compiler how to
encode the string,
i.e.,
char *p1 = (char *)u8"\U000E0005";
char *p2 = (char *)u"\U000E0005";
char *p3 = (char *)U"\U000E0005";

the individual byte contents pointed by p1, p2, p3 will be completely
different.

I guess wchar_t and prefix "L" in Windows will follow p2, I haven't
verified it yet.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top