Unicode: ugh!

Ben Pfaff · Mar 13, 2006

The Unicode standard says this in section 3.9:

"For example, a string is defined as a pointer to char in the
C language, and is conventionally terminated with a NULL
character."

You'd think folks writing standards would bother to properly read
and understand the other standards that they reference.

Artie Gold · Mar 13, 2006

Ben said:
The Unicode standard says this in section 3.9:

"For example, a string is defined as a pointer to char in the
C language, and is conventionally terminated with a NULL
character."

You'd think folks writing standards would bother to properly read
and understand the other standards that they reference.

....and your C question is...

;-) ;-)

--ag

osmium · Mar 13, 2006

Ben Pfaff said:
The Unicode standard says this in section 3.9:

"For example, a string is defined as a pointer to char in the
C language, and is conventionally terminated with a NULL
character."

You'd think folks writing standards would bother to properly read
and understand the other standards that they reference.

What's your complaint? That the ASCII null should be spelled NUL?

Ben Pfaff · Mar 13, 2006

osmium said:
What's your complaint? That the ASCII null should be spelled NUL?

Here is the definition of a string:

A string is a contiguous sequence of characters terminated
by and including the first null character.

A string is not a pointer to char: it is a sequence of
characters. It is not "conventionally" terminated by a null
character, it is always terminated by one (otherwise it is not a
string). In C, the null terminator is not a NULL character (NULL
is a null pointer constant); it is not the NUL character either,
because that assumes an ASCII character set; the null terminator
is in fact the "null character", as quoted above.

It's amazing how much they managed to get wrong in a single
sentence.

osmium · Mar 13, 2006

Ben Pfaff said:
Here is the definition of a string:

A string is a contiguous sequence of characters terminated
by and including the first null character.

A string is not a pointer to char: it is a sequence of
characters. It is not "conventionally" terminated by a null
character, it is always terminated by one (otherwise it is not a
string). In C, the null terminator is not a NULL character (NULL
is a null pointer constant); it is not the NUL character either,
because that assumes an ASCII character set; the null terminator
is in fact the "null character", as quoted above.

I glossed over the word "conventionally", that is not a good basis for a
definition. As far as the ASCII component, I figured that was justified
somewhere in the thicket of documents. Every UTF I have seen embeds ASCII
in it. But I don't claim to have seen all the UTF's that exist.

It's amazing how much they managed to get wrong in a single
sentence.

I just read it again and I now agree with you. I thought earlier you were
nit-picking on the extra 'L'.

Keith Thompson · Mar 13, 2006

osmium said:
I glossed over the word "conventionally", that is not a good basis for a
definition. As far as the ASCII component, I figured that was justified
somewhere in the thicket of documents. Every UTF I have seen embeds ASCII
in it. But I don't claim to have seen all the UTF's that exist.

I just read it again and I now agree with you. I thought earlier you were
nit-picking on the extra 'L'.

Even if that were the only problem, it would be enough of a basis to
criticize it. NULL has a very well-defined meaning in C, and it has
very little to do with the '\0' character.

Richard G. Riley · Mar 13, 2006

Even if that were the only problem, it would be enough of a basis to
criticize it. NULL has a very well-defined meaning in C, and it has
very little to do with the '\0' character.

In restrospect it was a bit silly to have a NULL and a "null"
character,'\0', and then to compound it all with a "null pointer"... 2
seconds with google shows generations of confusion and standards
abuse.

Jordan Abel · Mar 13, 2006

Even if that were the only problem, it would be enough of a basis to
criticize it. NULL has a very well-defined meaning in C, and it has
very little to do with the '\0' character.

Though, '\0' is incidentally a null pointer constant... so #define NULL
'\0' would be legal.

Keith Thompson · Mar 13, 2006

Jordan Abel said:
Though, '\0' is incidentally a null pointer constant... so #define NULL
'\0' would be legal.

Yes, of course; that's the "very little" I was referring to.

Richard Tobin · Mar 13, 2006

Even if that were the only problem, it would be enough of a basis to
criticize it. NULL has a very well-defined meaning in C, and it has
very little to do with the '\0' character.

However, the text is in the Unicode standard, and there NULL means the
character with code 0.

-- Richard

lawrence.jones · Mar 14, 2006

Ben Pfaff said:
You'd think folks writing standards would bother to properly read
and understand the other standards that they reference.

Not after you've worked in the standards world for a while, you
wouldn't.

Of course, some committees are better than others.

-Larry Jones

Wow, how existential can you get? -- Hobbes

Richard Bos · Mar 14, 2006

Ben Pfaff said:
The Unicode standard says this in section 3.9:

"For example, a string is defined as a pointer to char in the
C language, and is conventionally terminated with a NULL
character."

You'd think folks writing standards would bother to properly read
and understand the other standards that they reference.

The Unicode Standard (at least, the PDFs I have) also says this in its
first character chart (chapter 16, I believe):

"0000 [NUL] <control>
= NULL"

So _within the bounds of Unicode_ that comment is correct. The Unicode
name for character 0, ASCII NUL, the null character, is (and has been
for a while, don't know how long) NULL. This choice was ill-advised,
yes, but it having been made, your quotation is wrong in a general C
context, but correct in a Unicode context.

Richard

Ben Pfaff · Mar 14, 2006

[NULL is correct for Unicode.]

There's a lot more wrong with it than misspelling "null".

Jordan Abel · Mar 14, 2006

Ben Pfaff said:
Ben Pfaff said:

The Unicode standard says this in section 3.9:

"For example, a string is defined as a pointer to char in the
C language, and is conventionally terminated with a NULL
character."

You'd think folks writing standards would bother to properly read
and understand the other standards that they reference.

Click to expand...

The Unicode Standard (at least, the PDFs I have) also says this in its
first character chart (chapter 16, I believe):

"0000 [NUL] <control>
= NULL"

So _within the bounds of Unicode_ that comment is correct. The Unicode
name for character 0, ASCII NUL, the null character, is (and has been
for a while, don't know how long) NULL. This choice was ill-advised,
yes, but it having been made, your quotation is wrong in a general C
context, but correct in a Unicode context.

Well, yeah. That's the english word/phrase for which NUL is an
abbreviation, just like we have START OF TEXT for STX, and so on.

Dik T. Winter · Mar 15, 2006

> The Unicode
> name for character 0, ASCII NUL, the null character, is (and has been
> for a while, don't know how long) NULL.

That name was already present in Unicode 1.1.5 (July 1995) (the earliest
reference that is available online).

Ben Pfaff · Mar 15, 2006

Ben Pfaff said:
Ben Pfaff said:

The Unicode standard says this in section 3.9:

"For example, a string is defined as a pointer to char in the
C language, and is conventionally terminated with a NULL
character."

You'd think folks writing standards would bother to properly read
and understand the other standards that they reference.

Click to expand...

The Unicode Standard (at least, the PDFs I have) also says this in its
first character chart (chapter 16, I believe):

"0000 [NUL] <control>
= NULL"

So _within the bounds of Unicode_ that comment is correct. The Unicode
name for character 0, ASCII NUL, the null character, is (and has been
for a while, don't know how long) NULL. This choice was ill-advised,
yes, but it having been made, your quotation is wrong in a general C
context, but correct in a Unicode context.

I took another look at the text I quoted. At a second look, it
is clearly *not* referring to the Unicode character called NULL,
because Unicode character names in the Unicode standard are
expressed in small capital letters. The NULL in the paragraph
above is in full-size capital letters, so it does not refer to a
Unicode character name.

Richard Bos · Mar 16, 2006

They are in small caps when written as (for example) "U+004B LATIN
CAPITAL LETTER K", but they also appear (without the U+XXXX) in plain
capitals (e.g. in the character tables themselves), lower case, and
italics. So I don't think you can deduce that they are not referring
to the Unicode character. Though they might well be using it in a
more generic sense of a null character without reference to Unicode in
particular (which would be more accurate in a sense, because as far as
I can see nothing guarantees that C's string-terminating character
maps to U+0000).

The null character in C must be a character with value zero. U+0000
trivially also has value zero. If an implementation manages not to map
the one onto the other, I would say that that implementation does not
have Unicode as its character set, but at most Unicode-rearranged.

Richard

Unicode (UTF-8) in C	13	Mar 16, 2014
attempting to print unicode characters.	23	Aug 29, 2010
Unicode questions	17	Oct 19, 2010
Can't solve problems! please Help	0	Sep 26, 2022
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
Ascii to Unicode.	16	Jul 28, 2010
Right solution to unicode error?	21	Nov 7, 2012
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Apr 1, 2008

Unicode: ugh!

Ben Pfaff

Artie Gold

osmium

Ben Pfaff

osmium

Keith Thompson

Richard G. Riley

Jordan Abel

Keith Thompson

Richard Tobin

lawrence.jones

Richard Bos

Ben Pfaff

Jordan Abel

Dik T. Winter

Ben Pfaff

Richard Bos

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads