Zero terminated strings

Nobody · Aug 4, 2009

My guess is that higher-level code parses a domain name into components
using a rule along the lines of "any sequence of characters except a dot",
when it should have tested for membership in a specific set.

This is a common flaw in regexp-based parsers; occurrences of "." or
"[^...]" in a regexp almost invariably match things that they shouldn't be
matching. This may not being a problem if you're doing search-and-replace
on your own files, but it's often disastrous for validating untrusted data.

Click to expand...

Personally I'm not sure I would use something as complex as a regexp
parser for passing domain names. The requirements are so simple I think
it is easier to express it in simple code (with maybe a few useful
little library functions similar to ones I already have).

I'm not saying that they're actually using regexps; it's more likely that
it's something like "while (c != '.') ..." or strchr(str, '.').

The comment on regexp-based parsers was just that this is a particularly
common class of bug. Regexps make it easy to match "anything" or "anything
except ...", although doing so is usually wrong. If the programmer was
forced to use e.g. [\0--/-ÿ] instead of [^.], they might stop to think
whether they really wanted to match all of those characters.

Beej Jorgensen · Aug 4, 2009

Tech07 said:
I've always used 'ad hominem' it as synonymous with "personal attack"

In casual conversation, yes. But technically when it comes to
arguments, it's more than just an insult. It's subtle at times, but
there has to be an argumentative assertion, and an insult isn't
necessarily required:

"Joe can solve this equation for X."
"No, he can't--he's too tall."

That doesn't fool anyone, though. Usually it's more barbed than that:

"You can trust Joe."
"No, you can't--he's a police officer."

These examples, even if the particular conclusions are correct, are ad
hominem arguments.

Anyway, it's not clear in this case that Kenny is making an argument;
this would appear to merely be a collection of Kenny's opinions of Eric.
He could have said:

"Eric's argument is wrong because he is a total loon."

In that case, it would clearly be ad hominem (and an insult, as well.)

All we have here is that Kenny posted a follow-up to a posting by Eric.
Is the mere act of following-up enough to imply that he's making a
counter-argument? Or is he just making a general statement in the
middle of the thread? I don't know--it's a matter of interpretation.

-Beej

Beej Jorgensen · Aug 4, 2009

Tech07 said:
When talking about ASCII zero at the end of a character string, I call
it a 'terminator'.

OIC. Well, I do, too, so I guess that's fair. Maybe I'd say, in
general, the terminator is a sentinel value, or somesuch. In any case,
I'm guessing Eric was trying to use the same terminology to tie into his
string-as-linked-list example.

-Beej

Flash Gordon · Aug 4, 2009

Nobody said:
My guess is that higher-level code parses a domain name into components
using a rule along the lines of "any sequence of characters except a dot",
when it should have tested for membership in a specific set.

This is a common flaw in regexp-based parsers; occurrences of "." or
"[^...]" in a regexp almost invariably match things that they shouldn't be
matching. This may not being a problem if you're doing search-and-replace
on your own files, but it's often disastrous for validating untrusted data.

Click to expand...

Personally I'm not sure I would use something as complex as a regexp

Click to expand...

I'm not saying that they're actually using regexps;

<snip>

OK, I started writing a long reply, but no point as I think we basically
agree. The SW the cert authority used should have been accept explicitly
detected as correct and reject any other request.

Flash Gordon · Aug 4, 2009

Tech07 said:
Flash Gordon said:

Moi said:

On Sun, 02 Aug 2009 11:16:04 +0000, bartc wrote:

Gareth Owen wrote:

strcat is a MUCH faster operation if you do NOT seek the terminating
zero.
That depends. It's probably true for long strings. If your string
implementation looks roughly like

struct string
{
unsigned len;
char * str;
}

then for short strings, looking up the length could easily cause a
cache miss -- or even a page fault -- depending on memory access
pattern and string accesses require an extra level of indirection.
NULL terminated strings have guaranteed memory locality.
A more likely structure would be...

struct string
{
unsigned len; /* possibly size_t rather than unsigned */ char str[];
}

So the length is guaranteed to be a few bytes before the start of the
character data. So memory locality should not be a problem unless your
pages are very short indeed!

If you have lots of short strings (say you are trying to load a massive
dictionary in to memory, one string per work) the memory overhead of
the length could become more significant.
Then you use a short length of one byte, and you don't need the
terminator so the overheads are the same.

Click to expand...

Thus limiting the length of the string quite severely, or requiring
multiple different string types.

Or length could be variable.

Click to expand...

Adding complexity, if you mean what I think you mean.

If access is indirect (via functions or operator extensions), then the
string type can do anything it likes.

Click to expand...

The complexity is still there, and unless you do all the changes to the
language required to make strings a first class type (which I think would
be large changes) the complexity will show itself in various places.

If the strings are allocated individually, then those overheads
per-string will probably outweigh the length ones.

Click to expand...

There are other ways to do it. Allocate one block in which you store
multiple strings separated by the null termination. Then you have only one
allocation overhead for however many strings you have.

An alternative would be to misuse utf-8 encoding to store the length.
This would keep the alignment to 'char' and still allow a string length
of up to 16K, (IIRC).

Click to expand...

That kind of scheme is, I think, what bartc meant by "or length could be
variable". However, it does add complexity. It may well be valuable and
worth while complexity, but it is still there.

Click to expand...

On "complexity": You can't really fairly compare the complexity/simplicity
of an incorrect solution with that of a correct one.

That's good, because I'm not. I'm comparing the complexity of two valid
approaches.

On "correctness": It's very subjective (as evidenced again by this thread)
whether the "one style fits all" compromise of the null-terminated C string
design is "palatable" or even correct.

It was not incorrect. It was an appropriate trade-off, as were lots of
other things.

One could inductively reason about
the list of requirements that led to the null-terminated design, but there
probably wasn't any such formal or thorough analyses. The analysis is easier
done in retrospect. I'm not sure what portion of the C/C++ programming
languages crowd has either opinion, but I have an incling that most would
opt for a different solution if it was C language creation time right now.

For doing major string handling you want a language where strings are a
first class type, at which point whether the strings are counted, use a
sentinel or whatever becomes largely irrelevant to the person using the
language (there are ways of embedding a sentinel in a
sentinel-terminated blob of data).

OK, like me, you think the null-terminated design is not adequate.

Oh, it is adequate for a lot of software (an no problem at all for a
fair bit), and the reason I use other languages for major string
handling is nothing to do with the null termination.

I'm not
ready to say whether a library-only approach can adequately remedy the
issues, but of course, an ISO Standard remedy is much more difficult to do
than an in-house one.

A library approach will ALWAYS have major disadvantages compared to a
fully built in string type (built in the same way int is in C). However,
building it in to C properly would be a massive change (even C++ has not
done it as I understand, just faked it with clases). Even then for
untrusted data you have to be very careful. You still can't simply read
a line of untrusted text in to a string variable without a length
limitation or someone can grab all your memory!

C is still a very good language for all sorts of things, since a lot of
problems don't need serious string handling, not even the problem that
started this thread (which would have been avoided be simple validation
at the certificate authority and made visible by simple filtering
(making the null a visible but invalid character) on the browser.

Train people by making them write comms software over poor serial links
where you get significant amounts of corruption, they will soon learn
simple "tricks" (so simple they can easily be implemented in assembler
even) that will deal with bad data.

Øyvind Røtvold · Aug 4, 2009

[ ... ]

That doesn't fool anyone, though. Usually it's more barbed than that:

"You can trust Joe."
"No, you can't--he's a police officer."

I don't know if that was intended, but both those statements are in
fact ad hominem arguments.

[ ... ]

"Eric's argument is wrong because he is a total loon."

In that case, it would clearly be ad hominem (and an insult, as well.)

Yes, but if someone was to say (unrelated to current thread):

"Jawaharlal's argument is wrong because he aggregates information from
dubious sources and interprets them to fit his preconceived notions on
the issue."

Would that be concidered an ad hominem argument?

[...]

Nick Keighley · Aug 4, 2009

[sero terminated strings]

That's good, because I'm not. I'm comparing the complexity of two valid
approaches.

no not really. "Correct" means "compliant with specification". Zero
terminated
work fine within their limitations.

For doing major string handling you want a language where strings are a
first class type, at which point whether the strings are counted, use a
sentinel or whatever becomes largely irrelevant to the person using the
language (there are ways of embedding a sentinel in a
sentinel-terminated blob of data).

by first class you mean assignable and capable being compared
for equality?

string s;
string t = "hello";
s = "hello";
if (s == t) f (s);

You can write code like that in C++

A library approach will ALWAYS have major disadvantages compared to a
fully built in string type (built in the same way int is in C).

a language that provides the right hooks can add first class types.
Eg. C++ (and maybe Navia-C)

However,
building it in to C properly would be a massive change (even C++ has not
done it as I understand, just faked it with clases).

in what way has C++ "faked it". You can some pretty clever things
with C++ strings.

Train people by making them write comms software over poor serial links
where you get significant amounts of corruption, they will soon learn
simple "tricks" (so simple they can easily be implemented in assembler
even) that will deal with bad data.

I'm not sure I'd like to rely on an application that relied
on "simple tricks" to recover from comms errors!

Why not use a decent link-level protocol?

spinoza1111 · Aug 4, 2009

Beej Jorgensen said:

<grin> And when talking about the null pointer at the end of a list, I
don't call it a sentinel (although it is one) - I call it a
"terminator". In fact, I call it a "NULL terminator" (as opposed to
the null terminator that appears at the end of a string - C's case
sensitivity comes to the fore!).

Terminology is important in the absence of understanding:
But terminology is not Master Kong's "rectification of names".
Terminology is a matter of tokens and not a matter of ideas
It is the silken robe that hideth the cadaverous corpse.
Pedants admonish us on specific words they have committed to memory
But the true scholar sees the mountain having ascended beyond.
The common creature takes the safest way and uses rote,
Putting words in his mouth by repetition in houses of ill-fame.
He writes them unthinking on examination day and may even pass
But his lack of understanding makes the common people groan.
Unlike the fumes of wine which are dearer to him than is learning,
The words have not ascended to his mind, nor descended to his heart.
Whereas the true scholar is known to the true scholar,
And one of them may have failed the examination,
But shall pass into the house of the other, being welcomed
After many years, to discourse of old times and new learning.

Edward G. Nilges, Hong Kong, 4 Aug 2009: Moral rights asserted by the
author, so there.

Eric Sosman · Aug 4, 2009

Paul said:
Paul said:

jacob navia wrote:
Lew Pitcher wrote:
[snip]
What the OP complains about (his direct complaint) is the result of a
failure to validate, and that can happen in any language.
Yes bugs can happen in any language.
I believe misinterpreted NUL termination is unique to the C language.

Click to expand...

Prevalent, but not unique. I seem to recall an .ASCIZ directive
or equivalent in more than one assembler. And remember the CP/M I/O
functions that used '$'-terminated strings? (Ghastly, they were.)

Click to expand...

Are you doing a Colbert/Borat here? That doesn't even need a response
as it clearly makes *MY* case, not yours. (Assembly is not a
"language" and I have never seen an assembler that came with a
standard library that told you how to encode strings.)

"Assembly is not a `language'" suggests that you have a
notion of "language" that is foreign to me. People write
assembly code, read assembly code, and debug assembly code
exactly as they write, read, and debug C or Lisp.

As for the existence of a library, no: Assembly languages
themselves seldom come with such libraries. Fine. But why
did the assembler creators expend the effort to implement and
document an .ASCIZ directive? They did it because they thought
zero-terminated strings would be common enough that it was
worth while to offer a convenient shortcut for creating them,
that's why. In other words, the libraries that used zero-
terminated strings were not part of the assembler and its
language per se, but were anticipated to be a significant part
of the environment.

The existence of an artifact proves that the artificer
thought -- rightly or wrongly -- that the artifact would
serve a purpose. The existence of an .ASCIZ directive proves
that somebody thought -- rightly or wrongly -- that programmers
would use it, and this proves that somebody thought -- rightly
or wrongly -- that his assembler would be used in conjuction
with code that manipulated zero-terminated strings. And not in
C, which was the point of the digression: To refute the assertion
that zero-terminated strings (and mistakes made therewith) are
somehow C-specific.

[...]
If you decided to make strings a linked list of characters as you
suggest, you would not have a pointer to a dynamically allocated node
which contained a \0 (unless you are idiot) but instead just a pointer
to NULL (or some other sentinel value.) You would also be laughed at
for doing so; the overhead cost is way too high.

It wasn't a suggestion or a f'rinstance, it was a report
of an actual SNOBOL3 implementation I used around 1970. It
struck me as wasteful, too -- and this was in the days when
memory was scarce and waste could not be ignored as we so
casually do today. But it ran, it worked -- and it was real,
not some kind of thought experiment.

[...]
A sentinel can be and is best used as meta-data, not data. ASCII and
UNICODE both list \0 as the well defined control character NUL.
Neither standard demands how such a character is to be used. In fact
Unicode sees absolutely no distinction between the characters 0
through 8 inclusive. C's imposition of \0 as data with meta-data
meanings is just that -- an imposition.

C's insistence that pointer value 0 (not to be confused
with all-bits-zero) be given special treatment is equally an
imposition, and equally artificial. Much hardware that runs
C is perfectly capable of using pointer-0 to access memory;
it's a perfectly valid memory location and the hardware can
form perfectly valid pointers to it. The hardware is often
set up in such a way that the attempted access will trap (not
always, as witness the recent thread about gcc optimizations
and the Linux kernel), but this is just to help debug buggy
C code. (Non-buggy C code never attempts the access, and so
doesn't care what would happen if it did.)

Pointer-0 is no less an "imposition" than character-0:
both are examples of C attaching special meaning to a value
that is in not inherently special. A pointer-0 at the end of
a list and a character-0 at the end of a string are morally
equivalent, both matters of convention and not of necessity.

Beej Jorgensen · Aug 4, 2009

Øyvind Røtvold said:
I don't know if that was intended, but both those statements are in
fact ad hominem arguments.

Ha! It wasn't, but that's funny. "You can trust Joe because he is
Joe."

Yes, but if someone was to say (unrelated to current thread):

"Jawaharlal's argument is wrong because he aggregates information from
dubious sources and interprets them to fit his preconceived notions on
the issue."

Would that be concidered an ad hominem argument?

Good question, and I think I would--but only because it's not apparent
that in this case he has done these things. I don't think there's much
difference between your example and:

"Jawaharlal's argument is wrong because he usually lies."

which sure seems like ad hominem to me. Perhaps Jawaharlal has come to
the wrong conclusion in this case, and perhaps it was because he was
lying, but it's not necessarily true in all cases so it doesn't make a
strong argument.

On the other hand:

"Jawaharlal's argument is wrong because he is lying about premise X."

seems like another matter, and not ad hominem.

-Beej
--
"All non-denial denials. They doubt our ancestors, but they won't
say the story isn't true."
"What I want to know is, what's a real denial?"
"Well, when they start calling us 'God-damned liars', I guess we
start circling the wagons."
--Ben Bradlee and Bob Woodward, All The President's Men

Tech07 · Aug 4, 2009

Nick said:
[sero terminated strings]

That's good, because I'm not. I'm comparing the complexity of two
valid approaches.

Click to expand...

no not really. "Correct" means "compliant with specification". Zero
terminated
work fine within their limitations.

'correct' was a poor choice of word, but contextually, most should get the
gist of what I was trying to say. Something along the lines that the
compromises made by the original implementors look bad to some in
retrospect. To some, even incorrect given a hypothetical list of
requirements/constraints that render it so.

Tech07 · Aug 4, 2009

Flash said:
Tech07 said:

Flash Gordon said:

Moi wrote:
On Sun, 02 Aug 2009 11:16:04 +0000, bartc wrote:

Gareth Owen wrote:

strcat is a MUCH faster operation if you do NOT seek the
terminating zero.
That depends. It's probably true for long strings. If your
string implementation looks roughly like

struct string
{
unsigned len;
char * str;
}

then for short strings, looking up the length could easily
cause a cache miss -- or even a page fault -- depending on
memory access pattern and string accesses require an extra
level of indirection. NULL terminated strings have guaranteed
memory locality.
A more likely structure would be...

struct string
{
unsigned len; /* possibly size_t rather than unsigned */ char
str[]; }

So the length is guaranteed to be a few bytes before the start
of the character data. So memory locality should not be a
problem unless your pages are very short indeed!

If you have lots of short strings (say you are trying to load a
massive dictionary in to memory, one string per work) the memory
overhead of the length could become more significant.
Then you use a short length of one byte, and you don't need the
terminator so the overheads are the same.
Thus limiting the length of the string quite severely, or requiring
multiple different string types.

Or length could be variable.
Adding complexity, if you mean what I think you mean.

If access is indirect (via functions or operator extensions),
then the string type can do anything it likes.
The complexity is still there, and unless you do all the changes to
the language required to make strings a first class type (which I
think would be large changes) the complexity will show itself in
various places.
If the strings are allocated individually, then those overheads
per-string will probably outweigh the length ones.
There are other ways to do it. Allocate one block in which you store
multiple strings separated by the null termination. Then you have
only one allocation overhead for however many strings you have.

An alternative would be to misuse utf-8 encoding to store the
length. This would keep the alignment to 'char' and still allow a
string length of up to 16K, (IIRC).
That kind of scheme is, I think, what bartc meant by "or length
could be variable". However, it does add complexity. It may well be
valuable and worth while complexity, but it is still there.

Click to expand...

On "complexity": You can't really fairly compare the
complexity/simplicity of an incorrect solution with that of a
correct one.

Click to expand...

That's good, because I'm not. I'm comparing the complexity of two
valid approaches.

You'd have to reference requirements and constraints for that to hold. Else
there is nebulous context. But no need for more on that, I believe you: no
need to prove anything.

It was not incorrect. It was an appropriate trade-off, as were lots of
other things.

That's your opinion and you are entitled to it. If I had written the
requirements/constraints (been there then with what I know today), the
current implementation would indeed be an incorrect solution or a partial
one at best.

For doing major string handling you want a language where strings are
a first class type,

Maybe. But maybe adequacy can be had with a library-level implementation. I
use C++ so there is more "design bandwidth" available for the solution.
Indeed, a library-level solution is what I currently use and is what C++
offers also.

at which point whether the strings are counted,
use a sentinel or whatever becomes largely irrelevant to the person
using the language (there are ways of embedding a sentinel in a
sentinel-terminated blob of data).

Oh, it is adequate for a lot of software (an no problem at all for a
fair bit), and the reason I use other languages for major string
handling is nothing to do with the null termination.

A library approach will ALWAYS have major disadvantages compared to a
fully built in string type (built in the same way int is in C).

As there is no hardware-level correspondence for a string, as there is for
an int, I do not see any real opportunity for "a fully built in string
type". A string is a composed type, rather than a primitive. At the compiler
level, there is some opportunity to "build in a string type", but that
concept may be an obsolete idea also.

However, building it in to C properly would be a massive change

And "properly" is very subjective. Hand something like that to a committee
and you'd be asking for trouble/stalemate. And it would fundamentally make
the language a new one as the null-terminated string concept is a major
characteristic of C, so it's not even plausible. The most one can hope for
is an addition to the standard, but then, most people have library
implementations already (read most are not limited by what the standard
provides).

Flash Gordon · Aug 4, 2009

Nick said:
[sero terminated strings]

For doing major string handling you want a language where strings are a
first class type, at which point whether the strings are counted, use a
sentinel or whatever becomes largely irrelevant to the person using the
language (there are ways of embedding a sentinel in a
sentinel-terminated blob of data).

Click to expand...

by first class you mean assignable and capable being compared
for equality?

string s;
string t = "hello";
s = "hello";
if (s == t) f (s);

You can write code like that in C++

And then if f is defined as

void f(string str)
{
/* Modify str */
}

Then str is passed by value (if other built in types are) so the
modification to str does not affect s in the caller...

All the memory management handled for you, just as it is for type int...
including when you do things like string concatenation etc

The type being available without having to take extra measures...

a language that provides the right hooks can add first class types.
Eg. C++ (and maybe Navia-C)

in what way has C++ "faked it". You can some pretty clever things
with C++ strings.

Well, I can't say I've learned C++, so I don't know how closely it fits.

I'm not sure I'd like to rely on an application that relied
on "simple tricks" to recover from comms errors!

Why not use a decent link-level protocol?

Simple "tricks" are all you need to implement a decent link level
protocol. I know, I've done it. I did it so well with worked with the
data going across 4 serial links, one of which was un-shieled 3-wire
(data-in, data-out and signal ground) in an electrically noisy environment.

Wait for a valid first byte of a header (anything else you receive is
obviously invalid)

Check the header bytes (including message type and length, for which you
know the valid range) as they are coming in and assume that you have
dropped at least one byte if any of them are wrong. Check the checksum
at the end.

If there is a byte that is invalid during the header, check to see if it
is the first byte of the header in case you have dropped the entirety of
the rest of the message (a length or command type which are the same
value of the first byte of the header are invalid, as would that
particular byte occurring at any other point in the header). If the byte
you received is not the first byte, go back to waiting for the first
byte again...

If the checksum was invalid go back to the start of the data (i.e. just
after the header) and scan your buffer for the first byte of the header.
If found, keep going through checking the message from that point, if
not found go back to waiting for a first byte of a header.

When you have a complete valid message (at the final destination,
however many hops down the line it is) the final destination
acknowledges it, and keeps sending ACKs for it until it receives the
next message.

The sender keeps trying to send it until it gets the acknowledgment
(which is of similar format and again fully validated in the same way
described above, except for not doing an ACK for the ACK, as the next
valid message effectively is its ACK).

Oh, and the message has a unique identity in the header so the final
destination knows whether it has received a duplicate (due to delays, or
the ACK getting corrupted and so rejected). The ACK, obviously,
specifying which message it is an ACK for.

OK, so the above is bandwidth hungry, but I'm talking serial links here
which are dedicated to getting this specific data over. This worked 100%
reliably (given sufficient time) on a link where corruption was the
norm, not the exception.

As I say, just a few simple tricks. There were a few other oddities in
buffer management etc to balance processor loading and minimise buffer
size requirement, but they are not as relevant.

What is relevant to my original point (giving people experience of
working on systems with highly unreliable data transfer, and having them
solve the problem) is seeing the interface failing abysmally (before I
rewrote it as described) because there was *not* a link level protocol
seriously hammered in to me how unreliable data is when it comes from
else where, *even* when the "else where" is another computer sitting
within about four feet of the final destination and me having complete
control over all systems involved!

In other conditions (e.g. where I had to worry about bandwidth so just
repeatedly throwing the packet was not appropriate) I would want
something more sophisticated.

Phil Carmody · Aug 4, 2009

Beej Jorgensen said:
You and Eric are looking at this from entirely different levels. From
his correct theoretical level, there is no difference between a linked
list with a sentinel value terminator and a string with a NUL
terminator.

There is a huge difference; the two sets are certainly not isomorphic,
as there are multiple representations of the same string in the linked
list version.

Phil

jameskuyper · Aug 4, 2009

Flash said:
And then if f is defined as

void f(string str)
{
/* Modify str */
}

Then str is passed by value (if other built in types are) so the
modification to str does not affect s in the caller...

All the memory management handled for you, just as it is for type int...
including when you do things like string concatenation etc

All of the above descriptions fit C++ std::string.

The type being available without having to take extra measures...

This is the only thing not met: you have to #include <strings>.
However, in that regard it's no worse than char_t in C. Also, if it's
important to you to be able to use "string" rather than "std::string",
you have to put in a statement "using namespace std", though I
personally prefer not to - I like to keep my code's dependencies on
the C++ standard library explicit.

Well, I can't say I've learned C++, so I don't know how closely it fits.

It meets almost of the specifications you've provided so far.

Beej Jorgensen · Aug 4, 2009

Phil Carmody said:
There is a huge difference; the two sets are certainly not isomorphic,
as there are multiple representations of the same string in the linked
list version.

What I'm saying is: if you imagine that the string implementation is
simply a conforming black box, then you are still subject to the same
terminator-related security issues for every possible implementation,
both of arrays and linked lists. Eric provided an example of this.

-Beej

Flash Gordon · Aug 4, 2009

jameskuyper said:
All of the above descriptions fit C++ std::string.

Including automatically destroying all the created objects?
f(str1 . sr2);
(where . is the concatenation operator, although I'm not bothered *what*
the operator is)

This is the only thing not met: you have to #include <strings>.
However, in that regard it's no worse than char_t in C. Also, if it's
important to you to be able to use "string" rather than "std::string",
you have to put in a statement "using namespace std", though I
personally prefer not to - I like to keep my code's dependencies on
the C++ standard library explicit.

One reason for wanting it built in as part of the language is so that
when you see string, it is definitely the type built in! Also, and maybe
more importantly, it would mean that all functions expecting a "string"
expect the built in string type!

There is also that it should be the "one and only" string type (OK, I'll
accept small_string, medium_string and long_string, or whatever, but not
another "string type" that is vastly different such as null terminated
strings).

I can keep on thinking up requirements ;-)

It meets almost of the specifications you've provided so far.

Well, that it is *the* string type, rather than having two fundamentally
different ones, is harder to work around.

It may well be that C++ can provide 99% of my requirements, and maybe if
you did not have null terminated strings as well...

In response to what Tech07 said about there being no hardware level
correspondence, that is true for floating point (i.e. float, double and
long double) on some processors which are commonly programmed in C
(including a processor I liked which I programmed in C). It's even been
true of some of the standard integer types on some processors (and with
C99s long long this is more common again). In any case, the language is
there to allow you to do things more easily that when programming the
hardware directly, and that can include string handling.

In any case, I think we've drifted far off topic for comp.lang.c now.

jameskuyper · Aug 4, 2009

Flash said:
Including automatically destroying all the created objects?
f(str1 . sr2);
(where . is the concatenation operator, although I'm not bothered *what*
the operator is)

Yes, temporary objects are automatically destroyed and the memory they
were built in is automatically deallocated when you reach the end of
the full expression that caused the temporaries to be created - this
is true even if evaluation of that expression is interrupted by an
exception (C++ 12.2p3). You only have to worry about explicitly
deallocating objects that you explicitly allocate - and that's just as
true for int as for std::string.

....

One reason for wanting it built in as part of the language is so that
when you see string, it is definitely the type built in! Also, and maybe
more importantly, it would mean that all functions expecting a "string"
expect the built in string type!

If you want that, you should insist (in C++ code) on "std::string"
rather than simply "string", in which case there's no need for "using
namespace std;".

There is also that it should be the "one and only" string type (OK, I'll
accept small_string, medium_string and long_string, or whatever, but not
another "string type" that is vastly different such as null terminated
strings).

It's C++ - you can't stop people from writing new string classes.
However, if you don't want your algorithm to be affected by the
existence of such classes, that's trivial to arrange.

I can keep on thinking up requirements ;-)

Keep going.

Well, that it is *the* string type, rather than having two fundamentally
different ones, is harder to work around.

I initially thought that the two different string types you were
talking about were char* and wchar_t* in C; on the basis of that
assumption, I wrote:

Well, std::string is just a typedef for std::basic_string<char,
std::char_traits<char> >; there's also std::wstring, which is a
typedef for std::basic_string<wchar_t, std::char_traits<wchar_t> >.
The std::basic_string<charT, traits> template defines a doubly-
infinite family of classes, which can be instantiated with defined
behavior for any types, built-in or user-defined, meeting certain
specified requirements. If you don't like having two different string
types, std::basic_string<charT, traits> sounds like it would be a real
nightmare for you. However, you can just restrict yourself to
std::string, if you wish.

It may well be that C++ can provide 99% of my requirements, and maybe if
you did not have null terminated strings as well...

However, here it sounds like you're referring to char* and
std::string. In that case, keep in mind that support for char* is
mandated by the fact that one of the design objectives for C++ has
always been to remain backwards compatible with C. That objective has
occasionally been sacrificed in order to meet other objectives, but
there's no compelling need to delete support for C-style null-
terminated strings.

Paul Hsieh · Aug 4, 2009

Paul said:
Paul said:

Paul Hsieh wrote:
jacob navia wrote:
Lew Pitcher wrote:
[snip]
What the OP complains about (his direct complaint) is the result of a
failure to validate, and that can happen in any language.
Yes bugs can happen in any language.
I believe misinterpreted NUL termination is unique to the C language.
Prevalent, but not unique. I seem to recall an .ASCIZ directive
or equivalent in more than one assembler. And remember the CP/M I/O
functions that used '$'-terminated strings? (Ghastly, they were.)

Click to expand...

Click to expand...

Are you doing a Colbert/Borat here? That doesn't even need a response
as it clearly makes *MY* case, not yours. (Assembly is not a
"language" and I have never seen an assembler that came with a
standard library that told you how to encode strings.)

Click to expand...

"Assembly is not a `language'" suggests that you have a
notion of "language" that is foreign to me. People write
assembly code, read assembly code, and debug assembly code
exactly as they write, read, and debug C or Lisp.

Exactly as? I think not. In assembly, you don't have a "model" for
your machine, you have *your machine*. You think of your machine as a
massive bit-state, in which you simply drive the bit-state instruction
by instruction. Because that's what it actually is. Your level of
abstraction is to think of collections of these bits to hold values
that you will eventually use to interact with I/O devices.

In HLLs and to some degree in C, you think in terms of data
structures. And in fact, there are many parts of the machine you just
have no clue about because you are not supposed to know about them.
For example, at what point do you run out of memory in C or Lisp?
There is no sensible way to even address the question, whereas in
assembler its pretty much always clear the amount of memory available
to you, the amount you are using, and thus when you will run out.

As for the existence of a library, no: Assembly languages
themselves seldom come with such libraries. Fine. But why
did the assembler creators expend the effort to implement and
document an .ASCIZ directive?

Maybe because the C language reared its head and some people decided
that assemblers should be functionally inter-operable with it? You
might also ask why Intel made the 286's version of the loadall
instruction (now deprecated) or the laughable instruction: "POP
SS" (which will retrieve a new stack segment but will not reset the
stack pointer offset; D'oh!) They are engineers, not gods.

[...] They did it because they thought
zero-terminated strings would be common enough that it was
worth while to offer a convenient shortcut for creating them,
that's why.

You are arguing the fact that an assembler *can* support zero-
terminated strings (when they typically are never used by assembly
programs at all), versus C which practically *requires* that you do.
An assembler doesn't have to justify all its support extensions
because it doesn't dictate what the programmer is really going to do.
The C language library is the standard interface to I/O and it encodes
string in only one way (and even then inconsistently: fgets()).

[...] In other words, the libraries that used zero-
terminated strings were not part of the assembler and its
language per se, but were anticipated to be a significant part
of the environment.

Assembly language designers do not get to "anticipate" anything. The
CPU vendor makes an instruction set based on the market, academic
feedback, customer or some crazy designer ideas (read: itanium) and
the assembler just gets to encode it, with a few extra bells and
whistles to help the programmer.

The existence of an artifact proves that the artificer
thought -- rightly or wrongly -- that the artifact would
serve a purpose.

No, it proves that it *MIGHT* or is *AVAILABLE* to serve a purpose.
They don't get to tell developers when or if they should use it. As an
example: Intel learned this quite dramatically with the 80386 ISA
which had massive instruction set support for multi-tasking. The
instructions went largely unused beyond the very minimum support
required to build multitasking models totally in software. (AMD64
dropped support for these extraneous instructions in 64 bit mode
altogether.)

[...] The existence of an .ASCIZ directive proves
that somebody thought -- rightly or wrongly -- that programmers
would use it, and this proves that somebody thought -- rightly
or wrongly -- that his assembler would be used in conjuction
with code that manipulated zero-terminated strings. And not in
C, which was the point of the digression: To refute the assertion
that zero-terminated strings (and mistakes made therewith) are
somehow C-specific.

You argue in a space of pure delusion. In your delusion you require
that assembler is a language and the potential uses of zero terminated
strings in assembler are *ACTUAL* uses of zero terminated string in
real programs written in that assembler. Without, of course, any need
to produce an example of any such program (which would vacate any
pedantic point about your lack of discernment; but you've goose egged
there too.)

[...]
If you decided to make strings a linked list of characters as you
suggest, you would not have a pointer to a dynamically allocated node
which contained a \0 (unless you are idiot) but instead just a pointer
to NULL (or some other sentinel value.) You would also be laughed at
for doing so; the overhead cost is way too high.

Click to expand...

It wasn't a suggestion or a f'rinstance, it was a report
of an actual SNOBOL3 implementation I used around 1970. It
struck me as wasteful, too -- and this was in the days when
memory was scarce and waste could not be ignored as we so
casually do today. But it ran, it worked -- and it was real,
not some kind of thought experiment.

And did they allocate special nodes at the end of each string with the
contents of a 0 or some other terminator in it?

[...]
A sentinel can be and is best used as meta-data, not data. ASCII and
UNICODE both list \0 as the well defined control character NUL.
Neither standard demands how such a character is to be used. In fact
Unicode sees absolutely no distinction between the characters 0
through 8 inclusive. C's imposition of \0 as data with meta-data
meanings is just that -- an imposition.

Click to expand...

C's insistence that pointer value 0 (not to be confused
with all-bits-zero) be given special treatment is equally an
imposition, and equally artificial.

Equally???? What should the default or "no-value" contents of a
pointer be? Alternate systems don't make any sense. With C strings,
length delimiting is an extremely obvious alternative. Furthermore
having and using a value of '\0' makes just as much sense for a
character as any other. With pointers, you cannot *USE* the value on
the other side of NULL unless only *ONE* pointer uses it. Which is
insane for anything but the very simplest programs.

[...] Much hardware that runs
C is perfectly capable of using pointer-0 to access memory;
it's a perfectly valid memory location and the hardware can
form perfectly valid pointers to it. The hardware is often
set up in such a way that the attempted access will trap (not
always, as witness the recent thread about gcc optimizations
and the Linux kernel), but this is just to help debug buggy
C code. (Non-buggy C code never attempts the access, and so
doesn't care what would happen if it did.)

Ah. You have Heathfield disease. You never have to care about buggy
code. Given your level of analysis shown here, that must be wishful
thinking on your part.

Pointer-0 is no less an "imposition" than character-0:
both are examples of C attaching special meaning to a value
that is in not inherently special. A pointer-0 at the end of
a list and a character-0 at the end of a string are morally
equivalent, both matters of convention and not of necessity.

Tech07 has characterized your understanding of computer programming
quite accurately I see. Is this just senility or have you always been
this shallow?

Write up the code for a string as a linked list and see if you can't
tell the difference between a 0-character-terminator, and a NULL-link
terminator.

Lew Pitcher · Aug 4, 2009

Paul said:
Paul said:

Paul Hsieh wrote:
jacob navia wrote:
Lew Pitcher wrote:
[snip]
What the OP complains about (his direct complaint) is the result
of a failure to validate, and that can happen in any language..
Yes bugs can happen in any language.
I believe misinterpreted NUL termination is unique to the C language.
Prevalent, but not unique. Â I seem to recall an .ASCIZ directive
or equivalent in more than one assembler. Â And remember the CP/M I/O
functions that used '$'-terminated strings? Â (Ghastly, they were.)

Click to expand...

Are you doing a Colbert/Borat here? Â That doesn't even need a response
as it clearly makes *MY* case, not yours. (Assembly is not a
"language" and I have never seen an assembler that came with a
standard library that told you how to encode strings.)

Click to expand...

"Assembly is not a `language'" suggests that you have a
notion of "language" that is foreign to me. Â People write
assembly code, read assembly code, and debug assembly code
exactly as they write, read, and debug C or Lisp.

Click to expand...

Exactly as? I think not. In assembly, you don't have a "model" for
your machine, you have *your machine*.

Nonsense!

Agreed that, in the simplest of Assembly languages, you have an almost
one-for-one model of the machine, but the machine /doesn't/ execute
CPIR
statements, it executes binary values. The ability to use human-readable
mnemonics instead of binary values takes Assembly language one step away
from the bare metal of "your machine".

Now, throw in the language designer's penchant for simplifying mnemonics,
and you take a few more steps away from the machine. The hardware doesn't
know
LD <dest>,<source>
but instead knows
LD HL,DE
and
LD HL,(SP)
and many more variants as single, unique instructions. So, now we are two
steps away from bare metal.

Add to all of this the ability to linguistically separate data from
instructions:
JP $+14
DB 'This is a test'

and to arbitrarily label instructions with human-readable names
JP PAST
DB 'This is a test'
PAST:

and to place code and data in arbitrary places

ORG 0x0100
JP PAST
DB 'This is a test'
PAST: ORG $

and we are now several steps away from bare metal.

Finally, add in MACROS and other Assembly pseudo-operations, and we've got a
language that is verifiably *not* "the machine".

FUNCTION1: BEGIN "Function 1 - LDP - 2009/08/04"
PUSHINT 100
PUSHSTRING "%d\n"
CALL PRINTF
POPSTRING
POPINT
RETURN

--
Lew Pitcher

Master Codewright & JOAT-in-training | Registered Linux User #112576
http://pitcher.digitalfreehold.ca/ | GPG public key available by request
---------- Slackware - Because I know what I'm doing. ------

Zero Byte Terminated Strings	10	Mar 27, 2007
Working with NON-NULL terminated strings	4	Jul 14, 2007
strncpy() and null terminated strings	4	Apr 8, 2004
Reading null terminated strings in Java	9	Feb 4, 2009
Exact Arithmetic and Strings	4	Jul 13, 2010
Null-terminated strings with struct module?	2	Mar 5, 2004
Null character and JavaScript strings	16	Mar 4, 2011
FAQ 6.23 How can I match strings with multibyte characters?	0	Jan 11, 2011

Zero terminated strings

Nobody

Beej Jorgensen

Beej Jorgensen

Flash Gordon

Flash Gordon

Øyvind Røtvold

Nick Keighley

spinoza1111

Eric Sosman

Beej Jorgensen

Tech07

Tech07

Flash Gordon

Phil Carmody

jameskuyper

Beej Jorgensen

Flash Gordon

jameskuyper

Paul Hsieh

Lew Pitcher

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads