Re: Null-terminated strings: the final analysis.

Discussion in 'C Programming' started by Bartc, Mar 7, 2009.

  1. Bartc

    Bartc Guest

    "Tony" <> wrote in message
    news:vsnsl.13862$...
    > Null terminated strings are premature optimization when considered in
    > modern times. Null-terminated strings are baggage and C suffers from
    > holding on to this obsolete implementation. (No question here, just
    > wondering now why a computer programming language should age like a
    > person?). Comments welcome of course.


    This was discussed here recently, quite extensively too.

    And they are not just a C thing, they were and are used ubiqitously
    elsewhere, very successfully.

    Your alternatives I recall would not fit easily into the low-level C model.
    If you were using C as an intermediate language (unlikely as this might be),
    you would want things kept simple and transparent.

    --
    Bartc
    Bartc, Mar 7, 2009
    #1
    1. Advertising

  2. Bartc

    Tony Guest

    "Bartc" <> wrote in message
    news:CFusl.4007$...
    >
    > "Tony" <> wrote in message
    > news:vsnsl.13862$...
    >> Null terminated strings are premature optimization when considered in
    >> modern times. Null-terminated strings are baggage and C suffers from
    >> holding on to this obsolete implementation. (No question here, just
    >> wondering now why a computer programming language should age like a
    >> person?). Comments welcome of course.

    >
    > This was discussed here recently, quite extensively too.
    >
    > And they are not just a C thing, they were and are used ubiqitously
    > elsewhere, very successfully.
    >
    > Your alternatives I recall would not fit easily into the low-level C
    > model.


    "easily" is a subjective term, so your thought needs qualification to be
    understood. I have a R&D design where a one-character string is one byte
    (read no overhead), small (Pascal) strings have one byte of overhead (the
    length) and larger size strings have a few bytes more (but so what? they're
    larger strings!). The usage though is via one abstraction.

    As criteria for string implementation analysis, I see: space efficiency,
    time efficiency, practical usage (usage patterns are much different now than
    then and the std implementation may be/may have impeding/impeded progress),
    _maybe_ "one technique to cover all string types" (like literals, e.g.),
    character set requirement (Unicode highly overrated IMO), and some decidedly
    not that important issues.

    Tony
    Tony, Apr 12, 2009
    #2
    1. Advertising

  3. Bartc

    Mark Wooding Guest

    Mark McIntyre <> writes:

    > Some of the above may be true - C++ and Java use a form of
    > byte-counted string. However note that C's model is likely to be the
    > most space-efficient for non-trivial sized string, as the storage need
    > be only one byte longer than the string, whereas byte-counted strings
    > must be at least one byte longer.


    Of course, the relative overhead for large counted strings is
    negligible, since the overhead is constant and the string is large.
    What may be of more concern is the overhead for short strings, since
    implementations typically use the same sized length field for all
    strings. But even then the difference is unlikely to be particularly
    significant in practice.

    And none of this addresses the major defect of null-terminated strings,
    which is that they can't represent strings containing a zero byte.

    -- [mdw]
    Mark Wooding, Apr 12, 2009
    #3
  4. In article <>,
    Mark Wooding <> wrote:
    >Mark McIntyre <> writes:
    >
    >> Some of the above may be true - C++ and Java use a form of
    >> byte-counted string. However note that C's model is likely to be the
    >> most space-efficient for non-trivial sized string, as the storage need
    >> be only one byte longer than the string, whereas byte-counted strings
    >> must be at least one byte longer.

    >
    >Of course, the relative overhead for large counted strings is
    >negligible, since the overhead is constant and the string is large.
    >What may be of more concern is the overhead for short strings, since
    >implementations typically use the same sized length field for all
    >strings. But even then the difference is unlikely to be particularly
    >significant in practice.
    >
    >And none of this addresses the major defect of null-terminated strings,
    >which is that they can't represent strings containing a zero byte.
    >
    >-- [mdw]


    Note that the observation that "Mark McIntyre"'s logic is all screwed up
    is not exactly news these days. It is generally buried among the
    classifieds (if it is reported at all).
    Kenny McCormack, Apr 12, 2009
    #4
  5. Bartc

    CBFalconer Guest

    Mark Wooding wrote:
    >

    .... snip ...
    >
    > And none of this addresses the major defect of null-terminated
    > strings, which is that they can't represent strings containing
    > a zero byte.


    Bearing in mind that strings contain only printable characters,
    what possible use can you have for a zero byte?

    --
    [mail]: Chuck F (cbfalconer at maineline dot net)
    [page]: <http://cbfalconer.home.att.net>
    Try the download section.
    CBFalconer, Apr 12, 2009
    #5
  6. CBFalconer <> writes:
    > Mark Wooding wrote:
    >>

    > ... snip ...
    >>
    >> And none of this addresses the major defect of null-terminated
    >> strings, which is that they can't represent strings containing
    >> a zero byte.

    >
    > Bearing in mind that strings contain only printable characters,
    > what possible use can you have for a zero byte?


    Strings most certainly can contain non-printable characters.

    "\a"

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    Nokia
    "We must do something. This is something. Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
    Keith Thompson, Apr 12, 2009
    #6
  7. Bartc

    Mark Wooding Guest

    CBFalconer <> writes:

    > Bearing in mind that strings contain only printable characters,


    Says who? Many strings contain nonprinting characters, e.g., escape
    sequences for controlling terminals or printers.

    > what possible use can you have for a zero byte?


    It's a perfectly legitimate control character.

    -- [mdw]
    Mark Wooding, Apr 12, 2009
    #7
  8. Bartc

    Mark Wooding Guest

    Joe Wright <> writes:

    > I believe strings can contain tabs, and other things that don't print but I
    > agree strings cannot contain the NUL character.


    .... by a trivial consequence of C's definition of a string, no less.

    > Further, a text file is corrupted by the NUL character in a line.


    Only because C strings can't represent lines of text containing a zero
    byte. Using this to justify C's representative inadequacy is circular.

    -- [mdw]
    Mark Wooding, Apr 12, 2009
    #8
  9. Bartc

    Lew Pitcher Guest

    On April 12, 2009 15:22, in comp.lang.c, Joe Wright ()
    wrote:

    > Mark Wooding wrote:
    >> CBFalconer <> writes:
    >>
    >>> Bearing in mind that strings contain only printable characters,

    >>
    >> Says who? Many strings contain nonprinting characters, e.g., escape
    >> sequences for controlling terminals or printers.
    >>
    >>> what possible use can you have for a zero byte?

    >>
    >> It's a perfectly legitimate control character.
    >>

    >
    > OK, I'll bite. Where exactly and how is NUL used as a control character?


    As a "padding" character in many serial communications disciplines,
    primarily (think, CR,LF,NUL,NUL,NUL for slow printers, BSC comms, etc.)

    Also as a data character for many types of external device (think, ANSI
    terminal control character for cursor positioning, ToS byte in IP packets,
    etc.

    Granted that the C standard says nothing of these uses; they are valid uses
    outside of the C standard.

    --
    Lew Pitcher

    Master Codewright & JOAT-in-training | Registered Linux User #112576
    http://pitcher.digitalfreehold.ca/ | GPG public key available by request
    ---------- Slackware - Because I know what I'm doing. ------
    Lew Pitcher, Apr 12, 2009
    #9
  10. Bartc

    Ben Pfaff Guest

    Mark Wooding <> writes:

    > Joe Wright <> writes:
    >
    >> I believe strings can contain tabs, and other things that don't print but I
    >> agree strings cannot contain the NUL character.

    >
    > ... by a trivial consequence of C's definition of a string, no less.


    On the contrary, a string always contains exactly one zero byte
    ("NUL character"):

    A string is a contiguous sequence of characters terminated
    by and including the first null character.
    --
    "I'm not here to convince idiots not to be stupid.
    They won't listen anyway."
    --Dann Corbit
    Ben Pfaff, Apr 12, 2009
    #10
  11. Mark McIntyre <> writes:
    > On 12/04/09 20:16, Mark Wooding wrote:
    >> Joe Wright<> writes:
    >>
    >>> I believe strings can contain tabs, and other things that don't print but I
    >>> agree strings cannot contain the NUL character.

    >>
    >> ... by a trivial consequence of C's definition of a string, no less.

    >
    > Exactly - definitionally.
    >
    >>> Further, a text file is corrupted by the NUL character in a line.

    >>
    >> Only because C strings can't represent lines of text containing a zero
    >> byte.

    >
    > Again we're into definitions: my definition of a text file is one that
    > doesn't contain non-alphanumeric characters. So if you send a null into
    > such a file, its corrupted.


    I presume you meant non-printable, not non-alphanumeric; surely a text
    file can contain spaces and punctuation characters.

    Tab and newline characters are non-printable; can a text file contain
    those?

    I'm sure you can construct a rigorous definition of "text file" that
    excludes null characters. But I don't think there's any universal
    definition.

    On the systems I use, if I write a '\a' character (ASCII BEL) to a
    text file, I can reasonably expect to see a '\a' character when I read
    it back. The same is not true of '\0' if I use fgets() to read it
    (though I think can see the '\0' if I use fgetc()).

    >> Using this to justify C's representative inadequacy is circular.

    >
    > But then so is the counter-argument that is being made. C defines a
    > string as a null-terminated array of characters, therefore its circular
    > to complain that a string can't contain a null.
    >
    > And anyway, if you want char arrays containing nulls, C can do those, no
    > problem.


    Yes, but you can't store a null character in the middle of a string,
    which makes char arrays containing nulls more difficult to deal with.
    I'm not saying it's a fatal flaw in the language, but it is a slight
    inconvenience.

    And there are languages whose native strings *can* contain embedded
    null characters. In C, strlen("foo\0bar") returns 3; in Perl,
    length("foo\0bar") returns 7, and there's nothing particularly special
    about the 4th character.

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    Nokia
    "We must do something. This is something. Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
    Keith Thompson, Apr 12, 2009
    #11
  12. On Sun, 12 Apr 2009 13:32:39 -0700, Keith Thompson wrote:
    > On the systems I use, if I write a '\a' character (ASCII BEL) to a text
    > file, I can reasonably expect to see a '\a' character when I read it
    > back. The same is not true of '\0' if I use fgets() to read it (though
    > I think can see the '\0' if I use fgetc()).


    If you use fgets, you can see any '\0' that you had previously written,
    but you've got to be careful to make sure you don't treat it as a
    terminator. You cannot reliably determine whether the '\0' is a
    terminator, but you can reliably detect many instances where it is not: if
    the bytes following '\0' have been altered by fgets, then they were read
    from the file. If the bytes following '\0' have not been altered by fgets,
    then you cannot be sure where they came from.
    Harald van Dijk, Apr 12, 2009
    #12
  13. Bartc

    Ben Pfaff Guest

    Mark McIntyre <> writes:

    > On 12/04/09 20:16, Mark Wooding wrote:
    >> Joe Wright<> writes:
    >>
    >>> I believe strings can contain tabs, and other things that don't print but I
    >>> agree strings cannot contain the NUL character.

    >>
    >> ... by a trivial consequence of C's definition of a string, no less.

    >
    > Exactly - definitionally.


    It is apparent that neither of you is not aware of the actual
    definition of a C string:

    A string is a contiguous sequence of characters terminated
    by and including the first null character.

    A string includes the null terminator.
    --
    "Programmers have the right to be ignorant of many details of your code
    and still make reasonable changes."
    --Kernighan and Plauger, _Software Tools_
    Ben Pfaff, Apr 12, 2009
    #13
  14. Bartc

    Flash Gordon Guest

    Harald van Dijk wrote:
    > On Sun, 12 Apr 2009 13:32:39 -0700, Keith Thompson wrote:
    >> On the systems I use, if I write a '\a' character (ASCII BEL) to a text
    >> file, I can reasonably expect to see a '\a' character when I read it
    >> back. The same is not true of '\0' if I use fgets() to read it (though
    >> I think can see the '\0' if I use fgetc()).

    >
    > If you use fgets, you can see any '\0' that you had previously written,
    > but you've got to be careful to make sure you don't treat it as a
    > terminator. You cannot reliably determine whether the '\0' is a
    > terminator,


    I think you can almost all the time, but it takes a little work...

    Fill buf with '\n'
    if (fgets(buf,siz,file) != NULL) {
    if (no '/n' in buf) {
    All '\0' in buf before buf[siz-1] were read from the file
    }
    else {
    All '\0' in buf before the '\n' were read from the file
    }
    }
    else {
    if ferror(file) {
    something went wrong and buffer is indetermanate
    }
    else {
    if buf[0]=='\n' and buf[1]=='\n' {
    end-of-file encountered and no characters read
    }
    else if there is a '\n' in buf {
    all characters before the first "\0\n" sequence were read
    }
    else {
    don't think this should happen!
    }
    }
    }

    Any holes in my C-ish pseudo-code?

    > but you can reliably detect many instances where it is not: if
    > the bytes following '\0' have been altered by fgets, then they were read
    > from the file. If the bytes following '\0' have not been altered by fgets,
    > then you cannot be sure where they came from.


    My idea is more convoluted but, I think, more reliable.
    --
    Flash Gordon
    Flash Gordon, Apr 12, 2009
    #14
  15. On Sun, 12 Apr 2009 22:57:30 +0100, Flash Gordon wrote:
    > Harald van Dijk wrote:
    >> On Sun, 12 Apr 2009 13:32:39 -0700, Keith Thompson wrote:
    >>> On the systems I use, if I write a '\a' character (ASCII BEL) to a
    >>> text file, I can reasonably expect to see a '\a' character when I read
    >>> it back. The same is not true of '\0' if I use fgets() to read it
    >>> (though I think can see the '\0' if I use fgetc()).

    >>
    >> If you use fgets, you can see any '\0' that you had previously written,
    >> but you've got to be careful to make sure you don't treat it as a
    >> terminator. You cannot reliably determine whether the '\0' is a
    >> terminator,

    >
    > I think you can almost all the time, but it takes a little work...
    >[snip pseudo-code]


    I stand corrected. It may even be simpler than you suggested: ignoring the
    possibilities of EOF and errors (which you've already handled), after
    prefilling the buffer and calling fgets, you can scan the buffer backwards
    to find the last '\0' byte. Everything before, including any other '\0'
    bytes, were read from the file.
    Harald van Dijk, Apr 12, 2009
    #15
  16. Mark McIntyre <> writes:
    > On 12/04/09 21:32, Keith Thompson wrote:
    >> Tab and newline characters are non-printable; can a text file contain
    >> those?

    >
    > Indeed, I left that as so obvious it was unsaid - I'd forgotten I was
    > in the land of the pedants!


    Subtle distinctions are at the core of what we're discussing here.
    Let's not ignore such distinctions for the sake of avoiding pedantry.

    [...]

    >> On the systems I use, if I write a '\a' character (ASCII BEL) to a
    >> text file, I can reasonably expect to see a '\a' character when I read
    >> it back. The same is not true of '\0' if I use fgets() to read it

    >
    > What would you expect to "see"? I would hope that nothing is displayed
    > on your VDU or printed on paper for instance. So in the context of
    > "text", how can it be meaningful?


    By "see", I meant that I could write something like:

    c = fgetc(my_file);
    if (c == '\a') {
    puts("Yes, it's a '\\a' character");
    }

    with the expectation that the puts statement would be executed
    sometimes.

    I left that as so obvious it was unsaid. :cool:}

    Incidentally, I do have at least one text file with an embedded ASCII
    BEL character. I have a perfectly valid reason for doing this, and
    it's never been a serious problem.

    In any case, the distinction between text files and non-text files is
    irrelevant to a discussion of C strings. Clearly C strings can
    contain any characters other than '\0', including non-printable
    characters. If I want to construct a sequence of characters
    containing a control sequence for a VT100-style terminal, for example,
    a string is a perfectly sensible thing to use. And if any such
    sequences include null characters (I don't know whether they do or
    not), then the fact that I can't store embedded null characters in
    strings is an inconvenience.

    >>> And anyway, if you want char arrays containing nulls, C can do those, no
    >>> problem.

    >>
    >> Yes, but you can't store a null character in the middle of a string,

    >
    > But again thats a circular argument.


    Not at all.

    If, because of some requirement outside the C language, I want to
    store arbitrary character sequences, I can use C strings only if I can
    guaranteed that I don't need to store any null characters.

    [...]

    > So I concent that its not a useful point. If you want to transport
    > elephants, use a crate, not a box. If you want to transport nulls, use
    > an array, not a string - or use some language that allows internal
    > nulls in its string type.


    Right. So C strings impose a limitation, and I might have to work
    around that limitation in some circumstances. That seems to me to be
    a very useful thing to be aware of.

    For example, if I'm reading chunks of data from a binary file, I can
    store those chunks in character arrays, but I can't safely use the
    language's built-in string processing functions on them. For example,
    I can't use strstr() to search for a pattern in the data. If C had
    been designed differently, that wouldn't be an issue.

    >> which makes char arrays containing nulls more difficult to deal with.
    >> I'm not saying it's a fatal flaw in the language, but it is a slight
    >> inconvenience.

    >
    > I can't recall /ever/ having found it so, in 20+ years of
    > programming. Its surely just a matter of interface design: if you
    > expect to be fed non-strings, then don't use a string to contain
    > them. Alternatively, document the interface appropriately.


    Ok, so it's a *potential* inconvenience.

    >> And there are languages whose native strings *can* contain embedded
    >> null characters. In C, strlen("foo\0bar") returns 3; in Perl,
    >> length("foo\0bar") returns 7, and there's nothing particularly special
    >> about the 4th character.

    >
    > Apart from being a nul, which isn't a common character in real-world
    > strings. For instance, find me a place or person with a nul in their
    > name, or a word in any language, including klingon.


    Strings aren't just used to store names of places or people. And if C
    strings *could* store embedded null characters, they might be
    *slightly* more useful than they are without that ability.

    In the design of the language, a tradeoff was made between the
    convenience of null termination vs. the *slightly* greater flexibility
    of being able to store embedded null characters. I do not suggest
    that the choice was the wrong one, merely that it was a tradeoff with
    a non-zero cost. And if you've never run into it, that doesn't change
    the point.

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    Nokia
    "We must do something. This is something. Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
    Keith Thompson, Apr 13, 2009
    #16
  17. Bartc

    BartC Guest

    "Mark McIntyre" <> wrote in message
    news:tKtEl.165936$1.easynews.com...
    > On 12/04/09 21:32, Keith Thompson wrote:



    >> And there are languages whose native strings *can* contain embedded
    >> null characters. In C, strlen("foo\0bar") returns 3; in Perl,
    >> length("foo\0bar") returns 7, and there's nothing particularly special
    >> about the 4th character.

    >
    > Apart from being a nul, which isn't a common character in real-world
    > strings. For instance, find me a place or person with a nul in their name,
    > or a word in any language, including klingon.


    Some Win32 functions use strings with embedded zeros (eg. GetOpenFileName),
    using a double zero to terminate.

    --
    BartC
    BartC, Apr 13, 2009
    #17
  18. In article <>,
    Han from China <> wrote:
    >Kenny McCormack wrote:
    >> Note that the observation that "Mark McIntyre"'s logic is all screwed up
    >> is not exactly news these days. It is generally buried among the
    >> classifieds (if it is reported at all).

    >
    >McIntyre's posts in this thread show why he should remain a foul-mouthed
    >attack dog on the perimeter of Thompson, Heathfield, & Co. headquarters
    >instead of trying to gain a promotion to lead janitor.


    Yep.
    Kenny McCormack, Apr 13, 2009
    #18
  19. Bartc

    CBFalconer Guest

    Keith Thompson wrote:
    > CBFalconer <> writes:
    >> Mark Wooding wrote:
    >>>

    >> ... snip ...
    >>>
    >>> And none of this addresses the major defect of null-terminated
    >>> strings, which is that they can't represent strings containing
    >>> a zero byte.

    >>
    >> Bearing in mind that strings contain only printable characters,
    >> what possible use can you have for a zero byte?

    >
    > Strings most certainly can contain non-printable characters.


    My verbiage bytes again. Will you settle for 'actionable chars'?

    --
    [mail]: Chuck F (cbfalconer at maineline dot net)
    [page]: <http://cbfalconer.home.att.net>
    Try the download section.
    CBFalconer, Apr 13, 2009
    #19
  20. Bartc

    CBFalconer Guest

    Flash Gordon wrote:
    > Harald van Dijk wrote:
    >> Keith Thompson wrote:
    >>
    >>> On the systems I use, if I write a '\a' character (ASCII BEL)
    >>> to a text file, I can reasonably expect to see a '\a' character
    >>> when I read it back. The same is not true of '\0' if I use
    >>> fgets() to read it (though I think can see the '\0' if I use
    >>> fgetc()).

    >>
    >> If you use fgets, you can see any '\0' that you had previously
    >> written, but you've got to be careful to make sure you don't
    >> treat it as a terminator. You cannot reliably determine whether
    >> the '\0' is a terminator,

    >
    > I think you can almost all the time, but it takes a little work...
    >
    > Fill buf with '\n'

    .... snip code ...

    I don't think you need all that.

    int lastch; /* evil global */

    size_t countofzeroes(FILE *f) {
    size_t cnt = 0;

    while (0 == (lastch = getc(f))) cnt++;
    return cnt;
    }

    at which point we have read cnt copies of '\0' followed by lastch
    (which may be EOF). No strings involved.

    --
    [mail]: Chuck F (cbfalconer at maineline dot net)
    [page]: <http://cbfalconer.home.att.net>
    Try the download section.
    CBFalconer, Apr 13, 2009
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Roedy Green
    Replies:
    0
    Views:
    456
    Roedy Green
    Jul 9, 2003
  2. ssylee
    Replies:
    4
    Views:
    502
    CBFalconer
    Aug 12, 2008
  3. Martin Ambuhl

    Re: Null-terminated strings: the final analysis.

    Martin Ambuhl, Mar 7, 2009, in forum: C Programming
    Replies:
    4
    Views:
    300
    Kenny McCormack
    Mar 7, 2009
  4. the.theorist

    Re: Null-terminated strings: the final analysis.

    the.theorist, Apr 13, 2009, in forum: C Programming
    Replies:
    7
    Views:
    316
  5. Paul Hsieh

    Re: Null-terminated strings: the final analysis.

    Paul Hsieh, Apr 14, 2009, in forum: C Programming
    Replies:
    7
    Views:
    347
    Richard Bos
    Apr 16, 2009
Loading...

Share This Page