Increasing efficiency in C

Discussion in 'C Programming' started by jacob navia, Mar 3, 2004.

  1. jacob navia

    jacob navia Guest

    As everybody knows, C uses a zero delimited unbounded
    pointer for its representation of strings.

    This is extremely inefficient because at each query of the
    length of the string, the computer starts an unbounded
    memory scan searching for a zero that ends the string.

    A more efficient representation is:

    struct string {
    size_t length;
    char data[];
    };

    The length operation becomes just a memory read.
    This would considerably speed the programs. The basic
    idea is to use a string type that is length prefixed and
    allows run-time checking against UB: undefined
    behavior.

    Comparing strings is speeded up also because when
    testing for equality, the first length comparison tells
    maybe the whole story with just a couple of
    memory reads.

    A string like the one described above is not able to
    resize itself. Any pointers to it would cease to be valid
    when it is resized if the memory allocator is forced to
    move memory around. The block where that string was
    allocated is bounded by another blocks in memory, and
    it is not possible to resize it.

    A pointer ( an indirect representation) costs a sizeof(void *)
    but allows to resize strings without invalidating the pointers
    to them.

    struct string {
    size_t length;
    char *data;
    };

    There is no compelling reason to choose one or the other.
    It depends on the application. In any case, the standard
    library could be complemented by
    Strcmp
    Strcpy
    etc., all using length prefixed strings.

    Syntactic sugar.

    I have added some sugar to this coffee. I always liked coffee
    with a bit of sugar. I feel that is too acid without it.

    Current strings are used using the [ ] notation. This strings
    could have the same privilege isn't it?

    The language extension I propose is that the user has the right to
    define the operation [ ] for any data type he/she wishes.

    Not a big deal for today's compilers.

    Length checked strings can then use:

    String s;
    ....
    s[2] = 'a';

    I think I am proposing the obvious.

    Do you agree?

    jacob
     
    jacob navia, Mar 3, 2004
    #1
    1. Advertising

  2. jacob navia

    Mike Wahler Guest

    "jacob navia" <> wrote in message
    news:c25kuu$rae$...

    > The language extension I propose is that the user has the right to
    > define the operation [ ] for any data type he/she wishes.
    >
    > Not a big deal for today's compilers.
    >
    > Length checked strings can then use:
    >
    > String s;
    > ...
    > s[2] = 'a';
    >
    > I think I am proposing the obvious.


    I think you're proposing C++. Rather than try to 'reinvent' it,
    I just use it.

    -Mike
     
    Mike Wahler, Mar 3, 2004
    #2
    1. Advertising

  3. jacob navia

    jacob navia Guest

    "Mike Wahler" <> a écrit dans le message de
    news:LMs1c.17672$...
    > "jacob navia" <> wrote in message
    > news:c25kuu$rae$...
    >
    >
    > I think you're proposing C++. Rather than try to 'reinvent' it,
    > I just use it.


    Well I can't use it Mike.

    Just too complex.

    Default instantiated template traits?

    No thanks. Just characters. What I like of C is that it is not
    "object oriented".

    It is not oriented at all. It is the programmer that puts
    the orientation of the program.

    C++ has good ideas, but the complexity of the whole is
    so staggering, that actually it is a reminder where that leads,
    not knowing when to stop.

    The crux of the matter is knowing when to stop. When a feature
    becomes a nuisance, and doesn't simplify the task it is better
    to drop it.

    Syntactic sugar can lead to caries in the teeths. I said I like
    a *bit* of sugar. Not five spoonfuls you see?

    I want a bit of sugar in my coffee and not some coffee in my
    sugar.

    jacob
     
    jacob navia, Mar 3, 2004
    #3
  4. jacob navia <> wrote:
    > As everybody knows, C uses a zero delimited unbounded
    > pointer for its representation of strings.


    > This is extremely inefficient because at each query of the
    > length of the string, the computer starts an unbounded
    > memory scan searching for a zero that ends the string.


    Correction, strcmp is inefficient, I'd like to see someone produce a more
    efficient model for strings using even assembly

    > A more efficient representation is:


    > struct string {
    > size_t length;
    > char data[];
    > };


    I'm not really sure what the question is, you seem to have asked and answered
    all in one...

    > The language extension I propose is that the user has the right to
    > define the operation [ ] for any data type he/she wishes.


    foo[x] is the value held in the array foo at location x, not string foo at
    location x. strings don't even exist in c, just memory.

    > Length checked strings can then use:


    > String s;
    > ...
    > s[2] = 'a';


    > I think I am proposing the obvious.


    See C++

    --
    Harrison Caudill | .^ www.hypersphere.org
    Computer Science & Physics Double Major | | Me*Me=1
    Georgia Institute of Technology | v' I'm just a normal guy
     
    Charles Harrison Caudill, Mar 3, 2004
    #4
  5. jacob navia

    Mike Wahler Guest

    "jacob navia" <> wrote in message
    news:c25lvs$pev$...
    >
    > "Mike Wahler" <> a écrit dans le message de
    > news:LMs1c.17672$...
    > > "jacob navia" <> wrote in message
    > > news:c25kuu$rae$...
    > >
    > >
    > > I think you're proposing C++. Rather than try to 'reinvent' it,
    > > I just use it.

    >
    > Well I can't use it Mike.


    Whatever.

    > Just too complex.


    You needn't use all of it. You seem to be wanting
    a 'real' string type, which C++ has, and it's not
    at all difficult to use. Actually, if such a type
    is needed, imo that's a good enough reason to use C++
    (even if everything else is only the common subset
    of the two languages).

    > Default instantiated template traits?


    So don't use 'em.

    > No thanks. Just characters. What I like of C is that it is not
    > "object oriented".


    C++ does not require OO design. This is a very common misconception.

    > It is not oriented at all. It is the programmer that puts
    > the orientation of the program.


    Right. Which is why I find C++ useful for very many things.
    (and C as well, and other languages too).

    > C++ has good ideas, but the complexity of the whole is
    > so staggering, that actually it is a reminder where that leads,
    > not knowing when to stop.


    I use the parts I find useful, discard the rest.

    > The crux of the matter is knowing when to stop. When a feature
    > becomes a nuisance, and doesn't simplify the task it is better
    > to drop it.


    Or ignore it. Simple, huh?

    > Syntactic sugar can lead to caries in the teeths. I said I like
    > a *bit* of sugar. Not five spoonfuls you see?


    Ever hear of self-discipline? Anything can be abused.

    > I want a bit of sugar in my coffee and not some coffee in my
    > sugar.


    Who's got control of the spoon, you or someone else? :)

    I'll stop now, I don't want to be accused of language
    advocacy in a group about a different language. (It's
    probably too late, though :))

    -Mike
     
    Mike Wahler, Mar 3, 2004
    #5
  6. jacob navia

    Leor Zolman Guest

    On Wed, 3 Mar 2004 23:07:25 +0100, "jacob navia" <>
    wrote:

    >As everybody knows, C uses a zero delimited unbounded
    >pointer for its representation of strings.


    zero terminated, anyway.

    >
    >This is extremely inefficient because at each query of the
    >length of the string, the computer starts an unbounded
    >memory scan searching for a zero that ends the string.


    It is "extremely inefficient" only if you're continuously recalculating the
    length. For applications where you're not, it is extremely efficient.

    >
    >A more efficient representation is:
    >
    >struct string {
    > size_t length;
    > char data[];
    >};


    What is that [] about? That's not a legal definition. Are you implying that
    a fixed-length array implementation (with an actual size in there) is an
    improvement in any significant way over a simple char *? I don't think so.
    >
    >The length operation becomes just a memory read.
    >This would considerably speed the programs. The basic
    >idea is to use a string type that is length prefixed and
    >allows run-time checking against UB: undefined
    >behavior.


    Now it is starting to sound like Java.

    >
    >Comparing strings is speeded up also because when
    >testing for equality, the first length comparison tells
    >maybe the whole story with just a couple of
    >memory reads.


    Perhaps a bit; but on average, inequality is determined pretty quick the
    conventional way, and equality would actually take /more/ time to
    determine. But yes, you might net a teeny bit of an improvement.

    >
    >A string like the one described above is not able to
    >resize itself.


    Are we talking about the one with the fixed-length array, or the version
    with the mysterious empty brackets? Either way, /nothing/ in C can "resize
    itself"...

    >Any pointers to it would cease to be valid
    >when it is resized if the memory allocator is forced to
    >move memory around. The block where that string was
    >allocated is bounded by another blocks in memory, and
    >it is not possible to resize it.
    >
    >A pointer ( an indirect representation) costs a sizeof(void *)
    >but allows to resize strings without invalidating the pointers
    >to them.
    >
    >struct string {
    > size_t length;
    > char *data;
    >};


    This is the classic first C++ class implementation exercise. Thinking about
    it yields some good fundamental principles about class design. But, to
    achieve a true performance benefit in a string service, it ultimately
    requires tailoring the string implementation to the specific circumstances
    in which it will be used. There's no magic bullet; the "irrational
    exuberance" surrounding the rise-and-fall of reference counted Standard C++
    string implementations is a case in point.

    >
    >There is no compelling reason to choose one or the other.
    >It depends on the application. In any case, the standard
    >library could be complemented by
    >Strcmp
    >Strcpy
    >etc., all using length prefixed strings.
    >
    >Syntactic sugar.
    >
    >I have added some sugar to this coffee. I always liked coffee
    >with a bit of sugar. I feel that is too acid without it.
    >
    >Current strings are used using the [ ] notation. This strings
    >could have the same privilege isn't it?
    >
    >The language extension I propose is that the user has the right to
    >define the operation [ ] for any data type he/she wishes.
    >
    >Not a big deal for today's compilers.


    Might be a bigger deal for the C Standards committee ;-)

    >
    >Length checked strings can then use:
    >
    >String s;
    >...
    >s[2] = 'a';
    >
    >I think I am proposing the obvious.
    >
    >Do you agree?


    Mike's right: use C++.
    -leor

    >
    >jacob
    >


    Leor Zolman
    BD Software

    www.bdsoft.com -- On-Site Training in C/C++, Java, Perl & Unix
    C++ users: Download BD Software's free STL Error Message
    Decryptor at www.bdsoft.com/tools/stlfilt.html
     
    Leor Zolman, Mar 3, 2004
    #6
  7. jacob navia

    Leor Zolman Guest

    On Wed, 03 Mar 2004 22:54:20 GMT, "Mike Wahler" <>
    wrote:

    >"jacob navia" <> wrote in message
    >news:c25lvs$pev$...
    >>
    >> "Mike Wahler" <> a écrit dans le message de
    >> news:LMs1c.17672$...
    >> > "jacob navia" <> wrote in message
    >> > news:c25kuu$rae$...
    >> >
    >> >
    >> > I think you're proposing C++. Rather than try to 'reinvent' it,
    >> > I just use it.

    >>
    >> Well I can't use it Mike.

    >
    >Whatever.
    >
    >> Just too complex.

    >
    >You needn't use all of it. You seem to be wanting
    >a 'real' string type, which C++ has, and it's not
    >at all difficult to use. Actually, if such a type
    >is needed, imo that's a good enough reason to use C++
    >(even if everything else is only the common subset
    >of the two languages).
    >


    Jaocb: There's even a common term used to describe C++ when used only for
    those features that are a direct "clean-up" of messiness left over from C's
    need to be backward-compatible with earlier incarnations of itself: "A
    Better C". I don't think the C++ string class is typically lumped in with
    those features (which I'm reluctant to go into here since this isn't a C++
    group), but I'd think some more about it before discarding the idea of
    using C++ as "C plus strings." I even wrote a CUJ article to help folks
    with this very issue:
    http://www.bdsoft.com/resources/thinking.html
    (The title has "STL" in it, but the article is really about migration of
    char *-based code to using strings)
    -leor

    Leor Zolman
    BD Software

    www.bdsoft.com -- On-Site Training in C/C++, Java, Perl & Unix
    C++ users: Download BD Software's free STL Error Message
    Decryptor at www.bdsoft.com/tools/stlfilt.html
     
    Leor Zolman, Mar 3, 2004
    #7
  8. jacob navia

    jacob navia Guest

    "Leor Zolman" <> a écrit dans le message de
    news:...
    > On Wed, 3 Mar 2004 23:07:25 +0100, "jacob navia" <>
    > wrote:
    >
    > >As everybody knows, C uses a zero delimited unbounded
    > >pointer for its representation of strings.

    >
    > zero terminated, anyway.


    Yes Sir!
    Zero terminated and surely NOT zero delimited. What a deep
    difference :)

    >
    > >
    > >This is extremely inefficient because at each query of the
    > >length of the string, the computer starts an unbounded
    > >memory scan searching for a zero that ends the string.

    >
    > It is "extremely inefficient" only if you're continuously recalculating

    the
    > length.


    Obviously. And this is a very common use, haven't you
    notice it?


    > For applications where you're not, it is extremely efficient.


    Sorry but this string was once constructed, and the
    length was known. Why not keeping this information?

    What about the security?

    What about the failure modes of unbounded pointers,

    >
    > >
    > >A more efficient representation is:
    > >
    > >struct string {
    > > size_t length;
    > > char data[];
    > >};

    >
    > What is that [] about? That's not a legal definition.


    C99 introduces variable length arrays. This is standard
    notation.


    > Are you implying that
    > a fixed-length array implementation (with an actual size in there) is an
    > improvement in any significant way over a simple char *?


    Yes.

    1 Length operation is trivial
    2 Comparisons for equality are cheaper when the length
    of the strings differ. You never know this in C strings
    and you have to start scanning for that zero...
    3 Bounds checked strings can be implemented.

    > I don't think so.


    Well. I think so for the reasons above. Can you
    maybe go to those reasons in detail?

    > >
    > >The length operation becomes just a memory read.
    > >This would considerably speed the programs. The basic
    > >idea is to use a string type that is length prefixed and
    > >allows run-time checking against UB: undefined
    > >behavior.

    >
    > Now it is starting to sound like Java.
    >


    In matters of languages I do not despise any. I am
    sorry, I like C but I am not a zealot, and see
    C's problems and weakness. A bad string type
    is the reason for many bugs we could really get
    rid of.

    > >
    > >Comparing strings is speeded up also because when
    > >testing for equality, the first length comparison tells
    > >maybe the whole story with just a couple of
    > >memory reads.

    >
    > Perhaps a bit; but on average, inequality is determined pretty quick the
    > conventional way, and equality would actually take /more/ time to
    > determine. But yes, you might net a teeny bit of an improvement.
    >


    And also net a big security improvement...

    > >
    > >A string like the one described above is not able to
    > >resize itself.

    >
    > Are we talking about the one with the fixed-length array, or the version
    > with the mysterious empty brackets? Either way, /nothing/ in C can "resize
    > itself"...
    >

    Sorry, I thought realloc was part of C...

    > This is the classic first C++ class implementation exercise. Thinking

    about
    > it yields some good fundamental principles about class design.


    Maybe but I do not want any class design. There are no classes
    in C. I want strings for holding text. As I said, no
    default instantiated template traits. Just chars please.

    > But, to
    > achieve a true performance benefit in a string service, it ultimately
    > requires tailoring the string implementation to the specific circumstances
    > in which it will be used. There's no magic bullet; the "irrational
    > exuberance" surrounding the rise-and-fall of reference counted Standard

    C++
    > string implementations is a case in point.
    >


    Yes, each application has its own needs. That's why I would
    propose that the user writes many specialized string
    structures, that share a common description.

    Length delimited strings are infinitely extensible with other
    features.

    > Mike's right: use C++.


    I answered that to Mike. See my answer in a parallel thread.
    I think C is the last not object oriented language around.
    That makes it very interesting.

    jacob
    http://www.cs.virginia.edu/~lcc-win32
     
    jacob navia, Mar 4, 2004
    #8
  9. jacob navia

    Leor Zolman Guest

    On Thu, 4 Mar 2004 01:19:26 +0100, "jacob navia" <>
    wrote:

    >
    >"Leor Zolman" <> a écrit dans le message de
    >news:...
    >> On Wed, 3 Mar 2004 23:07:25 +0100, "jacob navia" <>
    >> wrote:
    >>
    >> >As everybody knows, C uses a zero delimited unbounded
    >> >pointer for its representation of strings.

    >>
    >> zero terminated, anyway.

    >
    >Yes Sir!
    >Zero terminated and surely NOT zero delimited. What a deep
    >difference :)


    I think of delimiters as a matched set, terminated as asymmetric. Just
    seemed off to use it there, but yes, I'm sure everyone knew what you meant.

    >
    >>
    >> >
    >> >This is extremely inefficient because at each query of the
    >> >length of the string, the computer starts an unbounded
    >> >memory scan searching for a zero that ends the string.

    >>
    >> It is "extremely inefficient" only if you're continuously recalculating

    >the
    >> length.

    >
    >Obviously. And this is a very common use, haven't you
    >notice it?


    Knowing something about how the strings are going to be used is precisely
    what drives the design decision of which flavor to use. When there's going
    to be a lot of repeated length testing, that fact may contribute to a
    decision against my using plain old char *'s /in that application/.

    >
    >
    >> For applications where you're not, it is extremely efficient.

    >
    >Sorry but this string was once constructed, and the
    >length was known. Why not keeping this information?


    Keeping and maintaining it has a spacial and temporal cost. Is it always
    justified? Sometimes, probably. Usually? Always?

    >
    >What about the security?
    >
    >What about the failure modes of unbounded pointers,


    C doesn't provide any automatic protection for these things. The spirit of
    C is to let the programmer program them if they're needed. Period.

    >
    >>
    >> >
    >> >A more efficient representation is:
    >> >
    >> >struct string {
    >> > size_t length;
    >> > char data[];
    >> >};

    >>
    >> What is that [] about? That's not a legal definition.

    >
    >C99 introduces variable length arrays. This is standard
    >notation.


    Darn, I'm really going to actually have to write a piece of code using VLAs
    some day, so I can at least recognize them when they get used (blush). But
    the problem is, I don't like them ;-)

    >
    >
    >> Are you implying that
    >> a fixed-length array implementation (with an actual size in there) is an
    >> improvement in any significant way over a simple char *?

    >
    >Yes.
    >
    >1 Length operation is trivial
    >2 Comparisons for equality are cheaper when the length
    > of the strings differ. You never know this in C strings
    > and you have to start scanning for that zero...


    ....or the first mismatch. If you happen to know that enough of your strings
    will be identical for their first several characters /and/ be of different
    lengths for this to make a significant difference, you'd have good reason
    to use your implementation /in that application/.

    >3 Bounds checked strings can be implemented.


    They can, but lots of things /can/ be implemented, it is just that C has no
    pretense of supporting such things at the core language level. Neither does
    C++, for that matter.

    >
    >> I don't think so.

    >
    >Well. I think so for the reasons above. Can you
    >maybe go to those reasons in detail?


    I'm not compelled to, no.

    >
    >> >
    >> >The length operation becomes just a memory read.
    >> >This would considerably speed the programs. The basic
    >> >idea is to use a string type that is length prefixed and
    >> >allows run-time checking against UB: undefined
    >> >behavior.

    >>
    >> Now it is starting to sound like Java.
    >>

    >
    >In matters of languages I do not despise any. I am
    >sorry, I like C but I am not a zealot, and see
    >C's problems and weakness. A bad string type
    >is the reason for many bugs we could really get
    >rid of.


    Nor do I despise Java (I've even written an article, still available on
    line somewhere, outlining why I believe Java makes a great "first"
    programming language.) But hand-holding features are just /not/ in C's job
    description, I'm sorry.

    >
    >> >
    >> >Comparing strings is speeded up also because when
    >> >testing for equality, the first length comparison tells
    >> >maybe the whole story with just a couple of
    >> >memory reads.

    >>
    >> Perhaps a bit; but on average, inequality is determined pretty quick the
    >> conventional way, and equality would actually take /more/ time to
    >> determine. But yes, you might net a teeny bit of an improvement.
    >>

    >
    >And also net a big security improvement...


    Which you may or may not want to pay for.

    >
    >> >
    >> >A string like the one described above is not able to
    >> >resize itself.

    >>
    >> Are we talking about the one with the fixed-length array, or the version
    >> with the mysterious empty brackets? Either way, /nothing/ in C can "resize
    >> itself"...
    >>

    >Sorry, I thought realloc was part of C...


    What I'm saying is that nothing "resizes itself", there has to be user code
    to recognize the need, dispatch to the appropriate functions, etc. At any
    given point in a design, a C programmer can choose whether or not to do
    that stuff. She may choose not to, for reasons that make all the sense in
    world for that application. She may not want that overhead forced upon her.

    >
    >> This is the classic first C++ class implementation exercise. Thinking

    >about
    >> it yields some good fundamental principles about class design.

    >
    >Maybe but I do not want any class design. There are no classes
    >in C. I want strings for holding text. As I said, no
    >default instantiated template traits. Just chars please.


    I'm not trying to force class design down your throat, I'm just saying
    that "black-box" string management is always going to be either prejudicial
    to some quality of the data being operated upon, or middle-of-the road and
    thus probably not optimal for /your/ situation, whatever that may be; it
    can't be sufficiently general-purpose and really efficient for some special
    case...

    >
    >> But, to
    >> achieve a true performance benefit in a string service, it ultimately
    >> requires tailoring the string implementation to the specific circumstances
    >> in which it will be used. There's no magic bullet; the "irrational
    >> exuberance" surrounding the rise-and-fall of reference counted Standard

    >C++
    >> string implementations is a case in point.
    >>

    >
    >Yes, each application has its own needs. That's why I would
    >propose that the user writes many specialized string
    >structures, that share a common description.
    >
    >Length delimited strings are infinitely extensible with other
    >features.
    >
    >> Mike's right: use C++.

    >
    >I answered that to Mike. See my answer in a parallel thread.
    >I think C is the last not object oriented language around.
    >That makes it very interesting.
    >
    >jacob
    >http://www.cs.virginia.edu/~lcc-win32
    >


    Okay. Good luck in your quest,
    -leor



    Leor Zolman
    BD Software

    www.bdsoft.com -- On-Site Training in C/C++, Java, Perl & Unix
    C++ users: Download BD Software's free STL Error Message
    Decryptor at www.bdsoft.com/tools/stlfilt.html
     
    Leor Zolman, Mar 4, 2004
    #9
  10. jacob navia

    Mike Wahler Guest

    "jacob navia" <> wrote in message
    news:c25smg$pqs$...
    > > Are we talking about the one with the fixed-length array, or the version
    > > with the mysterious empty brackets? Either way, /nothing/ in C can

    "resize
    > > itself"...
    > >

    > Sorry, I thought realloc was part of C...


    Ever notice that the return value from 'realloc()' is
    often not the same as returned by the original 'malloc()' or
    'calloc()'? 'realloc()' often (in my experience almost
    always, except in the cases of 'trivial' size increases)
    does a new allocation followed by a copy and a deallocation.
    Not really a 'resizing'.


    -Mike
     
    Mike Wahler, Mar 4, 2004
    #10
  11. On Wed, 3 Mar 2004, jacob navia wrote:
    >
    > As everybody knows, C uses a [null-terminated string model]
    > for its representation of strings.


    > A more efficient representation is:

    <snip>
    > struct string {
    > size_t length;
    > char *data;
    > };
    >
    > There is no compelling reason to choose one or the other.


    Except the aforementioned efficiency reasons, of course. ;-)
    So far, so good; but here you start to go downhill.

    > It depends on the application. In any case, the standard
    > library could be complemented by
    > Strcmp
    > Strcpy
    > etc., all using length prefixed strings.


    Except for the fact that nobody in the world would accept that
    kind of library bloat in C0x. I don't even like the dozens of
    transcendental and date-manipulation functions in C90. :-D Besides,
    if there's one thing a "simple" language *doesn't* need, it's two
    different and incompatible implementations of one fundamental
    concept (i.e., "string").

    > Syntactic sugar.


    Not if it's accomplished by making the programmer do the bookkeeping
    on those new library functions, it's not. I don't want to have to
    remember what kind of string I'm using! Let the computer do it!
    (This is one of the reasons I hate Java's library model: I don't care
    whether I'm using a TreeFoo or a ListFoo or a HashFoo, I just want
    some kind of Foo. They're all equally capable; why should I be
    forced to keep books on which one I'm using at the moment?)

    > The language extension I propose is that the user has the right to
    > define the operation [ ] for any data type he/she wishes.


    I.e., you're proposing to take a C++ compiler and strip it of
    most of the goodies. Okay, but that won't be the C language any
    more. Consider simplicity and orthogonality: you really think you
    should be able to re-define the semantics of [] but not * or ->?
    That's foolishness waiting to happen.

    > Do you agree?


    Of course not!

    But remember where I said "so far, so good"? What you *should*
    have concluded, there, was that since the "Pascal-style" string
    model is so superior to the "C-style" string model, that wouldn't
    it be neat if somebody implemented a C compiler that used the
    Pascal model internally?!
    That is, the *programmer* would still see a completely conforming
    standard C implementation; but when he writes

    a = strlen(a);

    where a typical C compiler would assemble the equivalent of
    [completely untested and not-real code, but you get the point:]

    ; strlen(a)
    MOV AX, 0
    L1: MOV BX, a[AX]
    INC AX
    JNZ L1
    SUB AX, a
    DEC AX
    ; a
    MOV BX, a
    ADD BX, i
    ; =
    MOV [BX], AX

    this hypothetical "Pascal-style" implementation could do the
    much faster

    ; strlen(a)
    MOV AX, [internal_a]
    ; a
    MOV BX, [internal_a+4]
    ADD BX, i
    ; =
    MOV [BX], AX

    Of course, this would require a *lot* of thinking-out ahead of
    time, and a *lot* of compiler support (including clever workarounds
    for users' trying to memcpy() over strings, or storing strings in
    unions, or simply using 'malloc' in creative ways)... but it would
    be really neat IMHO if you could get it to work.
    It would certainly be a bigger and more widely interesting challenge
    than simply re-implementing half of C++.

    -Arthur
     
    Arthur J. O'Dwyer, Mar 4, 2004
    #11
  12. jacob navia

    Chris Torek Guest

    In article <news:c25smg$pqs$>
    jacob navia <> writes:
    >In matters of languages I do not despise any. I am
    >sorry, I like C but I am not a zealot, and see
    >C's problems and weakness. A bad string type
    >is the reason for many bugs we could really get
    >rid of.


    I do not think C's "string" data format is necessarily "bad", merely
    "limited". The counted-strings other languages have used have their
    own advantages and drawbacks.

    Perhaps the biggest problem (as it were) with C is that it provides
    such a limited built-in syntax for *generating* these anonymous
    arrays. Anything inside double quotes is, ignoring the exception
    for initializers, one of these special anonymous arrays. In C99
    we at least can create anonymous "struct"s:

    struct counted_string { size_t len; char *data; };
    ...
    func((struct counted_string){sizeof "foo" - 1, "foo"});

    Of course, as shown here, you cannot even use flexible array members
    (initializing a variant of counted_string with "char data[]" instead
    of "char *data" appears to be invalid). It might be nice if one
    could get the above without resorting to preprocessor tricks.

    (On the other hand, if you want Lisp, you know where to find it. :) )
    --
    In-Real-Life: Chris Torek, Wind River Systems
    Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
    email: forget about it http://web.torek.net/torek/index.html
    Reading email is like searching for food in the garbage, thanks to spammers.
     
    Chris Torek, Mar 4, 2004
    #12
  13. in comp.lang.c i read:

    >since the "Pascal-style" string
    >model is so superior to the "C-style" string model, that wouldn't
    >it be neat if somebody implemented a C compiler that used the
    >Pascal model internally?!
    > That is, the *programmer* would still see a completely conforming
    >standard C implementation;


    > Of course, this would require a *lot* of thinking-out ahead of
    >time, and a *lot* of compiler support (including clever workarounds
    >for users' trying to memcpy() over strings, or storing strings in
    >unions, or simply using 'malloc' in creative ways)...


    or even the bog simple: a[strlen(a)-2] = 0;

    --
    a signature
     
    those who know me have no need of my name, Mar 4, 2004
    #13
  14. jacob navia

    Old Wolf Guest

    > As everybody knows, C uses a zero delimited unbounded
    > pointer for its representation of strings.


    Pointers cannot be bounded or unbounded, as such.
    String constants are represented by an array. C arrays
    are bounded (but it is not required for these bounds
    to ever be checked).
    C library functions expect a pointer into an array
    or some other piece of allocated memory, which contains
    a zero-terminator. Calling a C library function with
    anything else is a programming error.
    >
    > This is extremely inefficient because at each query of the
    > length of the string, the computer starts an unbounded
    > memory scan searching for a zero that ends the string.


    It is inefficient to repeatedly calculate the length of the
    string, as you say. However most of us are clever enough to
    not do that.

    >
    > A more efficient representation is:
    >
    > struct string {
    > size_t length;
    > char data[];
    > };


    Efficient? Apart from requiring C99 (harder to get a compiler
    for than C++) , it uses a lot more memory and is slower at
    runtime because you have to do extra checks and updates every
    time you modify the string. Also it cannot be declared
    statically or allocated using malloc (if my understanding of
    C99 VLA is correct). This is not my idea of efficient.

    If you meant "char data[N]" for some N, then it wastes
    even more memory and limits your string size.

    > The length operation becomes just a memory read.
    > This would considerably speed the programs.


    Programs which check length a lot and do not do much else
    with the string, possibly. This is a small subset of programs.

    > The basic idea is to use a string type that is length
    > prefixed and allows run-time checking against UB: undefined
    > behavior.


    How slow and inefficient. I prefer to prevent UB by coding
    correctly. Also, how do you propose to check against UB
    this way. Anyone can introduce UB by modifying the string
    directly, unless you also enforce all modifications to it
    to be via a library. (how slow).

    > Comparing strings is speeded up also because when
    > testing for equality, the first length comparison tells
    > maybe the whole story with just a couple of
    > memory reads.


    And if they were equal length you have wasted a comparison.
    If your application is the sort where you test string equality
    so much that it is important, you would probably introduce
    some other method of equality checking (eg. a hash).

    Most instances of "comparing strings" are actually interested
    in which one comes first alphabetically, for which your
    counted string is slower.

    > A string like the one described above is not able to
    > resize itself. Any pointers to it would cease to be valid
    > when it is resized if the memory allocator is forced to
    > move memory around. The block where that string was
    > allocated is bounded by another blocks in memory, and
    > it is not possible to resize it.


    This seems to be a problem with any string representation
    (including C's builtin one)

    > A pointer ( an indirect representation) costs a sizeof(void *)
    > but allows to resize strings without invalidating the pointers
    > to them.
    >
    > struct string {
    > size_t length;
    > char *data;
    > };


    How does this allow resizing without invalidating pointers?
    I presume you have something in mind like this:

    char *ptr = str.data + 6;
    resize_string(&str);
    /* keep using ptr... */

    To make a string bigger you have to get new memory (There is no
    portable way of getting more memory at the same location because,
    as you pointed out before, there might be something else already
    using the desired memory).

    > There is no compelling reason to choose one or the other.
    > It depends on the application.


    Congratulations, some sense. Wisely, the standard library
    does not make either choice, leaving it up to the programmer
    to do what is most efficient for his/her program.

    > In any case, the standard
    > library could be complemented by
    > Strcmp
    > Strcpy
    > etc., all using length prefixed strings.


    What bloat. The C library is big enough as it is.

    > Syntactic sugar.
    >
    > I have added some sugar to this coffee. I always liked coffee
    > with a bit of sugar. I feel that is too acid without it.
    >
    > Current strings are used using the [ ] notation. This strings
    > could have the same privilege isn't it?
    >
    > The language extension I propose is that the user has the right to
    > define the operation [ ] for any data type he/she wishes.
    >
    > Not a big deal for today's compilers.
    >
    > Length checked strings can then use:
    >
    > String s;
    > ...
    > s[2] = 'a';
    >
    > I think I am proposing the obvious.


    I think you are proposing bounds-checked arrays. The standard
    allows you to implement this already. Why not go and do it
    and then advertise your compiler as supporting bounds checking.
    In fact, why not write a library as you have proposed, and package
    it with your compiler. Then people can choose if it suits them or not.

    > Do you agree?


    Not really, no. You seem to be making big assumptions about
    the rest of the world's programming requirements.
     
    Old Wolf, Mar 4, 2004
    #14
  15. jacob navia

    Richard Bos Guest

    "jacob navia" <> wrote:

    > As everybody knows, C uses a zero delimited unbounded
    > pointer for its representation of strings.
    >
    > This is extremely inefficient because at each query of the
    > length of the string, the computer starts an unbounded
    > memory scan searching for a zero that ends the string.
    >
    > A more efficient representation is:
    >
    > struct string {
    > size_t length;
    > char data[];
    > };


    Possibly efficient in time, for certain programs, but almost certainly
    inefficient in storage. The problem is two-pronged: either your strings
    are large relative to your size_t, or they aren't. In the first case,
    you will, sooner or later, run into a string that simply won't fit
    inside a size_t.
    Ok, so you need a size_t that is large enough to contain every possible
    string length on the system, large enough to address all of your memory.
    But in that case, you'll hit the second case: now, you're using four
    bytes, maybe eight, to address a single string, AOT C's one-byte null
    terminator. Not a problem, perhaps, if all your strings are dozens of
    kilobytes, but most applications seem to use lots and lots of small
    strings, every single one potentially costing you seven bytes extra, and
    only a few large ones.
    You might get away with this on systems where sizeof(size_t) ==
    sizeof('\0'). IOW, an embedded device, most likely. But how many strings
    does a microwave oven need, anyway? The only application I can think of
    that makes this worthwhile is a mobile phone, where you may need quite a
    few strings, most of them static.

    > The length operation becomes just a memory read.
    > This would considerably speed the programs. The basic
    > idea is to use a string type that is length prefixed and
    > allows run-time checking against UB: undefined
    > behavior.


    How? What if I want a 30-char array, of which the first 20 to 29 chars
    contain a string, and the last char contains a checksum? Modifying the
    checksum wouldn't be undefined behaviour at all, but it would write
    beyond data[length-1].

    > Comparing strings is speeded up also because when
    > testing for equality, the first length comparison tells
    > maybe the whole story with just a couple of
    > memory reads.


    Erm... no. No, it doesn't at all. Because, you see, "aaa" < "zzzzz", but
    "zzz" < "aaaaa". Length has _nothing_ to do with the lexicographical
    ordering of strings, except when you've already compared the contents of
    the strings and found them identical up to the length of the shortest.
    In fact, I think you'll be hard pushed indeed to beat the efficiency of

    strcmp(const char *s1, const char *s2)
    {
    while (*s1==*s2 && *s1) { s1++; s2++; }
    return *s1-*s2;
    }

    using your length-indicated string.

    > Syntactic sugar.


    Syntactic sugar causes cancer of the semi-colon.

    > I have added some sugar to this coffee. I always liked coffee
    > with a bit of sugar. I feel that is too acid without it.


    YM bitter, I suspect. And that shows that truly civilised people drink
    tea, without sugar, and program in C, without ++ :)

    > Current strings are used using the [ ] notation. This strings
    > could have the same privilege isn't it?


    No, they couldn't. Puzzle for you: how would you extend the
    interoperation between strings and char pointers, using your
    length-strings? You can't point a char * inside one, because if you do,
    you've lost track of its length and there isn't a null terminator to
    help you find it...
    And of course, without pointer arithmetic, array indexing is impossible
    in the current Standard. You'd need to define an explicit exception for
    your length-strings.

    Richard
     
    Richard Bos, Mar 4, 2004
    #15
  16. jacob navia

    Richard Bos Guest

    Leor Zolman <> wrote:

    > Jaocb: There's even a common term used to describe C++ when used only for
    > those features that are a direct "clean-up" of messiness left over from C's
    > need to be backward-compatible with earlier incarnations of itself: "A
    > Better C".


    Common, and used, by whom? C++ programmers, I suspect. _My_ term for
    such a hybrid monstrum would be "A Bloody Mess".

    If you want C++, be a man and program in C++. Don't go pretend that
    you're almost using C.

    Richard
     
    Richard Bos, Mar 4, 2004
    #16
  17. In article <> "Arthur J. O'Dwyer" <> writes:
    ....
    > But remember where I said "so far, so good"? What you *should*
    > have concluded, there, was that since the "Pascal-style" string
    > model is so superior to the "C-style" string model, that wouldn't
    > it be neat if somebody implemented a C compiler that used the
    > Pascal model internally?!


    Eh? The "Pascal-style"? In the only Pascal I ever used (on the CDC
    Cyber, the original Pascal from ETH Zuerich), a string was implemented
    as a sequence of characters, and that was it. Nearly the same as in C,
    except that there was no terminator.
    --
    dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
    home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
     
    Dik T. Winter, Mar 4, 2004
    #17
  18. jacob navia

    jacob navia Guest

    "Richard Bos" <> a écrit dans le message de
    news:...
    > "jacob navia" <> wrote:
    >
    > > As everybody knows, C uses a zero delimited unbounded
    > > pointer for its representation of strings.
    > >
    > > This is extremely inefficient because at each query of the
    > > length of the string, the computer starts an unbounded
    > > memory scan searching for a zero that ends the string.
    > >
    > > A more efficient representation is:
    > >
    > > struct string {
    > > size_t length;
    > > char data[];
    > > };

    >
    > Possibly efficient in time, for certain programs, but almost certainly
    > inefficient in storage. The problem is two-pronged: either your strings
    > are large relative to your size_t, or they aren't. In the first case,
    > you will, sooner or later, run into a string that simply won't fit
    > inside a size_t.


    In lcc-win32 a size_t can contain up to 4GB strings.
    I think that searching for the terminating zero in a 1GB string would
    not be very efficient anyway... :)


    > Ok, so you need a size_t that is large enough to contain every possible
    > string length on the system, large enough to address all of your memory.


    Yes, normally size_t would do it.

    > But in that case, you'll hit the second case: now, you're using four
    > bytes, maybe eight, to address a single string, AOT C's one-byte null
    > terminator. Not a problem, perhaps, if all your strings are dozens of
    > kilobytes, but most applications seem to use lots and lots of small
    > strings, every single one potentially costing you seven bytes extra, and
    > only a few large ones.


    There is no free lunch.
    Yes, it will cost at least size_t bytes more than a zero terminated string.
    So what?
    Many applications can afford this.

    > > The length operation becomes just a memory read.
    > > This would considerably speed the programs. The basic
    > > idea is to use a string type that is length prefixed and
    > > allows run-time checking against UB: undefined
    > > behavior.

    >
    > How? What if I want a 30-char array, of which the first 20 to 29 chars
    > contain a string, and the last char contains a checksum? Modifying the
    > checksum wouldn't be undefined behaviour at all, but it would write
    > beyond data[length-1].
    >


    You can implement check sums in my schema. Anyway, I am proposing
    string handling, where normal zero terminated strings are assumed.


    > > Comparing strings is speeded up also because when
    > > testing for equality, the first length comparison tells
    > > maybe the whole story with just a couple of
    > > memory reads.

    >
    > Erm... no. No, it doesn't at all. Because, you see, "aaa" < "zzzzz", but
    > "zzz" < "aaaaa". Length has _nothing_ to do with the lexicographical
    > ordering of strings, except when you've already compared the contents of
    > the strings and found them identical up to the length of the shortest.


    Please read carefully. I said
    "When comparing strings for equality"

    and not strcmp !!!

    strcmp gives as a result a lexicographical ordering. This is NOT
    NEEDED when I just want to know if a == b.

    > > Syntactic sugar.

    >
    > Syntactic sugar causes cancer of the semi-colon.
    >

    In great doses YES. (See my answer as to why I do not use C++ in a
    parallel thread)

    I small doses sugar is useful.

    It comes to the dosage you see?

    It is a question of knowing when to stop.

    > > Current strings are used using the [ ] notation. This strings
    > > could have the same privilege isn't it?

    >
    > No, they couldn't. Puzzle for you: how would you extend the
    > interoperation between strings and char pointers, using your
    > length-strings?


    I have started a library in lcc-win32 that makes exactly that.
    It can be done.


    > You can't point a char * inside one, because if you do,
    > you've lost track of its length and there isn't a null terminator to
    > help you find it...


    All the strings are null terminated to keep interoperability
    with existing code.

    > And of course, without pointer arithmetic, array indexing is impossible
    > in the current Standard. You'd need to define an explicit exception for
    > your length-strings.


    Yes. I have done that.

    jacob
     
    jacob navia, Mar 4, 2004
    #18
  19. Dik T. Winter wrote:
    > In article <> "Arthur J. O'Dwyer" <> writes:
    > ...
    > > But remember where I said "so far, so good"? What you *should*
    > > have concluded, there, was that since the "Pascal-style" string
    > > model is so superior to the "C-style" string model, that wouldn't
    > > it be neat if somebody implemented a C compiler that used the
    > > Pascal model internally?!

    >
    > Eh? The "Pascal-style"? In the only Pascal I ever used (on the CDC
    > Cyber, the original Pascal from ETH Zuerich), a string was implemented
    > as a sequence of characters, and that was it. Nearly the same as in C,
    > except that there was no terminator.


    Meaning you had to keep track of the length in a separate variable?
    Sounds unlikely...
     
    =?ISO-8859-1?Q?Tor_Husab=F8?=, Mar 4, 2004
    #19
  20. jacob navia

    jacob navia Guest

    "Old Wolf" <> a écrit dans le message de
    news:...
    > > As everybody knows, C uses a zero delimited unbounded
    > > pointer for its representation of strings.

    >
    > Pointers cannot be bounded or unbounded, as such.
    > String constants are represented by an array. C arrays
    > are bounded (but it is not required for these bounds
    > to ever be checked).


    A bounded pointer has size/limits information associated with it. For
    example:

    fread(buffer,1,100,file);

    the buffer pointer is implicitely bounded by the 1,100 arguments.

    Or

    int process(int datalength, char *data);


    > C library functions expect a pointer into an array
    > or some other piece of allocated memory, which contains
    > a zero-terminator. Calling a C library function with
    > anything else is a programming error.


    Yes. I do not discuss that this is an error. My proposition goes
    into avoiding it.

    > >
    > > This is extremely inefficient because at each query of the
    > > length of the string, the computer starts an unbounded
    > > memory scan searching for a zero that ends the string.

    >
    > It is inefficient to repeatedly calculate the length of the
    > string, as you say. However most of us are clever enough to
    > not do that.
    >


    Then you keep the length and the string in separate variables.
    You have to name each string length that you use more than
    once, and never mix them up. This is very error prone and
    doesn't fit into structured programming.

    Why do we use:
    struct s{
    int age;
    bool sex;
    char *Name;
    } Employee;

    instead of
    int ageEmployee1, int sexEmployee1, char *NameEmployee1???
    Having a structure easies the way you program and avoids errors!


    > >
    > > A more efficient representation is:
    > >
    > > struct string {
    > > size_t length;
    > > char data[];
    > > };

    >
    > Efficient? Apart from requiring C99 (harder to get a compiler
    > for than C++) , it uses a lot more memory and is slower at
    > runtime because you have to do extra checks and updates every
    > time you modify the string.


    This is important for security reasons. A length check is just 4 assembly
    instructions at most!


    >
    > > The length operation becomes just a memory read.
    > > This would considerably speed the programs.

    >
    > Programs which check length a lot and do not do much else
    > with the string, possibly. This is a small subset of programs.
    >


    Sorry but the length operations is used VERY often.

    > > The basic idea is to use a string type that is length
    > > prefixed and allows run-time checking against UB: undefined
    > > behavior.

    >
    > How slow and inefficient. I prefer to prevent UB by coding
    > correctly.


    And you never make mistakes of course. You are Superman.

    > Also, how do you propose to check against UB
    > this way. Anyone can introduce UB by modifying the string
    > directly, unless you also enforce all modifications to it
    > to be via a library. (how slow).
    >


    Strings should not be modified directly. Very simple.
    Maybe 0.0000000001 seconds slower, but much faster to
    develop.

    > > Comparing strings is speeded up also because when
    > > testing for equality, the first length comparison tells
    > > maybe the whole story with just a couple of
    > > memory reads.

    >
    > And if they were equal length you have wasted a comparison.


    This is around 2-3 assembly instructions...
    At 2GHZ it is 0.000000000something seconds

    >


    > This seems to be a problem with any string representation
    > (including C's builtin one)
    >
    > > A pointer ( an indirect representation) costs a sizeof(void *)
    > > but allows to resize strings without invalidating the pointers
    > > to them.
    > >
    > > struct string {
    > > size_t length;
    > > char *data;
    > > };

    >
    > How does this allow resizing without invalidating pointers?


    This is a misunderstanding. I suppose you have pointers to the string
    not to the middle of teh data.

    >
    > > In any case, the standard
    > > library could be complemented by
    > > Strcmp
    > > Strcpy
    > > etc., all using length prefixed strings.

    >
    > What bloat. The C library is big enough as it is.


    I am sorry but there are functions for calculating the
    complex square root but not for using a reasonable
    string type.

    C string type is completely outdated. Yes, you can use
    it in small machines but it is too ERROR PRONE!

    > I think you are proposing bounds-checked arrays. The standard
    > allows you to implement this already. Why not go and do it
    > and then advertise your compiler as supporting bounds checking.


    Well I am doing that, but the intent here is to make people
    aware that this problems should be solved in a general fashion.

    I do not want to just add a "special" solution but to see if
    we can solve a general problem in a collective way i.e.
    in the way of standards.

    > In fact, why not write a library as you have proposed, and package
    > it with your compiler. Then people can choose if it suits them or not.
    >


    Yes, I am doing that as a "proof of concept"

    > > Do you agree?

    >
    > Not really, no. You seem to be making big assumptions about
    > the rest of the world's programming requirements.


    Yes.
    I assume that security is important, that extreme efficiency is not
    that important, and that you can spare size_t bytes for the length

    jacob
     
    jacob navia, Mar 4, 2004
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. anil
    Replies:
    2
    Views:
    602
    Jeremy Stringer
    Jun 17, 2005
  2. Coco
    Replies:
    0
    Views:
    311
  3. tshad
    Replies:
    6
    Views:
    407
    James Thomas
    Jan 5, 2005
  4. =?Utf-8?B?YXNob2s=?=

    increasing the RPS of any page

    =?Utf-8?B?YXNob2s=?=, Aug 29, 2005, in forum: ASP .Net
    Replies:
    3
    Views:
    471
    John Rivers
    Aug 30, 2005
  5. xisco

    Session count always increasing

    xisco, Mar 7, 2006, in forum: ASP .Net
    Replies:
    2
    Views:
    567
    xisco
    Mar 7, 2006
Loading...

Share This Page