Is C an unsuitable choice as a string parser?

Discussion in 'C Programming' started by Gallagher Polyn, Dec 13, 2013.

  1. Hey all,

    (My recent post on this question on stackoverflow put on hold as being 'opinion-based': stackoverflow.com/questions/20556729/is-c-an-unsuitable-choice-as-a-string-parser)

    I am considering C as a candidate for implementing a string parser.

    + first specialized on English, but can be extended to parse arbitrarily any character encoding
    + tokenizes strings which may then be used to search a data store
    + allows optional embedding of tools like Natural Language Tool Kit (python) for more lexical analytic power

    My task feels simple and limited -- not bad for a C program -- but I keep running across comments about C like 'prefer a language with first class string support' (e.g., stackoverflow.com/a/8465083/3097472)

    I could prefer something like Go Lang, which seems to have good string processing, but C appeals because of the performance offered for the relatively reduced complexity of the task. Moreover, it seems that a library like ICU may help...

    Can readers suggest any prima facie reason(s) not to use C as a string parser?

    Thanks,

    G-man
    Gallagher Polyn, Dec 13, 2013
    #1
    1. Advertising

  2. Gallagher Polyn

    Eric Sosman Guest

    On 12/12/2013 9:20 PM, Gallagher Polyn wrote:
    > Hey all,
    >
    > (My recent post on this question on stackoverflow put on hold as being 'opinion-based': stackoverflow.com/questions/20556729/is-c-an-unsuitable-choice-as-a-string-parser)
    >
    > I am considering C as a candidate for implementing a string parser.
    >
    > + first specialized on English, but can be extended to parse arbitrarily any character encoding
    > + tokenizes strings which may then be used to search a data store
    > + allows optional embedding of tools like Natural Language Tool Kit (python) for more lexical analytic power
    >
    > My task feels simple and limited -- not bad for a C program -- but I keep running across comments about C like 'prefer a language with first class string support' (e.g., stackoverflow.com/a/8465083/3097472)
    >
    > I could prefer something like Go Lang, which seems to have good string processing, but C appeals because of the performance offered for the relatively reduced complexity of the task. Moreover, it seems that a library like ICU may help...
    >
    > Can readers suggest any prima facie reason(s) not to use C as a string parser?


    One reason: C cannot do what you ask. Other than that, ...

    (No programming language implementation I've ever encountered,
    nor heard of, nor even imagined "can be extended to parse arbitrarily
    any character encoding." Your problem statement is a specification
    for vaporware. Perhaps if you were to offer a few specifics...?)

    --
    Eric Sosman
    d
    Eric Sosman, Dec 13, 2013
    #2
    1. Advertising

  3. Gallagher Polyn

    Clarus Baron Guest

    Gallagher Polyn <> writes:

    > Hey all,
    >
    > (My recent post on this question on stackoverflow put on hold as being 'opinion-based': stackoverflow.com/questions/20556729/is-c-an-unsuitable-choice-as-a-string-parser)
    >
    > I am considering C as a candidate for implementing a string parser.
    >
    > + first specialized on English, but can be extended to parse arbitrarily any character encoding
    > + tokenizes strings which may then be used to search a data store
    > + allows optional embedding of tools like Natural Language Tool Kit (python) for more lexical analytic power
    >
    > My task feels simple and limited -- not bad for a C program -- but I
    > keep running across comments about C like 'prefer a language with
    > first class string support' (e.g.,
    > stackoverflow.com/a/8465083/3097472)
    >
    > I could prefer something like Go Lang, which seems to have good string
    > processing, but C appeals because of the performance offered for the
    > relatively reduced complexity of the task. Moreover, it seems that a
    > library like ICU may help...
    >
    > Can readers suggest any prima facie reason(s) not to use C as a string parser?
    >
    > Thanks,
    >
    > G-man


    C is a viable choice, but so are many others. It's a question of, "what
    trade-offs are [you] willing to make?" A good C implementation will most
    likely be among the fastest of all possible implementations, but is
    optimizing for absolute run-time speed your number one consideration? Or
    would using a language that's a little bit slower but doesn't make you
    do everything yourself, ideally leading to a shorter development time, be a
    better choice? Only you can decide. Is this a serious project to help
    you get something done, or is it a learning project to help you learn a
    new language? Or a bit of both?

    Since you haven't said much else about your project (e.g. is it an
    end-user tool? a library? a daemon? all of the above? target
    platform(s)?), it's difficult to say a lot. My biggest concerns with
    using C for this task, based on what I know, are (1) I am not convinced
    the extra performance you might get is worth the added complexity of
    having to do manual memory management and having to worry about various
    cross-platform issues that disappear or are lessened with other choices,
    and (2) It sounds like you want to eventually support many, many
    languages. Presumably, you will not be writing the support for all of
    them yourself. Depending on how you support extensibility, you may have
    a more difficult time finding people who are experts in both a
    particular human language and C. If they're not C experts, or even if
    they are experts, they're more likely to shoot themselves in the foot,
    potentially causing problems for your software. If you choose a path to
    extensibility that doesn't require C for language "definitions," this
    problem goes away.

    I haven't done much with the space you're working in, but if I was
    starting a project like this I'd look a lot into what good libraries are out
    there that I can incorporate in my own code, to save time and reduce bugs
    (again, good libraries).


    Clarus
    Clarus Baron, Dec 13, 2013
    #3
  4. Gallagher Polyn

    BartC Guest

    "Gallagher Polyn" <> wrote in message
    news:...

    > Can readers suggest any prima facie reason(s) not to use C as a string
    > parser?


    I think C is fine for implementing things, not so good for open-ended
    development.

    So if you know exactly what your project will do, and how it will work, then
    go for C. Otherwise, you might try it in an easier, more dynamic language
    first (which will all have excellent 'first-class' string handling), then
    perhaps come back to C if you want it a bit faster.

    --
    Bartc
    BartC, Dec 13, 2013
    #4
  5. On Friday, December 13, 2013 2:20:42 AM UTC, Gallagher Polyn wrote:
    >
    > (My recent post on this question on stackoverflow put on hold as
    > being 'opinion-based': stackoverflow.com/questions/20556729/is-c-an-
    > unsuitable-choice-as-a-string-parser)
    >
    > I am considering C as a candidate for implementing a string parser.
    >
    > + first specialized on English, but can be extended to parse arbitrarily
    > any character encoding
    >
    > + tokenizes strings which may then be used to search a data store
    >
    > + allows optional embedding of tools like Natural Language Tool Kit (python)
    > for more lexical analytic power
    >

    As Stack Overflow suggested, there isn't a clear "yes/no" answer to this.

    Any program can be written in any language. C doesn't have any facilities
    for sophisticated string handling, however it can be extended very easily
    with either your own or third party libraries, and ten it will usually have
    better performance than competing languages, but be rather more fiddly to
    use.
    You can't easily toggle between encodings (I can't think off hand of any
    language that allows this, but I'm sure someone has tried). But if you settle
    on UTF8 you can handle pretty much any known character set whilst still
    treating English as plain ascii, which is usually what you want.

    To tokenise strings, you will need to pull an open source regular expression
    parser from the web. Or you might be able to call one in a library provided
    by your platform. (However then you can't easily port the program). It's
    not sensible to try to write regular expression parser of your own.
    Languages like Perl with built in expression parsers are obviously easier
    to set up and use.

    However if you want to use the Python Natural Language toolkit, use Python.
    Whilst there's probably a way of calling it from C, you don't want to
    mess about like that. Use the language the toolkit is designed for.
    If you want to embed lots of toolkits written in different languages,
    that's inherently not an easy thing to do, and there's no language which
    won't cause problems.
    Malcolm McLean, Dec 13, 2013
    #5
  6. Eric Sosman <> wrote:

    (snip)
    > One reason: C cannot do what you ask. Other than that, ...


    > (No programming language implementation I've ever encountered,
    > nor heard of, nor even imagined "can be extended to parse arbitrarily
    > any character encoding." Your problem statement is a specification
    > for vaporware. Perhaps if you were to offer a few specifics...?)


    The category codes of TeX have initial values, but can be changed with
    TeX commands. The category codes give meaning to the character encoding.

    You can, for example, change which characters are letters. Latex uses
    @ as a letter in many internal macros, intentionally, to avoid conflict
    with user macros. At the end of each macro file it changes @ back.

    -- glen
    glen herrmannsfeldt, Dec 13, 2013
    #6
  7. Gallagher Polyn

    David Brown Guest

    On 13/12/13 03:20, Gallagher Polyn wrote:
    > Hey all,
    >
    > (My recent post on this question on stackoverflow put on hold as
    > being 'opinion-based':
    > stackoverflow.com/questions/20556729/is-c-an-unsuitable-choice-as-a-string-parser)
    >
    > I am considering C as a candidate for implementing a string parser.


    C /can/ certainly be used - but you will have to write (and test, and
    debug, and document) far more source code for a C implementation than
    for a higher level language with good native string handling and
    libraries for things like parsers.

    >
    > + first specialized on English, but can be extended to parse
    > arbitrarily any character encoding


    Again, C /can/ be used - but it will be far more effort than for
    languages with native or good library support for other encodings.

    > + tokenizes strings which may then
    > be used to search a data store
    > + allows optional embedding of tools
    > like Natural Language Tool Kit (python) for more lexical analytic
    > power


    If you are going to connect to Python code, use Python. While it is
    certainly possible to mix Python and C code, it is a lot of effort
    (although it's quite straightforward to get Python code to call
    functions defined in an so/dll file).

    >
    > My task feels simple and limited -- not bad for a C program -- but I
    > keep running across comments about C like 'prefer a language with
    > first class string support' (e.g.,
    > stackoverflow.com/a/8465083/3097472)


    Simple and limited - not bad for a Python program. Or any other language.

    >
    > I could prefer something like Go Lang, which seems to have good
    > string processing, but C appeals because of the performance offered
    > for the relatively reduced complexity of the task. Moreover, it seems
    > that a library like ICU may help...


    Ignore performance issues unless you have very good reasons to believe
    speed will be a problem, or you have actually /measured/ the running
    code and seen that it is too slow.

    In most cases, ease of development is more important than getting the
    maximal speeds. In /all/ cases, correctness is more important than speed.

    >
    > Can readers suggest any prima facie reason(s) not to use C as a
    > string parser?


    Obviously there can be many reasons for choosing a language -
    familiarity and experience with the language is important.

    But if I were asked to pick a language for a project, and the only thing
    I knew about it was that it involved string parsing, Python would be my
    first choice. I would not even consider C unless it was forced upon me.

    >
    > Thanks,
    >
    > G-man
    >
    David Brown, Dec 13, 2013
    #7
  8. > > Can readers suggest any prima facie reason(s) not to use C as a
    >
    > > string parser?


    First, thanks to all for their thoughtful comments. By way of further background, and with a nod to Claus's comment...

    >Is this a serious project to help
    >you get something done, or is it a learning project to help you learn a
    >new language? Or a bit of both?


    ....I would be approaching C fresh for this task, so the many cautions from you and other respondents about the level of care and additional expense involved with a C implementation are duly noted.

    I somehow omitted from my original post here my lead motivation for considering C: SPEED.

    I intend to return results to users of the string parsing service with mostkeystrokes, much like, say Google's Places API (demo here: https://developers.google.com/maps/documentation/javascript/examples/places-autocomplete)

    Now...

    from Eric

    >No programming language implementation I've ever encountered,
    >nor heard of, nor even imagined "can be extended to parse arbitrarily
    >any character encoding." Your problem statement is a specification
    >for vaporware.


    ....sorry, I meant character set, as in how native users type their language.. Does that change at all your verdict?

    from Claus,

    >Since you haven't said much else about your project (e.g. is it an
    >end-user tool? a library? a daemon? all of the above?


    ....an end user tool, the program is hosted as a web API...

    > My biggest concerns with
    >using C for this task, based on what I know, are (1) I am not convinced
    >the extra performance you might get is worth the added complexity of
    >having to do manual memory management and having to worry about various
    >cross-platform issues that disappear or are lessened with other choices,


    ....I note your caution on the challenge of doing C well; I can choose my own platform for the web API...

    > It sounds like you want to eventually support many, many
    > languages. Presumably, you will not be writing the support for all of
    > them yourself. Depending on how you support extensibility, you may have
    > a more difficult time finding people who are experts in both a
    > particular human language and C.


    ....yes, I will think more about whether my envisioned approach to tokenizing strings (in English) based roughly on word boundaries (and not, yet, say,the more complex lexical categories that the Python NLTK can determine) islikely to be easily repeated to handle strings expressed in different character sets.

    from Malcolm

    >if you want to use the Python Natural Language toolkit, use Python.
    >Whilst there's probably a way of calling it from C, you don't want to
    >mess about like that. Use the language the toolkit is designed for.
    >If you want to embed lots of toolkits written in different languages,
    >that's inherently not an easy thing to do, and there's no language which
    >won't cause problems.


    ....OK!

    ....and from David

    >If you are going to connect to Python code, use Python. While it is
    >certainly possible to mix Python and C code, it is a lot of effort
    >(although it's quite straightforward to get Python code to call
    >functions defined in an so/dll file).


    ....OK! (Maybe I'll ask about C + NLTK on the NLTK discussion board to see what has been the experience of any user there.)

    > if I were asked to pick a language for a project, and the only thing
    > I knew about it was that it involved string parsing, Python would be my
    > first choice. I would not even consider C unless it was forced upon me.


    ....alright, I'll start thinking of Python, though it would deprecate somewhat my aforementioned speed priority.

    In sum, though I've here qualified my original post to add speed in processing as an important consideration for my web API string parser, I still take a slightly negative verdict, especially as regards my hope to leave a space for the inclusion of NLTK for lexical parsing -- that surprises me.

    The weight of the responses moves me to consider just what level of responsiveness i) I can expect from searches on the data store using tokenized strings and ii) what 'headroom' I can add to that from the single threaded processing portion of the service's work, i.e., the parsing and mapping of tokenized strings to data queries.

    Thanks again to all,

    G
    Gallagher Polyn, Dec 13, 2013
    #8
  9. On Friday, December 13, 2013 2:02:36 PM UTC, Gallagher Polyn wrote:
    >
    > The weight of the responses moves me to consider just what level
    > of responsiveness i) I can expect from searches on the data store using
    > tokenized strings and ii) what 'headroom' I can add to that from the
    > single threaded processing portion of the service's work, i.e., the
    > parsing and mapping of tokenized strings to data queries.
    >

    It all depends on your throughput.
    A computer can do string handling of almost any complexity within just
    noticeable time, that is to say, the user presses the return key, or whatever,
    and accepts the result as instantaneous.
    But it can't necessarily handle thousands of queries in just noticeable time.
    That doesn't always mean thousands of users, it can also mean one user
    submitting a file containing many lines for processing.
    Malcolm McLean, Dec 13, 2013
    #9
  10. Gallagher Polyn

    Eric Sosman Guest

    On 12/13/2013 9:02 AM, Gallagher Polyn wrote:
    >>> Can readers suggest any prima facie reason(s) not to use C as a

    >>
    >>> string parser?

    >
    > First, thanks to all for their thoughtful comments. By way of further background, and with a nod to Claus's comment...
    >
    >> Is this a serious project to help
    >> you get something done, or is it a learning project to help you learn a
    >> new language? Or a bit of both?

    >
    > ...I would be approaching C fresh for this task, so the many cautions from you and other respondents about the level of care and additional expense involved with a C implementation are duly noted.
    >
    > I somehow omitted from my original post here my lead motivation for considering C: SPEED.
    >
    > I intend to return results to users of the string parsing service with most keystrokes, much like, say Google's Places API (demo here: https://developers.google.com/maps/documentation/javascript/examples/places-autocomplete)
    >
    > Now...
    >
    > from Eric
    >
    >> No programming language implementation I've ever encountered,
    >> nor heard of, nor even imagined "can be extended to parse arbitrarily
    >> any character encoding." Your problem statement is a specification
    >> for vaporware.

    >
    > ...sorry, I meant character set, as in how native users type their language. Does that change at all your verdict?


    Sorry, but I still don't see anything that looks like a
    specification. You seem to be choosing an implementation language
    in advance of defining the problem to be solved -- or, at least, in
    advance of describing the problem to those whose advice you seek.

    > from Claus,
    >
    >> Since you haven't said much else about your project (e.g. is it an
    >> end-user tool? a library? a daemon? all of the above?

    >
    > ...an end user tool, the program is hosted as a web API...


    In that case, the speed of the implementation language is not
    likely to be an issue. If hand-crafted assembly language coding
    delivers the answer in 0.1 microseconds while an interpreted
    scripting language takes 10, you can rejoice in a hundredfold
    speedup. But if both answers then take 10 *milli*seconds to get
    across a network and through an HTML parser and onto a screen ...
    Well, the difference between 0.01001 seconds and 0.0100001 is
    unlikely to be of much concern.

    Write your application in whatever language and with whatever
    tools give you the most assistance toward the more important goal
    of getting the problem solved. Once you've got it working right,
    then you can measure the performance. If it's adequate, you're
    done. If not, maybe you can tweak it. And if tweaking doesn't
    help and you need to rewrite part or all in a "faster" language,
    the early work will not have been wasted: You'll have developed
    and debugged your solution method, and recoding will be fairly
    straightforward. As an old manager of mine used to say: "First
    make it work, then make it robust, and only then make it fast."

    --
    Eric Sosman
    d
    Eric Sosman, Dec 13, 2013
    #10
  11. Gallagher Polyn

    BartC Guest

    "Gallagher Polyn" <> wrote in message
    news:...

    > I somehow omitted from my original post here my lead motivation for
    > considering C: SPEED.


    > ...alright, I'll start thinking of Python, though it would deprecate
    > somewhat my aforementioned speed priority.


    I once ran a string-processing benchmark that was faster in Python than in
    C. IIRC due to the Python's use of counted strings, versus C's
    zero-terminated ones. I suppose the C version could have been adapted to use
    counted strings too, or a special library used, but that's an example of the
    extra effort needed with C.

    --
    Bartc
    BartC, Dec 13, 2013
    #11
  12. Gallagher Polyn

    Les Cargill Guest

    BartC wrote:
    > "Gallagher Polyn" <> wrote in message
    > news:...
    >
    >> I somehow omitted from my original post here my lead motivation for
    >> considering C: SPEED.

    >
    >> ...alright, I'll start thinking of Python, though it would deprecate
    >> somewhat my aforementioned speed priority.

    >
    > I once ran a string-processing benchmark that was faster in Python than
    > in C. IIRC due to the Python's use of counted strings, versus C's
    > zero-terminated ones.


    Yet the *opposite* should be true. There are lies, damned lies and
    benchmarks.

    No knock on Python, but counted strings should always be more overhead
    than zero-terminated strings.

    > I suppose the C version could have been adapted to
    > use counted strings too, or a special library used, but that's an
    > example of the extra effort needed with C.
    >


    it's not extra, it's just different.

    --
    Les Cargill
    Les Cargill, Dec 13, 2013
    #12
  13. Gallagher Polyn

    Seebs Guest

    On 2013-12-13, Les Cargill <> wrote:
    > Yet the *opposite* should be true.


    Not really.

    > No knock on Python, but counted strings should always be more overhead
    > than zero-terminated strings.


    Maybe more overhead, but less computation, since so many tasks
    involve walking to the end of the string.

    http://www.joelonsoftware.com/articles/fog0000000319.html

    It turns out that, by the time you do all the extra work to keep track
    of how much space you have allocated, and whether you need to allocate
    more, counted strings can win.

    >> I suppose the C version could have been adapted to
    >> use counted strings too, or a special library used, but that's an
    >> example of the extra effort needed with C.


    > it's not extra, it's just different.


    It's extra in terms of programmer time, though. And if you need to spend
    a whole heck of a lot of programmer time just to catch up with the "slower"
    language...

    -s
    --
    Copyright 2013, all wrongs reversed. Peter Seebach /
    http://www.seebs.net/log/ <-- lawsuits, religion, and funny pictures
    Autism Speaks does not speak for me. http://autisticadvocacy.org/
    I am not speaking for my employer, although they do rent some of my opinions.
    Seebs, Dec 13, 2013
    #13
  14. Gallagher Polyn

    BartC Guest

    "Les Cargill" <> wrote in message
    news:l8fj43$r4e$...
    > BartC wrote:
    >> "Gallagher Polyn" <> wrote in message
    >> news:...
    >>
    >>> I somehow omitted from my original post here my lead motivation for
    >>> considering C: SPEED.

    >>
    >>> ...alright, I'll start thinking of Python, though it would deprecate
    >>> somewhat my aforementioned speed priority.

    >>
    >> I once ran a string-processing benchmark that was faster in Python than
    >> in C. IIRC due to the Python's use of counted strings, versus C's
    >> zero-terminated ones.

    >
    > Yet the *opposite* should be true. There are lies, damned lies and
    > benchmarks.
    >
    > No knock on Python, but counted strings should always be more overhead
    > than zero-terminated strings.


    Always? Take two identical million-character strings. Now add one extra
    character to one, and pass both strings to the strcmp() function. It will
    have to check a million pairs of characters before it can conclude whether
    they're equal or not. With counted strings, it's only necessary to compare
    two numbers.

    (An unlikely scenario, as most strings will be short. Nevertheless my own
    counted-string language can do this compare some 25,000 times faster than C.
    Provided the strings are different lengths of course...)

    Counted strings have other advantages: take one of those million-character
    strings, and pass the first 10 characters of it to a function. With a zero
    terminator, you'd have to copy the string first; with a counted string you
    don't.

    --
    Bartc
    BartC, Dec 14, 2013
    #14
  15. On Friday, December 13, 2013 4:38:25 AM UTC-6, Malcolm McLean wrote:
    > On Friday, December 13, 2013 2:20:42 AM UTC, Gallagher Polyn wrote:
    >
    > > (My recent post on this question on stackoverflow put on hold as
    > > being 'opinion-based': stackoverflow.com/questions/20556729/is-c-an-
    > > unsuitable-choice-as-a-string-parser)
    > >
    > > I am considering C as a candidate for implementing a string parser.
    > >
    > > + first specialized on English, but can be extended to parse arbitrarily
    > > any character encoding
    > >
    > > + tokenizes strings which may then be used to search a data store
    > >
    > > + allows optional embedding of tools like Natural Language Tool Kit (python)
    > > for more lexical analytic power
    > >

    > As Stack Overflow suggested, there isn't a clear "yes/no" answer to this.
    >
    > Any program can be written in any language. C doesn't have any facilities
    > for sophisticated string handling, however it can be extended very easily
    > with either your own or third party libraries, and ten it will usually have
    > better performance than competing languages, but be rather more fiddly to
    > use.
    >
    > You can't easily toggle between encodings (I can't think off hand of any
    > language that allows this, but I'm sure someone has tried). But if you settle
    > on UTF8 you can handle pretty much any known character set whilst still
    > treating English as plain ascii, which is usually what you want.
    >
    > To tokenise strings, you will need to pull an open source regular expression
    > parser from the web. Or you might be able to call one in a library provided
    > by your platform. (However then you can't easily port the program). It's
    > not sensible to try to write regular expression parser of your own.


    I agree with everything you say until here. If the OP is
    truly intent upon writing language parsers in C, writing
    a RE parser would seem to be an appropriate first step.

    Then move on up the Chomsky Hierarchy, perhaps using the
    RE tools to implement the LL(0) tools.

    --
    luser
    $.02
    luser- -droog, Dec 14, 2013
    #15
  16. On Saturday, December 14, 2013 12:19:13 AM UTC, Bart wrote:
    > "Les Cargill" <> wrote in message
    >
    > Always? Take two identical million-character strings. Now add one extra
    > character to one, and pass both strings to the strcmp() function. It will
    > have to check a million pairs of characters before it can conclude whether
    > they're equal or not. With counted strings, it's only necessary to compare
    > two numbers.
    >

    But if you're expecting million character strings, it probably makes sense
    to wrap them up with control structures. Which can include a hash. So
    you can detect for inequality by comparing the hashes. Often the chance of
    a collision is so remote that it's acceptable to take hash equality as
    a test for equality as well.
    Malcolm McLean, Dec 14, 2013
    #16
  17. Gallagher Polyn

    BartC Guest

    "Malcolm McLean" <> wrote in message
    news:...

    > To tokenise strings, you will need to pull an open source regular
    > expression
    > parser from the web. Or you might be able to call one in a library
    > provided
    > by your platform.


    Tokenising strings is fairly straightforward and needn't involve any special
    tools. (In a compiler, the tokeniser is probably the easiest bit.) Using a
    library (especially involving regular expressions) is probably overkill.

    --
    Bartc
    BartC, Dec 14, 2013
    #17
  18. Gallagher Polyn

    Les Cargill Guest

    BartC wrote:
    >
    >
    > "Les Cargill" <> wrote in message
    > news:l8fj43$r4e$...
    >> BartC wrote:
    >>> "Gallagher Polyn" <> wrote in message
    >>> news:...
    >>>
    >>>> I somehow omitted from my original post here my lead motivation for
    >>>> considering C: SPEED.
    >>>
    >>>> ...alright, I'll start thinking of Python, though it would deprecate
    >>>> somewhat my aforementioned speed priority.
    >>>
    >>> I once ran a string-processing benchmark that was faster in Python than
    >>> in C. IIRC due to the Python's use of counted strings, versus C's
    >>> zero-terminated ones.

    >>
    >> Yet the *opposite* should be true. There are lies, damned lies and
    >> benchmarks.
    >>
    >> No knock on Python, but counted strings should always be more overhead
    >> than zero-terminated strings.

    >
    > Always?


    *Very nearly* always. Lazy typing on my part.

    My point is that it takes some context to know
    which approach will be an improvement.

    But I have little doubt that there will be cases in which
    either will outperform the other. "I use 'C' because it's
    faster" is pretty weak tea and is possibly a signal
    of premature optimization. "I use 'C' because it fits
    the runtime environment better" is much better reasoning.

    Same for Python.

    > Take two identical million-character strings.


    There are no *three thousand* character, single strings in the wild.
    There are vanishingly few *one thousand* character strings in the wild.

    Most are more like 80 characters.

    Now, if you put an entire webpage or document into one string, I'd
    consider that more complex object *than* a string and build
    scaffolding to deal with it. For one, you'll need to unmarshal
    several read() calls against a socket or file.

    I say that; Python may well hide all that for you. Tcl does not;
    you have to concatenate them.

    > Now add one extra
    > character to one, and pass both strings to the strcmp() function. It
    > will have to check a million pairs of characters before it can conclude
    > whether they're equal or not.


    Nope. You will have to count some number which is asymptotic with
    respect to one million, unless you always compare two identical strings.

    strcmp() stops when the two strings don't agree any more.

    And if you know you're comparing million-byte strings, use something
    better than strcmp(). :)

    > With counted strings, it's only necessary
    > to compare two numbers.
    >


    1) Somebody still has to set the length byte to begin with.
    2) If the strings are initialized from patches of memory,
    it'll have to count this at least once.
    3) If the string gets its contents from a file or socket, then you
    still have to set the length.

    > (An unlikely scenario, as most strings will be short.


    Yep.

    > Nevertheless my
    > own counted-string language can do this compare some 25,000 times faster
    > than C. Provided the strings are different lengths of course...)
    >


    Ah, but 'C' can also use counted strings. The principal difference is
    that, sfaik, in Python you cannot do that.

    > Counted strings have other advantages: take one of those
    > million-character strings, and pass the first 10 characters of it to a
    > function. With a zero terminator, you'd have to copy the string first;
    > with a counted string you don't.
    >


    I'd rather pass a pointer to the whole thing, and have the function
    know about the 10 character comparison/use limit.


    --
    Les Cargill
    Les Cargill, Dec 14, 2013
    #18
  19. Gallagher Polyn

    Eric Sosman Guest

    On 12/14/2013 9:59 AM, BartC wrote:
    > "Malcolm McLean" <> wrote in message
    > news:...
    >
    >> To tokenise strings, you will need to pull an open source regular
    >> expression
    >> parser from the web. Or you might be able to call one in a library
    >> provided
    >> by your platform.

    >
    > Tokenising strings is fairly straightforward and needn't involve any
    > special tools. (In a compiler, the tokeniser is probably the easiest
    > bit.) Using a library (especially involving regular expressions) is
    > probably overkill.


    The O.P. hasn't explained his problem in much detail, but
    I get the impression he's interested in parsing natural language
    strings. Tokenizing natural language is far from straightforward --
    for example, consider some uses of the apostrophe:

    - Possessive: "England's green and pleasant shore."

    - Possessive without s: "Cyprus' Mediterranean climate."

    - Plural possessive: "C programmers' blind spots."

    - Contraction: "Why won't it compile?"

    - Elision: "Nothin' doin'!"

    - Quotation: "Stop calling me 'Shirley'!"

    - Combinations of the above: "His exact words were 'It'd be
    to everyone's benefit if you'd stop callin' me Shirley'."

    - Mistakes: "Every dog has it's day."

    For problems of this kind I suspect a regular expression
    library would be underkill, not overkill.

    --
    Eric Sosman
    d
    Eric Sosman, Dec 14, 2013
    #19
  20. On Saturday, December 14, 2013 3:35:27 PM UTC, Les Cargill wrote:
    >
    > But I have little doubt that there will be cases in which
    > either will outperform the other. "I use 'C' because it's
    > faster" is pretty weak tea and is possibly a signal
    > of premature optimization. "I use 'C' because it fits
    > the runtime environment better" is much better reasoning.
    >

    You frequently know that the program is going to run into severe time
    constraints, for instance if you're displaying 3D graphics for a video game,
    you'll want to add more and more geometry until you hit the frame rate and
    the programs starts skipping frames. The more triangles on screen, the
    better it looks. So in cases like that, using C right from the word go, at
    least in the processor-intensive parts of the program, is a sensible choice.
    Malcolm McLean, Dec 14, 2013
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mickey Segal
    Replies:
    0
    Views:
    870
    Mickey Segal
    Feb 2, 2004
  2. Bernd Oninger
    Replies:
    0
    Views:
    756
    Bernd Oninger
    Jun 9, 2004
  3. ZOCOR

    XML Parser VS HTML Parser

    ZOCOR, Oct 3, 2004, in forum: Java
    Replies:
    11
    Views:
    811
    Paul King
    Oct 5, 2004
  4. Bernd Oninger
    Replies:
    0
    Views:
    810
    Bernd Oninger
    Jun 9, 2004
  5. miles.jg
    Replies:
    16
    Views:
    868
    Alf P. Steinbach
    Nov 14, 2007
Loading...

Share This Page