Is C an unsuitable choice as a string parser?

Discussion in 'C Programming' started by Gallagher Polyn, Dec 13, 2013.

  1. Hey all,

    (My recent post on this question on stackoverflow put on hold as being 'opinion-based':

    I am considering C as a candidate for implementing a string parser.

    + first specialized on English, but can be extended to parse arbitrarily any character encoding
    + tokenizes strings which may then be used to search a data store
    + allows optional embedding of tools like Natural Language Tool Kit (python) for more lexical analytic power

    My task feels simple and limited -- not bad for a C program -- but I keep running across comments about C like 'prefer a language with first class string support' (e.g.,

    I could prefer something like Go Lang, which seems to have good string processing, but C appeals because of the performance offered for the relatively reduced complexity of the task. Moreover, it seems that a library like ICU may help...

    Can readers suggest any prima facie reason(s) not to use C as a string parser?


    Gallagher Polyn, Dec 13, 2013
    1. Advertisements

  2. Gallagher Polyn

    Eric Sosman Guest

    One reason: C cannot do what you ask. Other than that, ...

    (No programming language implementation I've ever encountered,
    nor heard of, nor even imagined "can be extended to parse arbitrarily
    any character encoding." Your problem statement is a specification
    for vaporware. Perhaps if you were to offer a few specifics...?)
    Eric Sosman, Dec 13, 2013
    1. Advertisements

  3. Gallagher Polyn

    Clarus Baron Guest

    C is a viable choice, but so are many others. It's a question of, "what
    trade-offs are [you] willing to make?" A good C implementation will most
    likely be among the fastest of all possible implementations, but is
    optimizing for absolute run-time speed your number one consideration? Or
    would using a language that's a little bit slower but doesn't make you
    do everything yourself, ideally leading to a shorter development time, be a
    better choice? Only you can decide. Is this a serious project to help
    you get something done, or is it a learning project to help you learn a
    new language? Or a bit of both?

    Since you haven't said much else about your project (e.g. is it an
    end-user tool? a library? a daemon? all of the above? target
    platform(s)?), it's difficult to say a lot. My biggest concerns with
    using C for this task, based on what I know, are (1) I am not convinced
    the extra performance you might get is worth the added complexity of
    having to do manual memory management and having to worry about various
    cross-platform issues that disappear or are lessened with other choices,
    and (2) It sounds like you want to eventually support many, many
    languages. Presumably, you will not be writing the support for all of
    them yourself. Depending on how you support extensibility, you may have
    a more difficult time finding people who are experts in both a
    particular human language and C. If they're not C experts, or even if
    they are experts, they're more likely to shoot themselves in the foot,
    potentially causing problems for your software. If you choose a path to
    extensibility that doesn't require C for language "definitions," this
    problem goes away.

    I haven't done much with the space you're working in, but if I was
    starting a project like this I'd look a lot into what good libraries are out
    there that I can incorporate in my own code, to save time and reduce bugs
    (again, good libraries).

    Clarus Baron, Dec 13, 2013
  4. Gallagher Polyn

    BartC Guest

    I think C is fine for implementing things, not so good for open-ended

    So if you know exactly what your project will do, and how it will work, then
    go for C. Otherwise, you might try it in an easier, more dynamic language
    first (which will all have excellent 'first-class' string handling), then
    perhaps come back to C if you want it a bit faster.
    BartC, Dec 13, 2013
  5. As Stack Overflow suggested, there isn't a clear "yes/no" answer to this.

    Any program can be written in any language. C doesn't have any facilities
    for sophisticated string handling, however it can be extended very easily
    with either your own or third party libraries, and ten it will usually have
    better performance than competing languages, but be rather more fiddly to
    You can't easily toggle between encodings (I can't think off hand of any
    language that allows this, but I'm sure someone has tried). But if you settle
    on UTF8 you can handle pretty much any known character set whilst still
    treating English as plain ascii, which is usually what you want.

    To tokenise strings, you will need to pull an open source regular expression
    parser from the web. Or you might be able to call one in a library provided
    by your platform. (However then you can't easily port the program). It's
    not sensible to try to write regular expression parser of your own.
    Languages like Perl with built in expression parsers are obviously easier
    to set up and use.

    However if you want to use the Python Natural Language toolkit, use Python.
    Whilst there's probably a way of calling it from C, you don't want to
    mess about like that. Use the language the toolkit is designed for.
    If you want to embed lots of toolkits written in different languages,
    that's inherently not an easy thing to do, and there's no language which
    won't cause problems.
    Malcolm McLean, Dec 13, 2013
  6. (snip)
    The category codes of TeX have initial values, but can be changed with
    TeX commands. The category codes give meaning to the character encoding.

    You can, for example, change which characters are letters. Latex uses
    @ as a letter in many internal macros, intentionally, to avoid conflict
    with user macros. At the end of each macro file it changes @ back.

    -- glen
    glen herrmannsfeldt, Dec 13, 2013
  7. Gallagher Polyn

    David Brown Guest

    C /can/ certainly be used - but you will have to write (and test, and
    debug, and document) far more source code for a C implementation than
    for a higher level language with good native string handling and
    libraries for things like parsers.
    Again, C /can/ be used - but it will be far more effort than for
    languages with native or good library support for other encodings.
    If you are going to connect to Python code, use Python. While it is
    certainly possible to mix Python and C code, it is a lot of effort
    (although it's quite straightforward to get Python code to call
    functions defined in an so/dll file).
    Simple and limited - not bad for a Python program. Or any other language.
    Ignore performance issues unless you have very good reasons to believe
    speed will be a problem, or you have actually /measured/ the running
    code and seen that it is too slow.

    In most cases, ease of development is more important than getting the
    maximal speeds. In /all/ cases, correctness is more important than speed.
    Obviously there can be many reasons for choosing a language -
    familiarity and experience with the language is important.

    But if I were asked to pick a language for a project, and the only thing
    I knew about it was that it involved string parsing, Python would be my
    first choice. I would not even consider C unless it was forced upon me.
    David Brown, Dec 13, 2013
  8. Can readers suggest any prima facie reason(s) not to use C as a
    First, thanks to all for their thoughtful comments. By way of further background, and with a nod to Claus's comment...
    ....I would be approaching C fresh for this task, so the many cautions from you and other respondents about the level of care and additional expense involved with a C implementation are duly noted.

    I somehow omitted from my original post here my lead motivation for considering C: SPEED.

    I intend to return results to users of the string parsing service with mostkeystrokes, much like, say Google's Places API (demo here:


    from Eric
    ....sorry, I meant character set, as in how native users type their language.. Does that change at all your verdict?

    from Claus, end user tool, the program is hosted as a web API...
    ....I note your caution on the challenge of doing C well; I can choose my own platform for the web API...
    ....yes, I will think more about whether my envisioned approach to tokenizing strings (in English) based roughly on word boundaries (and not, yet, say,the more complex lexical categories that the Python NLTK can determine) islikely to be easily repeated to handle strings expressed in different character sets.

    from Malcolm

    ....and from David
    ....OK! (Maybe I'll ask about C + NLTK on the NLTK discussion board to see what has been the experience of any user there.)
    ....alright, I'll start thinking of Python, though it would deprecate somewhat my aforementioned speed priority.

    In sum, though I've here qualified my original post to add speed in processing as an important consideration for my web API string parser, I still take a slightly negative verdict, especially as regards my hope to leave a space for the inclusion of NLTK for lexical parsing -- that surprises me.

    The weight of the responses moves me to consider just what level of responsiveness i) I can expect from searches on the data store using tokenized strings and ii) what 'headroom' I can add to that from the single threaded processing portion of the service's work, i.e., the parsing and mapping of tokenized strings to data queries.

    Thanks again to all,

    Gallagher Polyn, Dec 13, 2013
  9. It all depends on your throughput.
    A computer can do string handling of almost any complexity within just
    noticeable time, that is to say, the user presses the return key, or whatever,
    and accepts the result as instantaneous.
    But it can't necessarily handle thousands of queries in just noticeable time.
    That doesn't always mean thousands of users, it can also mean one user
    submitting a file containing many lines for processing.
    Malcolm McLean, Dec 13, 2013
  10. Gallagher Polyn

    Eric Sosman Guest

    Sorry, but I still don't see anything that looks like a
    specification. You seem to be choosing an implementation language
    in advance of defining the problem to be solved -- or, at least, in
    advance of describing the problem to those whose advice you seek.
    In that case, the speed of the implementation language is not
    likely to be an issue. If hand-crafted assembly language coding
    delivers the answer in 0.1 microseconds while an interpreted
    scripting language takes 10, you can rejoice in a hundredfold
    speedup. But if both answers then take 10 *milli*seconds to get
    across a network and through an HTML parser and onto a screen ...
    Well, the difference between 0.01001 seconds and 0.0100001 is
    unlikely to be of much concern.

    Write your application in whatever language and with whatever
    tools give you the most assistance toward the more important goal
    of getting the problem solved. Once you've got it working right,
    then you can measure the performance. If it's adequate, you're
    done. If not, maybe you can tweak it. And if tweaking doesn't
    help and you need to rewrite part or all in a "faster" language,
    the early work will not have been wasted: You'll have developed
    and debugged your solution method, and recoding will be fairly
    straightforward. As an old manager of mine used to say: "First
    make it work, then make it robust, and only then make it fast."
    Eric Sosman, Dec 13, 2013
  11. Gallagher Polyn

    BartC Guest

    I once ran a string-processing benchmark that was faster in Python than in
    C. IIRC due to the Python's use of counted strings, versus C's
    zero-terminated ones. I suppose the C version could have been adapted to use
    counted strings too, or a special library used, but that's an example of the
    extra effort needed with C.
    BartC, Dec 13, 2013
  12. Gallagher Polyn

    Les Cargill Guest

    Yet the *opposite* should be true. There are lies, damned lies and

    No knock on Python, but counted strings should always be more overhead
    than zero-terminated strings.
    it's not extra, it's just different.
    Les Cargill, Dec 13, 2013
  13. Gallagher Polyn

    Seebs Guest

    Not really.
    Maybe more overhead, but less computation, since so many tasks
    involve walking to the end of the string.

    It turns out that, by the time you do all the extra work to keep track
    of how much space you have allocated, and whether you need to allocate
    more, counted strings can win.
    It's extra in terms of programmer time, though. And if you need to spend
    a whole heck of a lot of programmer time just to catch up with the "slower"

    Seebs, Dec 13, 2013
  14. Gallagher Polyn

    BartC Guest

    Always? Take two identical million-character strings. Now add one extra
    character to one, and pass both strings to the strcmp() function. It will
    have to check a million pairs of characters before it can conclude whether
    they're equal or not. With counted strings, it's only necessary to compare
    two numbers.

    (An unlikely scenario, as most strings will be short. Nevertheless my own
    counted-string language can do this compare some 25,000 times faster than C.
    Provided the strings are different lengths of course...)

    Counted strings have other advantages: take one of those million-character
    strings, and pass the first 10 characters of it to a function. With a zero
    terminator, you'd have to copy the string first; with a counted string you
    BartC, Dec 14, 2013
  15. I agree with everything you say until here. If the OP is
    truly intent upon writing language parsers in C, writing
    a RE parser would seem to be an appropriate first step.

    Then move on up the Chomsky Hierarchy, perhaps using the
    RE tools to implement the LL(0) tools.
    luser- -droog, Dec 14, 2013
  16. But if you're expecting million character strings, it probably makes sense
    to wrap them up with control structures. Which can include a hash. So
    you can detect for inequality by comparing the hashes. Often the chance of
    a collision is so remote that it's acceptable to take hash equality as
    a test for equality as well.
    Malcolm McLean, Dec 14, 2013
  17. Gallagher Polyn

    BartC Guest

    Tokenising strings is fairly straightforward and needn't involve any special
    tools. (In a compiler, the tokeniser is probably the easiest bit.) Using a
    library (especially involving regular expressions) is probably overkill.
    BartC, Dec 14, 2013
  18. Gallagher Polyn

    Les Cargill Guest

    *Very nearly* always. Lazy typing on my part.

    My point is that it takes some context to know
    which approach will be an improvement.

    But I have little doubt that there will be cases in which
    either will outperform the other. "I use 'C' because it's
    faster" is pretty weak tea and is possibly a signal
    of premature optimization. "I use 'C' because it fits
    the runtime environment better" is much better reasoning.

    Same for Python.
    There are no *three thousand* character, single strings in the wild.
    There are vanishingly few *one thousand* character strings in the wild.

    Most are more like 80 characters.

    Now, if you put an entire webpage or document into one string, I'd
    consider that more complex object *than* a string and build
    scaffolding to deal with it. For one, you'll need to unmarshal
    several read() calls against a socket or file.

    I say that; Python may well hide all that for you. Tcl does not;
    you have to concatenate them.
    Nope. You will have to count some number which is asymptotic with
    respect to one million, unless you always compare two identical strings.

    strcmp() stops when the two strings don't agree any more.

    And if you know you're comparing million-byte strings, use something
    better than strcmp(). :)
    1) Somebody still has to set the length byte to begin with.
    2) If the strings are initialized from patches of memory,
    it'll have to count this at least once.
    3) If the string gets its contents from a file or socket, then you
    still have to set the length.
    Ah, but 'C' can also use counted strings. The principal difference is
    that, sfaik, in Python you cannot do that.
    I'd rather pass a pointer to the whole thing, and have the function
    know about the 10 character comparison/use limit.
    Les Cargill, Dec 14, 2013
  19. Gallagher Polyn

    Eric Sosman Guest

    The O.P. hasn't explained his problem in much detail, but
    I get the impression he's interested in parsing natural language
    strings. Tokenizing natural language is far from straightforward --
    for example, consider some uses of the apostrophe:

    - Possessive: "England's green and pleasant shore."

    - Possessive without s: "Cyprus' Mediterranean climate."

    - Plural possessive: "C programmers' blind spots."

    - Contraction: "Why won't it compile?"

    - Elision: "Nothin' doin'!"

    - Quotation: "Stop calling me 'Shirley'!"

    - Combinations of the above: "His exact words were 'It'd be
    to everyone's benefit if you'd stop callin' me Shirley'."

    - Mistakes: "Every dog has it's day."

    For problems of this kind I suspect a regular expression
    library would be underkill, not overkill.
    Eric Sosman, Dec 14, 2013
  20. You frequently know that the program is going to run into severe time
    constraints, for instance if you're displaying 3D graphics for a video game,
    you'll want to add more and more geometry until you hit the frame rate and
    the programs starts skipping frames. The more triangles on screen, the
    better it looks. So in cases like that, using C right from the word go, at
    least in the processor-intensive parts of the program, is a sensible choice.
    Malcolm McLean, Dec 14, 2013
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.