Is C an unsuitable choice as a string parser?

Discussion in 'C Programming' started by Gallagher Polyn, Dec 13, 2013.

  1. Gallagher Polyn

    Joe keane Guest

    It is; it's in octal.
    Joe keane, Dec 19, 2013
    1. Advertisements

  2. Gallagher Polyn

    Jorgen Grahn Guest

    I would have said "in most programs there is a limit to the string
    lengths you have to tolerate". But I suspect that's C's fault more
    than anything else.

    People want to say something like 'char buf[5000]' and get away with
    it. That includes me -- I don't want to optimize for rare and silly
    scenarios every time I read a string.
    Indeed, and that's why UDP-based protocols tend to try to keep the
    datagram sizes down. (And TCP does it for you.)
    Reassembly in a /different/ sense though. You get whatever happens to
    have queued up at that point in time, and that may be less /or/ more
    than what you need to act.
    Yes, in actuality. It's just that vaguely similar problems will pop
    up in the application layer.
    Well, take a HTTP server for example. It sits waiting for something
    like this on a TCP socket:

    GET HTTP/1.1
    Lots more things: ...

    and only when the last empty line arrives it may act on it.

    I don't think the HTTP RFC puts a limit to the line lengths, or the
    total size of the request -- but in reality it would be foolish to
    allow a client to sit for hours feeding in more and more data; the
    only valid reason to do so is a DoS attack.

    So yes, I agree that it's usually silly to handle a multi-megabyte
    string. But the lower layers are not the reason.

    Jorgen Grahn, Dec 19, 2013
    1. Advertisements

  3. You have to be sure you're not opening a security hole for an exploit.
    In a lot of programming environments, it's not an issue. But where it is,
    the consequences can be serious.
    Malcolm McLean, Dec 19, 2013
  4. Gallagher Polyn

    Seebs Guest


    Uploads. I have used more than one page which allows file uploads,
    and those are implemented as HTTP requests. Pretty sure that can in
    at least some cases imply an HTTP request which is in fact going
    to be feeding in data for a long time, and if there's a slow link,
    that could be minutes, certainly.
    The key word is "usually".

    Seebs, Dec 19, 2013
  5. Other than local integration issues, such as ability to build a library written in C into programs where the top-level language is something else, C serves as a fine language with which to build a string parser. strtok(), strpbk(), and strspn() and their various updated functions are all designed tohelp make string parsing easy. And you can write your own adaptations witha bunch of boolean tables and they can be made to perform very fast.

    But, everything depends upon your goals of the eventual solution and the implementation.

    If, for instance, you are going to write your tokenized strings into a database file or put every token on a separate line of a file or write a CSV file and then process that file, C won't much help the speed or clarity of your overall solution. You might as well use some language like perl or awk or even PHP to do your parsing, if parsing is just the first step and havingthe result in memory when you are done won't be a big advantage.

    Basically, the speed you gain from C will be FAR overshadowed by I/O considerations, unless you can work with the result of the parsing in memory.
    Michael Angelo Ravera, Dec 19, 2013
  6. It depends. Sometimes, you really do need to optimize, often
    right from the start. Other posters in this thread have given
    some real examples.

    And frankly, string parser is a pretty good example. A typical
    use for a string parser is in processing inputs to databases.
    Possibly large databases. Possibly *very* large databases.

    In fact, take out the word "possibly" here. If your parser is
    going to receive any kind of broad distribution, you can pretty
    much guarantee that it's going to eventually used on a big

    I once worked at a small web startup that had written everything
    in Ruby. When the database passed ten million records, things started
    to get bogged down pretty badly, and throwing more servers at the
    problem was getting expensive.

    We did a rough cost-benefit analysis, and the rule of thumb we
    came up with was that once you passed 100 servers, it was better
    to re-code in C than to keep adding more and more servers.
    Edward A. Falk, Dec 19, 2013
  7. Gallagher Polyn

    Jorgen Grahn Guest

    Yes, of course. I'm assuming an interface where you (a) are explicit
    about the length of your buffer and (b) can detect if it wasn't really
    long enough. And that (c) you have an explicit plan for what to do in
    that rare case.

    Jorgen Grahn, Dec 20, 2013
  8. Gallagher Polyn

    Jorgen Grahn Guest

    Of course. I was oversimplifying. There's a difference between the
    payload of the request (which doesn't really have record boundaries,
    and can be pipelined) and the contents of the HTTP headers (which have
    to be stored until they are complete, more or less, and the actual
    data transfer may begin).

    It's easy to write a HTTP client which establishes a TCP connection,
    sends part of a request, then disappears without a trace. Multiply
    that by 1000 or more, and you have a nice low-cost denial of service
    attack. (Admittedly, you don't need long strings for that.)

    Jorgen Grahn, Dec 20, 2013
  9. Yes; I like that the GNU compiler will warn you about some unsafe
    practices. Buffer overflow is insidious.

    Case in point: I was once the subject of a CERT advisory when the
    San Diego Supercomputer Center discovered an exploit in a simple
    configuration utility I had written. (In my defense, the vulnerability
    was in some code I had copy-and-pasted from someone else's configuration
    utility.) After that, I started to take security seriously, and even
    attended DefCon once to see what I could learn.

    Case in point: I was once tasked with hardening security on a friend's
    web site after it had Once Again, been broken into by script kiddies.
    The vulnerabilities I found in the ftp daemon made me blanch.

    If I'm reviewing someone else's code, and I see something like
    "char buf[5000]", alarm bells go off.

    Buffer overflows. Not even once.
    Edward A. Falk, Dec 23, 2013
  10. Gallagher Polyn

    wpihughes Guest

    In my experience embarrassingly parallel problems are quite common.
    I find the application of a simple process pool neither expensive nor

    William Hughes
    wpihughes, Dec 24, 2013
  11. Gallagher Polyn

    Jorgen Grahn Guest

    That's certainly a place in the code you need to examine, but what I'm
    arguing is it doesn't have to be a bug. If e.g. you document "input
    lines may not be larger than 4999 characters or the program will abort
    with an error message" it's fair and sane and noone will complain.
    (Assuming of course that you don't introduce an overflow.)
    Yes, but not accepting infinite inputs and buffer overflows are
    separate issues.

    Jorgen Grahn, Dec 29, 2013
  12. It's better to get into the way of thinking that the program will perform the
    calculation unless it runs out of memory. However sometimes you have to worry
    about resource denial - less of an issue with C programs because if the user
    can run an arbitrary C program he can also easily hog every resource the OS
    allocates to him, but still maybe a problem is programs are being run from
    automatic processes. Then sometimes legitimate over-sized input is so unlikely
    that it's better to throw it out as obviously either malicious or corrupt.

    But generally I'd use a "getline" function rather than a really big buffer.
    Malcolm McLean, Dec 29, 2013
  13. Sure. In some situations letters with accents are the same letter as letters
    without, in other situations they are considered to be different letters.
    It's a kind of inherent difficulty. English just happens to be quite computer
    friendly, also it's the language the standards were originally designed for,
    so conventions (e.g. how to represent capitals, are 0 and O the same or
    different, are open and close quotes the same or different, are double
    quotes characters in their own right or concatenated single quotes etc)
    are quite well established.
    Malcolm McLean, Jan 13, 2014
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.