Is C an unsuitable choice as a string parser?

Discussion in 'C Programming' started by Gallagher Polyn, Dec 13, 2013.

  1. Gallagher Polyn

    Les Cargill Guest

    That's a pretty specialized sort of constraint. So other than the
    three* people who have to actually deal with that.

    *hyperbolically chosen figure used to illustrate the narrow
    bandwidth of such a filter...
    See "fits the runtime environment better".
     
    Les Cargill, Dec 14, 2013
    #21
    1. Advertisements

  2. As the OP, I had been concerned not to digress too much from C into more detail on my purpose for the parser. Of course, I have been reading the ensuing responses with as much profit as my understanding allows. In hopes that it may resolve the basis for any irresolvable claims by the participants, let me say a little more...

    I "23 tabby cats 2pm tomorrow Tarrytown"
    II "23 teal teacups 2pm tomorrow Tarrytown"

    Descriptions of cats and teacups matching the expressions above are encodedin a data store. For example, there could be descriptions of teacups and cats that distinguish them by how many are available or by different sets ofcolors or locations or available times. The parser's job is to to interpret strings like those above as they are entered and suggest matches as the user types. So if the user has typed "23 t", she may then be prompted to consider selecting between I and II with "23 tabby cats" and "23 teacups". In a sense, I seek to elicit form data from users with strings.

    In one version of handling strings, which I would call the 'tokenizer' version, splitting the strings on word boundaries is a reasonable way to searchthe data store for matches.

    In another version of handling strings, which I'll call the 'lexical' version, more grammatical versions of I and II could be handled, for example, "23 tabby cats tomorrow at 2pm in Tarrytown".

    I anticipate the tokenizer version will be the best initial version, but I would like to retain the option of the lexical version. This is why I thought C, besides other possible benefits, could be reasonable: because it seems the easiest way to incorporate other resources like Python's NLTK or any other developed resources for lexical analysis not developed in C. (Of course, I have been warned against this in the preceding discussion.)

    I'd like to retain the option of interpreting strings enterered in non-english, too. That is, I should presume internationalization to as many spoken languages as possible.

    G
     
    Gallagher Polyn, Dec 14, 2013
    #22
    1. Advertisements

  3. Gallagher Polyn

    Seebs Guest

    What on earth do you mean?
    How is it not "a string"? Sure, it might be assembled from other things,
    but there are reasons to, at least sometimes, want to treat a huge hunk
    of data as "a string".

    -s
     
    Seebs, Dec 14, 2013
    #23
  4. Gallagher Polyn

    Les Cargill Guest

    Most things in real life are constrained by things like Ethernet
    MTU limitations...

    Q: When is a string not a string?
    A: When it's a buffer.
     
    Les Cargill, Dec 14, 2013
    #24
  5. Gallagher Polyn

    James Harris Guest

    Not quite. Ethernet is a data link technology. IP runs on top of data link
    systems such as Ethernet and adds unreliable datagram communication. This
    permits sizes up to 64k. More importantly, TCP runs on top of IP and adds
    both reliability and a stream model. The stream model makes apps communicate
    octet streams without regard to packet or datagram sizes. So any app which
    uses TCP can pass long strings to other processes. It is not limited by the
    1500 or 64k limits.

    Note the Total Length field in the IP header

    https://en.wikipedia.org/wiki/IPv4_header#Header

    and the 32-bit sizes of the Syn and Ack fields in the TCP header

    https://en.wikipedia.org/wiki/Transmission_Control_Protocol#TCP_segment_structure

    The point is that Ethernet MTUs do not limit the sizes of strings that
    programs can send to one another. UDP will not but TCP will split a very
    long string into datagrams as needed.

    James
     
    James Harris, Dec 14, 2013
    #25
  6. Gallagher Polyn

    Les Cargill Guest

    *Like* Ethernet MTU constraints... or UDP, or whatever. I could
    have worded that better.
    Badly. I would avoid this if I were you. Google for "Fragmentation
    considered harmful"

    I am sure there are PHY layers such that 64 k byte as an MTU
    isn't a problem. At that point, go for it.
    But this is *also* problematic. If you use blocking sockets, then
    you are subject to the whims of whatever it is that is
    unreliable between you and the far end.

    If you use nonblocking sockets, you get a piece at a time
    and get to do your own reassembly.

    And when you use *blocking* sockets, you may well *hang*
    on a read() ioctl() at embarrassing times. This may well
    require a full-on reboot.

    Hence "when it's a buffer"....
    Not in actuality.
    SO you have two choices - either treat "unlimited stream size" as a
    natural right, and then have to go fix it when this assumption fails,
    or understand the lower layers and plan accordingly.

    I know which one I do...
    This entire line of argument *ignores* that I still have to wonder
    why anybody needs 1 megabyte strings - that subsequently require
    some sort of string manipulation - in the first place.
     
    Les Cargill, Dec 15, 2013
    #26
  7. Gallagher Polyn

    BartC Guest

    Why not? Once you have a comfortable way of dealing efficiently with strings
    of any length (counted strings for example) then you will find all sorts of
    uses for them.

    And without the need for a zero-byte, they can also be used to store any
    binary data.

    And 1-million-character strings might be used to represent file contents, as
    you mentioned, or to serialise large date structures, whatever, before
    writing to a file, or perhaps adding them to something else, or just doing
    whatever needs doing!

    In fact the most difficult aspect is when you need to send the string to a
    function that expects a zero-terminated one!
     
    BartC, Dec 15, 2013
    #27
  8. Gallagher Polyn

    Les Cargill Guest

    "You're looney" - Graham Chapman. :)
     
    Les Cargill, Dec 15, 2013
    #28
  9. Gallagher Polyn

    BartC Guest

    By coincidence, someone posts a message about representing million-digit
    numbers ("How to approach large numbers in programming puzzles"). My first
    attempts at such a library, used strings, with a million-character string
    representing a million-digit number.
     
    BartC, Dec 15, 2013
    #29
  10. The two common situations is that the computer performs calculations which
    are their core are relatively trivial (e.g. performs income tax calculations
    for a few thousand or even a few million people), and wraps that in nice
    interfaces. Or the other common situation is that we have a problem which
    can't really be solved by computer, e.g. computing shadows, but we can have a
    nice stab at it (trace say five reflections from up to three light sources).
    There are lots of problems in the second class, though they tend to require
    a higher level of ability to program. It's not a job for that being a
    Microsoft Certified Professional will qualify you for.

    If the problem is in the second class, you almost certainly will run into
    severe processing time constraints. Sometimes nowadays you can solve it
    by throwing parallel processors at it, but that tends to be an expensive
    and awkward solution. C is the obvious choice of programming language
    whether you go for a parallel solution or not.
     
    Malcolm McLean, Dec 15, 2013
    #30
  11. I disagree. I've written dozens of basic parsers in C, but
    I've never written an RE parser.
     
    Edward A. Falk, Dec 15, 2013
    #31
  12. IMHO, C is a *superb* language for a string parser -- if
    you care about efficiency. It has all the right pieces for
    easy scanning through strings and doing comparisons.

    You pay a higher price during the intial coding, as you have to
    build the data structures to hold your input buffers, readers,
    tokens, and parser state. But once you have the basic infrastructure
    for your parser down, the rest really flies.

    If you want quick-n-dirty, or built-in string handling, then go with
    Python, or maybe even Java. But for real production work, go with C.
     
    Edward A. Falk, Dec 15, 2013
    #32
  13. Gallagher Polyn

    Les Cargill Guest

    That's weird. I mean no snark, I was just having a failure of
    imagination as to how that would be used.
     
    Les Cargill, Dec 15, 2013
    #33
  14. Gallagher Polyn

    Les Cargill Guest

    It's most certainly more specialized.
     
    Les Cargill, Dec 15, 2013
    #34
  15. Not really. How many times have you seen people sat at a computerised device
    which shows a 3D scene? Granted, most of them will have faked up shadows
    with the current state of technology, but that's very rapidly changing.
     
    Malcolm McLean, Dec 15, 2013
    #35
  16. C allows you to read, assign to and test a letter with very straightforwards
    syntax. A lot of more sophisticated languages have lost sight of those
    elementary requirements. Though to be fair, if you move to UFT8 encoding,
    it's no longer so easy in C.
     
    Malcolm McLean, Dec 15, 2013
    #36
  17. Gallagher Polyn

    J. Clarke Guest

    "What is the longest palindromic sequence in the first million digits of
    pi?"

    I admit that it's a contrived problem.
     
    J. Clarke, Dec 16, 2013
    #37
  18. Gallagher Polyn

    Phil Carmody Guest

    And RE's can't even match brackets.
    Isn't the first non-trivial grammer you learn to parse the well-nested brackets one?

    Phil
     
    Phil Carmody, Dec 16, 2013
    #38
  19. Gallagher Polyn

    James Kuyper Guest

    On 12/16/2013 11:18 AM, Phil Carmody wrote:
    ....
    Characters that have a special meaning for an RE engines are necessarily
    more complicated to search for, but I don't know of any RE engine that
    is completely incapable of doing so. Could you give an example? The ones
    I know best can match any such character by preceding it with a '\'.
     
    James Kuyper, Dec 16, 2013
    #39
  20. Gallagher Polyn

    Phil Carmody Guest

    Then chose 'p' and 'q' as the bracketting characters, and try again.
    No RE can parse the language of well-nested brackets, as that language
    *isn't regular*.
    http://en.wikipedia.org/wiki/Pumping_lemma_for_regular_languages

    Phil
     
    Phil Carmody, Dec 16, 2013
    #40
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.