Is C an unsuitable choice as a string parser?

G

Gallagher Polyn

Hey all,

(My recent post on this question on stackoverflow put on hold as being 'opinion-based': stackoverflow.com/questions/20556729/is-c-an-unsuitable-choice-as-a-string-parser)

I am considering C as a candidate for implementing a string parser.

+ first specialized on English, but can be extended to parse arbitrarily any character encoding
+ tokenizes strings which may then be used to search a data store
+ allows optional embedding of tools like Natural Language Tool Kit (python) for more lexical analytic power

My task feels simple and limited -- not bad for a C program -- but I keep running across comments about C like 'prefer a language with first class string support' (e.g., stackoverflow.com/a/8465083/3097472)

I could prefer something like Go Lang, which seems to have good string processing, but C appeals because of the performance offered for the relatively reduced complexity of the task. Moreover, it seems that a library like ICU may help...

Can readers suggest any prima facie reason(s) not to use C as a string parser?

Thanks,

G-man
 
E

Eric Sosman

Hey all,

(My recent post on this question on stackoverflow put on hold as being 'opinion-based': stackoverflow.com/questions/20556729/is-c-an-unsuitable-choice-as-a-string-parser)

I am considering C as a candidate for implementing a string parser.

+ first specialized on English, but can be extended to parse arbitrarily any character encoding
+ tokenizes strings which may then be used to search a data store
+ allows optional embedding of tools like Natural Language Tool Kit (python) for more lexical analytic power

My task feels simple and limited -- not bad for a C program -- but I keep running across comments about C like 'prefer a language with first class string support' (e.g., stackoverflow.com/a/8465083/3097472)

I could prefer something like Go Lang, which seems to have good string processing, but C appeals because of the performance offered for the relatively reduced complexity of the task. Moreover, it seems that a library like ICU may help...

Can readers suggest any prima facie reason(s) not to use C as a string parser?

One reason: C cannot do what you ask. Other than that, ...

(No programming language implementation I've ever encountered,
nor heard of, nor even imagined "can be extended to parse arbitrarily
any character encoding." Your problem statement is a specification
for vaporware. Perhaps if you were to offer a few specifics...?)
 
C

Clarus Baron

Gallagher Polyn said:
Hey all,

(My recent post on this question on stackoverflow put on hold as being 'opinion-based': stackoverflow.com/questions/20556729/is-c-an-unsuitable-choice-as-a-string-parser)

I am considering C as a candidate for implementing a string parser.

+ first specialized on English, but can be extended to parse arbitrarily any character encoding
+ tokenizes strings which may then be used to search a data store
+ allows optional embedding of tools like Natural Language Tool Kit (python) for more lexical analytic power

My task feels simple and limited -- not bad for a C program -- but I
keep running across comments about C like 'prefer a language with
first class string support' (e.g.,
stackoverflow.com/a/8465083/3097472)

I could prefer something like Go Lang, which seems to have good string
processing, but C appeals because of the performance offered for the
relatively reduced complexity of the task. Moreover, it seems that a
library like ICU may help...

Can readers suggest any prima facie reason(s) not to use C as a string parser?

Thanks,

G-man

C is a viable choice, but so are many others. It's a question of, "what
trade-offs are [you] willing to make?" A good C implementation will most
likely be among the fastest of all possible implementations, but is
optimizing for absolute run-time speed your number one consideration? Or
would using a language that's a little bit slower but doesn't make you
do everything yourself, ideally leading to a shorter development time, be a
better choice? Only you can decide. Is this a serious project to help
you get something done, or is it a learning project to help you learn a
new language? Or a bit of both?

Since you haven't said much else about your project (e.g. is it an
end-user tool? a library? a daemon? all of the above? target
platform(s)?), it's difficult to say a lot. My biggest concerns with
using C for this task, based on what I know, are (1) I am not convinced
the extra performance you might get is worth the added complexity of
having to do manual memory management and having to worry about various
cross-platform issues that disappear or are lessened with other choices,
and (2) It sounds like you want to eventually support many, many
languages. Presumably, you will not be writing the support for all of
them yourself. Depending on how you support extensibility, you may have
a more difficult time finding people who are experts in both a
particular human language and C. If they're not C experts, or even if
they are experts, they're more likely to shoot themselves in the foot,
potentially causing problems for your software. If you choose a path to
extensibility that doesn't require C for language "definitions," this
problem goes away.

I haven't done much with the space you're working in, but if I was
starting a project like this I'd look a lot into what good libraries are out
there that I can incorporate in my own code, to save time and reduce bugs
(again, good libraries).


Clarus
 
B

BartC

Can readers suggest any prima facie reason(s) not to use C as a string
parser?

I think C is fine for implementing things, not so good for open-ended
development.

So if you know exactly what your project will do, and how it will work, then
go for C. Otherwise, you might try it in an easier, more dynamic language
first (which will all have excellent 'first-class' string handling), then
perhaps come back to C if you want it a bit faster.
 
M

Malcolm McLean

(My recent post on this question on stackoverflow put on hold as
being 'opinion-based': stackoverflow.com/questions/20556729/is-c-an-
unsuitable-choice-as-a-string-parser)

I am considering C as a candidate for implementing a string parser.

+ first specialized on English, but can be extended to parse arbitrarily
any character encoding

+ tokenizes strings which may then be used to search a data store

+ allows optional embedding of tools like Natural Language Tool Kit (python)
for more lexical analytic power
As Stack Overflow suggested, there isn't a clear "yes/no" answer to this.

Any program can be written in any language. C doesn't have any facilities
for sophisticated string handling, however it can be extended very easily
with either your own or third party libraries, and ten it will usually have
better performance than competing languages, but be rather more fiddly to
use.
You can't easily toggle between encodings (I can't think off hand of any
language that allows this, but I'm sure someone has tried). But if you settle
on UTF8 you can handle pretty much any known character set whilst still
treating English as plain ascii, which is usually what you want.

To tokenise strings, you will need to pull an open source regular expression
parser from the web. Or you might be able to call one in a library provided
by your platform. (However then you can't easily port the program). It's
not sensible to try to write regular expression parser of your own.
Languages like Perl with built in expression parsers are obviously easier
to set up and use.

However if you want to use the Python Natural Language toolkit, use Python.
Whilst there's probably a way of calling it from C, you don't want to
mess about like that. Use the language the toolkit is designed for.
If you want to embed lots of toolkits written in different languages,
that's inherently not an easy thing to do, and there's no language which
won't cause problems.
 
G

glen herrmannsfeldt

(snip)
One reason: C cannot do what you ask. Other than that, ...
(No programming language implementation I've ever encountered,
nor heard of, nor even imagined "can be extended to parse arbitrarily
any character encoding." Your problem statement is a specification
for vaporware. Perhaps if you were to offer a few specifics...?)

The category codes of TeX have initial values, but can be changed with
TeX commands. The category codes give meaning to the character encoding.

You can, for example, change which characters are letters. Latex uses
@ as a letter in many internal macros, intentionally, to avoid conflict
with user macros. At the end of each macro file it changes @ back.

-- glen
 
D

David Brown

Hey all,

(My recent post on this question on stackoverflow put on hold as
being 'opinion-based':
stackoverflow.com/questions/20556729/is-c-an-unsuitable-choice-as-a-string-parser)

I am considering C as a candidate for implementing a string parser.

C /can/ certainly be used - but you will have to write (and test, and
debug, and document) far more source code for a C implementation than
for a higher level language with good native string handling and
libraries for things like parsers.
+ first specialized on English, but can be extended to parse
arbitrarily any character encoding

Again, C /can/ be used - but it will be far more effort than for
languages with native or good library support for other encodings.
+ tokenizes strings which may then
be used to search a data store
+ allows optional embedding of tools
like Natural Language Tool Kit (python) for more lexical analytic
power

If you are going to connect to Python code, use Python. While it is
certainly possible to mix Python and C code, it is a lot of effort
(although it's quite straightforward to get Python code to call
functions defined in an so/dll file).
My task feels simple and limited -- not bad for a C program -- but I
keep running across comments about C like 'prefer a language with
first class string support' (e.g.,
stackoverflow.com/a/8465083/3097472)

Simple and limited - not bad for a Python program. Or any other language.
I could prefer something like Go Lang, which seems to have good
string processing, but C appeals because of the performance offered
for the relatively reduced complexity of the task. Moreover, it seems
that a library like ICU may help...

Ignore performance issues unless you have very good reasons to believe
speed will be a problem, or you have actually /measured/ the running
code and seen that it is too slow.

In most cases, ease of development is more important than getting the
maximal speeds. In /all/ cases, correctness is more important than speed.
Can readers suggest any prima facie reason(s) not to use C as a
string parser?

Obviously there can be many reasons for choosing a language -
familiarity and experience with the language is important.

But if I were asked to pick a language for a project, and the only thing
I knew about it was that it involved string parsing, Python would be my
first choice. I would not even consider C unless it was forced upon me.
 
G

Gallagher Polyn

Can readers suggest any prima facie reason(s) not to use C as a
First, thanks to all for their thoughtful comments. By way of further background, and with a nod to Claus's comment...
Is this a serious project to help
you get something done, or is it a learning project to help you learn a
new language? Or a bit of both?

....I would be approaching C fresh for this task, so the many cautions from you and other respondents about the level of care and additional expense involved with a C implementation are duly noted.

I somehow omitted from my original post here my lead motivation for considering C: SPEED.

I intend to return results to users of the string parsing service with mostkeystrokes, much like, say Google's Places API (demo here: https://developers.google.com/maps/documentation/javascript/examples/places-autocomplete)

Now...

from Eric
No programming language implementation I've ever encountered,
nor heard of, nor even imagined "can be extended to parse arbitrarily
any character encoding." Your problem statement is a specification
for vaporware.

....sorry, I meant character set, as in how native users type their language.. Does that change at all your verdict?

from Claus,
Since you haven't said much else about your project (e.g. is it an
end-user tool? a library? a daemon? all of the above?

....an end user tool, the program is hosted as a web API...
My biggest concerns with
using C for this task, based on what I know, are (1) I am not convinced
the extra performance you might get is worth the added complexity of
having to do manual memory management and having to worry about various
cross-platform issues that disappear or are lessened with other choices,

....I note your caution on the challenge of doing C well; I can choose my own platform for the web API...
It sounds like you want to eventually support many, many
languages. Presumably, you will not be writing the support for all of
them yourself. Depending on how you support extensibility, you may have
a more difficult time finding people who are experts in both a
particular human language and C.

....yes, I will think more about whether my envisioned approach to tokenizing strings (in English) based roughly on word boundaries (and not, yet, say,the more complex lexical categories that the Python NLTK can determine) islikely to be easily repeated to handle strings expressed in different character sets.

from Malcolm
if you want to use the Python Natural Language toolkit, use Python.
Whilst there's probably a way of calling it from C, you don't want to
mess about like that. Use the language the toolkit is designed for.
If you want to embed lots of toolkits written in different languages,
that's inherently not an easy thing to do, and there's no language which
won't cause problems.

....OK!

....and from David
If you are going to connect to Python code, use Python. While it is
certainly possible to mix Python and C code, it is a lot of effort
(although it's quite straightforward to get Python code to call
functions defined in an so/dll file).

....OK! (Maybe I'll ask about C + NLTK on the NLTK discussion board to see what has been the experience of any user there.)
if I were asked to pick a language for a project, and the only thing
I knew about it was that it involved string parsing, Python would be my
first choice. I would not even consider C unless it was forced upon me.

....alright, I'll start thinking of Python, though it would deprecate somewhat my aforementioned speed priority.

In sum, though I've here qualified my original post to add speed in processing as an important consideration for my web API string parser, I still take a slightly negative verdict, especially as regards my hope to leave a space for the inclusion of NLTK for lexical parsing -- that surprises me.

The weight of the responses moves me to consider just what level of responsiveness i) I can expect from searches on the data store using tokenized strings and ii) what 'headroom' I can add to that from the single threaded processing portion of the service's work, i.e., the parsing and mapping of tokenized strings to data queries.

Thanks again to all,

G
 
M

Malcolm McLean

The weight of the responses moves me to consider just what level
of responsiveness i) I can expect from searches on the data store using
tokenized strings and ii) what 'headroom' I can add to that from the
single threaded processing portion of the service's work, i.e., the
parsing and mapping of tokenized strings to data queries.
It all depends on your throughput.
A computer can do string handling of almost any complexity within just
noticeable time, that is to say, the user presses the return key, or whatever,
and accepts the result as instantaneous.
But it can't necessarily handle thousands of queries in just noticeable time.
That doesn't always mean thousands of users, it can also mean one user
submitting a file containing many lines for processing.
 
E

Eric Sosman

First, thanks to all for their thoughtful comments. By way of further background, and with a nod to Claus's comment...


...I would be approaching C fresh for this task, so the many cautions from you and other respondents about the level of care and additional expense involved with a C implementation are duly noted.

I somehow omitted from my original post here my lead motivation for considering C: SPEED.

I intend to return results to users of the string parsing service with most keystrokes, much like, say Google's Places API (demo here: https://developers.google.com/maps/documentation/javascript/examples/places-autocomplete)

Now...

from Eric


...sorry, I meant character set, as in how native users type their language. Does that change at all your verdict?

Sorry, but I still don't see anything that looks like a
specification. You seem to be choosing an implementation language
in advance of defining the problem to be solved -- or, at least, in
advance of describing the problem to those whose advice you seek.
from Claus,


...an end user tool, the program is hosted as a web API...

In that case, the speed of the implementation language is not
likely to be an issue. If hand-crafted assembly language coding
delivers the answer in 0.1 microseconds while an interpreted
scripting language takes 10, you can rejoice in a hundredfold
speedup. But if both answers then take 10 *milli*seconds to get
across a network and through an HTML parser and onto a screen ...
Well, the difference between 0.01001 seconds and 0.0100001 is
unlikely to be of much concern.

Write your application in whatever language and with whatever
tools give you the most assistance toward the more important goal
of getting the problem solved. Once you've got it working right,
then you can measure the performance. If it's adequate, you're
done. If not, maybe you can tweak it. And if tweaking doesn't
help and you need to rewrite part or all in a "faster" language,
the early work will not have been wasted: You'll have developed
and debugged your solution method, and recoding will be fairly
straightforward. As an old manager of mine used to say: "First
make it work, then make it robust, and only then make it fast."
 
B

BartC

I somehow omitted from my original post here my lead motivation for
considering C: SPEED.
...alright, I'll start thinking of Python, though it would deprecate
somewhat my aforementioned speed priority.

I once ran a string-processing benchmark that was faster in Python than in
C. IIRC due to the Python's use of counted strings, versus C's
zero-terminated ones. I suppose the C version could have been adapted to use
counted strings too, or a special library used, but that's an example of the
extra effort needed with C.
 
L

Les Cargill

BartC said:
I once ran a string-processing benchmark that was faster in Python than
in C. IIRC due to the Python's use of counted strings, versus C's
zero-terminated ones.

Yet the *opposite* should be true. There are lies, damned lies and
benchmarks.

No knock on Python, but counted strings should always be more overhead
than zero-terminated strings.
I suppose the C version could have been adapted to
use counted strings too, or a special library used, but that's an
example of the extra effort needed with C.

it's not extra, it's just different.
 
S

Seebs

Yet the *opposite* should be true.

Not really.
No knock on Python, but counted strings should always be more overhead
than zero-terminated strings.

Maybe more overhead, but less computation, since so many tasks
involve walking to the end of the string.

http://www.joelonsoftware.com/articles/fog0000000319.html

It turns out that, by the time you do all the extra work to keep track
of how much space you have allocated, and whether you need to allocate
more, counted strings can win.
it's not extra, it's just different.

It's extra in terms of programmer time, though. And if you need to spend
a whole heck of a lot of programmer time just to catch up with the "slower"
language...

-s
 
B

BartC

Les Cargill said:
Yet the *opposite* should be true. There are lies, damned lies and
benchmarks.

No knock on Python, but counted strings should always be more overhead
than zero-terminated strings.

Always? Take two identical million-character strings. Now add one extra
character to one, and pass both strings to the strcmp() function. It will
have to check a million pairs of characters before it can conclude whether
they're equal or not. With counted strings, it's only necessary to compare
two numbers.

(An unlikely scenario, as most strings will be short. Nevertheless my own
counted-string language can do this compare some 25,000 times faster than C.
Provided the strings are different lengths of course...)

Counted strings have other advantages: take one of those million-character
strings, and pass the first 10 characters of it to a function. With a zero
terminator, you'd have to copy the string first; with a counted string you
don't.
 
L

luser- -droog

As Stack Overflow suggested, there isn't a clear "yes/no" answer to this.

Any program can be written in any language. C doesn't have any facilities
for sophisticated string handling, however it can be extended very easily
with either your own or third party libraries, and ten it will usually have
better performance than competing languages, but be rather more fiddly to
use.

You can't easily toggle between encodings (I can't think off hand of any
language that allows this, but I'm sure someone has tried). But if you settle
on UTF8 you can handle pretty much any known character set whilst still
treating English as plain ascii, which is usually what you want.

To tokenise strings, you will need to pull an open source regular expression
parser from the web. Or you might be able to call one in a library provided
by your platform. (However then you can't easily port the program). It's
not sensible to try to write regular expression parser of your own.

I agree with everything you say until here. If the OP is
truly intent upon writing language parsers in C, writing
a RE parser would seem to be an appropriate first step.

Then move on up the Chomsky Hierarchy, perhaps using the
RE tools to implement the LL(0) tools.
 
M

Malcolm McLean

Always? Take two identical million-character strings. Now add one extra
character to one, and pass both strings to the strcmp() function. It will
have to check a million pairs of characters before it can conclude whether
they're equal or not. With counted strings, it's only necessary to compare
two numbers.
But if you're expecting million character strings, it probably makes sense
to wrap them up with control structures. Which can include a hash. So
you can detect for inequality by comparing the hashes. Often the chance of
a collision is so remote that it's acceptable to take hash equality as
a test for equality as well.
 
B

BartC

To tokenise strings, you will need to pull an open source regular
expression
parser from the web. Or you might be able to call one in a library
provided
by your platform.

Tokenising strings is fairly straightforward and needn't involve any special
tools. (In a compiler, the tokeniser is probably the easiest bit.) Using a
library (especially involving regular expressions) is probably overkill.
 
L

Les Cargill

BartC said:

*Very nearly* always. Lazy typing on my part.

My point is that it takes some context to know
which approach will be an improvement.

But I have little doubt that there will be cases in which
either will outperform the other. "I use 'C' because it's
faster" is pretty weak tea and is possibly a signal
of premature optimization. "I use 'C' because it fits
the runtime environment better" is much better reasoning.

Same for Python.
Take two identical million-character strings.

There are no *three thousand* character, single strings in the wild.
There are vanishingly few *one thousand* character strings in the wild.

Most are more like 80 characters.

Now, if you put an entire webpage or document into one string, I'd
consider that more complex object *than* a string and build
scaffolding to deal with it. For one, you'll need to unmarshal
several read() calls against a socket or file.

I say that; Python may well hide all that for you. Tcl does not;
you have to concatenate them.
Now add one extra
character to one, and pass both strings to the strcmp() function. It
will have to check a million pairs of characters before it can conclude
whether they're equal or not.

Nope. You will have to count some number which is asymptotic with
respect to one million, unless you always compare two identical strings.

strcmp() stops when the two strings don't agree any more.

And if you know you're comparing million-byte strings, use something
better than strcmp(). :)
With counted strings, it's only necessary
to compare two numbers.

1) Somebody still has to set the length byte to begin with.
2) If the strings are initialized from patches of memory,
it'll have to count this at least once.
3) If the string gets its contents from a file or socket, then you
still have to set the length.
(An unlikely scenario, as most strings will be short.
Yep.

Nevertheless my
own counted-string language can do this compare some 25,000 times faster
than C. Provided the strings are different lengths of course...)

Ah, but 'C' can also use counted strings. The principal difference is
that, sfaik, in Python you cannot do that.
Counted strings have other advantages: take one of those
million-character strings, and pass the first 10 characters of it to a
function. With a zero terminator, you'd have to copy the string first;
with a counted string you don't.

I'd rather pass a pointer to the whole thing, and have the function
know about the 10 character comparison/use limit.
 
E

Eric Sosman

Tokenising strings is fairly straightforward and needn't involve any
special tools. (In a compiler, the tokeniser is probably the easiest
bit.) Using a library (especially involving regular expressions) is
probably overkill.

The O.P. hasn't explained his problem in much detail, but
I get the impression he's interested in parsing natural language
strings. Tokenizing natural language is far from straightforward --
for example, consider some uses of the apostrophe:

- Possessive: "England's green and pleasant shore."

- Possessive without s: "Cyprus' Mediterranean climate."

- Plural possessive: "C programmers' blind spots."

- Contraction: "Why won't it compile?"

- Elision: "Nothin' doin'!"

- Quotation: "Stop calling me 'Shirley'!"

- Combinations of the above: "His exact words were 'It'd be
to everyone's benefit if you'd stop callin' me Shirley'."

- Mistakes: "Every dog has it's day."

For problems of this kind I suspect a regular expression
library would be underkill, not overkill.
 
M

Malcolm McLean

But I have little doubt that there will be cases in which
either will outperform the other. "I use 'C' because it's
faster" is pretty weak tea and is possibly a signal
of premature optimization. "I use 'C' because it fits
the runtime environment better" is much better reasoning.
You frequently know that the program is going to run into severe time
constraints, for instance if you're displaying 3D graphics for a video game,
you'll want to add more and more geometry until you hit the frame rate and
the programs starts skipping frames. The more triangles on screen, the
better it looks. So in cases like that, using C right from the word go, at
least in the processor-intensive parts of the program, is a sensible choice.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,884
Messages
2,569,953
Members
46,284
Latest member
TyrellKlim

Latest Threads

Top