Substrings and so on

Vicent · Jan 26, 2010

I posted this also in comp.lang.c++, so sorry for multiposting.

I would like to ask you about standard or usual ways to manage with
files or strings, specially when getting input data and writing output
from an algorithm.

I mean: which structures, data types or classes? Which standard ways
to read/write from/on files?

I've read some tutorials that deal with the standard C I/O and string
(string.h) libraries, but specially when managing strings, I am a bit
lost: Are there methods or functions to get substrings from a string,
or to take "spaces" ("blanks") away (a typical "wrap" function)??

About reading data from a text file, I think this is called "parsing".
Is there any "parsing" library???

Sorry if my questions are too naive, but I am a beginner.

Thank you very much in advance!

Tom St Denis · Jan 26, 2010

I posted this also in comp.lang.c++, so sorry for multiposting.

I would like to ask you about standard or usual ways to manage with
files or strings, specially when getting input data and writing output
from an algorithm.

I mean: which structures, data types or classes? Which standard ways
to read/write from/on files?

I've read some tutorials that deal with the standard C I/O and string
(string.h) libraries, but specially when managing strings, I am a bit
lost: Are there methods or functions to get substrings from a string,
or to take "spaces" ("blanks") away (a typical "wrap" function)??

About reading data from a text file, I think this is called "parsing".
Is there any "parsing" library???

Sorry if my questions are too naive, but I am a beginner.

Thank you very much in advance!

Sounds more like you have a comp.sci problem than a C problem, as in
learn how to manipulate data first then pick a language to express it.

Also, pick a single language and go with it. There is no C/C++ or
whatever. If you want to learn how to manipulate strings in C++
that's fine but at what I'm guessing is your level I'd stick to one or
another, specially since they're not related.

Tom

Flash Gordon · Jan 26, 2010

Vicent said:
I posted this also in comp.lang.c++, so sorry for multiposting.

If you are using C++ then I believe C++ provides a lot more facilities
string facilities than C.

I would like to ask you about standard or usual ways to manage with
files or strings, specially when getting input data and writing output
from an algorithm.

I mean: which structures, data types or classes? Which standard ways
to read/write from/on files?

C only has one string data structure (which is not a type), and that is
the nul terminated string. For input/output you have the functions in
stdio.h

A number of people have string libraries which they have written
themselves, which use more complex structures. However, these are not
standard.

I've read some tutorials that deal with the standard C I/O and string
(string.h) libraries,

That's what you get.

but specially when managing strings, I am a bit
lost: Are there methods or functions to get substrings from a string,
or to take "spaces" ("blanks") away (a typical "wrap" function)??

Not in C. You either have to write your own or get a non-standard
library that someone else wrote.

About reading data from a text file, I think this is called "parsing".
Is there any "parsing" library???

Reading the text and parsing it are different tasks. Reading is getting
it in to memory, parsing is breaking it apart in to useful chunks. Some
functions, e.g. fscanf, do both tasks. Genreally in my opinion the best
way is often to read a line at a time in to memory and then parse the
line entirely in memory.

Sorry if my questions are too naive, but I am a beginner.

The first thing you need to do is decide which language you are trying
to learn. If it is C++ then the best answers are likely to be very
different and could include classes or templates or something else which
C does not have.

Ersek, Laszlo · Jan 26, 2010

I posted this also in comp.lang.c++, so sorry for multiposting.

I would like to ask you about standard or usual ways to manage with
files or strings, specially when getting input data and writing output
from an algorithm.

I mean: which structures, data types or classes? Which standard ways
to read/write from/on files?

For C++, I guess you'd look first at std::string and std:iostream.
Google them, or obtain a draft or an actual edition of the ISO C++
standard (ISO/IEC 14882). A draft might be available at the C++
Standards Committee's site:

http://www.open-std.org/jtc1/sc22/wg21/

Chapter 21, Strings library
Chapter 27, Input/output library

The Qt or Boost libraries may prove helpful as well.

http://doc.trolltech.com/4.6-snapshot/qstring.html
http://www.boost.org/doc/libs/1_41_0/libs/libraries.htm#String

For C: don't start with it. Low-level string manipulation is one of the
most error-prone tasks in general, leading to countless security
vulnerabilities.

I've read some tutorials that deal with the standard C I/O and string
(string.h) libraries, but specially when managing strings, I am a bit
lost: Are there methods or functions to get substrings from a string,
or to take "spaces" ("blanks") away (a typical "wrap" function)??

(That would be a typical "trim" function I guess.) In my opinion, the
"string interface" provided by standard C (or by versions of the Single
Unix Specification) are much lower-level than you'd need; definitiely
not for a beginner with higher abstraction needs. I suggest you switch
to another language supporting high-level string manipulation (Perl,
Python, Ruby etc) or grab a strings library. A discussion on them
occurred on Reddit some time ago; several libraries were mentioned:

http://www.reddit.com/r/programming/comments/abh76/what_string_type_should_i_use_for_a_c_project

When posting to that topic, I stumbled upon the following comparison
page:

http://www.and.org/vstr/comparison

About reading data from a text file, I think this is called "parsing".
Is there any "parsing" library???

Especially in relation to parsing: don't start writing parsers in C. If
you must, stick to whole-line input (with bounded length) and regular
expressions. One such regex library is PCRE:

http://www.pcre.org/

But the Single Unix Specification defines a regex facility too.

http://www.opengroup.org/onlinepubs/007908775/xsh/regex.h.html

If you insist on consuming lines of arbitrary length, consider the
getline() GNU libc extension.

http://www.gnu.org/s/libc/manual/html_node/Line-Input.html#Line-Input

Localized low-level text processing (put very crudely: anything
non-ASCII) requires even more caution, so don't start with that either.
Some C and C++ libraries should support it transparently, though.

HTH,
lacos

Vicent · Jan 26, 2010

Sounds more like you have a comp.sci problem than a C problem, as in
learn how to manipulate data first then pick a language to express it.

Also, pick a single language and go with it. There is no C/C++ or
whatever. If you want to learn how to manipulate strings in C++
that's fine but at what I'm guessing is your level I'd stick to one or
another, specially since they're not related.

Tom

Tom,

Thank you for your answer.

I've chosen C++, because I need to program some algorithms and I think
it is a good choice for that purpose.

So, my problem is about how to read files in C++, I think.

Vicent · Jan 26, 2010

If you are using C++ then I believe C++ provides a lot more facilities
string facilities than C.

Yes, I realize of that...

A number of people have string libraries which they have written
themselves, which use more complex structures. However, these are not
standard.

OK. I didn't know that, although I was suspecting it.

Reading the text and parsing it are different tasks. Reading is getting
it in to memory, parsing is breaking it apart in to useful chunks. Some
functions, e.g. fscanf, do both tasks. Genreally in my opinion the best
way is often to read a line at a time in to memory and then parse the
line entirely in memory.

Yes, that was my idea, in fact --First, I read a line. Then, I try to
get the information from that line into some variables in my
algorithm.

The first thing you need to do is decide which language you are trying
to learn. If it is C++ then the best answers are likely to be very
different and could include classes or templates or something else which
C does not have.

I guess I'll stay with C++.

Thank you!

Eric Sosman · Jan 26, 2010

[...]
I've chosen C++, because I need to program some algorithms and I think
it is a good choice for that purpose.

So, my problem is about how to read files in C++, I think.

Perhaps the kind people on the comp.lang.c++ forum
would be better able to assist you with that language.
Since the I/O features of C++ differ quite a lot from
those of C, and since I/O is what you're interested in ...

Vicent · Jan 26, 2010

For C++, I guess you'd look first at std::string and std:iostream.

Click to expand...

Thank you! That's a point to start.

Google them, or obtain a draft or an actual edition of the ISO C++
standard (ISO/IEC 14882). A draft might be available at the C++
Standards Committee's site:

http://www.open-std.org/jtc1/sc22/wg21/

Chapter 21, Strings library
Chapter 27, Input/output library

Click to expand...

That's a great link!! Thanks.

The Qt or Boost libraries may prove helpful as well.

http://doc.trolltech.com/4.6-snapsh...org/doc/libs/1_41_0/libs/libraries.htm#String

Click to expand...

Good links, also.

For C: don't start with it. Low-level string manipulation is one of the
most error-prone tasks in general, leading to countless security
vulnerabilities.

Click to expand...

OK. Everyone tells me to avoid using C-strings, so...

(That would be a typical "trim" function I guess.)

Click to expand...

Yes, yes, sorry, I meant "trim", not "wrap". I miss those simple
"trim" functions at Visual Basic and PL/SQL Oracle...

In my opinion, the
"string interface" provided by standard C (or by versions of the Single
Unix Specification) are much lower-level than you'd need; definitiely
not for a beginner with higher abstraction needs. I suggest you switch
to another language supporting high-level string manipulation (Perl,
Python, Ruby etc) or grab a strings library. A discussion on them
occurred on Reddit some time ago; several libraries were mentioned:

http://www.reddit.com/r/programming/comments/abh76/what_string_type_s...

When posting to that topic, I stumbled upon the following comparison
page:

http://www.and.org/vstr/comparison

Click to expand...

OK, that's all very interesting. I see that other people had the same
problem before me!

Especially in relation to parsing: don't start writing parsers in C. If
you must, stick to whole-line input (with bounded length) and regular
expressions. One such regex library is PCRE:

http://www.pcre.org/

But the Single Unix Specification defines a regex facility too.

http://www.opengroup.org/onlinepubs/007908775/xsh/regex.h.html

If you insist on consuming lines of arbitrary length, consider the
getline() GNU libc extension.

http://www.gnu.org/s/libc/manual/html_node/Line-Input.html#Line-Input

Localized low-level text processing (put very crudely: anything
non-ASCII) requires even more caution, so don't start with that either.
Some C and C++ libraries should support it transparently, though.

Click to expand...

What I exactly need to do is the following:

While there are still new lines:
(1) Get one line from a given text file.
(2) In that line, detect a "first" part and a "second part", which are
separated by a "=" symbol.
(3) Take away the possible "blanks" (like a "trim" function would do)
from those parts.
(4) Detect which variable in my program is being referred by the
"first part".
(5) Translate the second part (it is still a "string") into a number.

- About #1 : It can be done by means of standard I/O C libraries. I
guess that there are also ways to do it with C++ libraries.

- About #2 : It would be as simple as: detecting the position of "="
and then get two substrings. I don't understand why this step is so
difficult to perform in C!!!! I mean: there IS a C standard function
for getting the position of a character (it is "strchr"), but not a
function for substring (unless it is a substring that starts at
position 1, which can be done with "strncpy_s"). Is it easier at C++??

- About #3 : I would only need an equivalent of VB's "trim"
function... Is there anything like that at C++?

- About #4 : I can do this by using a "case" or an "if" statement. No
problem at all with this step, provided that "first part" has been
successfully extracted and trimmed.

- About #5 : I hope that a proper casting statement will be enough.

So, do you think that C++ std::string and std:iostream classes are
the right choice for me??

Thank you in advance for your feed-back!!!

Ersek, Laszlo · Jan 26, 2010

On 26 Jan 2010 15:20:07 +0100

I thought that was comp.lang.c...

Please elaborate.

Thanks,
lacos

santosh · Jan 26, 2010

Vicent said:
Thank you! That's a point to start.

That's a great link!! Thanks.

Good links, also.

OK. Everyone tells me to avoid using C-strings, so...

Yes, yes, sorry, I meant "trim", not "wrap". I miss those simple
"trim" functions at Visual Basic and PL/SQL Oracle...

OK, that's all very interesting. I see that other people had the same
problem before me!

Click to expand...

Ersek, Laszlo · Jan 26, 2010

(5) Translate the second part (it is still a "string") into a number.

- About #5 : I hope that a proper casting statement will be enough.

Please read an introductory book or tutorial on C, preferably one not
contradicting the ISO C standard(s). I hope others will name such works.
Reddit had a similar discussion recently. I obviously can't vouch for
the pieces of advice given there.

http://www.reddit.com/r/programming/comments/au1fg/dear_proggit_what_book_would_you_recommend_for_a/

So, do you think that C++ std::string and std:iostream classes are
the right choice for me??

I don't know. For the stated purpose, in (not standard) C I'd likely use
fgets() with a 32,767 byte buffer, then call regexec() in order to
identify the trimmed parts via parenthesized subexpressions, then call
strtol() to convert the decimal sequence to a long int.

Cheers,
lacos

Ersek, Laszlo · Jan 26, 2010

On 26 Jan 2010 19:04:22 +0100

C is not perfect I know that, but saying "C: don't start with it" in a
newsgroup with this name, it sounds a bit strange to me. That's all.

Thanks for answering.

I love C (even though most of the time this love is unrequited). I
didn't intend to point out C's perceived "shortcomings" -- I hope not to
have an ego that big. I tried to signal that C (and especially
manipulation of character arrays for parsing purposes) might not be the
best choice for the *original poster*, following completely from what I
perceived to be the OP's understanding of C.

Someone advising against me operating a sawbench would be completely
justified. A sawbench is a wonderful tool. It's not the sawbench, it's
me. I should start with introductory woodworking lessons first.

(Yes, I just compared C to a sawbench, please forgive me. And for the
record, I can "operate" a hand saw.)

Cheers,
lacos

santosh · Jan 26, 2010

Vicent wrote:
[...]

What I exactly need to do is the following:

While there are still new lines:
(1) Get one line from a given text file.
(2) In that line, detect a "first" part and a "second part", which are
separated by a "=" symbol.
(3) Take away the possible "blanks" (like a "trim" function would do)
from those parts.
(4) Detect which variable in my program is being referred by the
"first part".
(5) Translate the second part (it is still a "string") into a number.

- About #1 : It can be done by means of standard I/O C libraries. I
guess that there are also ways to do it with C++ libraries.

Yes. For C, fgets() is the obvious choice, but if you want to read in
lines of arbitrary length, then you might have to write your own
function which uses dynamically allocated memory.

- About #2 : It would be as simple as: detecting the position of "="
and then get two substrings. I don't understand why this step is so
difficult to perform in C!!!! I mean: there IS a C standard function
for getting the position of a character (it is "strchr"), but not a
function for substring (unless it is a substring that starts at
position 1, which can be done with "strncpy_s"). Is it easier at C++??

Your point #2 is not clear. Do you simply need to locate the first
occurence of a '=' character? For that purpose strchr() would be fine.

[...]

- About #5 : I hope that a proper casting statement will be enough.

Atleast for C, no. Casting is not appropriate. Depending on what type
of number the "string" represents (i.e., integer or real), you'll want
to use one of the strto*() family of functions, like strtol() strtoul
() & strtod() to name three.

Here's a good online reference to Standard C library functions (among
others):

<http://www.dinkumware.com/manuals/>

[...]

santosh · Jan 26, 2010

Please read an introductory book or tutorial on C, preferably one not
contradicting the ISO C standard(s). I hope others will name such works.

[...]

One online tutorial for complete beginners might be the one by Steve
Summit:

<http://www.eskimo.com/~scs/cclass/cclass.html>

Since Mr. Summit was apparently involved in the standardisation
process of C90, one might trust his tutorial not to contradict
Standard C.

[...]

Robert Latest · Jan 26, 2010

Depends. It can be fun and educative. See below.

What I exactly need to do is the following:

While there are still new lines:
(1) Get one line from a given text file.

Use fgets() in a while loop.

(2) In that line, detect a "first" part and a "second part", which are
separated by a "=" symbol.
(3) Take away the possible "blanks" (like a "trim" function would do)
from those parts.

That's the fun part. You need a few simple loops to do this. I used to
do a lot of those string-walking exercises, so I just typed this into
the newsreader untested. It really helps you develop a sense of what
goes on behind the curtain.

for (p = buffer; *p && isspace(*p); ++p) ; /* skip initial WS */
first_part = p; /* save pointer */
for (; *p && *p != '='; ++p) ; /* find '=' */
for (q = p+1; *q && isspace(*q); ++q) ; /* skip more WS */
second_part = q; /* save pointer */
for (--p; isspace(*p); --p) ; /* skip trailing WS */
*(p+1) = 0; /* mark end of 1st */
for (p = second_part; *p; ++p) ; /* find \0 char */
for (--p; isspace(*p); --p) ; /* skip trailing WS */

now first_part and second_part should be nicely trimmed, NUL-terminated
C strings. This thing will probably segfault when fed invalid strings,
so some input validity checks are in order. This method can be driven to
the extreme; the nice thing is that everything happens in a single chunk
of memory ('buffer') which gets pointed into and peppered with zeroes.

If your first and second part can't contain whitespace, it boils down to
a sscanf() one-liner:

#include <stdio.h>
int main(void)
{
char *str = "abcd = 100 "; /* test string */
char first[20];
int second;
int r;

r = sscanf(str, " %[^ =] = %d", first, &second);
if (r == 2) {
printf("%s=%d\n", first, second);
} else {
fprintf(stderr, "Couldn't parse string (r=%d)\n", r);
}
return 0;
}

It would be wise to check for the position of the '=' sign first to make
sure that the buffer 'first' doesn't overflow.

(4) Detect which variable in my program is being referred by the
"first part".

A bsearch()-based solution comes to mind

(5) Translate the second part (it is still a "string") into a number.

strtol(), or automatically done by sscanf()

- About #2 : It would be as simple as: detecting the position of "="
and then get two substrings. I don't understand why this step is so
difficult to perform in C!!!!

I mean: there IS a C standard function
for getting the position of a character (it is "strchr"), but not a
function for substring

strtok() can also be your friend. For index-based substrings, use
strdup() and pointer arithmetics. All one-liners.

So, do you think that C++ std::string and std:iostream classes are
the right choice for me??

It really depends on what the rest of your application does. If breaking
up a string into two parts overwhelms you complexity-wise, it probably
does very little.

That said, I nowadays greatly prefer Python over C for many things,
although I enjoy coding in C more. Especially when dealing with
undefined input, the necessary overhead of error-checking and -handling
in C (and C++) can be bothersome.

robert

Stefan Ram · Jan 26, 2010

Vicent said:
About reading data from a text file, I think this is called "parsing".

I am just teaching about binary trees in C. So I started with:

struct tree { struct tree * left; int value; struct tree * right; };

To print a tree:

void print( struct tree const * const tree )
{ if( tree ){ putchar( '(' ); print( tree->left );
putchar( '0' + tree->value ); print( tree->right ); putchar( ')' ); }}

(The code is simplified insofar as it assumes one-digit
numbers only.)

An example output is:

(((0)1(2))3(4))

For the tree

3
/ \
/ \
1 4
/ \
/ \
0 2

Now, how do we parse this in again?

Two steps:

1.) Write a grammar:

<number> ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'.

<entry> ::= '(' <tree> <number> <tree> ')'.

<tree> ::= [<entry>].

2.) Write the parser in analogy with the grammar:

int number( void ){ return get( 1 )- '0'; }

struct tree * entry( void )
{ TREE left, right; int value;
get( '(' );
left = tree();
value = number();
right = tree();
get( ')' );
return newtree( left, value, right ); }

struct tree * tree( void )
{ return get( 0 )== '(' ? entry() : 0; }

This assume a »get« function that will return the current
character from the source and advances to the next character
when called with any non-zero argument. The code is
simplified insofar as it does not handle any run-time errors.

Thus, we are now able to round-trip serialize (write) and
de-serialize (read) binary tries with essentially 14 lines
of C code.

.------------------------------------------------------.
| Now, observe that during the whole serialization and |
| de-serialization we create and process strings of |
| symbols, but never actually build a 0-terminated |
| C-string in memory! |
'------------------------------------------------------'

I thought it would be nice if a tree in the source code
also would look like a tree. So the above tree

3
/ \
/ \
1 4
/ \
/ \
0 2

is being defined using:

extern struct tree t1, t0, t2, t4; struct tree t3 =
{ &t1, 3, &t4 },

t1 ={ &t0, 1, &t2 }, t4 ={ 0, 4, 0 },

t0 ={ 0, 0, 0 }, t2 ={ 0, 2, 0 };

Stefan Ram · Jan 26, 2010

{ TREE left, right; int value;

Oops, this should read:

{ struct tree *left, *right; int value;

Ersek, Laszlo · Jan 26, 2010

(Not to contradict, but to complement.)

If your first and second part can't contain whitespace, it boils down to
a sscanf() one-liner:

#include <stdio.h>
int main(void)
{
char *str = "abcd = 100 "; /* test string */
char first[20];
int second;
int r;

r = sscanf(str, " %[^ =] = %d", first, &second);
if (r == 2) {
printf("%s=%d\n", first, second);
} else {
fprintf(stderr, "Couldn't parse string (r=%d)\n", r);
}
return 0;
}

It would be wise to check for the position of the '=' sign first to make
sure that the buffer 'first' doesn't overflow.
[...]

(5) Translate the second part (it is still a "string") into a number.

Click to expand...

strtol(), or automatically done by sscanf()

abcd=99999999999999999999999999999999999999999999999999999999

%d -> implementation-defined behavior ("signed overflow")

%u -> silent truncation ("unsigned overflow")

%*d -> assignment suppressed, not applicable here

%9ld -> file position indicator will advance until after the ninth nine
(I think), the stored long int value (999,999,999) won't reflect the
actual decimal string, a matching failure will follow only in the next
cycle. Full range of long int not available to decimal strings.
Magnitude of smallest negative value is about one tenth of the greatest
positive value.

strtol() is better.

When writing my previous post in the thread, I've tried to create a
scanf() format string that (a) relies only on completely defined
behavior, (b) is correct: parses what the OP needs (pre-set limits on
the lengths of the trimmed parts are allowed), (c) is complete: refuses
anything else. I gave up after a while and decided to wait for other
submissions and try to break them, or if I can't, learn from them.

Cheers,
lacos

Ersek, Laszlo · Jan 26, 2010

abcd=99999999999999999999999999999999999999999999999999999999

%d -> implementation-defined behavior ("signed overflow")

I apologize, that would hold for a conversion from eg. unsigned int; the
fscanf() spec says (C99 7.19.6.2 The fscanf function, p10):

----v----
Unless assignment suppression was indicated by a *, the result of the
conversion is placed in the object pointed to by the first argument
following the format argument that has not already received a conversion
result. If this object does not have an appropriate type, or if the
result of the conversion cannot be represented in the object, the
behavior is undefined.
----^----

See also

http://groups.google.com/group/comp.lang.c.moderated/msg/700a797a716cf74a

Cheers,
lacos

Ersek, Laszlo · Jan 26, 2010

One online tutorial for complete beginners might be the one by Steve
Summit:

<http://www.eskimo.com/~scs/cclass/cclass.html>

Since Mr. Summit was apparently involved in the standardisation
process of C90, one might trust his tutorial not to contradict
Standard C.

Bookmarked, thank you!
lacos

Again substrings and so on	3	Jan 26, 2010
Clones, exceptions and so on.	2	Apr 18, 2005
Semi OT: Uniquely Identifying Substrings for an Elem in a Set: substr, Sets and Complexity	6	Aug 21, 2005
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
On the development of C	211	Mar 9, 2009
Multithreading and compatibility library (libconfig)	1	Jan 23, 2013
Python and PEP8 - Recommendations on breaking up long lines?	19	Nov 28, 2013
On VLAs and incomplete types	12	Mar 21, 2008

Substrings and so on

Vicent

Tom St Denis

Flash Gordon

Ersek, Laszlo

Vicent

Vicent

Eric Sosman

Vicent

Ersek, Laszlo

santosh

Ersek, Laszlo

Ersek, Laszlo

santosh

santosh

Robert Latest

Stefan Ram

Stefan Ram

Ersek, Laszlo

Ersek, Laszlo

Ersek, Laszlo

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads